Podcast

Podcast editing and cleanup: the voice post-production workflow

Editing and processing are two different jobs. Here is the order that gets a clean, consistent episode: cut, clean, EQ, compress, level, normalize, check.

By Hanna Eng·Audio engineer, Abbey Road Institute Paris

Updated 1 June 202610 min read
Part of: Produce a podcast

Normalize a podcast to -16 LUFS integrated, with a true-peak ceiling of -1 dBTP. Apple and Spotify state -16 LUFS regardless of channel count; if you deliver mono, some engineers aim about 3 LU lower (around -19 LUFS) to offset dual-speaker summing. That single master plays safely across Apple Podcasts, Spotify and YouTube: Spotify normalizes toward -14 LUFS and Apple recommends -16 LUFS, so -16 LUFS keeps you in range without distortion.

Most podcast audio problems come from doing things in the wrong order: processing a take before cutting it, or normalizing before the noise is gone. This guide separates editing from processing and walks the full chain in the order a sound engineer actually uses, with the loudness targets that get an episode accepted everywhere.

Podcast loudness targets

SettingValue
Master, stereo-16 LUFS
Master, mono-19 LUFS
True-peak ceiling-1 dBTP
Spotify normalization-14 LUFS
Apple recommendation-16 LUFS
Background music vs voice18 to 20 dB below
Dialogue edit crossfadea few ms

Source: Spotify (loudness normalization), Apple Podcasts for Creators, Podnews (LUFS and LKFS for podcasters)

Editing vs processing: two jobs, do not mix them

Editing a podcast means cutting and assembling content: removing mistakes, dead silence and hesitations so the episode flows. Cleaning and processing means working the signal: noise, reverb, EQ, dynamics and level. Do the editorial cut first, then the audio processing, so you never spend time treating audio you end up deleting.

The full workflow in 7 steps

A clean podcast edit follows seven steps: import and organize the tracks, edit the content, clean the defects, EQ and compress the voice, balance levels between speakers, normalize to the loudness target, then check and export. Following this order gives a consistent, repeatable result from one episode to the next.

Step 1: import and organize the tracks

Import each speaker on their own track, plus dedicated music and effects tracks. If you recorded a double-ender, align the tracks using the sync clap at the start. A tidy session lets you treat each voice independently, which is the basis of a clean edit.

Step 2: the content edit (rough cut)

Cut the big problems first: tangents, repetitions, technical errors and long silences. Reduce or mark the most distracting hesitations without trying to remove every filler word: the goal is clarity, not a robotic perfection that sounds artificial. This is the longest step, often several times the length of the raw audio.

Step 3: clean the voice (noise, reverb, sibilance, plosives)

Cleanup removes what gets in the way of the voice: constant background noise, room reverb, harsh sibilance, P and B plosives, mouth clicks and loud breaths. Treat one issue at a time and stay light: an aggressive de-noise makes the voice sound muffled, like it is underwater.

  • Reduce constant background noise without dulling the voice (gentle, not maximum).
  • Remove room reverb and echo, the number one flaw of home recordings (dialogue de-reverb in iZotope RX).
  • De-ess harsh S sounds, soften plosives, and remove mouth clicks.

Step 4: EQ and compress the voice

Start with a high-pass filter around 80 to 100 Hz to remove useless low end, then add a small presence lift between 2 and 5 kHz for intelligibility. Compression then reduces the gap between loud and quiet passages. The logical order is EQ, then compression, then de-esser, then limiter.

Step 5: balance levels between speakers

Before normalizing, match the perceived volume of each voice to the others and to the music. A guest louder or quieter than the host tires the listener. Set relative levels track by track, then check the transitions into jingles and the intro to avoid volume jumps.

Step 6: normalize to the right loudness (LUFS)

Target -16 LUFS integrated, with a true-peak ceiling of -1 dBTP. Apple and Spotify state -16 LUFS regardless of channel count; if you deliver mono, some engineers aim about 3 LU lower (around -19 LUFS) to offset dual-speaker summing, a convention rather than a platform rule. Spotify normalizes toward -14 LUFS and Apple recommends -16 LUFS. A single master at -16 LUFS and -1 dBTP suits every platform without distortion.

Step 7: quality control and export

Listen to the whole episode on at least two systems: headphones for detail and a phone or consumer speakers for the real listener experience. Check for residual noise, level jumps and hard cuts, then export to WAV or MP3 in the format the platforms expect.

Which software to edit and clean a podcast

To start, Audacity is free and capable. To go further, professional stations and dedicated dialogue-repair tools offer adaptive noise reduction, machine-learning de-reverb and precise de-essing. The choice depends on your level and how much time you want to spend on processing, not on the number of filters.

How long does editing a podcast take

Expect roughly three to five hours of editing per finished hour for a careful edit. A thirty-minute episode often means two to three hours of post-production, cleanup and mixing included, and more when there are several guests or heavy cleanup.

AI-assisted vs manual editing: a neutral hybrid

AI tools have become good at the repetitive parts of a podcast edit: detecting filler words, trimming long silences, and a first pass of noise and reverb reduction. They are not a replacement for editorial judgment, which decides what to cut, how a conversation should breathe, and where a pause carries meaning. The sensible approach is hybrid: let software do the mechanical sweep, then let a human make the calls that affect how the episode sounds and feels.

Common industry options for the automated pass include Descript and Cleanvoice for filler-word and silence removal, Adobe Podcast for one-click noise and reverb reduction, and Auphonic for automatic leveling and loudness normalization. These are neutral examples, not endorsements. Hanna processes audio in Pro Tools and iZotope RX, where each correction is applied and checked by ear rather than left entirely to a model.

  • Use AI for the mechanical sweep: filler detection, silence trimming, a first noise and reverb pass.
  • Keep editorial and tonal decisions human: what to cut, where to leave a breath, how the conversation flows.
  • Always listen back: AI can clip into neighbouring words or flatten a delivery if left unchecked.
  • Hanna works in Pro Tools and iZotope RX, with each move verified by ear.
TaskAI-assistedManual / by ear
Filler-word and silence removalFast first passFinal judgment on what stays
Noise and reverb reductionOne-click starting pointTargeted, restrained correction
Leveling and loudnessAutomatic normalizationBalance between speakers and music
Editorial and tonal callsNot suitableAlways human

Source: Descript (filler words), Cleanvoice (filler words), Auphonic (leveling and loudness), Adobe Podcast

Music and voice balance: ducking the background bed

When music plays under speech, keep the background bed roughly 18 to 20 dB below the speaking voice so it adds atmosphere without masking consonants. The cleanest way to hold that gap is ducking: the music level automatically dips whenever someone talks and lifts again in the gaps, using volume automation or a sidechain compressor.

Set the gap by ear on real speech, not on the loudest peak, and check it on phone speakers where intelligibility suffers first. Intros and outros, where music sits alone, can be louder than the bed that runs under dialogue.

  • Background music roughly 18 to 20 dB below the speaking voice under dialogue.
  • Use ducking (volume automation or sidechain) so music dips under speech and lifts in the gaps.
  • Judge the balance on phone speakers, where weak intelligibility shows up first.

Crossfades: cutting filler words without choppiness

A hard cut between two pieces of speech can leave a faint click or pop and an abrupt change in room tone. A short crossfade of a few milliseconds at the edit point blends the two sides, removes the click, and hides the seam left by a removed word or pause. This is what lets you cut filler words and still sound natural rather than chopped.

For longer gaps, lay matching room tone (the quiet ambience of the recording space) under the join so the silence between words sounds like the same room, not a digital hole. Use crossfades on every speech edit and keep them short; a long crossfade on dialogue smears consonants and sounds unnatural.

  • Apply a short crossfade (a few milliseconds) at each speech edit to kill clicks and blend the seam.
  • Use matching room tone under longer joins so the gap sounds like the same room.
  • Keep dialogue crossfades short: long ones smear consonants.

File organization for a clean handoff

A tidy project is faster to edit and far easier to revisit for a correction or a future episode. Name and group everything before you start cutting, and keep the deliverables separate from the working files.

  • Keep the original raw recordings untouched in their own folder; edit on copies.
  • One track per speaker, plus dedicated music and effects tracks, clearly labelled.
  • Name files consistently, for example ShowName_EpNN_Speaker_raw and ShowName_EpNN_master.
  • Keep deliverables (final WAV and MP3) separate from session and working files.
  • Note the loudness target and export settings with the project so future episodes match.

Frequently asked questions

What software should I use to edit a podcast?

Audacity is a free, capable starting point. For deeper cleanup, professional DAWs and dedicated dialogue-repair tools (such as iZotope RX) add adaptive noise reduction and de-reverb. The right tool depends on your level and how much processing time you want, not the feature count.

How do I remove background noise from a podcast?

Use a noise-reduction or spectral-repair tool, applied gently. Profile the constant background (hiss, hum, fan) and reduce it a few dB at a time rather than in one aggressive pass, which makes the voice sound muffled and underwater. Clean before you EQ and compress.

How do I remove echo or reverb from a recording?

Use a de-reverb tool such as iZotope RX Dialogue De-reverb. It reduces room reflections but cannot fully remove heavy reverb, so the real fix is recording in a soft, quiet room. De-reverb is a repair for moderate cases, not a substitute for a good capture.

In what order should I apply EQ, compression and normalization?

EQ first (high-pass around 80 to 100 Hz, then a small presence lift), then compression to even out the dynamics, then a de-esser, then a limiter, and finally normalize the whole episode to your loudness target. Cleaning noise and reverb comes before all of this.

What LUFS should I normalize a podcast to?

Target -16 LUFS integrated, with a true peak of -1 dBTP. Apple and Spotify state -16 LUFS regardless of channel count; if you deliver mono, some engineers aim about 3 LU lower (around -19 LUFS) to offset dual-speaker summing, a convention rather than a platform rule. That single master works across Apple Podcasts, Spotify and YouTube, since Spotify normalizes to -14 LUFS and Apple recommends -16 LUFS.

How long does it take to edit a one-hour podcast?

Plan for roughly three to five hours of editing per finished hour, and more for heavy cleanup or multiple guests. That is the same as three to five minutes of work per finished minute once you include cutting, noise cleanup, leveling and the final loudness pass.

How do I cut filler words without making the audio choppy?

Remove the word, then put a short crossfade of a few milliseconds at the edit point. The crossfade blends the two sides, removes the click an abrupt cut can leave, and hides the seam so the speech stays fluid. For longer gaps, lay matching room tone under the join so the silence sounds like the same room. Keep dialogue crossfades short, because a long one smears consonants and sounds unnatural.

How loud should background music be under the voice?

Keep the background music bed roughly 18 to 20 dB below the speaking voice so it adds atmosphere without masking consonants. Use ducking, with volume automation or a sidechain compressor, so the music dips automatically when someone talks and lifts again in the gaps. Set the level by ear on real speech and check it on phone speakers, where intelligibility suffers first.

Should I use AI tools to edit my podcast?

AI is good at the mechanical sweep: detecting filler words, trimming silences, and a first noise and reverb pass. Common options include Descript and Cleanvoice for fillers and silences, Adobe Podcast for noise and reverb, and Auphonic for leveling and loudness. The editorial and tonal calls, what to cut and where to leave a breath, should stay human, and every automated edit needs a listen-back. Hanna works this way in Pro Tools and iZotope RX, applying and checking each correction by ear.

How many days does podcast editing take to turn around?

Turnaround is counted in business days and depends on the episode: its length, the number of speakers, how much cleanup the raw audio needs, and whether music and chapters are involved. A short, clean solo episode turns around faster than a long multi-guest conversation that needs heavy noise and reverb work. Agree the delivery window up front so it fits your publishing schedule.

Sources and references

Have a project that needs this done right?

If your mix has to pass a platform spec, let's talk about the deliverables and the timeline.

Start a project