Why stripping 'um' from a recording is harder than it sounds

Why stripping 'um' from a recording is harder than it sounds — type0 | type0

PREVIEWWhy stripping 'um' from a recording is harder than it sounds · MD

A 30-second voice memo with three "ums" feels like a 10-second edit. Run the obvious pipeline, which is to transcribe the file, find the filler words, then cut them, and the result is usually worse than the original: clicks at every splice, a hiss that no longer matches, and some of the ums still there. A new local-first tool called erm, published by developer Doug Calobrisi on his site on May 2, 2026, treats that gap between expectation and output as an engineering problem worth solving in public. The post walks through the three failure modes that make naive filler removal sound broken, and the small set of choices that make a hand-rolled pipeline work.

The naive version looks simple. Hand the audio to Whisper with word-level timestamps, regex-match "um", "uh", and "er", and ask ffmpeg to splice the cuts back together. Calobrisi reports this approach hits around 60% recall on real recordings. The three reasons, in his accounting, explain the rest.

First, Whisper was trained on clean prose, so in Calobrisi's testing it often leaves fillers out of the transcript entirely. There is no token to match against, and no cut to plan. Second, hard cuts in a waveform produce steps the ear hears as clicks, even when the timing looks correct on paper. Third, every room has a low background hiss, and that hiss is never identical at two different points in the recording. Butt-joining two snippets gives the listener a tonal shift at every edit, which is more noticeable than the um ever was.

erm's design works around each of those. Detection runs on faster-whisper, the local CTranslate2 port of Whisper, with an anti-cleanup prompt so the model does not silently scrub the very words the pipeline is trying to find. The default model is medium.en for speed. large-v3 is recommended when filler recall matters more than latency. The detector then runs three passes: gap-fillers that fall inside Whisper-marked silences longer than 350 milliseconds, fillers that have been glued onto adjacent words and need to be split on a brief amplitude dip, and over-long words where a tail of voiced sound is tested against a held-vowel pitch signature so slow speech is not mistaken for a stretched "um". These are erm's parameters, not universal thresholds, and Calobrisi flags them as the kinds of values a fork can tune.

The second half of the problem is the splice. erm slides each cut endpoint up to 60 milliseconds toward a local amplitude minimum, then snaps to the nearest zero-crossing so the stitched waveform is continuous. Adjacent cuts that would leave a sliver shorter than 120 milliseconds are merged. Splices use an ffmpeg crossfade rather than a butt-join, with the fade length scaling with the cut size inside a 50-to-120-millisecond band and capped so it never reaches back into a real word.

The final piece is room tone. The tool loops a real quiet stretch from the source recording under the entire output at low volume, so the background is identical at every splice and small mismatches are masked. There is also a denoiser knob. ffmpeg's denoiser smooths the very volume and pitch features the detector depends on, so erm exposes four modes (none, pre, post, and a fourth combination) and the choice changes which signal detection actually runs on. Trade-offs are visible to the user rather than buried.

The privacy posture is part of the design. faster-whisper runs locally, the audio never leaves the machine, and the CLI is a single line, uvx erm input.wav, with a cleaned .wav and a JSON cut list as output. For podcasters, interviewers, and anyone who records voice notes and dislikes cloud pipelines, that is the practical appeal: a tool a reader can run, read, and fork.

None of this resolves the deeper problem. Calobrisi is honest that no automated pipeline produces a take that sounds like a human edited it, and the post is structured around limits as much as solutions. The interesting question is not whether erm is the best filler remover. It is one local CLI in a small category. The question is whether the craft of explaining what makes audio editing hard, in the open, becomes a recipe other small tools can copy. The parameters are tunable, and the next iteration is likely to come from someone who wanted a different threshold.

Why stripping 'um' from a recording is harder than it sounds

Sources