Free Auto-Generate Captions from Audio or Video — Browser-Based Whisper Guide
A deep guide to VideoBuff's auto-caption feature, covering the technical setup, the step-by-step workflow, and how it compares with cloud-based alternatives. OpenAI Whisper derivatives (Whisper Small / Whisper Large v3 Turbo) run inside the browser and turn audio into Japanese or English captions without uploading anything to a server. Useful for meeting transcripts, podcast post-production, interview captioning, and short-form (9:16) social videos where data sensitivity matters. Includes Japanese-to-English translated captions, SRT / VTT / TXT export, and a live preview that streams captions as they're decoded. Faster on WebGPU, falls back to WASM where unavailable.
Why in-browser — how it works and what it protects
Most caption services pipe your audio up to the cloud before running an AI model. That's convenient, but if you're working with internal meetings, interviews, or anything sensitive, handing the recording to a third party — encrypted or not — is something you may want to avoid. NDA-bound interview audio, conversations in healthcare or legal contexts, unreleased internal discussions — for many of these, "do not leave the device" is a hard requirement.
VideoBuff's auto-caption feature runs an ONNX Whisper derivative (OpenAI Whisper) inside the browser via the Hugging Face Transformers.js runtime. Audio stays in browser memory and never touches the network (zero uploads, no external API calls, no server logs). The first run pulls the model weights (about 600 MB to 1 GB depending on the tier) into the browser cache (Cache API / IndexedDB); after that everything works offline. As long as you keep using VideoBuff regularly the cache is retained — the browser only evicts via LRU under storage pressure or after several months of inactivity. WebGPU-capable browsers (Chrome / Edge) use the GPU for fast inference, with WASM as the fallback elsewhere.
Under the hood the feature runs Whisper's encoder-decoder Transformer in-browser. Audio is chunked into 30-second windows with a 5-second overlap on each side so caption timestamps don't drop content at chunk boundaries.
This is a good fit for sensitive sources, organizations with compliance constraints that forbid third-party uploads, or simply anyone who'd rather keep audio on their machine.
Fast vs High Quality
The modal offers two tiers.
Fast (~600 MB) A multilingual model built on Whisper Small — 244M parameters. Good enough for casual conversation, short-form videos, and draft transcripts. Pick this when you want a smaller download or speed over peak accuracy, e.g. meeting minutes.
High quality (~1 GB) A multilingual model built on Whisper Large v3 Turbo — 809M parameters (Whisper Large v3 with the decoder shrunk from 32 to 4 layers, ~6× faster than the original Large). Use this for interviews packed with technical jargon, recordings with heavy proper nouns, or any caption that's going to ship. Stays steadier on long-form material.
Both models cache in the browser, so the second run starts instantly. WebGPU-capable browsers parallelize inference on the GPU and finish a 30-second clip in seconds; WASM fallback is several times slower.
If you pick "Translate to English", VideoBuff automatically switches to the high-quality model even if you'd selected Fast. Whisper's translate task only works on multilingual checkpoints, and among the two we ship, Turbo is the one with usable translation quality.
For a rough accuracy sense: clean speech with low environmental noise typically lands at 5–10% character error rate (CER) for Japanese and around 5% word error rate (WER) for English on the high-quality model. Multiple overlapping speakers, dense jargon, or noisy recordings degrade those numbers — assume some manual editing afterwards.
Translation captions (Japanese audio → English captions)
When you choose "Japanese" as the audio language, a "Translate to English" option appears in the subtitle-language selector. This produces English captions from Japanese audio — useful for outward-facing videos or global distribution.
Under the hood we use Whisper's "translate" task. Note the model only supports translation TO English; the reverse (English audio → Japanese captions, or Japanese audio → Chinese captions) is not supported by the underlying model.
The translated captions are placed on the same kind of text track. If you want both Japanese and English captions, run the feature twice and stack the two text tracks.
Editing and downloading
Once generation finishes, the modal shows the resulting caption list. Click a row to edit the text inline, or hit × to remove it. Doing a quick pass here saves a lot of clean-up after the fact.
If you need a subtitle file, download as SRT, VTT, or TXT. SRT is the format YouTube and most players accept, VTT plugs into the HTML5 <track> element, and TXT is a plain transcript.
"Add to timeline" drops the captions onto the timeline as a dedicated text track. Font size and line length are adjusted automatically based on the canvas aspect ratio (16:9 / 9:16, etc.). After placement, each caption behaves like any TextClip — position, color, outline, font are all editable from the Inspector.
Live preview
During inference, finalized captions stream into the modal in real time. You don't have to wait for the full result to gauge whether things are going well — you can spot accuracy issues at the start and decide to rerun with different settings.
This is especially helpful for long-form audio (10 minutes or more), where the perceived wait shrinks dramatically.
Performance tips
The first run downloads the model, which takes some time. Every subsequent run starts instantly from the browser cache. Inference itself is faster on WebGPU-capable browsers (Chrome / Edge).
For long recordings, trimming away unnecessary parts before captioning shortens the actual processing time. If your source has lots of silence or noise — for instance, a meeting recording — running Auto-trim silences first reduces the number of useless segments you'd otherwise have to clean up.
How VideoBuff differs from other auto-caption tools
Auto-captioning isn't rare anymore — CapCut, Vrew, Adobe Premiere Pro Speech to Text, Descript, and many others all offer it. VideoBuff occupies a particular intersection.
Stays in the browser CapCut, Vrew, and Descript send the audio up to their servers. Adobe Premiere Pro's Speech to Text is local but requires a paid Creative Cloud subscription and a desktop install. VideoBuff needs neither — no install, no account, just a URL.
Integrated with the editor Stand-alone web-based Whisper demos like whisper-web are caption-generation tools only; you export SRT and bring it into a separate editor. VideoBuff merges the two — generated captions appear directly as TextClips on the timeline, ready for font, color, position, and transition tweaks in the Inspector.
Free, with no per-clip or per-month cap Most cloud providers offer a small free tier (e.g. 10 minutes) and meter beyond that into a $10–30/month plan. Adobe Premiere Pro requires Creative Cloud (~$22/month). VideoBuff is free with no usage cap; the cost is entirely whatever compute your browser provides.
Privacy as a verifiable fact Cloud services may state in their policies that "we don't train on your data" or "we delete after N days", but transfer and storage themselves can't be reduced to zero technically. VideoBuff's "no network upload" is observable in the DevTools Network tab — it's a property of how it works, not a promise. That matters for NDA-bound material, healthcare, legal contexts, and internal meetings.
Tradeoffs Running locally means you don't get the absolute peak accuracy of best-in-class commercial ASR (Google Speech-to-Text's latest_long, AssemblyAI Universal-2, etc.). High-end features like custom proper-noun dictionaries, speaker diarization, and sentiment tagging aren't here either. If you need maximum accuracy AND keep audio off third-party servers, the next step is a self-hosted Whisper deployment on your own GPU — outside the scope of an in-browser tool.
FAQ
Q. Can I use the captions commercially? A. Yes. There's no per-caption fee or usage cap. VideoBuff itself uses MIT-licensed dependencies and MIT-licensed Whisper weights, so nothing in the licensing chain prevents you from shipping the captioned output as a commercial deliverable. See the "Runtime ML Models" section of the OSS Licenses page for specifics.
Q. How long an audio clip can I caption? A. There's no hard cap, but in practice it's limited by available browser memory and processing time. Audio under 30 minutes runs fine in most environments. For more than an hour, trim unnecessary sections first via Auto-trim silences to keep the input manageable.
Q. Does it work in browsers without WebGPU? A. Yes, via the WASM (WebAssembly) fallback. It's several times slower than the WebGPU path but produces the same output. Safari is rolling out WebGPU support gradually; for guaranteed GPU inference today, prefer Chrome or Edge.
Q. What translation directions are supported? A. Whisper supports translation into English only — any source language to English (e.g. Japanese audio → English captions). The reverse (English → Japanese) or non-English target languages aren't supported. For multi-language captioning, generate Japanese or English captions first, then translate them externally with DeepL, Claude, or similar.
Q. Where is the audio stored? A. Only in browser memory, and discarded once captioning finishes. It's never sent to VideoBuff servers, Hugging Face, or any other external service. The model weights, on the other hand, are downloaded once from Hugging Face and stored in the browser's Cache API / IndexedDB (same behavior as a normal cached web resource).
Q. Can I edit captions on a mobile browser? A. Editing text and downloading SRT/VTT/TXT works on mobile, but given the model download size (600 MB to 1 GB) and inference cost, a desktop browser is recommended. Mobile devices may struggle with the initial download and the battery hit.
Q. Does this work for vertical (9:16) short-form video? A. Yes. Font size and max characters per line are auto-adjusted based on the project's canvas aspect ratio. On a 9:16 canvas the splitter shortens each caption line so it stays inside the frame.
Q. How do I improve caption accuracy? A. Three things help: (1) pick the High Quality model (Whisper Large v3 Turbo); (2) record with the mic close to the speaker and minimal background noise; (3) for existing recordings, run noise reduction and loudness normalization first — either inside VideoBuff via audio mixing or in our sister tool AudioBuff — before captioning.
When the auto-caption button is missing (advanced)
Captions are conceptually tied to audio, so the Inspector's "Auto Caption" section only appears when the selected clip has usable audio attached.
Audio clip selected Always shown. Generated captions go onto the timeline at the audio clip's startMs.
Video clip selected (with a linked audio clip on the timeline) The default state right after importing a video. The linked audio is used as the source and captions are placed at the video clip's startMs. This is the standard case and needs no thought.
Video clip selected (unlinked, or the linked audio has been deleted) The auto-caption section is hidden in this state. When the video and audio aren't bound together, it's ambiguous which timeline position the captions should align with. To caption such material, pick the audio clip instead — the captions then land at the audio's position, naturally staying in sync with the sound. If you want the captions on the video side, either re-link via the Inspector's link button, or move the audio back beside the video and link them again.
Audio-only files (m4a, mp3, etc.) Audio clips imported on their own work the same way — useful for meeting recordings, podcast post-production, and any audio-only source.
For heavier audio work — noise reduction, loudness normalization, EQ, etc. — before captioning, our sister service AudioBuff handles those tasks; clean the audio there and import the result into VideoBuff.