Skip to main content
💬

Free Auto-Generate Captions from Audio or Video — Browser-Based Whisper Guide

A deep guide to VideoBuff's auto-caption feature, covering the technical setup, the step-by-step workflow, and how it compares with cloud-based alternatives. OpenAI Whisper Large v3 Turbo runs inside the browser and turns audio into Japanese or English captions without uploading anything to a server. Useful for meeting transcripts, podcast post-production, interview captioning, and short-form (9:16) social videos where data sensitivity matters. Includes Japanese-to-English translated captions, SRT / VTT / TXT export, and a live preview that streams captions as they're decoded. Faster on WebGPU, falls back to WASM where unavailable.

Why in-browser — how it works and what it protects

Most caption services pipe your audio up to the cloud before running an AI model. That's convenient, but if you're working with internal meetings, interviews, or anything sensitive, handing the recording to a third party — encrypted or not — is something you may want to avoid. NDA-bound interview audio, conversations in healthcare or legal contexts, unreleased internal discussions — for many of these, "do not leave the device" is a hard requirement.

VideoBuff's auto-caption feature runs the ONNX OpenAI Whisper Large v3 Turbo inside the browser via the Hugging Face Transformers.js runtime. Audio stays in browser memory and never touches the network (zero uploads, no external API calls, no server logs). The first run pulls the model weights (~1 GB) into the browser cache (Cache API / IndexedDB); after that everything works offline. As long as you keep using VideoBuff regularly the cache is retained — the browser only evicts via LRU under storage pressure or after several months of inactivity. WebGPU-capable browsers (Chrome / Edge) use the GPU for fast inference, with WASM as the fallback elsewhere.

Under the hood the feature runs Whisper's encoder-decoder Transformer in-browser. Audio is chunked into 30-second windows with a 5-second overlap on each side so caption timestamps don't drop content at chunk boundaries.

This is a good fit for sensitive sources, organizations with compliance constraints that forbid third-party uploads, or simply anyone who'd rather keep audio on their machine.

The model — Whisper Large v3 Turbo

Auto-captioning runs on a single multilingual model: Whisper Large v3 Turbo. 809M parameters — Whisper Large v3 with the decoder shrunk from 32 to 4 layers, roughly 6× faster than the original Large while keeping comparable quality. Practical accuracy on Japanese and English audio, stable on long-form interviews and material with proper nouns.

Earlier versions offered two tiers (Whisper Small "fast" and Turbo "high quality"). We collapsed them after concluding that the lightweight tier's benefits (smaller download, lower-spec friendliness) didn't outweigh the cost of users running transcription twice over accuracy gaps. One model means one fewer thing to choose — open the modal, click Start, get the best result.

The model file (~1 GB) downloads to the browser cache on first run and starts instantly on subsequent runs. WebGPU-capable browsers parallelize inference on the GPU and finish a 30-second clip in seconds; WASM fallback is several times slower.

For a rough accuracy sense: clean speech with low environmental noise typically lands at 5–10% character error rate (CER) for Japanese and around 5% word error rate (WER) for English. Multiple overlapping speakers, dense jargon, or noisy recordings degrade those numbers — assume some manual editing afterwards.

Translation captions (Japanese audio → English captions)

When you choose "Japanese" as the audio language, a "Translate to English" option appears in the subtitle-language selector. This produces English captions from Japanese audio — useful for outward-facing videos or global distribution.

Under the hood we use Whisper's "translate" task. Note the model only supports translation TO English; the reverse (English audio → Japanese captions, or Japanese audio → Chinese captions) is not supported by the underlying model.

The translated captions are placed on the same kind of text track. If you want both Japanese and English captions, run the feature twice and stack the two text tracks.

Editing and downloading

Once generation finishes, the modal shows the resulting caption list. Click a row to edit the text inline, or hit × to remove it. Doing a quick pass here saves a lot of clean-up after the fact.

If you need a subtitle file, download as SRT, VTT, or TXT. SRT is the format YouTube and most players accept, VTT plugs into the HTML5 <track> element, and TXT is a plain transcript.

"Add to timeline" drops the captions onto the timeline as a dedicated text track. Font size and line length are adjusted automatically based on the canvas aspect ratio (16:9 / 9:16, etc.). After placement, each caption behaves like any TextClip — position, color, outline, font are all editable from the Inspector.

Live preview

During inference, finalized captions stream into the modal in real time. You don't have to wait for the full result to gauge whether things are going well — you can spot accuracy issues at the start and decide to rerun with different settings.

This is especially helpful for long-form audio (10 minutes or more), where the perceived wait shrinks dramatically.

Note that the live preview shows raw streaming output (Whisper uses a 30s window with 5s overlap), so sentences may appear fragmented or duplicated mid-stream. When inference finishes, Whisper's _decode_asr reassembles and deduplicates all chunks into a clean final result — fragmentation in the live view does not affect the final captions.

Karaoke captions — words light up as they're spoken

In the auto-caption modal, toggle "Place as karaoke captions" under Advanced. This uses Whisper's word-timestamp feature to generate captions that highlight each word as it's spoken during playback — the staple "kinetic captions" look from short-form video, available entirely client-side, automatic, and free.

Standard captioning (toggle OFF) places one TextClip per Whisper segment. Karaoke mode instead emits short per-line TextClips with per-word timing data baked in. Both preview and exported video drive the highlight from playhead position, switching the active character to the accent color at the right moment.

Four style presets — pick one of four at generation time (or change per clip later from the Inspector):

  • Classic: sung words stay accent-colored — traditional karaoke fill
  • Pop: active char briefly scales up — bouncy animation
  • Glow: active char gets an accent-colored halo around it
  • Bar: a progressive accent-colored underline grows under sung words

Accent color is set at generation time and applied to every clip; per-clip overrides are available in the Inspector under Karaoke color / Karaoke style.

How it works and what to expect — we pass return_timestamps: 'word' to Whisper and let it estimate word boundaries via cross-attention dynamic time warping. Whisper itself doesn't emit sub-word timestamps, so VideoBuff splits each word's duration evenly across its characters as an approximation. Japanese audio works well because Whisper emits morpheme-sized "words" that line up naturally with phrase rhythm.

Accuracy depends on the material — fast speech or heavy background noise can shift word timestamps by ±100–300ms. Mixed speech-and-music tracks are not ideal. When recognition errors or timing drift are noticeable, edit or delete affected lines inside the WordTimingPreview before placing them on the timeline.

Preview ↔ export parity — the preview (DOM / CSS) and the exported video (Canvas 2D) share the same per-character timing formula and accent-color resolution logic, so they look near-identical. The karaoke effect is baked directly into the exported MP4.

Performance tips

The first run downloads the model, which takes some time. Every subsequent run starts instantly from the browser cache. Inference itself is faster on WebGPU-capable browsers (Chrome / Edge).

For long recordings, trimming away unnecessary parts before captioning shortens the actual processing time. If your source has lots of silence or noise — for instance, a meeting recording — running Auto-trim silences first reduces the number of useless segments you'd otherwise have to clean up.

How VideoBuff differs from other auto-caption tools

Auto-captioning isn't rare anymore — CapCut, Vrew, Adobe Premiere Pro Speech to Text, Descript, and many others all offer it. VideoBuff occupies a particular intersection.

Stays in the browser CapCut, Vrew, and Descript send the audio up to their servers. Adobe Premiere Pro's Speech to Text is local but requires a paid Creative Cloud subscription and a desktop install. VideoBuff needs neither — no install, no account, just a URL.

Integrated with the editor Stand-alone web-based Whisper demos like whisper-web are caption-generation tools only; you export SRT and bring it into a separate editor. VideoBuff merges the two — generated captions appear directly as TextClips on the timeline, ready for font, color, position, and transition tweaks in the Inspector.

Free, with no per-clip or per-month cap Most cloud providers offer a small free tier (e.g. 10 minutes) and meter beyond that into a $10–30/month plan. Adobe Premiere Pro requires Creative Cloud (~$22/month). VideoBuff is free with no usage cap; the cost is entirely whatever compute your browser provides.

Privacy as a verifiable fact Cloud services may state in their policies that "we don't train on your data" or "we delete after N days", but transfer and storage themselves can't be reduced to zero technically. VideoBuff's "no network upload" is observable in the DevTools Network tab — it's a property of how it works, not a promise. That matters for NDA-bound material, healthcare, legal contexts, and internal meetings.

Tradeoffs Running locally means you don't get the absolute peak accuracy of best-in-class commercial ASR (Google Speech-to-Text's latest_long, AssemblyAI Universal-2, etc.). High-end features like custom proper-noun dictionaries, speaker diarization, and sentiment tagging aren't here either. If you need maximum accuracy AND keep audio off third-party servers, the next step is a self-hosted Whisper deployment on your own GPU — outside the scope of an in-browser tool.

FAQ

Q. Can I use the captions commercially? A. Yes. There's no per-caption fee or usage cap. VideoBuff itself uses MIT-licensed dependencies and MIT-licensed Whisper weights, so nothing in the licensing chain prevents you from shipping the captioned output as a commercial deliverable. See the "Runtime ML Models" section of the OSS Licenses page for specifics.

Q. How long an audio clip can I caption? A. There's no hard cap, but in practice it's limited by available browser memory and processing time. Audio under 30 minutes runs fine in most environments. For more than an hour, trim unnecessary sections first via Auto-trim silences to keep the input manageable.

Q. Does it work in browsers without WebGPU? A. Yes, via the WASM (WebAssembly) fallback. It's several times slower than the WebGPU path but produces the same output. Safari is rolling out WebGPU support gradually; for guaranteed GPU inference today, prefer Chrome or Edge.

Q. What translation directions are supported? A. Whisper supports translation into English only — any source language to English (e.g. Japanese audio → English captions). The reverse (English → Japanese) or non-English target languages aren't supported. For multi-language captioning, generate Japanese or English captions first, then translate them externally with DeepL, Claude, or similar.

Q. Where is the audio stored? A. Only in browser memory, and discarded once captioning finishes. It's never sent to VideoBuff servers, Hugging Face, or any other external service. The model weights, on the other hand, are downloaded once from Hugging Face and stored in the browser's Cache API / IndexedDB (same behavior as a normal cached web resource).

Q. Can I edit captions on a mobile browser? A. Editing text and downloading SRT/VTT/TXT works on mobile, but given the model download size (~1 GB) and inference cost, a desktop browser is recommended. Mobile devices may struggle with the initial download and the battery hit.

Q. Does this work for vertical (9:16) short-form video? A. Yes. Font size and max characters per line are auto-adjusted based on the project's canvas aspect ratio. On a 9:16 canvas the splitter shortens each caption line so it stays inside the frame.

Q. How do I improve caption accuracy? A. Two things help: (1) record with the mic close to the speaker and minimal background noise; (2) for existing recordings, run noise reduction and loudness normalization first — either inside VideoBuff via audio mixing or in our sister tool AudioBuff — before captioning. VideoBuff uses a single high-accuracy model (Whisper Large v3 Turbo) for everyone, so there's no model tier to switch.

When the auto-caption button is missing (advanced)

Captions are conceptually tied to audio, so the Inspector's "Auto Caption" section only appears when the selected clip has usable audio attached.

Audio clip selected Always shown. Generated captions go onto the timeline at the audio clip's startMs.

Video clip selected (with a linked audio clip on the timeline) The default state right after importing a video. The linked audio is used as the source and captions are placed at the video clip's startMs. This is the standard case and needs no thought.

Video clip selected (unlinked, or the linked audio has been deleted) The auto-caption section is hidden in this state. When the video and audio aren't bound together, it's ambiguous which timeline position the captions should align with. To caption such material, pick the audio clip instead — the captions then land at the audio's position, naturally staying in sync with the sound. If you want the captions on the video side, either re-link via the Inspector's link button, or move the audio back beside the video and link them again.

Audio-only files (m4a, mp3, etc.) Audio clips imported on their own work the same way — useful for meeting recordings, podcast post-production, and any audio-only source.

For heavier audio work — noise reduction, loudness normalization, EQ, etc. — before captioning, our sister service AudioBuff handles those tasks; clean the audio there and import the result into VideoBuff.

Try it now

No download, no account. Open your browser and start editing right away.

Start Editing →