Whisper Transcription Guide: Models, Accuracy, and Offline Use
Understand how Whisper works, how tiny through large-v3-turbo differ, how to evaluate multilingual accuracy, and how to run speech-to-text locally on Mac, iPhone, and iPad.
Updated

Key takeaways
- Larger models are often more robust, but cost more memory, energy, and time.
- Real recordings from your own languages and environments are more useful than one public benchmark.
- Evaluation should include critical-entity errors, hallucinations, runtime, and memory—not WER alone.
How Whisper turns speech into text
Whisper is OpenAI’s general-purpose speech recognition model family. It converts audio into a spectrogram, encodes acoustic features, and decodes them into text across many languages and recording conditions. It can still fail on music, overlapping speakers, uncommon names, numbers, and long silence, so timestamps and access to the source audio remain essential for consequential work.
Choosing tiny, base, small, medium, large, or turbo
Tiny and base suit constrained devices and quick drafts; small offers a useful multilingual balance; medium and large can help with difficult accents and noise; turbo reduces decoder depth for faster local transcription. Parameter count alone is not enough because quantization, inference software, and Apple Silicon acceleration materially change speed and memory use.
- Quick voice memos: test a small model or turbo first
- Multilingual interviews: compare small, medium, and large
- Older devices: prioritize memory and stability
- Critical records: use a stronger model and human review
Measure accuracy in a way that matches the task
WER is common for space-delimited languages while CER is often more useful for Chinese and Japanese. A useful test set includes clean speech, distant meetings, names, numbers, code-switching, and overlapping voices. Track high-impact mistakes separately: changing “not approved” to “approved” matters far more than dropping a filler word.
Run Whisper completely offline
Download the model before disconnecting, then test a new recording with Wi-Fi and cellular data disabled. Macs usually sustain long or batch workloads better, while iPhone and iPad make capture and mobile processing convenient. Local inference avoids a transcription-server upload but still consumes storage, battery, and memory and does not automatically disable backups or analytics.
Whisper alongside Parakeet, SenseVoice, and Voxtral
Whisper has broad language coverage and a mature ecosystem. Parakeet emphasizes high-throughput recognition for its supported European languages, SenseVoice is compelling for Chinese, Japanese, and Korean, and Voxtral combines transcription with audio understanding. Route by language and task, but benchmark every candidate on the same recordings before setting a default.
Frequently asked questions
Can Whisper run fully offline?
Yes. Once model weights are stored on the device, inference can run locally. Check separately whether the application uses cloud sync, analytics, or backups.
Which Whisper model is best for multilingual transcription?
Start with small or turbo, then compare medium and large on representative recordings. The best choice depends on languages, terminology, hardware, and acceptable waiting time.
Is the largest Whisper model always the most accurate?
No. Larger models are often more robust, but language, audio conditions, quantization, and decoding settings can change the result.
Can Whisper output be used directly for legal or medical records?
Not without qualified human review. Verify key entities against the source and follow applicable consent, retention, and professional requirements.