What Whisper is, and why it's the best dictation engine in 2026
Whisper is OpenAI's open-source automatic speech recognition model, released September 2022, trained on 680,000 hours of multilingual audio scraped from the public web. It's MIT-licensed, multilingual (99 languages), and runs entirely on consumer hardware. The whisper.cpp port by Georgi Gerganov made it fast enough for real-time use on phones and laptops without dedicated GPUs.
Why Whisper beats older speech recognition:
- Training scale: 680,000 hours vs Dragon's hand-curated thousands
- Robustness: Handles accents, background noise, code-switching, mixed languages naturally
- 99 languages with one model — no language packs to install
- No vocabulary training step — recognizes domain terms out of the box
- Open weights = no vendor lock-in, MIT license
What Whisper isn't:
- Not a magic bullet for speaker diarization (use VAD + downstream models)
- Not optimized for real-time streaming out of the box (we patch this)
- Not equally accurate across all 99 languages — best on English, very strong on German/French/Spanish, weaker on low-resource languages
Whisper is the de facto standard for offline speech recognition in 2026. VoicePad's job is to package it correctly for live dictation.
How VoicePad packages Whisper for live dictation
Whisper alone is a transcription engine. Live dictation needs five extra layers that VoicePad provides:
1. Voice Activity Detection (VAD) — Silero VAD model gates the audio. No speech in, no Whisper invocation, no hallucinations on silence. Threshold tuned to 0.008 after field testing.
2. Hallucination filter v2 — Whisper has a known failure mode where silence or near-silence triggers phantom outputs ("Thank you", "Subscribe to my channel", etc.). VoicePad's filter uses a dual-gate architecture: exact-phrase blacklist (387 real examples) + prefix pattern matching. Catches what VAD misses.
3. Custom dictionary — 915-entry curated EN+DE dictionary for proper nouns, technical terms, and disambiguation. Post-processes Whisper output to fix consistent errors (e.g. "ChatGPT" not "chat GPT", "VoicePad" not "voice pad").
4. Text injection layer — Different per OS. Whisper outputs text; getting it into WhatsApp, ChatGPT, Notion, or any text field requires OS-specific work (see stack table below).
5. Activation UI — Float Orb on mobile, hotkey on desktop. Tap, speak, release. The interaction model that makes Whisper feel like dictation instead of transcription.
Without these five layers, Whisper is a great library and a bad dictation app. VoicePad is the integration work.
Whisper model sizes: which one for what?
| Model | Parameters | Disk | RAM | Speed (rel.) | Best for |
|---|---|---|---|---|---|
| Tiny | 39M | ~75 MB | ~390 MB | 32× | Phone, low-end laptop |
| Base | 74M | ~140 MB | ~500 MB | 16× | Phone, mid laptop |
| Small (default) | 244M | ~480 MB | ~1 GB | 6× | Most users, balanced |
| Medium (Pro) | 769M | ~1.5 GB | ~2.6 GB | 2× | Pro users, desktop |
| Large-v3 (planned) | 1550M | ~3 GB | ~5.5 GB | 1× | Workstation |
VoicePad defaults to Small because it's the sweet spot: ~96% English / ~94% German accuracy with under-a-second latency on a 5-year-old laptop or mid-range phone. Pro users on desktops get Medium for the extra 2-3 percentage points on edge cases.
Speed numbers are relative to Large on a CPU. On Apple Silicon with CoreML and Apple Neural Engine, Small runs at ~10× real-time. On a desktop GPU (CUDA), Medium runs at ~5× real-time.
Per-platform technical stack
Developers want to see the real wiring. Here's VoicePad's stack on each OS:
| Layer | Windows | macOS | iOS | Android |
|---|---|---|---|---|
| Whisper runtime | whisper.cpp | WhisperKit | WhisperKit | whisper.cpp (JNI) |
| Acceleration | CPU + optional CUDA | Metal + Apple NE | CoreML + Apple NE | CPU (NEON SIMD) |
| VAD | Silero (ONNX) | Silero (CoreML) | Silero (CoreML) | Silero (ONNX NNAPI) |
| Audio capture | WASAPI | AVAudioEngine | AVAudioEngine | AudioRecord |
| Text injection | Win32 SendInput + HWND | Accessibility API | Keyboard extension | AccessibilityService |
| Activation UI | Floating window | Menu bar + hotkey | Keyboard extension | Float Orb (overlay) |
| Codebase | Python + PyInstaller | Swift | Swift | Kotlin (KMP) |
Shared logic between iOS and Android lives in Kotlin Multiplatform (KMP). Native shells on each platform handle the OS-specific bits.
Why this matters: Most "cross-platform" dictation apps are Electron wrappers that ship the same JavaScript everywhere. VoicePad is genuinely native per OS. That's why it works in WhatsApp on Android (which fights generic input methods) and in ChatGPT on iOS (which has its own keyboard quirks).
Whisper dictation alternatives: honest comparison
| Tool | Engine | Platforms | Live dictation | License | Notes |
|---|---|---|---|---|---|
| VoicePad AI | Whisper (cpp + Kit) | Win+Mac+iOS+Android | ✅ | Commercial | This page |
| whisper.cpp | Whisper | All (CLI) | ❌ batch | MIT | Great library, no UX |
| MacWhisper | Whisper | macOS only | ❌ batch | Commercial | File transcription |
| Superwhisper | Whisper | Mac+iOS+Win | ✅ | $249 lifetime | Premium, no Android |
| Wispr Flow | Proprietary (not Whisper) | Mac+Win+iOS+Android | ✅ | $15/mo | Cloud-only |
| OpenWhispr | Whisper | Mac+Win+Linux | ✅ | Open source | Free, less polish |
| Handl | Whisper | Mac+Win+Linux | ✅ | Open source | Free, no mobile |
| VoiceInk | Whisper | macOS only | ✅ | OSS / paid | Mac-focused |
Important: Wispr Flow doesn't run Whisper at runtime. Despite some marketing, their production pipeline is proprietary cloud inference. Every word you speak is sent to their servers. For privacy-sensitive use cases, this matters.
Superwhisper is the closest cross-platform competitor. Excellent product, premium price ($249), expanding to Windows. VoicePad's positioning: same engine quality, scrappy early-stage pricing, and Android coverage (which Superwhisper doesn't have).
Accuracy: what to actually expect
Whisper's accuracy varies by language, audio quality, and model size. Honest numbers from VoicePad's internal testing and public benchmarks:
| Condition | Small | Medium | Large-v3 |
|---|---|---|---|
| Clean English, quiet room | 96-98% | 97-99% | 98-99% |
| Clean German, quiet room | 94-97% | 96-98% | 97-99% |
| English with background noise | 92-95% | 94-97% | 95-98% |
| German with regional accent | 88-93% | 92-96% | 94-97% |
| Mixed code-switching (DE + EN) | 85-90% | 90-94% | 93-96% |
WER (word error rate) benchmark for German on Whisper Small: ~4.2% on clean speech (Uni Mannheim test set), comparable to or better than Dragon NaturallySpeaking's German pack — which costs €500 and runs Windows-only.
Limits to know:
- Whisper performs worse on very short utterances (under 1 second). VAD helps but doesn't eliminate this.
- Heavy accents that aren't well-represented in the 680k training hours show 5-10% higher WER.
- Real-time streaming is patched in via chunking, not native. Latency is ~1-2 seconds for Small on a modern phone, ~3-4 seconds on older hardware.
Frequently asked questions
Try Whisper dictation on your own setup
Free standard tier (Whisper Small) on all platforms. Founding members get lifetime Pro access (Medium + Large-v3 + WiFi sync) — — spots remaining.
Try it free