Whisper Dictation — On-Device Speech Recognition for Windows, Mac, iOS, Android

Q: What about latency?

On Apple Silicon (M1 or later) with Whisper Small: sub-1-second end-of-speech to text. On a 5-year-old Intel laptop: 1-2 seconds. On a mid-range Android phone: 1-2 seconds. The Float Orb shows a processing indicator so you know it's working.

What Whisper is, and why it's the best dictation engine in 2026

Whisper is OpenAI's open-source automatic speech recognition model, released September 2022, trained on 680,000 hours of multilingual audio scraped from the public web. It's MIT-licensed, multilingual, and runs entirely on consumer hardware. The whisper.cpp port by Georgi Gerganov made it fast enough for real-time use on phones and laptops without dedicated GPUs.

Why Whisper beats older speech recognition:

Training scale: 680,000 hours vs Dragon's hand-curated thousands
Robustness: Handles accents, background noise, code-switching, mixed languages naturally
Multilingual — one model, no language packs to install
No vocabulary training step — recognizes domain terms out of the box
Open weights = no vendor lock-in, MIT license

What Whisper isn't:

Not a magic bullet for speaker diarization (use VAD + downstream models)
Not optimized for real-time streaming out of the box (we patch this)
Best accuracy on English and German; other languages have higher error rates

Whisper is the de facto standard for offline speech recognition in 2026. VoicePad's job is to package it correctly for live dictation.

How VoicePad packages Whisper for live dictation

Whisper alone is a transcription engine. Live dictation needs five extra layers that VoicePad provides:

1. Voice Activity Detection (VAD) — Silero VAD model gates the audio. No speech in, no Whisper invocation, no hallucinations on silence. Threshold tuned to 0.008 after field testing.

2. Hallucination filter v2 — Whisper has a known failure mode where silence or near-silence triggers phantom outputs ("Thank you", "Subscribe to my channel", etc.). VoicePad's filter uses a dual-gate architecture: exact-phrase blacklist (387 real examples) + prefix pattern matching. Catches what VAD misses.

3. Custom dictionary — 915-entry curated EN+DE dictionary for proper nouns, technical terms, and disambiguation. Post-processes Whisper output to fix consistent errors (e.g. "ChatGPT" not "chat GPT", "VoicePad" not "voice pad").

4. Text injection layer — Different per OS. Whisper outputs text; getting it into WhatsApp, ChatGPT, Notion, or any text field requires OS-specific work (see stack table below).

5. Activation UI — Float Orb on mobile, hotkey on desktop. Tap, speak, release. The interaction model that makes Whisper feel like dictation instead of transcription.

Without these five layers, Whisper is a great library and a bad dictation app. VoicePad is the integration work.

Whisper model sizes: which one for what?

Model	Parameters	Disk	RAM	Speed (rel.)	Best for
Tiny	39M	~75 MB	~390 MB	32×	Phone, low-end laptop
Base	74M	~140 MB	~500 MB	16×	Phone, mid laptop
Small (default)	244M	~480 MB	~1 GB	6×	Most users, balanced
Medium (Pro)	769M	~1.5 GB	~2.6 GB	2×	Pro users, desktop
Large-v3 (planned)	1550M	~3 GB	~5.5 GB	1×	Workstation

VoicePad defaults to Small because it's the sweet spot: ~96% English / ~94% German accuracy with under-a-second latency on a 5-year-old laptop or mid-range phone. Pro users on desktops get Medium for the extra 2-3 percentage points on edge cases.

Speed numbers are relative to Large on a CPU. On Apple Silicon with CoreML and Apple Neural Engine, Small runs at ~10× real-time. On a desktop GPU (CUDA), Medium runs at ~5× real-time.

Per-platform technical stack

Developers want to see the real wiring. Here's VoicePad's stack on each OS:

Layer	Windows	macOS	iOS	Android
Whisper runtime	whisper.cpp	WhisperKit	WhisperKit	whisper.cpp (JNI)
Acceleration	CPU + optional CUDA	Metal + Apple NE	CoreML + Apple NE	CPU (NEON SIMD)
VAD	Silero (ONNX)	Silero (CoreML)	Silero (CoreML)	Silero (ONNX NNAPI)
Audio capture	WASAPI	AVAudioEngine	AVAudioEngine	AudioRecord
Text injection	Win32 SendInput + HWND	Accessibility API	Keyboard extension	InputMethodService (IME)
Activation UI	Floating window	Menu bar + hotkey	Keyboard extension	Float Orb (overlay)
Codebase	Python + PyInstaller	Swift	Swift	Kotlin (KMP)

Shared logic between iOS and Android lives in Kotlin Multiplatform (KMP). Native shells on each platform handle the OS-specific bits.

Why this matters: Most "cross-platform" dictation apps are Electron wrappers that ship the same JavaScript everywhere. VoicePad is genuinely native per OS. That's why it works in WhatsApp on Android (which fights generic input methods) and in ChatGPT on iOS (which has its own keyboard quirks).

Whisper dictation alternatives: honest comparison

Tool	Engine	Platforms	Live dictation	License	Notes
VoicePad AI	Whisper (cpp + Kit)	Win+Mac+iOS+Android	✅	Commercial	This page
whisper.cpp	Whisper	All (CLI)	❌ batch	MIT	Great library, no UX
MacWhisper	Whisper	macOS only	❌ batch	Commercial	File transcription
Superwhisper	Whisper	Mac+iOS+Win	✅	$249 lifetime	Premium, no Android
Wispr Flow	Proprietary (not Whisper)	Mac+Win+iOS+Android	✅	$15/mo	Cloud-only
OpenWhispr	Whisper	Mac+Win+Linux	✅	Open source	Free, less polish
Handl	Whisper	Mac+Win+Linux	✅	Open source	Free, no mobile
VoiceInk	Whisper	macOS only	✅	OSS / paid	Mac-focused

Important: Wispr Flow doesn't run Whisper at runtime. Despite some marketing, their production pipeline is proprietary cloud inference. Every word you speak is sent to their servers. For privacy-sensitive use cases, this matters.

Superwhisper is the closest cross-platform competitor. Excellent product, premium price ($249), expanding to Windows. VoicePad's positioning: same engine quality, scrappy early-stage pricing, and Android coverage (which Superwhisper doesn't have).

Accuracy: what to actually expect

Whisper's accuracy varies by language, audio quality, and model size. Honest numbers from VoicePad's internal testing and public benchmarks:

Condition	Small	Medium	Large-v3
Clean English, quiet room	96-98%	97-99%	98-99%
Clean German, quiet room	94-97%	96-98%	97-99%
English with background noise	92-95%	94-97%	95-98%
German with regional accent	88-93%	92-96%	94-97%
Mixed code-switching (DE + EN)	85-90%	90-94%	93-96%

WER (word error rate) benchmark for German on Whisper Small: ~4.2% on clean speech (Uni Mannheim test set), comparable to or better than Dragon NaturallySpeaking's German pack — which costs €500 and runs Windows-only.

Limits to know:

Whisper performs worse on very short utterances (under 1 second). VAD helps but doesn't eliminate this.
Heavy accents that aren't well-represented in the 680k training hours show 5-10% higher WER.
Real-time streaming is patched in via chunking, not native. Latency is ~1-2 seconds for Small on a modern phone, ~3-4 seconds on older hardware.

Frequently asked questions

Does VoicePad actually run Whisper, or is it just inspired by it?

Actually runs Whisper. On Windows and Android: whisper.cpp (the official C/C++ port). On macOS and iOS: WhisperKit (Argmax's Swift port optimized for Apple Silicon). Model weights are the same OpenAI Whisper weights you'd get from huggingface.co/openai. We don't fine-tune or modify the model.

Why not just use whisper.cpp directly?

You can, and you should if you only need batch transcription on the CLI. VoicePad adds VAD, hallucination filtering, dictionary post-processing, per-OS text injection, and a usable activation UI. Building those layers yourself on four operating systems took us two years.

Can I use my own Whisper model (fine-tuned, custom)?

Not in the current build. Roadmap item for v2.2 — bring-your-own GGUF support so you can drop in domain-fine-tuned models (medical, legal, specialty technical vocab). If you have a use case, email alex@voicepad.tech and we'll prioritize.

How does VoicePad compare to Apple's built-in Dictation?

Apple Dictation uses Apple's proprietary speech engine, not Whisper. Independent benchmarks put it at roughly 3× the error rate of Whisper-based tools on the same test sets. Apple Enhanced Dictation sends audio to Apple's servers; VoicePad never leaves the device.

What about latency?

On Apple Silicon (M1 or later) with Whisper Small: dictation feels real-time, sub-1-second end-of-speech to text. On a 5-year-old Intel laptop: 1-2 seconds. On a mid-range Android phone: 1-2 seconds. The Float Orb shows a processing indicator so you know it's working.

Is the Whisper dictation runtime open source?

The Whisper model weights are MIT-licensed (OpenAI). whisper.cpp is MIT (Georgi Gerganov). WhisperKit is MIT (Argmax). VoicePad's app code that wraps these is closed-source for now — but we're planning to open-source specific components first, starting with the hallucination filter and the EN+DE dictionary.

Do you support real-time streaming or only chunk-based?

Chunk-based with smart boundaries. VoicePad listens continuously while the Orb is active, runs VAD on a rolling buffer, and invokes Whisper on detected speech segments. Latency is the segment length + Whisper inference time. Pure-streaming Whisper exists (Whisper-Streaming project on GitHub) but adds complexity without a major UX win for dictation use cases.

Whisper dictation that runs entirely on your device

What Whisper is, and why it's the best dictation engine in 2026

Why Whisper beats older speech recognition:

What Whisper isn't:

How VoicePad packages Whisper for live dictation

Whisper model sizes: which one for what?

Per-platform technical stack

Whisper dictation alternatives: honest comparison

Accuracy: what to actually expect

Limits to know:

Frequently asked questions

Try Whisper dictation on your own setup

What Whisper is, and why it's the best dictation engine in 2026

Why Whisper beats older speech recognition:

What Whisper isn't:

How VoicePad packages Whisper for live dictation

Whisper model sizes: which one for what?

Per-platform technical stack

Whisper dictation alternatives: honest comparison

Accuracy: what to actually expect

Limits to know:

Frequently asked questions

Try Whisper dictation on your own setup

Related topics