Whisper dictation that runs entirely on your device

VoicePad packages OpenAI Whisper into a real-time dictation app for Windows, macOS, iOS, and Android. Same engine that powers the best transcription tools — repackaged for live voice typing into any text field, with no cloud roundtrip and no subscription.

Built on whisper.cpp + WhisperKit · MIT-licensed engine · Sub-5% WER on clean German · No telemetry

founding spots open for lifetime free access

What Whisper is, and why it's the best dictation engine in 2026

Whisper is OpenAI's open-source automatic speech recognition model, released September 2022, trained on 680,000 hours of multilingual audio scraped from the public web. It's MIT-licensed, multilingual (99 languages), and runs entirely on consumer hardware. The whisper.cpp port by Georgi Gerganov made it fast enough for real-time use on phones and laptops without dedicated GPUs.

Why Whisper beats older speech recognition:

  • Training scale: 680,000 hours vs Dragon's hand-curated thousands
  • Robustness: Handles accents, background noise, code-switching, mixed languages naturally
  • 99 languages with one model — no language packs to install
  • No vocabulary training step — recognizes domain terms out of the box
  • Open weights = no vendor lock-in, MIT license

What Whisper isn't:

  • Not a magic bullet for speaker diarization (use VAD + downstream models)
  • Not optimized for real-time streaming out of the box (we patch this)
  • Not equally accurate across all 99 languages — best on English, very strong on German/French/Spanish, weaker on low-resource languages

Whisper is the de facto standard for offline speech recognition in 2026. VoicePad's job is to package it correctly for live dictation.

How VoicePad packages Whisper for live dictation

Whisper alone is a transcription engine. Live dictation needs five extra layers that VoicePad provides:

1. Voice Activity Detection (VAD) — Silero VAD model gates the audio. No speech in, no Whisper invocation, no hallucinations on silence. Threshold tuned to 0.008 after field testing.

2. Hallucination filter v2 — Whisper has a known failure mode where silence or near-silence triggers phantom outputs ("Thank you", "Subscribe to my channel", etc.). VoicePad's filter uses a dual-gate architecture: exact-phrase blacklist (387 real examples) + prefix pattern matching. Catches what VAD misses.

3. Custom dictionary — 915-entry curated EN+DE dictionary for proper nouns, technical terms, and disambiguation. Post-processes Whisper output to fix consistent errors (e.g. "ChatGPT" not "chat GPT", "VoicePad" not "voice pad").

4. Text injection layer — Different per OS. Whisper outputs text; getting it into WhatsApp, ChatGPT, Notion, or any text field requires OS-specific work (see stack table below).

5. Activation UI — Float Orb on mobile, hotkey on desktop. Tap, speak, release. The interaction model that makes Whisper feel like dictation instead of transcription.

Without these five layers, Whisper is a great library and a bad dictation app. VoicePad is the integration work.

Whisper model sizes: which one for what?

Model Parameters Disk RAM Speed (rel.) Best for
Tiny 39M ~75 MB ~390 MB 32× Phone, low-end laptop
Base 74M ~140 MB ~500 MB 16× Phone, mid laptop
Small (default) 244M ~480 MB ~1 GB Most users, balanced
Medium (Pro) 769M ~1.5 GB ~2.6 GB Pro users, desktop
Large-v3 (planned) 1550M ~3 GB ~5.5 GB Workstation

VoicePad defaults to Small because it's the sweet spot: ~96% English / ~94% German accuracy with under-a-second latency on a 5-year-old laptop or mid-range phone. Pro users on desktops get Medium for the extra 2-3 percentage points on edge cases.

Speed numbers are relative to Large on a CPU. On Apple Silicon with CoreML and Apple Neural Engine, Small runs at ~10× real-time. On a desktop GPU (CUDA), Medium runs at ~5× real-time.

Per-platform technical stack

Developers want to see the real wiring. Here's VoicePad's stack on each OS:

Layer Windows macOS iOS Android
Whisper runtime whisper.cpp WhisperKit WhisperKit whisper.cpp (JNI)
Acceleration CPU + optional CUDA Metal + Apple NE CoreML + Apple NE CPU (NEON SIMD)
VAD Silero (ONNX) Silero (CoreML) Silero (CoreML) Silero (ONNX NNAPI)
Audio capture WASAPI AVAudioEngine AVAudioEngine AudioRecord
Text injection Win32 SendInput + HWND Accessibility API Keyboard extension AccessibilityService
Activation UI Floating window Menu bar + hotkey Keyboard extension Float Orb (overlay)
Codebase Python + PyInstaller Swift Swift Kotlin (KMP)

Shared logic between iOS and Android lives in Kotlin Multiplatform (KMP). Native shells on each platform handle the OS-specific bits.

Why this matters: Most "cross-platform" dictation apps are Electron wrappers that ship the same JavaScript everywhere. VoicePad is genuinely native per OS. That's why it works in WhatsApp on Android (which fights generic input methods) and in ChatGPT on iOS (which has its own keyboard quirks).

Whisper dictation alternatives: honest comparison

Tool Engine Platforms Live dictation License Notes
VoicePad AI Whisper (cpp + Kit) Win+Mac+iOS+Android Commercial This page
whisper.cpp Whisper All (CLI) ❌ batch MIT Great library, no UX
MacWhisper Whisper macOS only ❌ batch Commercial File transcription
Superwhisper Whisper Mac+iOS+Win $249 lifetime Premium, no Android
Wispr Flow Proprietary (not Whisper) Mac+Win+iOS+Android $15/mo Cloud-only
OpenWhispr Whisper Mac+Win+Linux Open source Free, less polish
Handl Whisper Mac+Win+Linux Open source Free, no mobile
VoiceInk Whisper macOS only OSS / paid Mac-focused

Important: Wispr Flow doesn't run Whisper at runtime. Despite some marketing, their production pipeline is proprietary cloud inference. Every word you speak is sent to their servers. For privacy-sensitive use cases, this matters.

Superwhisper is the closest cross-platform competitor. Excellent product, premium price ($249), expanding to Windows. VoicePad's positioning: same engine quality, scrappy early-stage pricing, and Android coverage (which Superwhisper doesn't have).

Accuracy: what to actually expect

Whisper's accuracy varies by language, audio quality, and model size. Honest numbers from VoicePad's internal testing and public benchmarks:

Condition Small Medium Large-v3
Clean English, quiet room 96-98% 97-99% 98-99%
Clean German, quiet room 94-97% 96-98% 97-99%
English with background noise 92-95% 94-97% 95-98%
German with regional accent 88-93% 92-96% 94-97%
Mixed code-switching (DE + EN) 85-90% 90-94% 93-96%

WER (word error rate) benchmark for German on Whisper Small: ~4.2% on clean speech (Uni Mannheim test set), comparable to or better than Dragon NaturallySpeaking's German pack — which costs €500 and runs Windows-only.

Limits to know:

  • Whisper performs worse on very short utterances (under 1 second). VAD helps but doesn't eliminate this.
  • Heavy accents that aren't well-represented in the 680k training hours show 5-10% higher WER.
  • Real-time streaming is patched in via chunking, not native. Latency is ~1-2 seconds for Small on a modern phone, ~3-4 seconds on older hardware.

Frequently asked questions

Does VoicePad actually run Whisper, or is it just inspired by it?
Actually runs Whisper. On Windows and Android: whisper.cpp (the official C/C++ port). On macOS and iOS: WhisperKit (Argmax's Swift port optimized for Apple Silicon). Model weights are the same OpenAI Whisper weights you'd get from huggingface.co/openai. We don't fine-tune or modify the model.
Why not just use whisper.cpp directly?
You can, and you should if you only need batch transcription on the CLI. VoicePad adds VAD, hallucination filtering, dictionary post-processing, per-OS text injection, and a usable activation UI. Building those layers yourself on four operating systems took us two years.
Can I use my own Whisper model (fine-tuned, custom)?
Not in the current build. Roadmap item for v2.2 — bring-your-own GGUF support so you can drop in domain-fine-tuned models (medical, legal, specialty technical vocab). If you have a use case, email alex@voicepad.tech and we'll prioritize.
How does VoicePad compare to Apple's built-in Dictation?
Apple Dictation uses Apple's proprietary speech engine, not Whisper. Independent benchmarks put it at roughly 3× the error rate of Whisper-based tools on the same test sets. Apple Enhanced Dictation sends audio to Apple's servers; VoicePad never leaves the device.
What about latency?
On Apple Silicon (M1 or later) with Whisper Small: dictation feels real-time, sub-1-second end-of-speech to text. On a 5-year-old Intel laptop: 1-2 seconds. On a mid-range Android phone: 1-2 seconds. The Float Orb shows a processing indicator so you know it's working.
Is the Whisper dictation runtime open source?
The Whisper model weights are MIT-licensed (OpenAI). whisper.cpp is MIT (Georgi Gerganov). WhisperKit is MIT (Argmax). VoicePad's app code that wraps these is closed-source for now — but we're planning to open-source specific components first, starting with the hallucination filter and the EN+DE dictionary.
Do you support real-time streaming or only chunk-based?
Chunk-based with smart boundaries. VoicePad listens continuously while the Orb is active, runs VAD on a rolling buffer, and invokes Whisper on detected speech segments. Latency is the segment length + Whisper inference time. Pure-streaming Whisper exists (Whisper-Streaming project on GitHub) but adds complexity without a major UX win for dictation use cases.

Try Whisper dictation on your own setup

Free standard tier (Whisper Small) on all platforms. Founding members get lifetime Pro access (Medium + Large-v3 + WiFi sync) — spots remaining.

Try it free

No credit card · Instant access · 4 platforms · No subscription