What "offline" actually means for speech to text
Most speech recognition services — Google Speech-to-Text, Amazon Transcribe, Microsoft Azure Speech — require an internet connection. Your voice audio is sent to remote servers, processed there, and results are sent back. This has three problems:
- Privacy: Your voice data is transmitted to and stored on third-party servers
- Availability: No internet = no functionality. Airplane, basement, rural area = unusable
- Cost: Per-minute billing adds up. Heavy users pay hundreds per month
Offline speech to text solves all three. The speech recognition model runs locally on your device. Audio never leaves your hardware. Works without any network connection. One-time cost.
VoicePad implements this using OpenAI's Whisper model — the same engine that powers the best transcription services — packaged to run directly on your Windows PC, Mac, iPhone, or Android phone.
How VoicePad delivers offline speech to text
Running Whisper offline requires more than just the model. VoicePad adds five layers that make it actually usable for real-time dictation:
1. Model optimization — Whisper weights are converted to platform-optimized formats: GGUF for whisper.cpp (Windows/Android) and CoreML for WhisperKit (macOS/iOS). This enables GPU/NPU acceleration without requiring CUDA or manual setup.
2. Voice Activity Detection (VAD) — Silero VAD runs continuously, detecting when you're actually speaking. Whisper only processes audio segments with speech — saving CPU cycles and preventing hallucinations on silence.
3. Hallucination filtering — Whisper has a known issue: it sometimes outputs phantom text on silence ("Thank you", "Subscribe", etc.). VoicePad's filter catches these with a 387-phrase blacklist plus pattern matching.
4. Smart chunking — For real-time feel, audio is processed in segments (1-3 seconds) rather than waiting for complete utterances. This adds engineering complexity but makes the experience feel instant.
5. Text injection — Transcribed text needs to reach your target app (WhatsApp, Word, ChatGPT). VoicePad handles this differently per OS: Win32 SendInput on Windows, Accessibility API on Mac, keyboard extension on iOS, AccessibilityService on Android.
The result: speak, see text appear in any app, no internet involved at any step.
Model sizes and system requirements
| Model | Download | RAM needed | Speed | Best for |
|---|---|---|---|---|
| Whisper Tiny | ~75 MB | ~400 MB | Fastest | Older phones, low-spec devices |
| Whisper Small (default) | ~480 MB | ~1 GB | Fast | Most users — best accuracy/speed balance |
| Whisper Medium (Pro) | ~1.5 GB | ~2.6 GB | Moderate | Power users, desktop systems |
| Whisper Large-v3 (planned) | ~3 GB | ~5.5 GB | Slow | Workstations, maximum accuracy |
Minimum requirements:
- Windows: Windows 10 or later, 4 GB RAM, any CPU from the last 5 years
- macOS: macOS 12+, works on Intel and Apple Silicon (M1+ runs ~3x faster)
- iOS: iPhone 8 or later, iOS 15+
- Android: Android 8+, 3 GB RAM minimum, 4 GB+ recommended
No dedicated GPU required. Whisper Small runs in real-time on a 5-year-old laptop CPU. On Apple Silicon or with CUDA on Windows, performance is significantly better.
Technical stack per platform
| Component | Windows | macOS | iOS | Android |
|---|---|---|---|---|
| Whisper runtime | whisper.cpp | WhisperKit | WhisperKit | whisper.cpp (JNI) |
| Acceleration | CPU + optional CUDA | Metal + Neural Engine | CoreML + Neural Engine | CPU (NEON SIMD) |
| VAD engine | Silero (ONNX) | Silero (CoreML) | Silero (CoreML) | Silero (ONNX) |
| Audio capture | WASAPI | AVAudioEngine | AVAudioEngine | AudioRecord |
| Text injection | Win32 SendInput | Accessibility API | Keyboard extension | AccessibilityService |
| Activation | System tray + hotkey | Menu bar + hotkey | Keyboard button | Float Orb overlay |
Each platform uses native APIs for best performance. No Electron wrapper, no web views, no cross-platform compromises that sacrifice speed.
Offline speech to text: VoicePad vs alternatives
| Solution | Truly offline | Platforms | Live dictation | Price |
|---|---|---|---|---|
| VoicePad AI | Yes, 100% | Win+Mac+iOS+Android | Yes | $0-50 one-time |
| Windows Voice Typing | Partial (Enhanced = cloud) | Windows only | Yes | Free |
| Apple Dictation | Partial (Enhanced = cloud) | Apple only | Yes | Free |
| Google Voice Typing | No | Android/Gboard | Yes | Free |
| Dragon NaturallySpeaking | Yes | Windows only | Yes | $500+ |
| Whisper.cpp (CLI) | Yes | All (technical) | No (batch) | Free |
| Otter.ai | No | Web/Mobile | Yes | $17/mo |
Note: Windows and Apple's "Enhanced" dictation modes send audio to servers. Only their basic modes are truly offline — and those have lower accuracy. VoicePad is always offline with Whisper-level accuracy.
Accuracy comparison: offline vs cloud
| Condition | VoicePad (Whisper Small) | Google Cloud Speech | Windows Offline |
|---|---|---|---|
| Clean English | 96-98% | 97-99% | 88-92% |
| Clean German | 94-97% | 95-98% | 82-88% |
| Accented English | 92-96% | 93-97% | 80-88% |
| Background noise | 90-95% | 92-96% | 75-85% |
| Technical vocabulary | 90-95% | 88-94% | 70-82% |
Whisper matches or beats cloud services on most conditions. The gap is small on clean audio. Where cloud wins: very noisy environments and rare languages with limited Whisper training data.
VoicePad's custom dictionary adds another 1-2% by correcting consistent errors (proper nouns, technical terms, brand names).
Frequently asked questions
Try offline speech to text
Free standard tier on all platforms. Founding members get lifetime Pro access (Whisper Medium + WiFi Sync) — — spots remaining.
Try it free