@hmcs/stt
The Speech-to-Text MOD (@hmcs/stt) provides voice recognition using OpenAI's Whisper model. It includes a UI control panel for managing Whisper models and configuring recognition settings.
Overview
STT (Speech-to-Text) converts spoken audio from your microphone into text. The recognition engine runs locally — no cloud API required. The MOD supports:
- Single-shot recognition — Speak a sentence and get the transcription
- Push-to-Talk (PTT) — Hold a key to record, release to transcribe
- 6 model sizes — From tiny (fast, less accurate) to large-v3 (slow, most accurate)
- Language auto-detection — Or manual language override
Prerequisites
Install the STT MOD:
hmcs mod install @hmcs/stt
Requirements:
- Microphone access (the OS may prompt for permission on first use)
- Disk space for Whisper models (75 MB for tiny, up to 3 GB for large-v3)
GPU acceleration (optional):
- macOS: Metal acceleration via
stt-metalfeature flag - NVIDIA GPUs: CUDA acceleration via
stt-cudafeature flag
Control Panel
Access from the system tray → "Speech to Text".
The control panel lets you:
- Download Whisper models with progress display
- Delete downloaded models to free disk space
- View downloaded models and their file sizes
- Configure default language and model size
Available Models
| Model | Size | Speed | Accuracy |
|---|---|---|---|
tiny | ~75 MB | Fastest | Basic |
base | ~150 MB | Fast | Good |
small | ~500 MB | Moderate | Better |
medium | ~1.5 GB | Slow | High |
large-v3-turbo | ~1.6 GB | Moderate | Very High |
large-v3 | ~3 GB | Slowest | Highest |
Start with the base model for a good balance of speed and accuracy. Upgrade to small or medium if recognition quality isn't sufficient.
Agent Integration
Transcribed speech can be fed into an external OpenClaw agent via the @hmcs/openclaw-plugin bridge. See AI Integration for setup details.
SDK Reference
For programmatic STT access, see the STT SDK reference.
Notes
- Whisper models are stored in
~/.homunculus/stt_models/. - The STT engine (
homunculus_microphone) is built into the Desktop Homunculus engine, not the MOD itself. The MOD provides the UI and user-facing controls. - Language defaults to auto-detection. Override in the control panel or via the SDK.