Skip to main content

@hmcs/stt

The Speech-to-Text MOD (@hmcs/stt) provides voice recognition using OpenAI's Whisper model. It includes a UI control panel for managing Whisper models and configuring recognition settings.

Overview

STT (Speech-to-Text) converts spoken audio from your microphone into text. The recognition engine runs locally — no cloud API required. The MOD supports:

  • Single-shot recognition — Speak a sentence and get the transcription
  • Push-to-Talk (PTT) — Hold a key to record, release to transcribe
  • 6 model sizes — From tiny (fast, less accurate) to large-v3 (slow, most accurate)
  • Language auto-detection — Or manual language override

Prerequisites

Install the STT MOD:

hmcs mod install @hmcs/stt

Requirements:

  • Microphone access (the OS may prompt for permission on first use)
  • Disk space for Whisper models (75 MB for tiny, up to 3 GB for large-v3)

GPU acceleration (optional):

  • macOS: Metal acceleration via stt-metal feature flag
  • NVIDIA GPUs: CUDA acceleration via stt-cuda feature flag

Control Panel

Access from the system tray"Speech to Text".

The control panel lets you:

  • Download Whisper models with progress display
  • Delete downloaded models to free disk space
  • View downloaded models and their file sizes
  • Configure default language and model size

Available Models

ModelSizeSpeedAccuracy
tiny~75 MBFastestBasic
base~150 MBFastGood
small~500 MBModerateBetter
medium~1.5 GBSlowHigh
large-v3-turbo~1.6 GBModerateVery High
large-v3~3 GBSlowestHighest
tip

Start with the base model for a good balance of speed and accuracy. Upgrade to small or medium if recognition quality isn't sufficient.

Agent Integration

Transcribed speech can be fed into an external OpenClaw agent via the @hmcs/openclaw-plugin bridge. See AI Integration for setup details.

SDK Reference

For programmatic STT access, see the STT SDK reference.

Notes

  • Whisper models are stored in ~/.homunculus/stt_models/.
  • The STT engine (homunculus_microphone) is built into the Desktop Homunculus engine, not the MOD itself. The MOD provides the UI and user-facing controls.
  • Language defaults to auto-detection. Override in the control panel or via the SDK.