Transcription Deployment
Quick start — same machine (CPU)
# docker/.env
TRANSCRIPTION_LIVE_ENABLED=true
TRANSCRIPTION_POST_ENABLED=true
WHISPER_MODEL=base
WHISPER_DEVICE=cpu
WHISPER_COMPUTE_TYPE=int8 # 2–3x faster than float32 on CPU
# Start STT profiles alongside the core stack
docker compose --profile stt-post --profile stt-live up -dModel weights (~150 MB for base) are downloaded on first start and cached in the whisper_models Docker volume.
Whisper model selection
| Model | Size | CPU speed | GPU speed | Recommended for |
|---|---|---|---|---|
tiny | 75 MB | Fastest | — | Dev/testing only |
base | 150 MB | Fast | — | Default — CPU deployments |
small | 250 MB | Moderate | Fast | Better accuracy on CPU |
medium | 770 MB | Slow | Very fast | GPU deployments |
large-v3 | 1.5 GB | Very slow | Excellent | High-accuracy GPU |
Use .en variants (base.en, small.en) for English-only — about 10% faster and more accurate.
Set in docker/.env:
WHISPER_MODEL=base.enLive transcription latency
Perceived delay = PAUSE_THRESHOLD_SECONDS + Whisper processing time.
| Hardware | Model | Whisper processing | Total perceived delay |
|---|---|---|---|
| CPU 2–4 cores | tiny.en | ~2–3s | 4–5s |
| CPU 4–8 cores | base.en | ~1–2s | 3–4s |
| CPU 8+ cores | small.en | ~1.5–2.5s | 3–4s |
| GPU 6–8 GB (RTX 3060, T4) | medium | ~0.3–0.7s | 2–2.5s |
| GPU 16–24 GB (RTX 3090/4090, A10) | large-v3 | ~0.2–0.5s | 2s |
TIP
PAUSE_THRESHOLD_SECONDS=1.5 means the agent waits 1.5s of silence before sending audio to Whisper. The speaking indicator appears immediately (before Whisper is involved), so the UI never feels frozen even with longer processing times.
GPU acceleration (same machine)
Requirements:
- NVIDIA GPU (compute capability 5.0+)
- nvidia-container-toolkit installed on the host
# docker/.env
WHISPER_MODEL=large-v3
WHISPER_DEVICE=cuda
WHISPER_COMPUTE_TYPE=float16
# Start with GPU override file
docker compose \
-f docker-compose.yml \
-f docker-compose.stt.gpu.yml \
--profile stt-post --profile stt-live \
up -dDedicated GPU machine
Run Whisper and the STT workers on a separate machine — useful when your main server is CPU-only but you have a GPU workstation or cloud instance available.
On the STT machine
cd docker
cp .env.stt.example .env.stt
# Edit .env.stt — set REDIS_URL, MINIO_ENDPOINT, LIVEKIT_URL, credentials
nano .env.stt
# CPU
docker compose -f docker-compose.stt.yml --env-file .env.stt up -d
# GPU
docker compose \
-f docker-compose.stt.yml \
-f docker-compose.stt.gpu.yml \
--env-file .env.stt up -dOn the main machine
# docker/.env
WHISPER_URL=http://STT_MACHINE_IP:3281
TRANSCRIPTION_POST_ENABLED=true
TRANSCRIPTION_LIVE_ENABLED=truedocker compose up -d apiThe stt-worker and stt-live-agent services run on the STT machine. The main API just points its WHISPER_URL at the remote service.
Secure the connection
Redis and MinIO must never be exposed on a public IP without encryption. Use one of:
- WireGuard or Tailscale (recommended) — private encrypted tunnel
- SSH tunnel —
ssh -L 6379:localhost:6379 user@main-host - Cloud VPC — security groups allowing only the STT machine's private IP
Environment variables reference
| Variable | Default | Description |
|---|---|---|
TRANSCRIPTION_LIVE_ENABLED | false | Enable live transcription API endpoints |
TRANSCRIPTION_POST_ENABLED | false | Enable post-call transcription API endpoints |
WHISPER_MODEL | base | Model: tiny, base, small, medium, large-v3 |
WHISPER_DEVICE | cpu | cpu or cuda |
WHISPER_COMPUTE_TYPE | int8 | int8 (CPU) or float16 (GPU) |
STT_LANGUAGE | en | ISO 639-1 code or auto for detection |
PAUSE_THRESHOLD_SECONDS | 1.5 | Silence before flushing to Whisper |
SHORT_PAUSE_SECONDS | 0.3 | Short pause — only flush if last chunk ended with punctuation |
MAX_CHUNK_SECONDS | 30.0 | Force-flush audio after this duration |
SPEECH_RMS_THRESHOLD | 200 | RMS energy threshold to detect speech |
WHISPER_MAX_CONCURRENT | 2 | Max parallel Whisper requests per room |
WHISPER_URL | http://whisper:8080 | URL for the Whisper REST service |
PORT_WHISPER | 3281 | Host port for the Whisper REST API |

