Transcription Deployment

Quick start — same machine (CPU)

bash

# docker/.env
TRANSCRIPTION_LIVE_ENABLED=true
TRANSCRIPTION_POST_ENABLED=true
WHISPER_MODEL=base
WHISPER_DEVICE=cpu
WHISPER_COMPUTE_TYPE=int8   # 2–3x faster than float32 on CPU

# Start STT profiles alongside the core stack
docker compose --profile stt-post --profile stt-live up -d

Model weights (~150 MB for base) are downloaded on first start and cached in the whisper_models Docker volume.

Whisper model selection

Model	Size	CPU speed	GPU speed	Recommended for
`tiny`	75 MB	Fastest	—	Dev/testing only
`base`	150 MB	Fast	—	Default — CPU deployments
`small`	250 MB	Moderate	Fast	Better accuracy on CPU
`medium`	770 MB	Slow	Very fast	GPU deployments
`large-v3`	1.5 GB	Very slow	Excellent	High-accuracy GPU

Use .en variants (base.en, small.en) for English-only — about 10% faster and more accurate.

Set in docker/.env:

dotenv

WHISPER_MODEL=base.en

Live transcription latency

Perceived delay = PAUSE_THRESHOLD_SECONDS + Whisper processing time.

Hardware	Model	Whisper processing	Total perceived delay
CPU 2–4 cores	`tiny.en`	~2–3s	4–5s
CPU 4–8 cores	`base.en`	~1–2s	3–4s
CPU 8+ cores	`small.en`	~1.5–2.5s	3–4s
GPU 6–8 GB (RTX 3060, T4)	`medium`	~0.3–0.7s	2–2.5s
GPU 16–24 GB (RTX 3090/4090, A10)	`large-v3`	~0.2–0.5s	2s

TIP

PAUSE_THRESHOLD_SECONDS=1.5 means the agent waits 1.5s of silence before sending audio to Whisper. The speaking indicator appears immediately (before Whisper is involved), so the UI never feels frozen even with longer processing times.

GPU acceleration (same machine)

Requirements:

NVIDIA GPU (compute capability 5.0+)
nvidia-container-toolkit installed on the host

bash

# docker/.env
WHISPER_MODEL=large-v3
WHISPER_DEVICE=cuda
WHISPER_COMPUTE_TYPE=float16

# Start with GPU override file
docker compose \
  -f docker-compose.yml \
  -f docker-compose.stt.gpu.yml \
  --profile stt-post --profile stt-live \
  up -d

Dedicated GPU machine

Run Whisper and the STT workers on a separate machine — useful when your main server is CPU-only but you have a GPU workstation or cloud instance available.

On the STT machine

bash

cd docker
cp .env.stt.example .env.stt
# Edit .env.stt — set REDIS_URL, MINIO_ENDPOINT, LIVEKIT_URL, credentials
nano .env.stt

# CPU
docker compose -f docker-compose.stt.yml --env-file .env.stt up -d

# GPU
docker compose \
  -f docker-compose.stt.yml \
  -f docker-compose.stt.gpu.yml \
  --env-file .env.stt up -d

On the main machine

dotenv

# docker/.env
WHISPER_URL=http://STT_MACHINE_IP:3281
TRANSCRIPTION_POST_ENABLED=true
TRANSCRIPTION_LIVE_ENABLED=true

bash

docker compose up -d api

The stt-worker and stt-live-agent services run on the STT machine. The main API just points its WHISPER_URL at the remote service.

Secure the connection

Redis and MinIO must never be exposed on a public IP without encryption. Use one of:

WireGuard or Tailscale (recommended) — private encrypted tunnel
SSH tunnel — ssh -L 6379:localhost:6379 user@main-host
Cloud VPC — security groups allowing only the STT machine's private IP

Environment variables reference

Variable	Default	Description
`TRANSCRIPTION_LIVE_ENABLED`	`false`	Enable live transcription API endpoints
`TRANSCRIPTION_POST_ENABLED`	`false`	Enable post-call transcription API endpoints
`WHISPER_MODEL`	`base`	Model: `tiny`, `base`, `small`, `medium`, `large-v3`
`WHISPER_DEVICE`	`cpu`	`cpu` or `cuda`
`WHISPER_COMPUTE_TYPE`	`int8`	`int8` (CPU) or `float16` (GPU)
`STT_LANGUAGE`	`en`	ISO 639-1 code or `auto` for detection
`PAUSE_THRESHOLD_SECONDS`	`1.5`	Silence before flushing to Whisper
`SHORT_PAUSE_SECONDS`	`0.3`	Short pause — only flush if last chunk ended with punctuation
`MAX_CHUNK_SECONDS`	`30.0`	Force-flush audio after this duration
`SPEECH_RMS_THRESHOLD`	`200`	RMS energy threshold to detect speech
`WHISPER_MAX_CONCURRENT`	`2`	Max parallel Whisper requests per room
`WHISPER_URL`	`http://whisper:8080`	URL for the Whisper REST service
`PORT_WHISPER`	`3281`	Host port for the Whisper REST API

Transcription Deployment ​

Quick start — same machine (CPU) ​

Whisper model selection ​

Live transcription latency ​

GPU acceleration (same machine) ​

Dedicated GPU machine ​

On the STT machine ​

On the main machine ​

Environment variables reference ​

Transcription Deployment

Quick start — same machine (CPU)

Whisper model selection

Live transcription latency

GPU acceleration (same machine)

Dedicated GPU machine

On the STT machine

On the main machine

Environment variables reference