Architecture β
RTCstack is intentionally thin. It manages tokens, rooms, and transcription state; it never touches media.
Two-Path Design β
Browser / Mobile
β
ββββ HTTP POST /v1/token βββΊ RTCstack API βββΊ LiveKit JWT
β (once, before call)
β
ββββ WSS + UDP (WebRTC) βββββββββββββββββββΊ LiveKit SFU
(during call, no RTCstack in path)Path 1 β Token handshake (one-shot REST request): Your app backend calls POST /v1/token. RTCstack signs a LiveKit JWT with the correct role grants and returns it. After this single round-trip, RTCstack is entirely out of the picture for media.
Path 2 β The call itself (WebSocket + UDP, direct to LiveKit): The SDK connects to wss://yourdomain.com/livekit (Caddy-proxied). Audio/video travels over UDP directly from browser to LiveKit SFU. RTCstack never sees, touches, or forwards media packets.
Full Service Map β
βββββββββββββββββββββββββββββ
β Your App Backend β
β (issues signed requests) β
ββββββββββββββ¬βββββββββββββββ
β HMAC-signed HTTP
ββββββββββββββΌβββββββββββββββ
β RTCstack API (Fastify) ββββ Redis (state/webhooks/transcripts)
β POST /v1/token β
β GET /v1/rooms β
β POST /v1/recording/start β
β POST /v1/transcription/* β
ββββββββββββββ¬βββββββββββββββ
β livekit-server-sdk
βββββββββββββββββββββββββββββΌβββββββββββββββββββββ
β LiveKit SFU β
β - WebSocket signalling β
β - UDP media (RTP/RTCP) β
β - Selective Forwarding Unit β
ββββ¬ββββββββββββββββββββββββββββββββββββββββ¬βββ¬βββ
β β β
βββββββΌββββββββ βββββββββββββββΌβββ€ βββββββββββββββββ
β LiveKit β β coturn TURN β β stt-live-agentβ
β Egress β β TLS:443 relay β β (Python) β
β (recording) β β (firewall β β subscribes toβ
β β MinIO/S3 β β fallback) β β audio tracks β
ββββββββ¬βββββββ ββββββββββββββββββ ββββββββ¬βββββββββ
β β
ββββββββΌβββββββββββββ ββββββββββββΌβββββββββ
β stt-worker β β Whisper REST β
β (Node.js) ββββββββββ HTTP /asr βββββββββΊ (faster-whisper)β
β post-call queue β ββββββββββββ¬βββββββββ
ββββββββ¬βββββββββββββ β
β transcribed text
ββββββββΌβββββββββ ββββββββββββΌβββββββββ
β Redis ββββββββββββββ segments βββββββββ LiveKit β
β (segments, β β data channel β
β job queue) β β publish_data() β
βββββββββββββββββ ββββββββββββ¬βββββββββ
β
SDK transcriptReceived
speakingStarted/Stopped
β
ββββββββββββΌβββββββββ
β <TranscriptPanel>β
β or custom UI β
βββββββββββββββββββββTranscription Architecture β
Transcription is a first-class feature with two independent modes:
Live transcription path β
1. POST /v1/rooms/:roomId/transcription/start
β
2. API writes session state to Redis
β
3. stt-live-agent (Python LiveKit agent) joins room
β subscribes to every audio track
β RMS energy detection fires when speaker is active
β
4. Agent publishes { type: "speaking", speakerId, speaker }
via LiveKit data channel β SDK emits speakingStarted
β
5. Silence detected β audio chunk sent to Whisper /asr
β
6. Whisper returns text β clean_transcript() filters hallucinations
β
7. Agent publishes { type: "transcript", text, speaker, speakerId }
via LiveKit data channel β SDK emits transcriptReceived
β
8. UI: <TranscriptPanel /> or custom handlersKey latency numbers:
- Speaking indicator: ~immediate (RMS detection, no Whisper involved)
- Final transcript:
PAUSE_THRESHOLD_SECONDS(default 1.5s) + Whisper processing (~0.5β2s on CPU)
Post-call transcription path β
1. POST /v1/recordings/:recordingId/transcribe
β
2. API pushes job to Redis stt:queue
β
3. stt-worker pops job, fetches MP4 from MinIO
β
4. stt-worker POST recording to Whisper /asr
β
5. Whisper returns timestamped segments
β
6. stt-worker writes segments + full text to Redis
β
7. GET /v1/transcriptions/:transcriptionId β { status, segments, text }Component Responsibilities β
| Component | Responsibility | Does NOT do |
|---|---|---|
| RTCstack API | Token signing, room/recording/transcription management | Media, WebRTC, real-time events |
| LiveKit SFU | WebSocket signalling, media routing, recording triggers | Auth, room metadata, webhooks |
| stt-live-agent | Audio capture, VAD, Whisper requests, data channel publishing | Storage, API calls |
| stt-worker | Post-call job queue, MinIO fetch, Whisper requests, Redis writes | Real-time anything |
| Whisper (faster-whisper) | Audio β text, timestamps | Speaker ID, VAD |
| RTCstack SDK | Wraps LiveKit JS client, exposes clean event API | Network transport |
| Caddy | TLS termination, reverse proxy for WSS + API | Business logic |
| coturn | TURN relay for NAT traversal | SFU work |
| Redis | Webhooks, transcription segments, replay-attack window | File persistence |
| MinIO/S3 | Recording file storage | Processing |
Security Boundaries β
All API requests require two-layer authentication:
- X-Api-Key header β identifies the calling service
- X-RTCstack-Signature header β HMAC-SHA256 of
METHOD\nPATH\nTIMESTAMP\nSHA256(body)β prevents replay attacks (5-minute window)
LiveKit tokens are short-lived JWTs (configurable TTL, default 6 hours). The tokenRefresher option in createCall() allows transparent refresh before reconnects.
Scaling Notes β
- API is stateless β run multiple replicas behind a load balancer. Redis is the only shared state.
- LiveKit is the SFU β horizontal scaling follows LiveKit's own clustering docs.
- Egress containers are CPU-intensive β run dedicated nodes for recording workloads.
- coturn is UDP-heavy β separate network interface recommended for high-volume TURN.
- Whisper is the transcription bottleneck β a dedicated GPU machine dramatically improves latency. See Deployment β Dedicated GPU Machine.

