Skip to content

Architecture ​

RTCstack is intentionally thin. It manages tokens, rooms, and transcription state; it never touches media.

Two-Path Design ​

Browser / Mobile
      β”‚
      β”œβ”€β”€β”€ HTTP POST /v1/token ──► RTCstack API ──► LiveKit JWT
      β”‚         (once, before call)
      β”‚
      └─── WSS + UDP (WebRTC) ──────────────────► LiveKit SFU
                (during call, no RTCstack in path)

Path 1 β€” Token handshake (one-shot REST request): Your app backend calls POST /v1/token. RTCstack signs a LiveKit JWT with the correct role grants and returns it. After this single round-trip, RTCstack is entirely out of the picture for media.

Path 2 β€” The call itself (WebSocket + UDP, direct to LiveKit): The SDK connects to wss://yourdomain.com/livekit (Caddy-proxied). Audio/video travels over UDP directly from browser to LiveKit SFU. RTCstack never sees, touches, or forwards media packets.

Full Service Map ​

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Your App Backend          β”‚
                    β”‚  (issues signed requests)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚ HMAC-signed HTTP
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  RTCstack API (Fastify)    │◄── Redis (state/webhooks/transcripts)
                    β”‚  POST /v1/token            β”‚
                    β”‚  GET  /v1/rooms            β”‚
                    β”‚  POST /v1/recording/start  β”‚
                    β”‚  POST /v1/transcription/*  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚ livekit-server-sdk
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚               LiveKit SFU                       β”‚
     β”‚  - WebSocket signalling                         β”‚
     β”‚  - UDP media (RTP/RTCP)                         β”‚
     β”‚  - Selective Forwarding Unit                    β”‚
     β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”˜
        β”‚                                       β”‚  β”‚
  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ LiveKit     β”‚                 β”‚  coturn TURN   β”‚  β”‚ stt-live-agentβ”‚
  β”‚ Egress      β”‚                 β”‚  TLS:443 relay β”‚  β”‚  (Python)     β”‚
  β”‚ (recording) β”‚                 β”‚  (firewall     β”‚  β”‚  subscribes toβ”‚
  β”‚ β†’ MinIO/S3  β”‚                 β”‚   fallback)    β”‚  β”‚  audio tracks β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                                    β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  stt-worker       β”‚                            β”‚   Whisper REST    β”‚
  β”‚  (Node.js)        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ HTTP /asr ────────►   (faster-whisper)β”‚
  β”‚  post-call queue  β”‚                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                       β”‚
         β”‚                                              transcribed text
  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Redis        │◄──────────── segments ─────────  LiveKit           β”‚
  β”‚  (segments,   β”‚                               β”‚  data channel      β”‚
  β”‚   job queue)  β”‚                               β”‚  publish_data()    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                             β”‚
                                                     SDK transcriptReceived
                                                     speakingStarted/Stopped
                                                             β”‚
                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                                                  β”‚  <TranscriptPanel>β”‚
                                                  β”‚  or custom UI     β”‚
                                                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Transcription Architecture ​

Transcription is a first-class feature with two independent modes:

Live transcription path ​

1. POST /v1/rooms/:roomId/transcription/start
          ↓
2. API writes session state to Redis
          ↓
3. stt-live-agent (Python LiveKit agent) joins room
   β€” subscribes to every audio track
   β€” RMS energy detection fires when speaker is active
          ↓
4. Agent publishes { type: "speaking", speakerId, speaker }
   via LiveKit data channel β†’ SDK emits speakingStarted
          ↓
5. Silence detected β†’ audio chunk sent to Whisper /asr
          ↓
6. Whisper returns text β†’ clean_transcript() filters hallucinations
          ↓
7. Agent publishes { type: "transcript", text, speaker, speakerId }
   via LiveKit data channel β†’ SDK emits transcriptReceived
          ↓
8. UI: <TranscriptPanel /> or custom handlers

Key latency numbers:

  • Speaking indicator: ~immediate (RMS detection, no Whisper involved)
  • Final transcript: PAUSE_THRESHOLD_SECONDS (default 1.5s) + Whisper processing (~0.5–2s on CPU)

Post-call transcription path ​

1. POST /v1/recordings/:recordingId/transcribe
          ↓
2. API pushes job to Redis stt:queue
          ↓
3. stt-worker pops job, fetches MP4 from MinIO
          ↓
4. stt-worker POST recording to Whisper /asr
          ↓
5. Whisper returns timestamped segments
          ↓
6. stt-worker writes segments + full text to Redis
          ↓
7. GET /v1/transcriptions/:transcriptionId β†’ { status, segments, text }

Component Responsibilities ​

ComponentResponsibilityDoes NOT do
RTCstack APIToken signing, room/recording/transcription managementMedia, WebRTC, real-time events
LiveKit SFUWebSocket signalling, media routing, recording triggersAuth, room metadata, webhooks
stt-live-agentAudio capture, VAD, Whisper requests, data channel publishingStorage, API calls
stt-workerPost-call job queue, MinIO fetch, Whisper requests, Redis writesReal-time anything
Whisper (faster-whisper)Audio β†’ text, timestampsSpeaker ID, VAD
RTCstack SDKWraps LiveKit JS client, exposes clean event APINetwork transport
CaddyTLS termination, reverse proxy for WSS + APIBusiness logic
coturnTURN relay for NAT traversalSFU work
RedisWebhooks, transcription segments, replay-attack windowFile persistence
MinIO/S3Recording file storageProcessing

Security Boundaries ​

All API requests require two-layer authentication:

  1. X-Api-Key header β€” identifies the calling service
  2. X-RTCstack-Signature header β€” HMAC-SHA256 of METHOD\nPATH\nTIMESTAMP\nSHA256(body) β€” prevents replay attacks (5-minute window)

LiveKit tokens are short-lived JWTs (configurable TTL, default 6 hours). The tokenRefresher option in createCall() allows transparent refresh before reconnects.

Scaling Notes ​

  • API is stateless β€” run multiple replicas behind a load balancer. Redis is the only shared state.
  • LiveKit is the SFU β€” horizontal scaling follows LiveKit's own clustering docs.
  • Egress containers are CPU-intensive β€” run dedicated nodes for recording workloads.
  • coturn is UDP-heavy β€” separate network interface recommended for high-volume TURN.
  • Whisper is the transcription bottleneck β€” a dedicated GPU machine dramatically improves latency. See Deployment β†’ Dedicated GPU Machine.