Architecture

RTCstack is intentionally thin. It manages tokens, rooms, and transcription state; it never touches media.

Two-Path Design

Browser / Mobile
      │
      ├─── HTTP POST /v1/token ──► RTCstack API ──► LiveKit JWT
      │         (once, before call)
      │
      └─── WSS + UDP (WebRTC) ──────────────────► LiveKit SFU
                (during call, no RTCstack in path)

Path 1 — Token handshake (one-shot REST request): Your app backend calls POST /v1/token. RTCstack signs a LiveKit JWT with the correct role grants and returns it. After this single round-trip, RTCstack is entirely out of the picture for media.

Path 2 — The call itself (WebSocket + UDP, direct to LiveKit): The SDK connects to wss://yourdomain.com/livekit (Caddy-proxied). Audio/video travels over UDP directly from browser to LiveKit SFU. RTCstack never sees, touches, or forwards media packets.

Full Service Map

                    ┌───────────────────────────┐
                    │  Your App Backend          │
                    │  (issues signed requests)  │
                    └────────────┬──────────────┘
                                 │ HMAC-signed HTTP
                    ┌────────────▼──────────────┐
                    │  RTCstack API (Fastify)    │◄── Redis (state/webhooks/transcripts)
                    │  POST /v1/token            │
                    │  GET  /v1/rooms            │
                    │  POST /v1/recording/start  │
                    │  POST /v1/transcription/*  │
                    └────────────┬──────────────┘
                                 │ livekit-server-sdk
     ┌───────────────────────────▼────────────────────┐
     │               LiveKit SFU                       │
     │  - WebSocket signalling                         │
     │  - UDP media (RTP/RTCP)                         │
     │  - Selective Forwarding Unit                    │
     └──┬───────────────────────────────────────┬──┬──┘
        │                                       │  │
  ┌─────▼───────┐                 ┌─────────────▼──┤  ┌───────────────┐
  │ LiveKit     │                 │  coturn TURN   │  │ stt-live-agent│
  │ Egress      │                 │  TLS:443 relay │  │  (Python)     │
  │ (recording) │                 │  (firewall     │  │  subscribes to│
  │ → MinIO/S3  │                 │   fallback)    │  │  audio tracks │
  └──────┬──────┘                 └────────────────┘  └──────┬────────┘
         │                                                    │
  ┌──────▼────────────┐                            ┌──────────▼────────┐
  │  stt-worker       │                            │   Whisper REST    │
  │  (Node.js)        ├───────── HTTP /asr ────────►   (faster-whisper)│
  │  post-call queue  │                            └──────────┬────────┘
  └──────┬────────────┘                                       │
         │                                              transcribed text
  ┌──────▼────────┐                               ┌──────────▼────────┐
  │  Redis        │◄──────────── segments ─────────  LiveKit           │
  │  (segments,   │                               │  data channel      │
  │   job queue)  │                               │  publish_data()    │
  └───────────────┘                               └──────────┬────────┘
                                                             │
                                                     SDK transcriptReceived
                                                     speakingStarted/Stopped
                                                             │
                                                  ┌──────────▼────────┐
                                                  │  <TranscriptPanel>│
                                                  │  or custom UI     │
                                                  └───────────────────┘

Transcription Architecture

Transcription is a first-class feature with two independent modes:

Live transcription path

1. POST /v1/rooms/:roomId/transcription/start
          ↓
2. API writes session state to Redis
          ↓
3. stt-live-agent (Python LiveKit agent) joins room
   — subscribes to every audio track
   — RMS energy detection fires when speaker is active
          ↓
4. Agent publishes { type: "speaking", speakerId, speaker }
   via LiveKit data channel → SDK emits speakingStarted
          ↓
5. Silence detected → audio chunk sent to Whisper /asr
          ↓
6. Whisper returns text → clean_transcript() filters hallucinations
          ↓
7. Agent publishes { type: "transcript", text, speaker, speakerId }
   via LiveKit data channel → SDK emits transcriptReceived
          ↓
8. UI: <TranscriptPanel /> or custom handlers

Key latency numbers:

Speaking indicator: ~immediate (RMS detection, no Whisper involved)
Final transcript: PAUSE_THRESHOLD_SECONDS (default 1.5s) + Whisper processing (~0.5–2s on CPU)

Post-call transcription path

1. POST /v1/recordings/:recordingId/transcribe
          ↓
2. API pushes job to Redis stt:queue
          ↓
3. stt-worker pops job, fetches MP4 from MinIO
          ↓
4. stt-worker POST recording to Whisper /asr
          ↓
5. Whisper returns timestamped segments
          ↓
6. stt-worker writes segments + full text to Redis
          ↓
7. GET /v1/transcriptions/:transcriptionId → { status, segments, text }

Component Responsibilities

Component	Responsibility	Does NOT do
RTCstack API	Token signing, room/recording/transcription management	Media, WebRTC, real-time events
LiveKit SFU	WebSocket signalling, media routing, recording triggers	Auth, room metadata, webhooks
stt-live-agent	Audio capture, VAD, Whisper requests, data channel publishing	Storage, API calls
stt-worker	Post-call job queue, MinIO fetch, Whisper requests, Redis writes	Real-time anything
Whisper (faster-whisper)	Audio → text, timestamps	Speaker ID, VAD
RTCstack SDK	Wraps LiveKit JS client, exposes clean event API	Network transport
Caddy	TLS termination, reverse proxy for WSS + API	Business logic
coturn	TURN relay for NAT traversal	SFU work
Redis	Webhooks, transcription segments, replay-attack window	File persistence
MinIO/S3	Recording file storage	Processing

Security Boundaries

All API requests require two-layer authentication:

X-Api-Key header — identifies the calling service
X-RTCstack-Signature header — HMAC-SHA256 of METHOD\nPATH\nTIMESTAMP\nSHA256(body) — prevents replay attacks (5-minute window)

LiveKit tokens are short-lived JWTs (configurable TTL, default 6 hours). The tokenRefresher option in createCall() allows transparent refresh before reconnects.

Scaling Notes

API is stateless — run multiple replicas behind a load balancer. Redis is the only shared state.
LiveKit is the SFU — horizontal scaling follows LiveKit's own clustering docs.
Egress containers are CPU-intensive — run dedicated nodes for recording workloads.
coturn is UDP-heavy — separate network interface recommended for high-volume TURN.
Whisper is the transcription bottleneck — a dedicated GPU machine dramatically improves latency. See Deployment → Dedicated GPU Machine.

Architecture ​

Two-Path Design ​

Full Service Map ​

Transcription Architecture ​

Live transcription path ​

Post-call transcription path ​

Component Responsibilities ​

Security Boundaries ​

Scaling Notes ​