Skip to content

Transcription Overview โ€‹

Live and post-call transcription are core features of RTCstack โ€” not an external API, not a paid add-on. Whisper runs inside your Docker stack alongside everything else. No audio ever leaves your server.

Two modes โ€‹

LivePost-call
WhenDuring the callAfter the recording ends
Latency1โ€“3 seconds per utteranceMinutes (whole file at once)
OutputSDK events โ†’ UI in real timeRedis segments + plain text
TriggerAPI call to start/stopAPI call on a recording ID
Speaker attributionYes (per audio track)Yes (via diarization or speaker tags)
Use caseLive captions, meeting notes, accessibilityArchives, searchable transcripts

Both modes share the same Whisper service and are independently enabled via environment variables.

How live transcription works โ€‹

Participant speaks
       โ†“
stt-live-agent detects audio energy (RMS)
       โ†“ (immediately)
โ†’ speakingStarted event in SDK  โ†’  "ยทยทยท" in UI
       โ†“
1.5s silence โ†’ audio chunk flushed to Whisper
       โ†“ (~0.5โ€“2s)
Whisper returns text
       โ†“
โ†’ transcriptReceived event in SDK  โ†’  final text in UI

The speaking indicator fires before Whisper is involved โ€” it's just RMS energy detection. The real-time feel comes from showing the indicator immediately, then replacing it with transcribed text once Whisper responds.

SDK events โ€‹

typescript
// Speaking indicator โ€” fires immediately when mic energy is detected
call.on('speakingStarted', (speakerId: string, speakerName: string) => {
  showTypingIndicator(speakerName)
})

// Fired when transcript arrives (also clears the speaking indicator)
call.on('transcriptReceived', (segment: TranscriptSegment) => {
  appendTranscript(segment.speaker, segment.text)
})

// Speaking indicator cleared
call.on('speakingStopped', (speakerId: string) => {
  removeTypingIndicator(speakerId)
})

Built-in UI components โ€‹

React โ€‹

tsx
import { TranscriptPanel } from '@rtcstack/ui-react'

// Standalone
<TranscriptPanel maxItems={100} showSpeakerName />

// Or built into VideoConference with one prop
<VideoConference call={call} showTranscript />

Vue 3 โ€‹

vue
<TranscriptPanel :max-items="100" :show-speaker-name="true" />

<!-- Or -->
<VideoConference :call="call" :show-transcript="true" />

Vanilla JS โ€‹

typescript
mountVideoConference(el, {
  call,
  showTranscript: true,  // default: true
})

Custom transcript rendering (any framework) โ€‹

Use the SDK events directly and build your own UI:

typescript
call.on('speakingStarted', (speakerId, name) => {
  // Create a "typing" bubble for this speaker
})

call.on('transcriptReceived', ({ speaker, speakerId, text, timestamp }) => {
  // Replace the bubble with real text, or append to existing
})

Requirements โ€‹

  • Docker Compose with the stt-live profile enabled
  • Whisper model weights (~150 MB for base) โ€” downloaded automatically on first start
  • TRANSCRIPTION_LIVE_ENABLED=true in docker/.env

See Live Transcription Setup for the full configuration guide.