Skip to content

Live Transcription โ€‹

Per-speaker, real-time transcription during a call. The stt-live-agent joins each room as a Python LiveKit agent, subscribes to all audio tracks, and sends chunks to Whisper. Results are published back into the room via LiveKit's data channel and surfaced through SDK events.

Enable โ€‹

bash
# docker/.env
TRANSCRIPTION_LIVE_ENABLED=true
WHISPER_MODEL=base          # base is default; use small/medium for better accuracy
PAUSE_THRESHOLD_SECONDS=1.5 # flush audio after 1.5s silence
bash
docker compose --profile stt-live up -d

Start a session โ€‹

http
POST /v1/rooms/:roomId/transcription/start
Content-Type: application/json

{
  "language": "en"
}

Response:

json
{ "transcriptionId": "tr_abc123", "status": "active" }

Or from the SDK (requires apiUrl + roomName in CallOptions):

typescript
const call = createCall({
  url: 'wss://...',
  token: '...',
  roomName: 'my-room',
  apiUrl: 'https://api.yourapp.com',
})

await call.connect()
await call.startTranscription()

Stop a session โ€‹

http
POST /v1/rooms/:roomId/transcription/stop

Or:

typescript
await call.stopTranscription()

SDK integration โ€‹

Events โ€‹

typescript
import { createCall } from '@rtcstack/sdk'
import type { TranscriptSegment } from '@rtcstack/sdk'

call.on('speakingStarted', (speakerId: string, speakerName: string) => {
  // Speaker is talking โ€” no text yet. Show a typing indicator.
  console.log(speakerName, 'is speaking...')
})

call.on('speakingStopped', (speakerId: string) => {
  // Speaking indicator should be cleared.
  // This fires when the transcript arrives, not when the mic goes silent.
})

call.on('transcriptReceived', (segment: TranscriptSegment) => {
  console.log(`[${segment.speaker}] ${segment.text}`)
  // segment.speakerId  โ€” participant identity
  // segment.timestamp  โ€” Date when event was received
  // segment.startMs    โ€” millisecond offset within the audio chunk
})

React hooks โ€‹

tsx
import { useTranscription, useSpeakingIndicators } from '@rtcstack/ui-react'

function MyTranscript() {
  const segments = useTranscription()        // TranscriptSegment[], live
  const speaking = useSpeakingIndicators()   // Map<speakerId, speakerName>

  return (
    <div>
      {segments.map((seg, i) => (
        <p key={i}><strong>{seg.speaker}:</strong> {seg.text}</p>
      ))}
      {[...speaking.entries()].map(([id, name]) => (
        <p key={id} style={{ opacity: 0.5 }}><strong>{name}:</strong> ยทยทยท</p>
      ))}
    </div>
  )
}

Or just use the pre-built panel:

tsx
import { TranscriptPanel } from '@rtcstack/ui-react'

<TranscriptPanel showSpeakerName maxItems={200} />

Vue 3 composables โ€‹

vue
<script setup lang="ts">
import { useTranscription, useSpeakingIndicators } from '@rtcstack/ui-vue'

const segments = useTranscription()       // Ref<TranscriptSegment[]>
const speaking = useSpeakingIndicators()  // Ref<Map<speakerId, speakerName>>
</script>

Or the pre-built component:

vue
<TranscriptPanel :show-speaker-name="true" :max-items="200" />

Vanilla JS โ€‹

typescript
import { mountVideoConference } from '@rtcstack/ui-vanilla'

mountVideoConference(el, { call, showTranscript: true })

// Or wire up events directly
call.on('speakingStarted', (id, name) => { /* show indicator */ })
call.on('transcriptReceived', (seg) => { /* append text */ })

Fetch transcript via API โ€‹

Poll while the session is active, or retrieve the full transcript after stopping:

http
GET /v1/rooms/:roomId/transcription
json
{
  "transcriptionId": "tr_abc123",
  "status": "active",
  "language": "en",
  "segments": [
    {
      "startMs": 1500,
      "endMs": 4200,
      "speakerName": "Alice",
      "speakerId": "PA_abc",
      "text": "Can everyone hear me?"
    },
    {
      "startMs": 5100,
      "endMs": 8300,
      "speakerName": "Bob",
      "speakerId": "PA_xyz",
      "text": "Yes, loud and clear."
    }
  ]
}

Tuning latency โ€‹

Perceived delay = PAUSE_THRESHOLD_SECONDS + Whisper processing time.

PAUSE_THRESHOLD_SECONDSFeelTrade-off
0.8sVery snappyMore partial chunks, lower accuracy
1.5sDefault โ€” good balance
3.0sWaits for natural pausesBetter accuracy on long sentences

For Whisper processing time, see Model Selection.

Environment variables โ€‹

VariableDefaultDescription
TRANSCRIPTION_LIVE_ENABLEDfalseEnable live transcription endpoints
PAUSE_THRESHOLD_SECONDS1.5Silence duration before flushing audio to Whisper
SHORT_PAUSE_SECONDS0.3Very short pause: only flush if previous chunk ended with punctuation
MAX_CHUNK_SECONDS30.0Force-flush audio after this many seconds regardless of silence
SPEECH_RMS_THRESHOLD200RMS energy threshold for speech detection
WHISPER_MAX_CONCURRENT2Max parallel Whisper requests per room
STT_LANGUAGEenISO 639-1 language code or auto
WHISPER_MODELbaseWhisper model size
WHISPER_URLhttp://whisper:8080Whisper service URL