Live Transcription

Per-speaker, real-time transcription during a call. The stt-live-agent joins each room as a Python LiveKit agent, subscribes to all audio tracks, and sends chunks to Whisper. Results are published back into the room via LiveKit's data channel and surfaced through SDK events.

Enable

bash

# docker/.env
TRANSCRIPTION_LIVE_ENABLED=true
WHISPER_MODEL=base          # base is default; use small/medium for better accuracy
PAUSE_THRESHOLD_SECONDS=1.5 # flush audio after 1.5s silence

bash

docker compose --profile stt-live up -d

Start a session

http

POST /v1/rooms/:roomId/transcription/start
Content-Type: application/json

{
  "language": "en"
}

Response:

json

{ "transcriptionId": "tr_abc123", "status": "active" }

Or from the SDK (requires apiUrl + roomName in CallOptions):

typescript

const call = createCall({
  url: 'wss://...',
  token: '...',
  roomName: 'my-room',
  apiUrl: 'https://api.yourapp.com',
})

await call.connect()
await call.startTranscription()

Stop a session

http

POST /v1/rooms/:roomId/transcription/stop

Or:

typescript

await call.stopTranscription()

SDK integration

Events

typescript

import { createCall } from '@rtcstack/sdk'
import type { TranscriptSegment } from '@rtcstack/sdk'

call.on('speakingStarted', (speakerId: string, speakerName: string) => {
  // Speaker is talking — no text yet. Show a typing indicator.
  console.log(speakerName, 'is speaking...')
})

call.on('speakingStopped', (speakerId: string) => {
  // Speaking indicator should be cleared.
  // This fires when the transcript arrives, not when the mic goes silent.
})

call.on('transcriptReceived', (segment: TranscriptSegment) => {
  console.log(`[${segment.speaker}] ${segment.text}`)
  // segment.speakerId  — participant identity
  // segment.timestamp  — Date when event was received
  // segment.startMs    — millisecond offset within the audio chunk
})

React hooks

tsx

import { useTranscription, useSpeakingIndicators } from '@rtcstack/ui-react'

function MyTranscript() {
  const segments = useTranscription()        // TranscriptSegment[], live
  const speaking = useSpeakingIndicators()   // Map<speakerId, speakerName>

  return (
    <div>
      {segments.map((seg, i) => (
        <p key={i}><strong>{seg.speaker}:</strong> {seg.text}</p>
      ))}
      {[...speaking.entries()].map(([id, name]) => (
        <p key={id} style={{ opacity: 0.5 }}><strong>{name}:</strong> ···</p>
      ))}
    </div>
  )
}

Or just use the pre-built panel:

tsx

import { TranscriptPanel } from '@rtcstack/ui-react'

<TranscriptPanel showSpeakerName maxItems={200} />

Vue 3 composables

vue

<script setup lang="ts">
import { useTranscription, useSpeakingIndicators } from '@rtcstack/ui-vue'

const segments = useTranscription()       // Ref<TranscriptSegment[]>
const speaking = useSpeakingIndicators()  // Ref<Map<speakerId, speakerName>>
</script>

Or the pre-built component:

vue

<TranscriptPanel :show-speaker-name="true" :max-items="200" />

Vanilla JS

typescript

import { mountVideoConference } from '@rtcstack/ui-vanilla'

mountVideoConference(el, { call, showTranscript: true })

// Or wire up events directly
call.on('speakingStarted', (id, name) => { /* show indicator */ })
call.on('transcriptReceived', (seg) => { /* append text */ })

Fetch transcript via API

Poll while the session is active, or retrieve the full transcript after stopping:

http

GET /v1/rooms/:roomId/transcription

json

{
  "transcriptionId": "tr_abc123",
  "status": "active",
  "language": "en",
  "segments": [
    {
      "startMs": 1500,
      "endMs": 4200,
      "speakerName": "Alice",
      "speakerId": "PA_abc",
      "text": "Can everyone hear me?"
    },
    {
      "startMs": 5100,
      "endMs": 8300,
      "speakerName": "Bob",
      "speakerId": "PA_xyz",
      "text": "Yes, loud and clear."
    }
  ]
}

Tuning latency

Perceived delay = PAUSE_THRESHOLD_SECONDS + Whisper processing time.

`PAUSE_THRESHOLD_SECONDS`	Feel	Trade-off
0.8s	Very snappy	More partial chunks, lower accuracy
1.5s	Default — good balance
3.0s	Waits for natural pauses	Better accuracy on long sentences

For Whisper processing time, see Model Selection.

Environment variables

Variable	Default	Description
`TRANSCRIPTION_LIVE_ENABLED`	`false`	Enable live transcription endpoints
`PAUSE_THRESHOLD_SECONDS`	`1.5`	Silence duration before flushing audio to Whisper
`SHORT_PAUSE_SECONDS`	`0.3`	Very short pause: only flush if previous chunk ended with punctuation
`MAX_CHUNK_SECONDS`	`30.0`	Force-flush audio after this many seconds regardless of silence
`SPEECH_RMS_THRESHOLD`	`200`	RMS energy threshold for speech detection
`WHISPER_MAX_CONCURRENT`	`2`	Max parallel Whisper requests per room
`STT_LANGUAGE`	`en`	ISO 639-1 language code or `auto`
`WHISPER_MODEL`	`base`	Whisper model size
`WHISPER_URL`	`http://whisper:8080`	Whisper service URL

Live Transcription ​

Enable ​

Start a session ​

Stop a session ​

SDK integration ​

Events ​

React hooks ​

Vue 3 composables ​

Vanilla JS ​

Fetch transcript via API ​

Tuning latency ​

Environment variables ​

Live Transcription

Enable

Start a session

Stop a session

SDK integration

Events

React hooks

Vue 3 composables

Vanilla JS

Fetch transcript via API

Tuning latency

Environment variables