Live Transcription โ
Per-speaker, real-time transcription during a call. The stt-live-agent joins each room as a Python LiveKit agent, subscribes to all audio tracks, and sends chunks to Whisper. Results are published back into the room via LiveKit's data channel and surfaced through SDK events.
Enable โ
bash
# docker/.env
TRANSCRIPTION_LIVE_ENABLED=true
WHISPER_MODEL=base # base is default; use small/medium for better accuracy
PAUSE_THRESHOLD_SECONDS=1.5 # flush audio after 1.5s silencebash
docker compose --profile stt-live up -dStart a session โ
http
POST /v1/rooms/:roomId/transcription/start
Content-Type: application/json
{
"language": "en"
}Response:
json
{ "transcriptionId": "tr_abc123", "status": "active" }Or from the SDK (requires apiUrl + roomName in CallOptions):
typescript
const call = createCall({
url: 'wss://...',
token: '...',
roomName: 'my-room',
apiUrl: 'https://api.yourapp.com',
})
await call.connect()
await call.startTranscription()Stop a session โ
http
POST /v1/rooms/:roomId/transcription/stopOr:
typescript
await call.stopTranscription()SDK integration โ
Events โ
typescript
import { createCall } from '@rtcstack/sdk'
import type { TranscriptSegment } from '@rtcstack/sdk'
call.on('speakingStarted', (speakerId: string, speakerName: string) => {
// Speaker is talking โ no text yet. Show a typing indicator.
console.log(speakerName, 'is speaking...')
})
call.on('speakingStopped', (speakerId: string) => {
// Speaking indicator should be cleared.
// This fires when the transcript arrives, not when the mic goes silent.
})
call.on('transcriptReceived', (segment: TranscriptSegment) => {
console.log(`[${segment.speaker}] ${segment.text}`)
// segment.speakerId โ participant identity
// segment.timestamp โ Date when event was received
// segment.startMs โ millisecond offset within the audio chunk
})React hooks โ
tsx
import { useTranscription, useSpeakingIndicators } from '@rtcstack/ui-react'
function MyTranscript() {
const segments = useTranscription() // TranscriptSegment[], live
const speaking = useSpeakingIndicators() // Map<speakerId, speakerName>
return (
<div>
{segments.map((seg, i) => (
<p key={i}><strong>{seg.speaker}:</strong> {seg.text}</p>
))}
{[...speaking.entries()].map(([id, name]) => (
<p key={id} style={{ opacity: 0.5 }}><strong>{name}:</strong> ยทยทยท</p>
))}
</div>
)
}Or just use the pre-built panel:
tsx
import { TranscriptPanel } from '@rtcstack/ui-react'
<TranscriptPanel showSpeakerName maxItems={200} />Vue 3 composables โ
vue
<script setup lang="ts">
import { useTranscription, useSpeakingIndicators } from '@rtcstack/ui-vue'
const segments = useTranscription() // Ref<TranscriptSegment[]>
const speaking = useSpeakingIndicators() // Ref<Map<speakerId, speakerName>>
</script>Or the pre-built component:
vue
<TranscriptPanel :show-speaker-name="true" :max-items="200" />Vanilla JS โ
typescript
import { mountVideoConference } from '@rtcstack/ui-vanilla'
mountVideoConference(el, { call, showTranscript: true })
// Or wire up events directly
call.on('speakingStarted', (id, name) => { /* show indicator */ })
call.on('transcriptReceived', (seg) => { /* append text */ })Fetch transcript via API โ
Poll while the session is active, or retrieve the full transcript after stopping:
http
GET /v1/rooms/:roomId/transcriptionjson
{
"transcriptionId": "tr_abc123",
"status": "active",
"language": "en",
"segments": [
{
"startMs": 1500,
"endMs": 4200,
"speakerName": "Alice",
"speakerId": "PA_abc",
"text": "Can everyone hear me?"
},
{
"startMs": 5100,
"endMs": 8300,
"speakerName": "Bob",
"speakerId": "PA_xyz",
"text": "Yes, loud and clear."
}
]
}Tuning latency โ
Perceived delay = PAUSE_THRESHOLD_SECONDS + Whisper processing time.
PAUSE_THRESHOLD_SECONDS | Feel | Trade-off |
|---|---|---|
| 0.8s | Very snappy | More partial chunks, lower accuracy |
| 1.5s | Default โ good balance | |
| 3.0s | Waits for natural pauses | Better accuracy on long sentences |
For Whisper processing time, see Model Selection.
Environment variables โ
| Variable | Default | Description |
|---|---|---|
TRANSCRIPTION_LIVE_ENABLED | false | Enable live transcription endpoints |
PAUSE_THRESHOLD_SECONDS | 1.5 | Silence duration before flushing audio to Whisper |
SHORT_PAUSE_SECONDS | 0.3 | Very short pause: only flush if previous chunk ended with punctuation |
MAX_CHUNK_SECONDS | 30.0 | Force-flush audio after this many seconds regardless of silence |
SPEECH_RMS_THRESHOLD | 200 | RMS energy threshold for speech detection |
WHISPER_MAX_CONCURRENT | 2 | Max parallel Whisper requests per room |
STT_LANGUAGE | en | ISO 639-1 language code or auto |
WHISPER_MODEL | base | Whisper model size |
WHISPER_URL | http://whisper:8080 | Whisper service URL |

