Streaming ASR over WebSocket

For voice agents, IVR, and live captioning. You open a WebSocket, send audio frames as they arrive, and receive partial and final transcripts in real time.

Endpoint

shell
wss://asr.shunyalabs.ai/ws

Connection lifecycle

Step 1: Open the connection & send init

The first message after connecting is a JSON config that authenticates and sets audio parameters. You can't change these mid-stream.

json
{
  "api_key": "sk-your-api-key",
  "model": "zero-indic",
  "language": "hi",
  "sample_rate": 8000,
  "dtype": "ulaw",
  "chunk_size_sec": 2.0,
  "silence_threshold_sec": 0.8
}

Init fields

FieldTypeDefaultDescription
api_keystringrequiredBearer token
session_idstringautoCustom session ID for tracking
modelstringzero-indicASR model
languagestringautoISO code or full name
sample_rateint16000Audio sample rate (Hz)
dtypestringfloat32ulaw, alaw, int16, float32
chunk_size_secfloat2.0Audio buffer size before returning results
silence_threshold_secfloat0.8Silence duration before finalizing a segment
inactivity_timeoutfloat300Max idle time before timeout
max_connection_durationfloat3600Max total connection time

Step 2: Stream audio frames

Send raw binary audio frames. The encoding must match dtype:

dtypeDescriptionUse for
ulawG.711 μ-law, 8-bitTelephony (8 kHz), SIP
alawG.711 A-law, 8-bitEuropean telephony
int1616-bit signed PCMGeneral recording
float3232-bit IEEE floatPre-processed audio
WAV headers are auto-detected
If your audio has a WAV header, sample rate and dtype are extracted automatically from the first binary frame, you don't need to set them in the init.

Step 3: Handle server events

Event table

EventWhenKey fields
readyConnection acceptedsession_id
speech_startSpeech onset detectedoffset_ms, segment_id
partialInterim transcription (best-guess, not final)text, language, segment_id, offset_ms, duration_ms, is_final: false
speech_endSilence detectedoffset_ms, segment_id
final_segmentSegment finalizedtext, language, segment_id, offset_ms, status, is_final: true
end_of_transcriptAll segments deliveredtotal_segments, total_audio_duration_sec
doneSession completetotal_segments, total_audio_duration_sec
errorError occurredmessage, code

End-of-audio signals

Three equivalent ways to tell the server you're done sending:

MethodFormat
Text message"END" or "END_OF_AUDIO"
JSON message{"type": "end"}
Empty binary frameBuffer.alloc(0)

Error codes

CodeDescription
AUTH_FAILEDInvalid API key (WS close 4001)
TIMEOUTInactivity or max connection duration exceeded
CAPACITY_FULLServer at max concurrent sessions
PROTOCOL_ERRORInvalid init config or message format
INTERNAL_ERRORUnexpected server error

Full example

node
import WebSocket from "ws";
const ws = new WebSocket("wss://asr.shunyalabs.ai/ws");

ws.on("open", () => {
  ws.send(JSON.stringify({
    api_key: process.env.SHUNYALABS_API_KEY,
    language: "hi",
    sample_rate: 8000,
    dtype: "ulaw",
    model: "zero-indic",
  }));
});

ws.on("message", (raw) => {
  const msg = JSON.parse(raw.toString());
  switch (msg.type) {
    case "ready":         console.log("Session:", msg.session_id); break;
    case "speech_start":  console.log("Speech start @", msg.offset_ms); break;
    case "partial":       console.log("Partial:", msg.text); break;
    case "final_segment": console.log("Final:", msg.text); break;
    case "done":          console.log("Done:", msg.total_segments, "segments"); break;
    case "error":         console.error("Error:", msg.code, msg.message); break;
  }
});

// Stream audio from a source (e.g. telephony media, mic, file)
audioSource.on("data", (chunk) => ws.send(chunk));
audioSource.on("end",  () => ws.send("END"));
python
import asyncio, json, os, websockets

async def transcribe_stream(audio_iter):
    async with websockets.connect("wss://asr.shunyalabs.ai/ws") as ws:
        await ws.send(json.dumps({
            "api_key": os.environ["SHUNYALABS_API_KEY"],
            "model": "zero-indic",
            "language": "hi",
            "sample_rate": 16000,
            "dtype": "int16",
        }))

        async def sender():
            async for frame in audio_iter:
                await ws.send(frame)
            await ws.send("END")

        async def receiver():
            async for raw in ws:
                msg = json.loads(raw)
                if msg["type"] == "partial":
                    print(" partial:", msg["text"])
                elif msg["type"] == "final_segment":
                    print(" final:  ", msg["text"])
                elif msg["type"] == "done":
                    return

        await asyncio.gather(sender(), receiver())