Streaming ASR over WebSocket
For voice agents, IVR, and live captioning. You open a WebSocket, send audio frames as they arrive, and receive partial and final transcripts in real time.
Endpoint
shell
wss://asr.shunyalabs.ai/wsConnection lifecycle
Step 1: Open the connection & send init
The first message after connecting is a JSON config that authenticates and sets audio parameters. You can't change these mid-stream.
json
{
"api_key": "sk-your-api-key",
"model": "zero-indic",
"language": "hi",
"sample_rate": 8000,
"dtype": "ulaw",
"chunk_size_sec": 2.0,
"silence_threshold_sec": 0.8
}Init fields
| Field | Type | Default | Description |
|---|---|---|---|
api_key | string | required | Bearer token |
session_id | string | auto | Custom session ID for tracking |
model | string | zero-indic | ASR model |
language | string | auto | ISO code or full name |
sample_rate | int | 16000 | Audio sample rate (Hz) |
dtype | string | float32 | ulaw, alaw, int16, float32 |
chunk_size_sec | float | 2.0 | Audio buffer size before returning results |
silence_threshold_sec | float | 0.8 | Silence duration before finalizing a segment |
inactivity_timeout | float | 300 | Max idle time before timeout |
max_connection_duration | float | 3600 | Max total connection time |
Step 2: Stream audio frames
Send raw binary audio frames. The encoding must match dtype:
dtype | Description | Use for |
|---|---|---|
ulaw | G.711 μ-law, 8-bit | Telephony (8 kHz), SIP |
alaw | G.711 A-law, 8-bit | European telephony |
int16 | 16-bit signed PCM | General recording |
float32 | 32-bit IEEE float | Pre-processed audio |
WAV headers are auto-detected
If your audio has a WAV header, sample rate and dtype are extracted automatically from the first binary frame, you don't need to set them in the init.Step 3: Handle server events
Event table
| Event | When | Key fields |
|---|---|---|
ready | Connection accepted | session_id |
speech_start | Speech onset detected | offset_ms, segment_id |
partial | Interim transcription (best-guess, not final) | text, language, segment_id, offset_ms, duration_ms, is_final: false |
speech_end | Silence detected | offset_ms, segment_id |
final_segment | Segment finalized | text, language, segment_id, offset_ms, status, is_final: true |
end_of_transcript | All segments delivered | total_segments, total_audio_duration_sec |
done | Session complete | total_segments, total_audio_duration_sec |
error | Error occurred | message, code |
End-of-audio signals
Three equivalent ways to tell the server you're done sending:
| Method | Format |
|---|---|
| Text message | "END" or "END_OF_AUDIO" |
| JSON message | {"type": "end"} |
| Empty binary frame | Buffer.alloc(0) |
Error codes
| Code | Description |
|---|---|
AUTH_FAILED | Invalid API key (WS close 4001) |
TIMEOUT | Inactivity or max connection duration exceeded |
CAPACITY_FULL | Server at max concurrent sessions |
PROTOCOL_ERROR | Invalid init config or message format |
INTERNAL_ERROR | Unexpected server error |
Full example
node
import WebSocket from "ws";
const ws = new WebSocket("wss://asr.shunyalabs.ai/ws");
ws.on("open", () => {
ws.send(JSON.stringify({
api_key: process.env.SHUNYALABS_API_KEY,
language: "hi",
sample_rate: 8000,
dtype: "ulaw",
model: "zero-indic",
}));
});
ws.on("message", (raw) => {
const msg = JSON.parse(raw.toString());
switch (msg.type) {
case "ready": console.log("Session:", msg.session_id); break;
case "speech_start": console.log("Speech start @", msg.offset_ms); break;
case "partial": console.log("Partial:", msg.text); break;
case "final_segment": console.log("Final:", msg.text); break;
case "done": console.log("Done:", msg.total_segments, "segments"); break;
case "error": console.error("Error:", msg.code, msg.message); break;
}
});
// Stream audio from a source (e.g. telephony media, mic, file)
audioSource.on("data", (chunk) => ws.send(chunk));
audioSource.on("end", () => ws.send("END"));python
import asyncio, json, os, websockets
async def transcribe_stream(audio_iter):
async with websockets.connect("wss://asr.shunyalabs.ai/ws") as ws:
await ws.send(json.dumps({
"api_key": os.environ["SHUNYALABS_API_KEY"],
"model": "zero-indic",
"language": "hi",
"sample_rate": 16000,
"dtype": "int16",
}))
async def sender():
async for frame in audio_iter:
await ws.send(frame)
await ws.send("END")
async def receiver():
async for raw in ws:
msg = json.loads(raw)
if msg["type"] == "partial":
print(" partial:", msg["text"])
elif msg["type"] == "final_segment":
print(" final: ", msg["text"])
elif msg["type"] == "done":
return
await asyncio.gather(sender(), receiver())