Streaming ASR over WebSocket

For voice agents, IVR, and live captioning. You open a WebSocket, send audio frames as they arrive, and receive partial and final transcripts in real time.

Endpoint

wss://asr.shunyalabs.ai/ws

Connection lifecycle

Step 1: Open the connection & send init

The first message after connecting is a JSON config that authenticates and sets audio parameters. You can't change these mid-stream.

{
  "api_key": "sk-your-api-key",
  "model": "zero-indic",
  "language": "hi",
  "sample_rate": 8000,
  "dtype": "ulaw",
  "chunk_size_sec": 2.0,
  "silence_threshold_sec": 0.8
}

Init fields

Field	Type	Default	Description
`api_key`	string	required	Bearer token
`session_id`	string	auto	Custom session ID for tracking
`model`	string	`zero-indic`	ASR model
`language`	string	auto	ISO code or full name
`sample_rate`	int	16000	Audio sample rate (Hz)
`dtype`	string	`float32`	`ulaw`, `alaw`, `int16`, `float32`
`chunk_size_sec`	float	2.0	Audio buffer size before returning results
`silence_threshold_sec`	float	0.8	Silence duration before finalizing a segment
`inactivity_timeout`	float	300	Max idle time before timeout
`max_connection_duration`	float	3600	Max total connection time

Step 2: Stream audio frames

Send raw binary audio frames. The encoding must match dtype:

`dtype`	Description	Use for
`ulaw`	G.711 μ-law, 8-bit	Telephony (8 kHz), SIP
`alaw`	G.711 A-law, 8-bit	European telephony
`int16`	16-bit signed PCM	General recording
`float32`	32-bit IEEE float	Pre-processed audio

WAV headers are auto-detected

If your audio has a WAV header, sample rate and dtype are extracted automatically from the first binary frame, you don't need to set them in the init.

Step 3: Handle server events

Event table

Event	When	Key fields
`ready`	Connection accepted	`session_id`
`speech_start`	Speech onset detected	`offset_ms`, `segment_id`
`partial`	Interim transcription (best-guess, not final)	`text`, `language`, `segment_id`, `offset_ms`, `duration_ms`, `is_final: false`
`speech_end`	Silence detected	`offset_ms`, `segment_id`
`final_segment`	Segment finalized	`text`, `language`, `segment_id`, `offset_ms`, `status`, `is_final: true`
`end_of_transcript`	All segments delivered	`total_segments`, `total_audio_duration_sec`
`done`	Session complete	`total_segments`, `total_audio_duration_sec`
`error`	Error occurred	`message`, `code`

End-of-audio signals

Three equivalent ways to tell the server you're done sending:

Method	Format
Text message	`"END"` or `"END_OF_AUDIO"`
JSON message	`{"type": "end"}`
Empty binary frame	`Buffer.alloc(0)`

Error codes

Code	Description
`AUTH_FAILED`	Invalid API key (WS close 4001)
`TIMEOUT`	Inactivity or max connection duration exceeded
`CAPACITY_FULL`	Server at max concurrent sessions
`PROTOCOL_ERROR`	Invalid init config or message format
`INTERNAL_ERROR`	Unexpected server error

Full example

node

import WebSocket from "ws";
const ws = new WebSocket("wss://asr.shunyalabs.ai/ws");

ws.on("open", () => {
  ws.send(JSON.stringify({
    api_key: process.env.SHUNYALABS_API_KEY,
    language: "hi",
    sample_rate: 8000,
    dtype: "ulaw",
    model: "zero-indic",
  }));
});

ws.on("message", (raw) => {
  const msg = JSON.parse(raw.toString());
  switch (msg.type) {
    case "ready":         console.log("Session:", msg.session_id); break;
    case "speech_start":  console.log("Speech start @", msg.offset_ms); break;
    case "partial":       console.log("Partial:", msg.text); break;
    case "final_segment": console.log("Final:", msg.text); break;
    case "done":          console.log("Done:", msg.total_segments, "segments"); break;
    case "error":         console.error("Error:", msg.code, msg.message); break;
  }
});

// Stream audio from a source (e.g. telephony media, mic, file)
audioSource.on("data", (chunk) => ws.send(chunk));
audioSource.on("end",  () => ws.send("END"));

python

import asyncio, json, os, websockets

async def transcribe_stream(audio_iter):
    async with websockets.connect("wss://asr.shunyalabs.ai/ws") as ws:
        await ws.send(json.dumps({
            "api_key": os.environ["SHUNYALABS_API_KEY"],
            "model": "zero-indic",
            "language": "hi",
            "sample_rate": 16000,
            "dtype": "int16",
        }))

        async def sender():
            async for frame in audio_iter:
                await ws.send(frame)
            await ws.send("END")

        async def receiver():
            async for raw in ws:
                msg = json.loads(raw)
                if msg["type"] == "partial":
                    print(" partial:", msg["text"])
                elif msg["type"] == "final_segment":
                    print(" final:  ", msg["text"])
                elif msg["type"] == "done":
                    return

        await asyncio.gather(sender(), receiver())