Speech-to-Text (ASR)

Zero STT is Shunya's speech recognition family. One API surface for batch and streaming, a choice of four models tuned for different domains, and an intelligence layer that adds diarization, emotion, intent, and more on top of the transcript.

How it fits together

Endpoints

Mode	Endpoint	Use for
Batch	`POST https://asr.shunyalabs.ai/v1/audio/transcriptions`	Uploaded files, post-processing, async jobs.
Streaming	`wss://asr.shunyalabs.ai/ws`	Live transcription, voice agents, IVR.
Health	`GET https://asr.shunyalabs.ai/health`	Liveness checks. No auth.
Languages	`GET https://asr.shunyalabs.ai/languages`	Returns supported language names, ISO codes, and scripts.
Speakers	`/v1/speakers/*`	Register, list, identify, delete voice profiles for speaker identification.

Batch vs Streaming

Same models, same intelligence layer, different transports. Pick batch when you have a complete audio file in hand. Pick streaming when audio is arriving live and you want partial transcripts as the speaker is still talking.

Batch

HTTP POST, transcribe an audio file

Accepts multipart/form-data. Required fields: file (or url) and model.

Models available

zero-indic, General Indian languages (Hindi, Tamil, Telugu, Kannada, Marathi, Bengali, etc.)
zero-med, Medical/clinical audio, auto-applies medical terminology correction via MedGemma
zero-codeswitch, Code-switched speech (Hinglish, Tanglish, etc.), auto-restores English words to Latin script via Gemini
zero-universal, 99-language Whisper model, English, European, Asian, and African languages

Properties

Endpoint: POST /v1/audio/transcriptions
Host: asr.shunyalabs.ai
Content type: multipart/form-data
Auth: Bearer <API_KEY>
Required: file (or url) and model
Default response: verbose_json

Flow

1Upload file or url→2Server transcribes→3Single JSON response

Streaming

WebSocket, real-time partials + finals

Real-time streaming transcription over WebSocket. Supports binary mode (raw PCM/ulaw/alaw bytes) and JSON mode (base64-encoded audio frames).

Audio formats (dtype)

ulaw: G.711 mu-law (8-bit), Telephony (8 kHz)
alaw: G.711 A-law (8-bit), Telephony (8 kHz)
int16: 16-bit signed PCM, General recording
float32: 32-bit IEEE float, Pre-processed audio

Properties

Endpoint: wss://asr.shunyalabs.ai/ws
Init message: JSON config (first frame)
Sample rate: 16000 Hz (default)
Chunk size: 2.0 s (default)
Silence threshold: 0.8 s (default)

Connection flow

1Connect to /ws→2Send JSON init→3Stream audio frames→4Send "END"→5Receive events until done

Server → client events

readyspeech_startpartialspeech_endfinal_segmentend_of_transcriptdoneerror

Source: Shunyalabs ASR Gateway API Reference (31 March), "Base Call" and "WebSocket Streaming API" sections, reproduced verbatim.

Pick a model

Every request takes a model field. There are four to choose from:

zero-indic

Hindi, Tamil, Telugu, Kannada, Marathi, Bengali and 50+ Indian languages. The default for Indic content.

View model →

zero-universal

204-language Whisper-class model. English, European, Asian, African. Auto-detects language when you don't know it.

View model →

zero-med

Clinical / medical speech with drug, procedure, and diagnosis vocabulary. HIPAA-cleared. Auto-applies medical terminology correction.

View model →

zero-codeswitch

Native handling of mixed Hindi-English (Hinglish), Tamil-English (Tanglish) and similar blends. Gemini-backed code-switch restoration.

View model →

Your first request

Minimum viable: file, model, bearer token.

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@meeting.wav" \
  -F "model=zero-indic"

Or pass a URL instead of uploading a file:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "url=https://example.com/call.wav" \
  -F "model=zero-indic"