Speech-to-Text (ASR)
Zero STT is Shunya's speech recognition family. One API surface for batch and streaming, a choice of four models tuned for different domains, and an intelligence layer that adds diarization, emotion, intent, and more on top of the transcript.
How it fits together
Endpoints
| Mode | Endpoint | Use for |
|---|---|---|
| Batch | POST https://asr.shunyalabs.ai/v1/audio/transcriptions | Uploaded files, post-processing, async jobs. |
| Streaming | wss://asr.shunyalabs.ai/ws | Live transcription, voice agents, IVR. |
| Health | GET https://asr.shunyalabs.ai/health | Liveness checks. No auth. |
| Languages | GET https://asr.shunyalabs.ai/languages | Returns supported language names, ISO codes, and scripts. |
| Speakers | /v1/speakers/* | Register, list, identify, delete voice profiles for speaker identification. |
Batch vs Streaming
Same models, same intelligence layer, different transports. Pick batch when you have a complete audio file in hand. Pick streaming when audio is arriving live and you want partial transcripts as the speaker is still talking.
Accepts multipart/form-data. Required fields: file (or url) and model.
zero-indic, General Indian languages (Hindi, Tamil, Telugu, Kannada, Marathi, Bengali, etc.)zero-med, Medical/clinical audio, auto-applies medical terminology correction via MedGemmazero-codeswitch, Code-switched speech (Hinglish, Tanglish, etc.), auto-restores English words to Latin script via Geminizero-universal, 99-language Whisper model, English, European, Asian, and African languages
- Endpoint
- POST
/v1/audio/transcriptions - Host
asr.shunyalabs.ai- Content type
multipart/form-data- Auth
- Bearer
<API_KEY> - Required
file(orurl) andmodel- Default response
verbose_json
Real-time streaming transcription over WebSocket. Supports binary mode (raw PCM/ulaw/alaw bytes) and JSON mode (base64-encoded audio frames).
ulaw: G.711 mu-law (8-bit), Telephony (8 kHz)alaw: G.711 A-law (8-bit), Telephony (8 kHz)int16: 16-bit signed PCM, General recordingfloat32: 32-bit IEEE float, Pre-processed audio
- Endpoint
wss://asr.shunyalabs.ai/ws- Init message
- JSON config (first frame)
- Sample rate
16000Hz (default)- Chunk size
2.0s (default)- Silence threshold
0.8s (default)
Source: Shunyalabs ASR Gateway API Reference (31 March), "Base Call" and "WebSocket Streaming API" sections, reproduced verbatim.
Pick a model
Every request takes a model field. There are four to choose from:
Hindi, Tamil, Telugu, Kannada, Marathi, Bengali and 50+ Indian languages. The default for Indic content.
View model →204-language Whisper-class model. English, European, Asian, African. Auto-detects language when you don't know it.
View model →Clinical / medical speech with drug, procedure, and diagnosis vocabulary. HIPAA-cleared. Auto-applies medical terminology correction.
View model →Native handling of mixed Hindi-English (Hinglish), Tamil-English (Tanglish) and similar blends. Gemini-backed code-switch restoration.
View model →Your first request
Minimum viable: file, model, bearer token.
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@meeting.wav" \
-F "model=zero-indic"Or pass a URL instead of uploading a file:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "url=https://example.com/call.wav" \
-F "model=zero-indic"