Speech-to-Text (ASR)

Zero STT is Shunya's speech recognition family. One API surface for batch and streaming, a choice of four models tuned for different domains, and an intelligence layer that adds diarization, emotion, intent, and more on top of the transcript.

How it fits together

Endpoints

ModeEndpointUse for
BatchPOST https://asr.shunyalabs.ai/v1/audio/transcriptionsUploaded files, post-processing, async jobs.
Streamingwss://asr.shunyalabs.ai/wsLive transcription, voice agents, IVR.
HealthGET https://asr.shunyalabs.ai/healthLiveness checks. No auth.
LanguagesGET https://asr.shunyalabs.ai/languagesReturns supported language names, ISO codes, and scripts.
Speakers/v1/speakers/*Register, list, identify, delete voice profiles for speaker identification.

Batch vs Streaming

Same models, same intelligence layer, different transports. Pick batch when you have a complete audio file in hand. Pick streaming when audio is arriving live and you want partial transcripts as the speaker is still talking.

Batch
HTTP POST, transcribe an audio file

Accepts multipart/form-data. Required fields: file (or url) and model.

  • zero-indic, General Indian languages (Hindi, Tamil, Telugu, Kannada, Marathi, Bengali, etc.)
  • zero-med, Medical/clinical audio, auto-applies medical terminology correction via MedGemma
  • zero-codeswitch, Code-switched speech (Hinglish, Tanglish, etc.), auto-restores English words to Latin script via Gemini
  • zero-universal, 99-language Whisper model, English, European, Asian, and African languages
Endpoint
POST /v1/audio/transcriptions
Host
asr.shunyalabs.ai
Content type
multipart/form-data
Auth
Bearer <API_KEY>
Required
file (or url) and model
Default response
verbose_json
1Upload file or url2Server transcribes3Single JSON response
Streaming
WebSocket, real-time partials + finals

Real-time streaming transcription over WebSocket. Supports binary mode (raw PCM/ulaw/alaw bytes) and JSON mode (base64-encoded audio frames).

  • ulaw: G.711 mu-law (8-bit), Telephony (8 kHz)
  • alaw: G.711 A-law (8-bit), Telephony (8 kHz)
  • int16: 16-bit signed PCM, General recording
  • float32: 32-bit IEEE float, Pre-processed audio
Endpoint
wss://asr.shunyalabs.ai/ws
Init message
JSON config (first frame)
Sample rate
16000 Hz (default)
Chunk size
2.0 s (default)
Silence threshold
0.8 s (default)
1Connect to /ws2Send JSON init3Stream audio frames4Send "END"5Receive events until done
readyspeech_startpartialspeech_endfinal_segmentend_of_transcriptdoneerror

Source: Shunyalabs ASR Gateway API Reference (31 March), "Base Call" and "WebSocket Streaming API" sections, reproduced verbatim.

Pick a model

Every request takes a model field. There are four to choose from:

Hindi, Tamil, Telugu, Kannada, Marathi, Bengali and 50+ Indian languages. The default for Indic content.

View model

204-language Whisper-class model. English, European, Asian, African. Auto-detects language when you don't know it.

View model

Clinical / medical speech with drug, procedure, and diagnosis vocabulary. HIPAA-cleared. Auto-applies medical terminology correction.

View model

Native handling of mixed Hindi-English (Hinglish), Tamil-English (Tanglish) and similar blends. Gemini-backed code-switch restoration.

View model

Your first request

Minimum viable: file, model, bearer token.

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@meeting.wav" \
  -F "model=zero-indic"

Or pass a URL instead of uploading a file:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "url=https://example.com/call.wav" \
  -F "model=zero-indic"