ASR configuration
Every parameter you can pass to POST /v1/audio/transcriptions, in one place. Two are required (file or url, and model); the rest have safe defaults.
Required
| Field | Type | Description |
|---|---|---|
file | file upload | Audio file (WAV, MP3, M4A, OGG, FLAC, WebM). One of file or url required. |
url | string | Public audio URL. Can't be combined with file. |
model | string | One of zero-indic, zero-universal, zero-med, zero-codeswitch. |
Language & output
| Parameter | Type, default | Description |
|---|---|---|
language_code | string, "auto" | Language hint. ISO 639-1 (hi, ta, kn, bn, mr, te, gu, pa, ml, or, ur, en) or full name (Hindi, Tamil). Use auto to detect. |
response_format | string, "verbose_json" | verbose_json for full response (segments, NLP, timing, language). json for minimal {"text": "..."}: OpenAI-compatible. |
output_script | string, "auto" | Transliterate to a different script without changing language. Devanagari, Bengali, Telugu, Tamil, Kannada, Latin, ITRANS. Powered by aksharamukha, no LLM, no latency cost. |
output_language | string, none | Translate the transcript via Gemini. Result lives in nlp_analysis.translation. Accepts ISO codes or full names. For script change only, use output_script. |
Why setting
On clips < 5 seconds, language detection is error-prone and the model can "translate" instead of transcribe. Locking the language avoids both.language_code mattersExample: transcribe Hindi audio and romanise the script to Latin:
shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@hindi-call.wav" \
-F "model=zero-indic" \
-F "language_code=hi" \
-F "output_script=Latin"Returns: "namaste mohammad ji ye ek zaruri call hai" (Hindi pronunciation, written in Latin letters).
Audio pre-processing
| Parameter | Type, default | Description |
|---|---|---|
use_vad_chunking | bool, true | Split audio at natural speech pauses using Voice Activity Detection. Silence is stripped before the model sees it. Set false only for continuous dictation with no pauses. |
chunk_size | int seconds, 30 | Applies when use_vad_chunking=false. Range 1-60. Sets a fixed segment length. |
enable_denoising | bool, false | Runs noise reduction before transcription. Skipped automatically for files > 5 MB to avoid latency. Use for call-centre and field recordings. |
Segmentation & alignment
| Parameter | Type, default | Description |
|---|---|---|
word_timestamps | bool, false | Adds per-word start, end, and score to every segment. Uses CTC forced alignment on an ONNX model, runs on GPU, no external call. |
enable_diarization | bool, false | Speaker-level segmentation. Every segment gets a speaker: SPEAKER_XX label; the top-level transcript is prefixed with speaker tags; speakers array lists all unique speakers detected. Segments are capped at 30 s each for quality. |
enable_speaker_identification + project | bool, false | Resolves anonymous SPEAKER_XX labels to registered names. Requires enable_diarization=true and pre-registered voice profiles. project scopes the speaker library, useful for per-customer isolation. Speaker registration API → |
enable_emotion_diarization | bool, false | Detects the dominant emotion in each segment and adds an emotion field. Works alongside standard diarization. |
Intelligence layer
| Parameter | Type, default | Description |
|---|---|---|
enable_intent_detection + intent_choices | bool, false | Classifies the overall transcript intent via Gemini. Optionally constrain to a list of allowed intents with intent_choices (JSON array). Result in nlp_analysis.intent: label, confidence, reasoning. |
enable_summarization + summary_max_length | bool, false | Generates a concise transcript summary via Gemini. summary_max_length is an approximate word count (default 150). Result in nlp_analysis.summary. |
enable_sentiment_analysis | bool, false | Returns nlp_analysis.sentiment with a label (positive/negative/neutral), a numeric score, and a short explanation. |
enable_keyterm_normalization + keyterm_keywords | bool, false | Normalises domain-specific terms the ASR model might render informally. Optionally focus on specific terms with keyterm_keywords (JSON array). Output preserves the original language. Result in nlp_analysis.normalized_text. |
Example: classify the call into one of four intents:
shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_intent_detection=true" \
-F 'intent_choices=["complaint","inquiry","service_request","compliment"]'Redaction
| Parameter | Type, default | Description |
|---|---|---|
enable_profanity_hashing | bool, false | Replaces profane words with **** in-place in both segments[].text and the top-level text. Uses Gemini for detection. |
hash_keywords | JSON array, none | Masks a specific list of words/phrases using regex (case-insensitive, no LLM). Independent of profanity hashing. Use for account numbers, card numbers, OTP, and any custom sensitive terms. |
Example: mask account numbers, card numbers, and OTPs in the transcript:
shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F 'hash_keywords=["account number","card number","OTP"]'Legacy / compatibility
| Parameter | Type, default | Description |
|---|---|---|
task | string, "transcribe" | OpenAI-compatible field. Only "transcribe" is supported today. |
Putting it all together
shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "language_code=hi" \
-F "enable_diarization=true" \
-F "enable_speaker_identification=true" \
-F "project=support_team" \
-F "enable_emotion_diarization=true" \
-F "word_timestamps=true" \
-F "enable_intent_detection=true" \
-F 'intent_choices=["complaint","inquiry","service_request"]' \
-F "enable_summarization=true" \
-F "enable_sentiment_analysis=true" \
-F "enable_denoising=true"A sane default set
For a contact-centre transcription with agent-assist, start with: model=zero-indic, enable_diarization=true, enable_intent_detection=true, enable_sentiment_analysis=true. Add word_timestamps=true if you need precise search.