ASR configuration

Every parameter you can pass to POST /v1/audio/transcriptions, in one place. Two are required (file or url, and model); the rest have safe defaults.

Required

Field	Type	Description
`file`	file upload	Audio file (WAV, MP3, M4A, OGG, FLAC, WebM). One of `file` or `url` required.
`url`	string	Public audio URL. Can't be combined with `file`.
`model`	string	One of `zero-indic`, `zero-universal`, `zero-med`, `zero-codeswitch`.

Language & output

Parameter	Type, default	Description
`language_code`	string, `"auto"`	Language hint. ISO 639-1 (`hi`, `ta`, `kn`, `bn`, `mr`, `te`, `gu`, `pa`, `ml`, `or`, `ur`, `en`) or full name (`Hindi`, `Tamil`). Use `auto` to detect.
`response_format`	string, `"verbose_json"`	`verbose_json` for full response (segments, NLP, timing, language). `json` for minimal `{"text": "..."}`: OpenAI-compatible.
`output_script`	string, `"auto"`	Transliterate to a different script without changing language. `Devanagari`, `Bengali`, `Telugu`, `Tamil`, `Kannada`, `Latin`, `ITRANS`. Powered by aksharamukha, no LLM, no latency cost.
`output_language`	string, none	Translate the transcript via Gemini. Result lives in `nlp_analysis.translation`. Accepts ISO codes or full names. For script change only, use `output_script`.

Why setting language_code matters

On clips < 5 seconds, language detection is error-prone and the model can "translate" instead of transcribe. Locking the language avoids both.

Example: transcribe Hindi audio and romanise the script to Latin:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@hindi-call.wav" \
  -F "model=zero-indic" \
  -F "language_code=hi" \
  -F "output_script=Latin"

Returns: "namaste mohammad ji ye ek zaruri call hai" (Hindi pronunciation, written in Latin letters).

Audio pre-processing

Parameter	Type, default	Description
`use_vad_chunking`	bool, `true`	Split audio at natural speech pauses using Voice Activity Detection. Silence is stripped before the model sees it. Set `false` only for continuous dictation with no pauses.
`chunk_size`	int seconds, `30`	Applies when `use_vad_chunking=false`. Range 1-60. Sets a fixed segment length.
`enable_denoising`	bool, `false`	Runs noise reduction before transcription. Skipped automatically for files > 5 MB to avoid latency. Use for call-centre and field recordings.

Segmentation & alignment

Parameter	Type, default	Description
`word_timestamps`	bool, `false`	Adds per-word `start`, `end`, and `score` to every segment. Uses CTC forced alignment on an ONNX model, runs on GPU, no external call.
`enable_diarization`	bool, `false`	Speaker-level segmentation. Every segment gets a `speaker: SPEAKER_XX` label; the top-level transcript is prefixed with speaker tags; `speakers` array lists all unique speakers detected. Segments are capped at 30 s each for quality.
`enable_speaker_identification` + `project`	bool, `false`	Resolves anonymous `SPEAKER_XX` labels to registered names. Requires `enable_diarization=true` and pre-registered voice profiles. `project` scopes the speaker library, useful for per-customer isolation. Speaker registration API →
`enable_emotion_diarization`	bool, `false`	Detects the dominant emotion in each segment and adds an `emotion` field. Works alongside standard diarization.

Intelligence layer

Parameter	Type, default	Description
`enable_intent_detection` + `intent_choices`	bool, `false`	Classifies the overall transcript intent via Gemini. Optionally constrain to a list of allowed intents with `intent_choices` (JSON array). Result in `nlp_analysis.intent`: `label`, `confidence`, `reasoning`.
`enable_summarization` + `summary_max_length`	bool, `false`	Generates a concise transcript summary via Gemini. `summary_max_length` is an approximate word count (default 150). Result in `nlp_analysis.summary`.
`enable_sentiment_analysis`	bool, `false`	Returns `nlp_analysis.sentiment` with a `label` (positive/negative/neutral), a numeric `score`, and a short `explanation`.
`enable_keyterm_normalization` + `keyterm_keywords`	bool, `false`	Normalises domain-specific terms the ASR model might render informally. Optionally focus on specific terms with `keyterm_keywords` (JSON array). Output preserves the original language. Result in `nlp_analysis.normalized_text`.

Example: classify the call into one of four intents:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request","compliment"]'

Redaction

Parameter	Type, default	Description
`enable_profanity_hashing`	bool, `false`	Replaces profane words with `****` in-place in both `segments[].text` and the top-level `text`. Uses Gemini for detection.
`hash_keywords`	JSON array, none	Masks a specific list of words/phrases using regex (case-insensitive, no LLM). Independent of profanity hashing. Use for account numbers, card numbers, OTP, and any custom sensitive terms.

Example: mask account numbers, card numbers, and OTPs in the transcript:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F 'hash_keywords=["account number","card number","OTP"]'

Legacy / compatibility

Parameter	Type, default	Description
`task`	string, `"transcribe"`	OpenAI-compatible field. Only `"transcribe"` is supported today.

Putting it all together

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "language_code=hi" \
  -F "enable_diarization=true" \
  -F "enable_speaker_identification=true" \
  -F "project=support_team" \
  -F "enable_emotion_diarization=true" \
  -F "word_timestamps=true" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request"]' \
  -F "enable_summarization=true" \
  -F "enable_sentiment_analysis=true" \
  -F "enable_denoising=true"

A sane default set

For a contact-centre transcription with agent-assist, start with: model=zero-indic, enable_diarization=true, enable_intent_detection=true, enable_sentiment_analysis=true. Add word_timestamps=true if you need precise search.