ASR configuration

Every parameter you can pass to POST /v1/audio/transcriptions, in one place. Two are required (file or url, and model); the rest have safe defaults.

Required

FieldTypeDescription
filefile uploadAudio file (WAV, MP3, M4A, OGG, FLAC, WebM). One of file or url required.
urlstringPublic audio URL. Can't be combined with file.
modelstringOne of zero-indic, zero-universal, zero-med, zero-codeswitch.

Language & output

ParameterType, defaultDescription
language_codestring, "auto"Language hint. ISO 639-1 (hi, ta, kn, bn, mr, te, gu, pa, ml, or, ur, en) or full name (Hindi, Tamil). Use auto to detect.
response_formatstring, "verbose_json"verbose_json for full response (segments, NLP, timing, language). json for minimal {"text": "..."}: OpenAI-compatible.
output_scriptstring, "auto"Transliterate to a different script without changing language. Devanagari, Bengali, Telugu, Tamil, Kannada, Latin, ITRANS. Powered by aksharamukha, no LLM, no latency cost.
output_languagestring, noneTranslate the transcript via Gemini. Result lives in nlp_analysis.translation. Accepts ISO codes or full names. For script change only, use output_script.
Why setting language_code matters
On clips < 5 seconds, language detection is error-prone and the model can "translate" instead of transcribe. Locking the language avoids both.

Example: transcribe Hindi audio and romanise the script to Latin:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@hindi-call.wav" \
  -F "model=zero-indic" \
  -F "language_code=hi" \
  -F "output_script=Latin"

Returns: "namaste mohammad ji ye ek zaruri call hai" (Hindi pronunciation, written in Latin letters).

Audio pre-processing

ParameterType, defaultDescription
use_vad_chunkingbool, trueSplit audio at natural speech pauses using Voice Activity Detection. Silence is stripped before the model sees it. Set false only for continuous dictation with no pauses.
chunk_sizeint seconds, 30Applies when use_vad_chunking=false. Range 1-60. Sets a fixed segment length.
enable_denoisingbool, falseRuns noise reduction before transcription. Skipped automatically for files > 5 MB to avoid latency. Use for call-centre and field recordings.

Segmentation & alignment

ParameterType, defaultDescription
word_timestampsbool, falseAdds per-word start, end, and score to every segment. Uses CTC forced alignment on an ONNX model, runs on GPU, no external call.
enable_diarizationbool, falseSpeaker-level segmentation. Every segment gets a speaker: SPEAKER_XX label; the top-level transcript is prefixed with speaker tags; speakers array lists all unique speakers detected. Segments are capped at 30 s each for quality.
enable_speaker_identification + projectbool, falseResolves anonymous SPEAKER_XX labels to registered names. Requires enable_diarization=true and pre-registered voice profiles. project scopes the speaker library, useful for per-customer isolation. Speaker registration API →
enable_emotion_diarizationbool, falseDetects the dominant emotion in each segment and adds an emotion field. Works alongside standard diarization.

Intelligence layer

ParameterType, defaultDescription
enable_intent_detection + intent_choicesbool, falseClassifies the overall transcript intent via Gemini. Optionally constrain to a list of allowed intents with intent_choices (JSON array). Result in nlp_analysis.intent: label, confidence, reasoning.
enable_summarization + summary_max_lengthbool, falseGenerates a concise transcript summary via Gemini. summary_max_length is an approximate word count (default 150). Result in nlp_analysis.summary.
enable_sentiment_analysisbool, falseReturns nlp_analysis.sentiment with a label (positive/negative/neutral), a numeric score, and a short explanation.
enable_keyterm_normalization + keyterm_keywordsbool, falseNormalises domain-specific terms the ASR model might render informally. Optionally focus on specific terms with keyterm_keywords (JSON array). Output preserves the original language. Result in nlp_analysis.normalized_text.

Example: classify the call into one of four intents:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request","compliment"]'

Redaction

ParameterType, defaultDescription
enable_profanity_hashingbool, falseReplaces profane words with **** in-place in both segments[].text and the top-level text. Uses Gemini for detection.
hash_keywordsJSON array, noneMasks a specific list of words/phrases using regex (case-insensitive, no LLM). Independent of profanity hashing. Use for account numbers, card numbers, OTP, and any custom sensitive terms.

Example: mask account numbers, card numbers, and OTPs in the transcript:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F 'hash_keywords=["account number","card number","OTP"]'

Legacy / compatibility

ParameterType, defaultDescription
taskstring, "transcribe"OpenAI-compatible field. Only "transcribe" is supported today.

Putting it all together

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "language_code=hi" \
  -F "enable_diarization=true" \
  -F "enable_speaker_identification=true" \
  -F "project=support_team" \
  -F "enable_emotion_diarization=true" \
  -F "word_timestamps=true" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request"]' \
  -F "enable_summarization=true" \
  -F "enable_sentiment_analysis=true" \
  -F "enable_denoising=true"
A sane default set
For a contact-centre transcription with agent-assist, start with: model=zero-indic, enable_diarization=true, enable_intent_detection=true, enable_sentiment_analysis=true. Add word_timestamps=true if you need precise search.