ASR features
The intelligence layer on top of Zero STT. Enable each with a boolean flag, none of them are required, and they combine freely. All of these return their results in the same JSON response as the transcript.
1. Diarization
"Who spoke when." Adds speaker: SPEAKER_XX to every segment and a top-level speakers array. The full text field is prefixed with [SPEAKER_XX] tags per segment.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@meeting.wav" \
-F "model=zero-indic" \
-F "enable_diarization=true"Response:
{
"text": "[SPEAKER_00] नमस्ते, आप कैसे हैं? [SPEAKER_01] मैं ठीक हूँ, धन्यवाद।",
"segments": [
{ "start": 0.5, "end": 3.2, "text": "नमस्ते, आप कैसे हैं?", "speaker": "SPEAKER_00" },
{ "start": 4.1, "end": 6.8, "text": "मैं ठीक हूँ, धन्यवाद।", "speaker": "SPEAKER_01" }
],
"speakers": ["SPEAKER_00", "SPEAKER_01"]
}- Segments are capped at 30 seconds each to maintain transcription quality.
- Works on any number of speakers, but best with 2-6 distinct voices.
- Combine with
enable_speaker_identificationto get real names instead of anonymous labels.
2. Speaker identification
Turns anonymous SPEAKER_00 labels into registered names. Requires diarization on and at least one registered voice profile.
Step 1: register a speaker (use a 5-15 second clip of the speaker alone, no background music):
curl -X POST https://asr.shunyalabs.ai/v1/speakers/register \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "name=Priya" \
-F "file=@priya_sample.wav" \
-F "project=support_team"Step 2: transcribe with identification on:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_diarization=true" \
-F "enable_speaker_identification=true" \
-F "project=support_team"Unrecognised speakers stay as SPEAKER_XX. The project parameter isolates per-customer or per-team voice libraries.
3. Emotion diarization
Detects the dominant emotion in each segment's audio. Adds an emotion field. Works independently of speaker diarization but is commonly used together.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_diarization=true" \
-F "enable_emotion_diarization=true"Response:
{
"segments": [
{ "start": 0.5, "end": 3.2, "text": "...", "speaker": "SPEAKER_00", "emotion": "angry" },
{ "start": 4.1, "end": 6.8, "text": "...", "speaker": "SPEAKER_01", "emotion": "neutral" }
]
}4. Intent detection
Classifies the overall transcript intent using Gemini. Pass intent_choices to constrain to your taxonomy, or leave off for free-form detection.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_intent_detection=true" \
-F 'intent_choices=["complaint","inquiry","service_request","compliment"]'Response:
{
"nlp_analysis": {
"intent": {
"label": "service_request",
"confidence": 0.92,
"reasoning": "Caller is requesting roadside assistance for a broken-down vehicle"
}
}
}5. Sentiment analysis
Overall sentiment of the transcript. Returns a label, a numeric score (-1 to 1), and an explanation.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_sentiment_analysis=true"Response:
{
"nlp_analysis": {
"sentiment": {
"label": "negative",
"score": -0.72,
"explanation": "Customer expresses frustration about the vehicle being stranded."
}
}
}6. Summarization
Concise summary of the transcript. summary_max_length sets an approximate word cap (default 150).
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_summarization=true" \
-F "summary_max_length=50"Response:
{
"nlp_analysis": {
"summary": "Customer called about a vehicle breakdown. Agent confirmed the complaint was registered and promised a technician within the hour."
}
}7. Keyterm normalization
Cleans up domain-specific terms the ASR model might render informally, emi → EMI, nach mandate → NACH mandate. Preserves the original language. Optionally focus on a specific glossary with keyterm_keywords.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_keyterm_normalization=true" \
-F 'keyterm_keywords=["EMI","NACH mandate","bounce charge"]'Normalised text returns in nlp_analysis.normalized_text, alongside the original transcript.
8. Translation (output_language)
Translates the full transcript into a target language via Gemini. ISO 639-1 code (en, hi) or full name (English).
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "output_language=en"Response:
{
"nlp_analysis": {
"translation": "Hello, this is an urgent call."
}
}9. Profanity hashing
Masks profane words with **** in-place in both the top-level text and segment text.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "enable_profanity_hashing=true"10. Custom keyword redaction (hash_keywords)
Regex-based masking of specific terms or phrases. No LLM, fast, deterministic. Use for PII and domain-sensitive tokens.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F 'hash_keywords=["account number","card number","OTP","aadhaar"]'Response:
{
"text": "आपका **** 4321 है और आपका **** कल भेजा गया था",
"segments": [
{ "start": 0.0, "end": 5.0, "text": "आपका **** 4321 है और आपका **** कल भेजा गया था" }
]
}11. Word timestamps
Per-word timing with a confidence score. Adds a words array to every segment. Runs via ONNX CTC forced alignment on GPU, no external call, negligible latency overhead.
Request:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "word_timestamps=true"Response:
{
"segments": [
{
"start": 0.51,
"end": 5.70,
"text": "नमस्ते मोहम्मद जी",
"words": [
{ "word": "नमस्ते", "start": 0.532, "end": 0.932, "score": 0.85 },
{ "word": "मोहम्मद", "start": 1.012, "end": 1.412, "score": 0.72 },
{ "word": "जी", "start": 1.492, "end": 1.652, "score": 0.91 }
]
}
]
}Combining features
Every feature above can be enabled on the same request. Here's a full contact-centre configuration:
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $SHUNYALABS_API_KEY" \
-F "file=@call.wav" \
-F "model=zero-indic" \
-F "language_code=hi" \
-F "enable_diarization=true" \
-F "enable_speaker_identification=true" \
-F "project=support_team" \
-F "enable_emotion_diarization=true" \
-F "word_timestamps=true" \
-F "enable_intent_detection=true" \
-F 'intent_choices=["complaint","inquiry","service_request"]' \
-F "enable_summarization=true" \
-F "enable_sentiment_analysis=true" \
-F 'hash_keywords=["account number","card number","OTP"]' \
-F "enable_denoising=true"