ASR features

The intelligence layer on top of Zero STT. Enable each with a boolean flag, none of them are required, and they combine freely. All of these return their results in the same JSON response as the transcript.

Easier to browse?
Open the Intelligence overview for a card-based UI with jump links and collapsible request/response examples for every feature.

1. Diarization

"Who spoke when." Adds speaker: SPEAKER_XX to every segment and a top-level speakers array. The full text field is prefixed with [SPEAKER_XX] tags per segment.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@meeting.wav" \
  -F "model=zero-indic" \
  -F "enable_diarization=true"

Response:

json
{
  "text": "[SPEAKER_00] नमस्ते, आप कैसे हैं? [SPEAKER_01] मैं ठीक हूँ, धन्यवाद।",
  "segments": [
    { "start": 0.5, "end": 3.2, "text": "नमस्ते, आप कैसे हैं?", "speaker": "SPEAKER_00" },
    { "start": 4.1, "end": 6.8, "text": "मैं ठीक हूँ, धन्यवाद।", "speaker": "SPEAKER_01" }
  ],
  "speakers": ["SPEAKER_00", "SPEAKER_01"]
}
  • Segments are capped at 30 seconds each to maintain transcription quality.
  • Works on any number of speakers, but best with 2-6 distinct voices.
  • Combine with enable_speaker_identification to get real names instead of anonymous labels.

2. Speaker identification

Turns anonymous SPEAKER_00 labels into registered names. Requires diarization on and at least one registered voice profile.

Step 1: register a speaker (use a 5-15 second clip of the speaker alone, no background music):

shell
curl -X POST https://asr.shunyalabs.ai/v1/speakers/register \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "name=Priya" \
  -F "file=@priya_sample.wav" \
  -F "project=support_team"

Step 2: transcribe with identification on:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_diarization=true" \
  -F "enable_speaker_identification=true" \
  -F "project=support_team"

Unrecognised speakers stay as SPEAKER_XX. The project parameter isolates per-customer or per-team voice libraries.

3. Emotion diarization

Detects the dominant emotion in each segment's audio. Adds an emotion field. Works independently of speaker diarization but is commonly used together.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_diarization=true" \
  -F "enable_emotion_diarization=true"

Response:

json
{
  "segments": [
    { "start": 0.5, "end": 3.2, "text": "...", "speaker": "SPEAKER_00", "emotion": "angry" },
    { "start": 4.1, "end": 6.8, "text": "...", "speaker": "SPEAKER_01", "emotion": "neutral" }
  ]
}

4. Intent detection

Classifies the overall transcript intent using Gemini. Pass intent_choices to constrain to your taxonomy, or leave off for free-form detection.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request","compliment"]'

Response:

json
{
  "nlp_analysis": {
    "intent": {
      "label": "service_request",
      "confidence": 0.92,
      "reasoning": "Caller is requesting roadside assistance for a broken-down vehicle"
    }
  }
}

5. Sentiment analysis

Overall sentiment of the transcript. Returns a label, a numeric score (-1 to 1), and an explanation.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_sentiment_analysis=true"

Response:

json
{
  "nlp_analysis": {
    "sentiment": {
      "label": "negative",
      "score": -0.72,
      "explanation": "Customer expresses frustration about the vehicle being stranded."
    }
  }
}

6. Summarization

Concise summary of the transcript. summary_max_length sets an approximate word cap (default 150).

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_summarization=true" \
  -F "summary_max_length=50"

Response:

json
{
  "nlp_analysis": {
    "summary": "Customer called about a vehicle breakdown. Agent confirmed the complaint was registered and promised a technician within the hour."
  }
}

7. Keyterm normalization

Cleans up domain-specific terms the ASR model might render informally, emi EMI, nach mandateNACH mandate. Preserves the original language. Optionally focus on a specific glossary with keyterm_keywords.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_keyterm_normalization=true" \
  -F 'keyterm_keywords=["EMI","NACH mandate","bounce charge"]'

Normalised text returns in nlp_analysis.normalized_text, alongside the original transcript.

8. Translation (output_language)

Translates the full transcript into a target language via Gemini. ISO 639-1 code (en, hi) or full name (English).

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "output_language=en"

Response:

json
{
  "nlp_analysis": {
    "translation": "Hello, this is an urgent call."
  }
}

9. Profanity hashing

Masks profane words with **** in-place in both the top-level text and segment text.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_profanity_hashing=true"

10. Custom keyword redaction (hash_keywords)

Regex-based masking of specific terms or phrases. No LLM, fast, deterministic. Use for PII and domain-sensitive tokens.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F 'hash_keywords=["account number","card number","OTP","aadhaar"]'

Response:

json
{
  "text": "आपका **** 4321 है और आपका **** कल भेजा गया था",
  "segments": [
    { "start": 0.0, "end": 5.0, "text": "आपका **** 4321 है और आपका **** कल भेजा गया था" }
  ]
}

11. Word timestamps

Per-word timing with a confidence score. Adds a words array to every segment. Runs via ONNX CTC forced alignment on GPU, no external call, negligible latency overhead.

Request:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "word_timestamps=true"

Response:

json
{
  "segments": [
    {
      "start": 0.51,
      "end": 5.70,
      "text": "नमस्ते मोहम्मद जी",
      "words": [
        { "word": "नमस्ते", "start": 0.532, "end": 0.932, "score": 0.85 },
        { "word": "मोहम्मद", "start": 1.012, "end": 1.412, "score": 0.72 },
        { "word": "जी", "start": 1.492, "end": 1.652, "score": 0.91 }
      ]
    }
  ]
}

Combining features

Every feature above can be enabled on the same request. Here's a full contact-centre configuration:

shell
curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "language_code=hi" \
  -F "enable_diarization=true" \
  -F "enable_speaker_identification=true" \
  -F "project=support_team" \
  -F "enable_emotion_diarization=true" \
  -F "word_timestamps=true" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request"]' \
  -F "enable_summarization=true" \
  -F "enable_sentiment_analysis=true" \
  -F 'hash_keywords=["account number","card number","OTP"]' \
  -F "enable_denoising=true"
Latency trade-off
Each intelligence feature adds a Gemini call on top of the ASR path. Enable only what you consume, don't pay for a summary you won't read.