ASR features

The intelligence layer on top of Zero STT. Enable each with a boolean flag, none of them are required, and they combine freely. All of these return their results in the same JSON response as the transcript.

Easier to browse?

Open the Intelligence overview for a card-based UI with jump links and collapsible request/response examples for every feature.

1. Diarization

"Who spoke when." Adds speaker: SPEAKER_XX to every segment and a top-level speakers array. The full text field is prefixed with [SPEAKER_XX] tags per segment.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@meeting.wav" \
  -F "model=zero-indic" \
  -F "enable_diarization=true"

Response:

{
  "text": "[SPEAKER_00] नमस्ते, आप कैसे हैं? [SPEAKER_01] मैं ठीक हूँ, धन्यवाद।",
  "segments": [
    { "start": 0.5, "end": 3.2, "text": "नमस्ते, आप कैसे हैं?", "speaker": "SPEAKER_00" },
    { "start": 4.1, "end": 6.8, "text": "मैं ठीक हूँ, धन्यवाद।", "speaker": "SPEAKER_01" }
  ],
  "speakers": ["SPEAKER_00", "SPEAKER_01"]
}

Segments are capped at 30 seconds each to maintain transcription quality.
Works on any number of speakers, but best with 2-6 distinct voices.
Combine with enable_speaker_identification to get real names instead of anonymous labels.

2. Speaker identification

Turns anonymous SPEAKER_00 labels into registered names. Requires diarization on and at least one registered voice profile.

Step 1: register a speaker (use a 5-15 second clip of the speaker alone, no background music):

curl -X POST https://asr.shunyalabs.ai/v1/speakers/register \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "name=Priya" \
  -F "file=@priya_sample.wav" \
  -F "project=support_team"

Step 2: transcribe with identification on:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_diarization=true" \
  -F "enable_speaker_identification=true" \
  -F "project=support_team"

Unrecognised speakers stay as SPEAKER_XX. The project parameter isolates per-customer or per-team voice libraries.

3. Emotion diarization

Detects the dominant emotion in each segment's audio. Adds an emotion field. Works independently of speaker diarization but is commonly used together.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_diarization=true" \
  -F "enable_emotion_diarization=true"

Response:

{
  "segments": [
    { "start": 0.5, "end": 3.2, "text": "...", "speaker": "SPEAKER_00", "emotion": "angry" },
    { "start": 4.1, "end": 6.8, "text": "...", "speaker": "SPEAKER_01", "emotion": "neutral" }
  ]
}

4. Intent detection

Classifies the overall transcript intent using Gemini. Pass intent_choices to constrain to your taxonomy, or leave off for free-form detection.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request","compliment"]'

Response:

{
  "nlp_analysis": {
    "intent": {
      "label": "service_request",
      "confidence": 0.92,
      "reasoning": "Caller is requesting roadside assistance for a broken-down vehicle"
    }
  }
}

5. Sentiment analysis

Overall sentiment of the transcript. Returns a label, a numeric score (-1 to 1), and an explanation.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_sentiment_analysis=true"

Response:

{
  "nlp_analysis": {
    "sentiment": {
      "label": "negative",
      "score": -0.72,
      "explanation": "Customer expresses frustration about the vehicle being stranded."
    }
  }
}

6. Summarization

Concise summary of the transcript. summary_max_length sets an approximate word cap (default 150).

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_summarization=true" \
  -F "summary_max_length=50"

Response:

{
  "nlp_analysis": {
    "summary": "Customer called about a vehicle breakdown. Agent confirmed the complaint was registered and promised a technician within the hour."
  }
}

7. Keyterm normalization

Cleans up domain-specific terms the ASR model might render informally, emi → EMI, nach mandate → NACH mandate. Preserves the original language. Optionally focus on a specific glossary with keyterm_keywords.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_keyterm_normalization=true" \
  -F 'keyterm_keywords=["EMI","NACH mandate","bounce charge"]'

Normalised text returns in nlp_analysis.normalized_text, alongside the original transcript.

8. Translation (`output_language`)

Translates the full transcript into a target language via Gemini. ISO 639-1 code (en, hi) or full name (English).

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "output_language=en"

Response:

{
  "nlp_analysis": {
    "translation": "Hello, this is an urgent call."
  }
}

9. Profanity hashing

Masks profane words with **** in-place in both the top-level text and segment text.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "enable_profanity_hashing=true"

10. Custom keyword redaction (`hash_keywords`)

Regex-based masking of specific terms or phrases. No LLM, fast, deterministic. Use for PII and domain-sensitive tokens.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F 'hash_keywords=["account number","card number","OTP","aadhaar"]'

Response:

{
  "text": "आपका **** 4321 है और आपका **** कल भेजा गया था",
  "segments": [
    { "start": 0.0, "end": 5.0, "text": "आपका **** 4321 है और आपका **** कल भेजा गया था" }
  ]
}

11. Word timestamps

Per-word timing with a confidence score. Adds a words array to every segment. Runs via ONNX CTC forced alignment on GPU, no external call, negligible latency overhead.

Request:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "word_timestamps=true"

Response:

{
  "segments": [
    {
      "start": 0.51,
      "end": 5.70,
      "text": "नमस्ते मोहम्मद जी",
      "words": [
        { "word": "नमस्ते", "start": 0.532, "end": 0.932, "score": 0.85 },
        { "word": "मोहम्मद", "start": 1.012, "end": 1.412, "score": 0.72 },
        { "word": "जी", "start": 1.492, "end": 1.652, "score": 0.91 }
      ]
    }
  ]
}

Combining features

Every feature above can be enabled on the same request. Here's a full contact-centre configuration:

curl -X POST https://asr.shunyalabs.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -F "file=@call.wav" \
  -F "model=zero-indic" \
  -F "language_code=hi" \
  -F "enable_diarization=true" \
  -F "enable_speaker_identification=true" \
  -F "project=support_team" \
  -F "enable_emotion_diarization=true" \
  -F "word_timestamps=true" \
  -F "enable_intent_detection=true" \
  -F 'intent_choices=["complaint","inquiry","service_request"]' \
  -F "enable_summarization=true" \
  -F "enable_sentiment_analysis=true" \
  -F 'hash_keywords=["account number","card number","OTP"]' \
  -F "enable_denoising=true"

Latency trade-off

Each intelligence feature adds a Gemini call on top of the ASR path. Enable only what you consume, don't pay for a summary you won't read.

ASR features

1. Diarization

2. Speaker identification

3. Emotion diarization

4. Intent detection

5. Sentiment analysis

6. Summarization

7. Keyterm normalization

8. Translation (output_language)

9. Profanity hashing

10. Custom keyword redaction (hash_keywords)

11. Word timestamps

Combining features

8. Translation (`output_language`)

10. Custom keyword redaction (`hash_keywords`)