Word Timestamps

Word-level timestamps provide precise timing information for each individual word in the transcription, along with confidence scores. Each segment is broken down into words with exact start and end times and a probability score indicating recognition confidence.

How to Enable

"include_word_timestamps": "true"

Request

Don’t forget to replace YOUR_API_KEY with your own secret key.

import requests

url = "https://tb2.shunyalabs.ai/v1/transcriptions"
headers = {"X-API-Key": "your-api-key"}

with open("sample.wav", "rb") as f:
    files = {"file": f}
    data = {
        "include_word_timestamps": "true"
    }

response = requests.post(
    url,
    headers=headers,
    files=files,
    data=data
)

print(response.json())

Example Output

{
  "success": true,
  "text": "Hello, how are you doing today?",
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, how are you doing today?",
      "speaker": "SPEAKER_00",
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.5, "probability": 0.96 },
        { "word": "how", "start": 0.8, "end": 1.0, "probability": 0.94 },
        { "word": "are", "start": 1.1, "end": 1.3, "probability": 0.92 },
        { "word": "you", "start": 1.4, "end": 1.7, "probability": 0.95 },
        { "word": "doing", "start": 1.8, "end": 2.2, "probability": 0.89 },
        { "word": "today", "start": 2.3, "end": 2.8, "probability": 0.91 }
      ]
    }
  ]
}

Understanding Confidence Scores

Each word includes a probability value between 0.0 and 1.0 indicating recognition confidence:

≥ 0.9 — Very high confidence
0.8 – 0.9 — High confidence
0.7 – 0.8 — Moderate confidence
< 0.7 — Low confidence (may require review)

Best Practices

Enable only when needed — adds ~5–10% additional processing time
Use for quality control by reviewing words with low confidence
Combine with speaker diarization to track who said each word
Balance detail vs performance depending on your use case

Use Cases

Video subtitles with precise word-level timing
Karaoke and lyrics synchronization
Quality assurance and transcript review
Audio editing and content navigation
Accessibility for hearing-impaired users
Language learning and pronunciation analysis