Term glossary
Plain-English definitions for terms in these docs. STT = speech-to-text (audio → text). TTS = text-to-speech (text → audio).
Speech-to-Text (STT)
Audio in, text out. Shunya product family: Zero STT. Also called ASR (automatic speech recognition).
- STT (speech-to-text)
- Converting spoken audio into written text. Shunya endpoint:
POST /v1/audio/transcriptions. - ASR (automatic speech recognition)
- Another name for STT, same meaning.
- Batch STT
- Upload a complete audio file; receive the full transcript when done.
- Streaming STT
- Send live audio over WebSocket; receive partial transcripts while the person is still speaking.
- Transcript
- The text returned by STT, plain text or JSON with segments and timestamps.
- Translation
- Converting text from one language to another. Often used after STT. Shunya product: Vāķ.
- Sample rate
- Audio samples per second. 16 kHz for recordings; 8 kHz for phone lines.
- PCM
- Raw uncompressed audio, common for low-latency streaming STT.
- μ-law / A-law (G.711)
- 8 kHz phone codecs. Use
ulaworalawwhen audio comes from PSTN or SIP. - VAD (voice activity detection)
- Detects when someone is speaking vs silent.
- Diarization
- Labels who spoke when, e.g.
SPEAKER_00,SPEAKER_01. - Speaker identification
- Maps anonymous speaker labels to enrolled names.
- Code-switching
- Mixing languages in one utterance (e.g. Hinglish). Model:
zero-codeswitch. - Transliteration
- Same words in a different script, set
output_script. - Word timestamps
- Start and end time for each word in the transcript.
- WER (word error rate)
- STT accuracy metric, lower is better. Zero STT composite WER: 3.10%.
- Intent detection
- Classifies what the speaker wanted from the transcript.
- Sentiment analysis
- Whether speech sounds positive, negative, or neutral.
- NER (named entity recognition)
- Extracts names, dates, amounts, and places from a transcript.
- Keyterm normalization
- Standardizes domain terms in the transcript.
- Profanity hashing
- Masks offensive or sensitive words in STT output.
Text-to-Speech (TTS)
Text in, audio out. Shunya product family: Zero TTS, voices, styles, and cloning.
- TTS (text-to-speech)
- Converting written text into spoken audio. Shunya endpoint:
POST /v1/audio/speech. - Batch TTS
- Send text in one request; receive a complete audio file (MP3, WAV, etc.).
- Streaming TTS
- Send text over WebSocket; receive audio chunks as they are generated.
- Voice
- Which speaker reads the text, 46 voices across 23 languages.
- Cross-lingual voice
- A voice speaking a language outside its native set, any Shunya voice can read any supported language.
- Expression style
- Tone tag in text,
<Happy>,<News>, etc. The tag is not spoken aloud. - Voice cloning
- Generate speech in a custom voice from a 3-6 second audio sample.
- Prosody
- Rhythm and intonation of spoken output.
- Sample rate (TTS output)
- 16 kHz for MP3/WAV; 8 kHz for telephony formats.
- μ-law / A-law (TTS output)
- Phone-line audio from TTS, set
response_formattomulaworalaw.
Deployment & security
- Sovereign AI
- AI where data and models stay under your organization's control and jurisdiction.
- Air-gapped
- Deployment with no connection to the public internet.
- CPU-compatible / CPU-first
- Models that run on standard CPUs without a GPU.
- SOC 2 Type II
- Third-party audit of security controls over time.
- ISO 27001:2022
- International information security management standard.
- HIPAA
- US healthcare privacy law, requires a BAA for PHI.
- FHIR / HL7
- Healthcare data standards for EHR integration.
Infrastructure
- Triton Inference Server
- NVIDIA inference server used to serve models at scale.
- RTFx (real-time factor)
- How fast batch audio is processed, higher means faster.
- WebSocket
- Persistent connection used for streaming STT and TTS.
- Bearer token
- API key sent as
Authorization: Bearer <key>.