Term glossary

Plain-English definitions for terms in these docs. STT = speech-to-text (audio → text). TTS = text-to-speech (text → audio).

Speech-to-Text (STT)

Audio in, text out. Shunya product family: Zero STT. Also called ASR (automatic speech recognition).

STT (speech-to-text)
Converting spoken audio into written text. Shunya endpoint: POST /v1/audio/transcriptions.
ASR (automatic speech recognition)
Another name for STT, same meaning.
Batch STT
Upload a complete audio file; receive the full transcript when done.
Streaming STT
Send live audio over WebSocket; receive partial transcripts while the person is still speaking.
Transcript
The text returned by STT, plain text or JSON with segments and timestamps.
Translation
Converting text from one language to another. Often used after STT. Shunya product: Vāķ.
Sample rate
Audio samples per second. 16 kHz for recordings; 8 kHz for phone lines.
PCM
Raw uncompressed audio, common for low-latency streaming STT.
μ-law / A-law (G.711)
8 kHz phone codecs. Use ulaw or alaw when audio comes from PSTN or SIP.
VAD (voice activity detection)
Detects when someone is speaking vs silent.
Diarization
Labels who spoke when, e.g. SPEAKER_00, SPEAKER_01.
Speaker identification
Maps anonymous speaker labels to enrolled names.
Code-switching
Mixing languages in one utterance (e.g. Hinglish). Model: zero-codeswitch.
Transliteration
Same words in a different script, set output_script.
Word timestamps
Start and end time for each word in the transcript.
WER (word error rate)
STT accuracy metric, lower is better. Zero STT composite WER: 3.10%.
Intent detection
Classifies what the speaker wanted from the transcript.
Sentiment analysis
Whether speech sounds positive, negative, or neutral.
NER (named entity recognition)
Extracts names, dates, amounts, and places from a transcript.
Keyterm normalization
Standardizes domain terms in the transcript.
Profanity hashing
Masks offensive or sensitive words in STT output.

Text-to-Speech (TTS)

Text in, audio out. Shunya product family: Zero TTS, voices, styles, and cloning.

TTS (text-to-speech)
Converting written text into spoken audio. Shunya endpoint: POST /v1/audio/speech.
Batch TTS
Send text in one request; receive a complete audio file (MP3, WAV, etc.).
Streaming TTS
Send text over WebSocket; receive audio chunks as they are generated.
Voice
Which speaker reads the text, 46 voices across 23 languages.
Cross-lingual voice
A voice speaking a language outside its native set, any Shunya voice can read any supported language.
Expression style
Tone tag in text, <Happy>, <News>, etc. The tag is not spoken aloud.
Voice cloning
Generate speech in a custom voice from a 3-6 second audio sample.
Prosody
Rhythm and intonation of spoken output.
Sample rate (TTS output)
16 kHz for MP3/WAV; 8 kHz for telephony formats.
μ-law / A-law (TTS output)
Phone-line audio from TTS, set response_format to mulaw or alaw.

Deployment & security

Sovereign AI
AI where data and models stay under your organization's control and jurisdiction.
Air-gapped
Deployment with no connection to the public internet.
CPU-compatible / CPU-first
Models that run on standard CPUs without a GPU.
SOC 2 Type II
Third-party audit of security controls over time.
ISO 27001:2022
International information security management standard.
HIPAA
US healthcare privacy law, requires a BAA for PHI.
FHIR / HL7
Healthcare data standards for EHR integration.

Infrastructure

Triton Inference Server
NVIDIA inference server used to serve models at scale.
RTFx (real-time factor)
How fast batch audio is processed, higher means faster.
WebSocket
Persistent connection used for streaming STT and TTS.
Bearer token
API key sent as Authorization: Bearer <key>.