Term glossary

Plain-English definitions for terms in these docs. STT = speech-to-text (audio → text). TTS = text-to-speech (text → audio).

Speech-to-Text (STT)

Audio in, text out. Shunya product family: Zero STT. Also called ASR (automatic speech recognition).

STT (speech-to-text): Converting spoken audio into written text. Shunya endpoint: POST /v1/audio/transcriptions.
ASR (automatic speech recognition): Another name for STT, same meaning.
Batch STT: Upload a complete audio file; receive the full transcript when done.
Streaming STT: Send live audio over WebSocket; receive partial transcripts while the person is still speaking.
Transcript: The text returned by STT, plain text or JSON with segments and timestamps.
Translation: Converting text from one language to another. Often used after STT. Shunya product: Vāķ.
Sample rate: Audio samples per second. 16 kHz for recordings; 8 kHz for phone lines.
PCM: Raw uncompressed audio, common for low-latency streaming STT.
μ-law / A-law (G.711): 8 kHz phone codecs. Use ulaw or alaw when audio comes from PSTN or SIP.
VAD (voice activity detection): Detects when someone is speaking vs silent.
Diarization: Labels who spoke when, e.g. SPEAKER_00, SPEAKER_01.
Speaker identification: Maps anonymous speaker labels to enrolled names.
Code-switching: Mixing languages in one utterance (e.g. Hinglish). Model: zero-codeswitch.
Transliteration: Same words in a different script, set output_script.
Word timestamps: Start and end time for each word in the transcript.
WER (word error rate): STT accuracy metric, lower is better. Zero STT composite WER: 3.10%.
Intent detection: Classifies what the speaker wanted from the transcript.
Sentiment analysis: Whether speech sounds positive, negative, or neutral.
NER (named entity recognition): Extracts names, dates, amounts, and places from a transcript.
Keyterm normalization: Standardizes domain terms in the transcript.
Profanity hashing: Masks offensive or sensitive words in STT output.

Text-to-Speech (TTS)

Text in, audio out. Shunya product family: Zero TTS, voices, styles, and cloning.

TTS (text-to-speech): Converting written text into spoken audio. Shunya endpoint: POST /v1/audio/speech.
Batch TTS: Send text in one request; receive a complete audio file (MP3, WAV, etc.).
Streaming TTS: Send text over WebSocket; receive audio chunks as they are generated.
Voice: Which speaker reads the text, 46 voices across 23 languages.
Cross-lingual voice: A voice speaking a language outside its native set, any Shunya voice can read any supported language.
Expression style: Tone tag in text, <Happy>, <News>, etc. The tag is not spoken aloud.
Voice cloning: Generate speech in a custom voice from a 3-6 second audio sample.
Prosody: Rhythm and intonation of spoken output.
Sample rate (TTS output): 16 kHz for MP3/WAV; 8 kHz for telephony formats.
μ-law / A-law (TTS output): Phone-line audio from TTS, set response_format to mulaw or alaw.

Deployment & security

Sovereign AI: AI where data and models stay under your organization's control and jurisdiction.
Air-gapped: Deployment with no connection to the public internet.
CPU-compatible / CPU-first: Models that run on standard CPUs without a GPU.
SOC 2 Type II: Third-party audit of security controls over time.
ISO 27001:2022: International information security management standard.
HIPAA: US healthcare privacy law, requires a BAA for PHI.
FHIR / HL7: Healthcare data standards for EHR integration.

Infrastructure

Triton Inference Server: NVIDIA inference server used to serve models at scale.
RTFx (real-time factor): How fast batch audio is processed, higher means faster.
WebSocket: Persistent connection used for streaming STT and TTS.
Bearer token: API key sent as Authorization: Bearer <key>.