Text-to-Speech (TTS)

Zero TTS is Shunya's speech synthesis family. 46 speaker voices across 23 Indic languages and English. Every voice can speak every language. 11 expression styles. Voice cloning from a 3-6 second reference clip.

How it fits together

Batch vs Streaming

Two synthesis modes are available. Same model, same voices, different transport and different "when does the first byte of audio leave the server."

Batch

HTTP POST, returns a complete file

Send text via HTTP POST and receive a complete audio file in a single response.

Best for

Pre-rendered voice prompts for IVR and telephony systems.
Notification audio, order updates, alerts, reminders.
Podcast, audiobook, and long-form content generation.
Any use case where audio does not need to start playing before synthesis is complete.

Properties

Transport: HTTP POST
Endpoint: https://tts.shunyalabs.ai/v1/audio/speech
Auth: Bearer <API_KEY>
Required: text, model, voice
Default format: mp3

Flow

1POST text→2Server synthesizes→3Receive audio

Streaming

WebSocket, chunks arrive in real time

Open a persistent WebSocket connection and receive audio chunks in real time as synthesis happens.

Best for

Voice agents and conversational AI requiring sub-second audio start.
IVR and telephony pipelines.
Real-time audio playback in applications.
Any use case where audio must begin playing before synthesis of the full text is complete.

Properties

Transport: WebSocket
Endpoint: wss://tts.shunyalabs.ai/ws/v1/audio/speech
also: /ws/tts, /ws
Config: TTSConfig
Default format: mp3

Lifecycle

1Connect→2Receive chunks→3Done

Source: Shunyalabs TTS Developer Documentation v1.0 (March 2026), §2.1 Batch overview and §3.1 Streaming overview, text reproduced verbatim.

Endpoints

Mode	Endpoint	Default format
Batch	`POST https://tts.shunyalabs.ai/v1/audio/speech`	mp3
Streaming	`wss://tts.shunyalabs.ai/ws/v1/audio/speech`	mp3 (pcm recommended)
Health	`GET https://tts.shunyalabs.ai/health`	-

Your first synthesis

curl -X POST https://tts.shunyalabs.ai/v1/audio/speech \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"zero-indic","input":"Hello, how are you today?","voice":"Varun"}' \
  --output hello.mp3

import asyncio
from shunyalabs import AsyncShunyaClient
from shunyalabs.tts import TTSConfig

async def main():
    async with AsyncShunyaClient() as client:
        result = await client.tts.synthesize(
            "Hello, how are you today?",
            config=TTSConfig(model="zero-indic", voice="Varun"),
        )
        result.save("hello.mp3")

asyncio.run(main())

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://tts.shunyalabs.ai/v1",
)
response = client.audio.speech.create(
    model="zero-indic",
    input="Hello, how are you today?",
    voice="Varun",
    response_format="mp3",
)
response.stream_to_file("output.mp3")

Key features

46 voices

Male & female speakers per language. Any voice, any language, cross-lingual synthesis is built in.

7 audio formats

PCM, WAV, MP3, OGG Opus, FLAC, mulaw, alaw. Pick by use case, telephony, web, archival.

11 expression styles

Happy, Sad, News, Narrative, Conversational, Enthusiastic, and more, prepended as tags to your text.

Voice cloning

Clone a voice from 3-6 seconds of reference audio. Works across all 23 supported languages.

Streaming

First audio in under 350 ms. Critical for voice agents.

LLM → TTS pipeline

Pipe OpenAI / Anthropic / Gemini tokens to TTS at sentence boundaries, the core pattern for low-latency voice agents.

Required fields, at a glance

{
  "model": "zero-indic",      // required, only "zero-indic" today
  "input": "Your text here",  // required, up to 10,000 chars
  "voice": "Varun",           // required, see Voices page for full list
  "response_format": "mp3",   // optional, default mp3
  "speed": 1.0,               // optional, 0.25 to 4.0
  "language": "en",           // optional, ISO code for preprocessing
  "trim_silence": false,      // optional, tight audio when true
  "reference_wav": "...",     // optional, base64 for voice cloning
  "reference_text": "..."     // optional, transcript for voice cloning
}