Streaming TTS

Open a persistent WebSocket and receive audio chunks in real time as synthesis happens. Best for voice agents, IVR, and any case where audio has to start playing before the full text is synthesized.

Endpoint

shell
wss://tts.shunyalabs.ai/ws/v1/audio/speech
# Also accepts: /ws/tts, /ws

Connection lifecycle

Your first stream

shell
# Install: npm install -g wscat
wscat -c "wss://tts.shunyalabs.ai/ws/v1/audio/speech" \
  -H "Authorization: Bearer $SHUNYALABS_API_KEY"

# Send synthesis request
> {"model": "zero-indic", "input": "Hello!", "voice": "Varun", "response_format": "pcm"}

# Receive metadata + binary + completion
< {"type": "chunk", "chunk_index": 0, "format": "pcm", "sample_rate": 16000, ...}
< [binary audio data]
< {"type": "completion", "total_chunks": 3, "total_duration_seconds": 0.8, ...}
python
import asyncio
from shunyalabs import AsyncShunyaClient
from shunyalabs.tts import TTSConfig

async def main():
    async with AsyncShunyaClient() as client:
        config = TTSConfig(
            model="zero-indic",
            voice="Varun",
            response_format="pcm",
        )
        with open("output.pcm", "wb") as f:
            async for audio in await client.tts.stream("Hello!", config=config):
                f.write(audio)

asyncio.run(main())

Streaming methods (SDK)

MethodWhat it does
stream(text, config)Async generator yielding audio bytes per chunk. Primary streaming method.
stream(text, config, detailed=True)Yields (chunk_meta, audio_bytes) tuples with chunk_index and timing.
synthesize_stream(text, config)Collects all chunks internally and returns combined bytes. Convenience wrapper.
stream_to_file(text, path, config)Streams directly to disk. No in-memory buffer required.

Pick the right variant

All four variants do the same job, generate audio for some text, but each one is shaped for a different consumer. Pick the tab that matches your scenario.

Use when: you want to play or process each audio chunk the moment it arrives, voice agents, IVR, real-time playback.

python
async for audio in await client.tts.stream("Hello!", config=config):
    speaker.write(audio)

Use when: you also need per-chunk metadata, index, timing, format, for logging, analytics, or buffering decisions.

python
async for chunk_meta, audio in await client.tts.stream("Hello!", config=config, detailed=True):
    print(f"chunk {chunk_meta.chunk_index}: {len(audio)} bytes")

Use when: you need the full audio as a single bytes object and don't care about per-chunk handling. Convenience wrapper that consumes the stream for you.

python
audio_bytes = await client.tts.synthesize_stream("Hello!", config=config)

Use when: long-form synthesis (audiobooks, batch scripts) where buffering the full audio in RAM is impractical. Writes each chunk to disk as it arrives, constant memory regardless of length.

python
await client.tts.stream_to_file("Hello!", "output.pcm", config=config)

Flush & close commands

For interactive sessions where you send multiple text messages over the same WebSocket, two control messages let you drive the lifecycle:

  • {"type": "flush"}: synthesize whatever buffered text the server has so far, without closing the connection.
  • {"type": "close"}: end the session cleanly.

Flush mid-stream

Client sends:

json
{"type": "flush"}

Server replies (binary audio chunks, then a flushed marker):

json
[binary audio data]
{"type": "flushed", "sequence_id": 1}

Close the session

Client sends:

json
{"type": "close"}

Error handling

Connection-level errors raise on stream start; server errors can arrive mid-stream.

python
from shunyalabs.exceptions import (
    AuthenticationError, ConnectionError, ShunyalabsError,
)

try:
    async for audio in await client.tts.stream("Hello!", config=config):
        chunks.append(audio)
except AuthenticationError:
    print("Invalid API key")
except ConnectionError:
    print("WebSocket connection failed, check network or endpoint")
except ShunyalabsError as e:
    print(f"SDK error: {e}")

Reconnection pattern

python
MAX_RECONNECT = 3

async def stream_with_reconnect(text, config):
    for attempt in range(MAX_RECONNECT):
        try:
            chunks = []
            async for audio in await client.tts.stream(text, config=config):
                chunks.append(audio)
            return b"".join(chunks)
        except ConnectionError:
            wait = 2 ** attempt
            print(f"Reconnecting in {wait}s...")
            await asyncio.sleep(wait)
    raise ConnectionError("Max reconnects exceeded")

Tips for streaming

Picking the format

  • PCM, real-time playback. No decoding overhead, lowest latency.
  • mulaw, telephony pipelines. 8 kHz, forward directly to SIP.
  • Avoid FLAC, lossless formats require full-file assembly before playback.

Minimising time-to-first-audio

  • Pre-open the WebSocket before your LLM call, saves ~200 ms of connection setup.
  • Use PCM, no client-side decoding overhead.
  • Use shorter sentences as flush units if piping from an LLM, see LLM → TTS pipeline.

Production readiness checklist

  • SHUNYALABS_API_KEY from env, not code
  • ✅ Error handler covers ConnectionError and AuthenticationError
  • ✅ Reconnection logic for long-running sessions
  • ✅ Audio played incrementally per chunk, not buffered
  • stream_to_file() used for long-form synthesis (audiobooks) to avoid memory pressure