Voice cloning

Generate speech in a custom voice from a short reference audio sample. No custom model training, no upload-and-wait. Works across all 23 supported Indic languages, a reference in Hindi can clone into Tamil, Kannada, or English.

How it works

Two parameters control cloning. Both are optional and work alongside the required voice parameter. When reference_wav is supplied, the cloned voice characteristics override the preset speaker identity.

Parameter	Required	Purpose
`reference_wav`	Optional	Base64-encoded audio clip. Extracts speaker identity.
`reference_text`	Optional	Transcript of the reference audio. Improves cloning quality.

Two cloning modes

Pick the tab that fits how much you know about the reference clip.

Higher quality. The model uses speaker identity and speaking style from the reference. The transcript gives semantic alignment between reference and target, significantly improving voice similarity.

python

import base64, requests

with open("reference.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://tts.shunyalabs.ai/v1/audio/speech",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "zero-indic",
        "input": "Text to synthesize in the cloned voice.",
        "voice": "Varun",
        "reference_wav": audio_b64,
        "reference_text": "Exact transcript of the reference audio.",
    },
    timeout=120,
)
with open("cloned_output.mp3", "wb") as f:
    f.write(response.content)

Speaker identity only. Simpler, slightly lower quality. Use when you don't have a clean transcript.

python

import base64, requests

with open("reference.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://tts.shunyalabs.ai/v1/audio/speech",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "zero-indic",
        "input": "Text to synthesize in the cloned voice.",
        "voice": "Varun",
        "reference_wav": audio_b64,
    },
    timeout=120,
)
with open("cloned_output.mp3", "wb") as f:
    f.write(response.content)

Reference audio requirements

Requirement	Details
Format	WAV, FLAC, or OGG. Auto-converted to 16 kHz mono if different.
Duration	1-6 seconds recommended. Under 1 s is rejected. Over 6 s: only the first 6 s are used.
Sample rate	16 kHz mono recommended. Other rates (≥ 8 kHz) are auto-resampled.
Content	Clear speech. Minimal background noise. Avoid music or overlapping voices.
Size	Max 10 MB raw audio.
Encoding	Base64 string in the `reference_wav` field.

Best practices

Use 3-6 seconds of clean, clear speech, the sweet spot for quality.
Provide an accurate reference_text matching the reference audio exactly, this significantly improves similarity.
Same clip for consistency, if you call cloning repeatedly, use the same reference every time to keep the voice consistent across generations.
Clean your reference, no background music, no echo, no overlapping voices. A quiet office recording with a decent microphone is ideal.
The voice parameter is still required but the cloned voice characteristics override it.
Cross-lingual works, a Hindi reference can clone into Tamil output. Useful for brand voices across markets.

Error cases

Error	Status	Cause
Reference audio too short	400	Audio must be ≥ 1 second.
Invalid base64 encoding	400	`reference_wav` not valid base64.
Unsupported audio format	400	Use WAV, FLAC, or OGG.
`reference_wav` too large	400	Max 10 MB raw audio.
Sample rate too low	400	Min 8 kHz.
`reference_text` without `reference_wav`	400	Cannot provide a transcript without reference audio.
`reference_text` too long	400	Max 500 characters.

Use cases we've seen

Brand voice consistency

Clone a company's brand-voice talent once, use it everywhere, IVR, ads, notifications, in every language you operate in.

Accessibility

Clone a family member's voice for a user with speech difficulties, so their synthesized speech sounds like their voice.

Audiobook expansion

Author narrates the first chapter; the rest of the book is synthesized in the same voice for rapid multi-language release.

Localized marketing

Record a single English voice sample from your host; generate 23 Indic-language variants of the same ad read.

Ethics reminder

Don't clone someone's voice without their written consent. Synthetic voice of real people is a high-impact capability, treat it with the same care you would a signature stamp.