Voice cloning

Generate speech in a custom voice from a short reference audio sample. No custom model training, no upload-and-wait. Works across all 23 supported Indic languages, a reference in Hindi can clone into Tamil, Kannada, or English.

How it works

Two parameters control cloning. Both are optional and work alongside the required voice parameter. When reference_wav is supplied, the cloned voice characteristics override the preset speaker identity.

ParameterRequiredPurpose
reference_wavOptionalBase64-encoded audio clip. Extracts speaker identity.
reference_textOptionalTranscript of the reference audio. Improves cloning quality.

Two cloning modes

Pick the tab that fits how much you know about the reference clip.

Higher quality. The model uses speaker identity and speaking style from the reference. The transcript gives semantic alignment between reference and target, significantly improving voice similarity.

python
import base64, requests

with open("reference.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://tts.shunyalabs.ai/v1/audio/speech",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "zero-indic",
        "input": "Text to synthesize in the cloned voice.",
        "voice": "Varun",
        "reference_wav": audio_b64,
        "reference_text": "Exact transcript of the reference audio.",
    },
    timeout=120,
)
with open("cloned_output.mp3", "wb") as f:
    f.write(response.content)

Speaker identity only. Simpler, slightly lower quality. Use when you don't have a clean transcript.

python
import base64, requests

with open("reference.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://tts.shunyalabs.ai/v1/audio/speech",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "zero-indic",
        "input": "Text to synthesize in the cloned voice.",
        "voice": "Varun",
        "reference_wav": audio_b64,
    },
    timeout=120,
)
with open("cloned_output.mp3", "wb") as f:
    f.write(response.content)

Reference audio requirements

RequirementDetails
FormatWAV, FLAC, or OGG. Auto-converted to 16 kHz mono if different.
Duration1-6 seconds recommended. Under 1 s is rejected. Over 6 s: only the first 6 s are used.
Sample rate16 kHz mono recommended. Other rates (≥ 8 kHz) are auto-resampled.
ContentClear speech. Minimal background noise. Avoid music or overlapping voices.
SizeMax 10 MB raw audio.
EncodingBase64 string in the reference_wav field.

Best practices

  • Use 3-6 seconds of clean, clear speech, the sweet spot for quality.
  • Provide an accurate reference_text matching the reference audio exactly, this significantly improves similarity.
  • Same clip for consistency, if you call cloning repeatedly, use the same reference every time to keep the voice consistent across generations.
  • Clean your reference, no background music, no echo, no overlapping voices. A quiet office recording with a decent microphone is ideal.
  • The voice parameter is still required but the cloned voice characteristics override it.
  • Cross-lingual works, a Hindi reference can clone into Tamil output. Useful for brand voices across markets.

Error cases

ErrorStatusCause
Reference audio too short400Audio must be ≥ 1 second.
Invalid base64 encoding400reference_wav not valid base64.
Unsupported audio format400Use WAV, FLAC, or OGG.
reference_wav too large400Max 10 MB raw audio.
Sample rate too low400Min 8 kHz.
reference_text without reference_wav400Cannot provide a transcript without reference audio.
reference_text too long400Max 500 characters.

Use cases we've seen

Brand voice consistency

Clone a company's brand-voice talent once, use it everywhere, IVR, ads, notifications, in every language you operate in.

Accessibility

Clone a family member's voice for a user with speech difficulties, so their synthesized speech sounds like their voice.

Audiobook expansion

Author narrates the first chapter; the rest of the book is synthesized in the same voice for rapid multi-language release.

Localized marketing

Record a single English voice sample from your host; generate 23 Indic-language variants of the same ad read.

Ethics reminder
Don't clone someone's voice without their written consent. Synthetic voice of real people is a high-impact capability, treat it with the same care you would a signature stamp.