Voice cloning
Generate speech in a custom voice from a short reference audio sample. No custom model training, no upload-and-wait. Works across all 23 supported Indic languages, a reference in Hindi can clone into Tamil, Kannada, or English.
How it works
Two parameters control cloning. Both are optional and work alongside the required voice parameter. When reference_wav is supplied, the cloned voice characteristics override the preset speaker identity.
| Parameter | Required | Purpose |
|---|---|---|
reference_wav | Optional | Base64-encoded audio clip. Extracts speaker identity. |
reference_text | Optional | Transcript of the reference audio. Improves cloning quality. |
Two cloning modes
Pick the tab that fits how much you know about the reference clip.
Higher quality. The model uses speaker identity and speaking style from the reference. The transcript gives semantic alignment between reference and target, significantly improving voice similarity.
import base64, requests
with open("reference.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"https://tts.shunyalabs.ai/v1/audio/speech",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "zero-indic",
"input": "Text to synthesize in the cloned voice.",
"voice": "Varun",
"reference_wav": audio_b64,
"reference_text": "Exact transcript of the reference audio.",
},
timeout=120,
)
with open("cloned_output.mp3", "wb") as f:
f.write(response.content)Speaker identity only. Simpler, slightly lower quality. Use when you don't have a clean transcript.
import base64, requests
with open("reference.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"https://tts.shunyalabs.ai/v1/audio/speech",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "zero-indic",
"input": "Text to synthesize in the cloned voice.",
"voice": "Varun",
"reference_wav": audio_b64,
},
timeout=120,
)
with open("cloned_output.mp3", "wb") as f:
f.write(response.content)Reference audio requirements
| Requirement | Details |
|---|---|
| Format | WAV, FLAC, or OGG. Auto-converted to 16 kHz mono if different. |
| Duration | 1-6 seconds recommended. Under 1 s is rejected. Over 6 s: only the first 6 s are used. |
| Sample rate | 16 kHz mono recommended. Other rates (≥ 8 kHz) are auto-resampled. |
| Content | Clear speech. Minimal background noise. Avoid music or overlapping voices. |
| Size | Max 10 MB raw audio. |
| Encoding | Base64 string in the reference_wav field. |
Best practices
- Use 3-6 seconds of clean, clear speech, the sweet spot for quality.
- Provide an accurate reference_text matching the reference audio exactly, this significantly improves similarity.
- Same clip for consistency, if you call cloning repeatedly, use the same reference every time to keep the voice consistent across generations.
- Clean your reference, no background music, no echo, no overlapping voices. A quiet office recording with a decent microphone is ideal.
- The voice parameter is still required but the cloned voice characteristics override it.
- Cross-lingual works, a Hindi reference can clone into Tamil output. Useful for brand voices across markets.
Error cases
| Error | Status | Cause |
|---|---|---|
| Reference audio too short | 400 | Audio must be ≥ 1 second. |
| Invalid base64 encoding | 400 | reference_wav not valid base64. |
| Unsupported audio format | 400 | Use WAV, FLAC, or OGG. |
reference_wav too large | 400 | Max 10 MB raw audio. |
| Sample rate too low | 400 | Min 8 kHz. |
reference_text without reference_wav | 400 | Cannot provide a transcript without reference audio. |
reference_text too long | 400 | Max 500 characters. |
Use cases we've seen
Clone a company's brand-voice talent once, use it everywhere, IVR, ads, notifications, in every language you operate in.
Clone a family member's voice for a user with speech difficulties, so their synthesized speech sounds like their voice.
Author narrates the first chapter; the rest of the book is synthesized in the same voice for rapid multi-language release.
Record a single English voice sample from your host; generate 23 Indic-language variants of the same ad read.