Voice Cloning
Two Cloning Modes
Choose between providing audio with a transcript (recommended) or audio only.
Mode 1: Audio + Transcript (Recommended)
Provide both reference_wav and reference_text. The transcript helps the model align the audio to the spoken content, producing higher-fidelity voice clones.
python
import base64
from shunyalabs import ShunyaLabs, TTSConfig
client = ShunyaLabs()
# Read and encode reference audio
with open("reference.wav", "rb") as f:
ref_audio = base64.b64encode(f.read()).decode("utf-8")
config = TTSConfig(
model="zero-indic",
voice="Varun",
reference_wav=ref_audio,
reference_text="This is the transcript of the reference audio."
)
result = await client.tts.synthesize(
"Hello! This speech will sound like the reference speaker.",
config=config
)
result.save("cloned_output.mp3")Mode 2: Audio Only
Provide only reference_wav without a transcript. The model will still clone the voice, but quality may be slightly lower than Mode 1.
python
import base64
from shunyalabs import ShunyaLabs, TTSConfig
client = ShunyaLabs()
# Read and encode reference audio
with open("reference.wav", "rb") as f:
ref_audio = base64.b64encode(f.read()).decode("utf-8")
config = TTSConfig(
model="zero-indic",
voice="Varun",
reference_wav=ref_audio
)
result = await client.tts.synthesize(
"Hello! This speech will sound like the reference speaker.",
config=config
)
result.save("cloned_output.mp3")