Voice Cloning

Two Cloning Modes

Choose between providing audio with a transcript (recommended) or audio only.


Mode 1: Audio + Transcript (Recommended)

Provide both reference_wav and reference_text. The transcript helps the model align the audio to the spoken content, producing higher-fidelity voice clones.

python
import base64
from shunyalabs import ShunyaLabs, TTSConfig

client = ShunyaLabs()

# Read and encode reference audio
with open("reference.wav", "rb") as f:
    ref_audio = base64.b64encode(f.read()).decode("utf-8")

config = TTSConfig(
    model="zero-indic",
    voice="Varun",
    reference_wav=ref_audio,
    reference_text="This is the transcript of the reference audio."
)

result = await client.tts.synthesize(
    "Hello! This speech will sound like the reference speaker.",
    config=config
)
result.save("cloned_output.mp3")

Mode 2: Audio Only

Provide only reference_wav without a transcript. The model will still clone the voice, but quality may be slightly lower than Mode 1.

python
import base64
from shunyalabs import ShunyaLabs, TTSConfig

client = ShunyaLabs()

# Read and encode reference audio
with open("reference.wav", "rb") as f:
    ref_audio = base64.b64encode(f.read()).decode("utf-8")

config = TTSConfig(
    model="zero-indic",
    voice="Varun",
    reference_wav=ref_audio
)

result = await client.tts.synthesize(
    "Hello! This speech will sound like the reference speaker.",
    config=config
)
result.save("cloned_output.mp3")