Voice Cloning

Best Practices

Tips for getting the best results from voice cloning.

Tips for High-Quality Cloning

Use a quiet recording environment. Background noise, echo, and reverb degrade cloning quality. Record in a treated room or use a close-mic setup.
Keep reference audio between 3 and 5 seconds. Too short (under 1 second) gives the model insufficient data. Too long (over 6 seconds) adds diminishing returns and increases latency.
Always provide a transcript when possible. Mode 1 (audio + transcript) consistently outperforms Mode 2 (audio only). The transcript helps the model accurately map the audio to phonemes.
Ensure the reference contains only one speaker. Multi-speaker audio confuses the model. Trim the clip so it contains only the target speaker with no overlapping voices.
Match the reference language to common use. While cross-lingual synthesis works, cloning quality is highest when the reference language matches the most frequent synthesis language.
Use lossless or high-bitrate formats. WAV at 16 kHz mono is ideal. Highly compressed MP3 or low-bitrate audio may lose vocal details the model needs for accurate cloning.