The idea for this project appeared to me based on my personal experience communicating with non-verbal individuals. Generally communication aids (AAC devices and apps) for these persons typically rely on generic, robotic voices that sound nothing like the person using them. If that person has existing voice recordings, even short ones from a phone or messaging app, in an age of AI/ML, it is now possible to clone their voice and use it as a personalised text-to-speech system.

In this post I will walk through how I built (AI assisted, of course) such a system locally using open-source tools, starting from a handful of noisy voice messages and ending with a CLI tool that synthesizes speech in the cloned voice. The setup I introduce here should be cross-platform and run on macOS, Linux and Windows.

Evaluating Model Options (Reference Stack)

There are a lot of voice cloning models available, and most of them will not work for this kind of use case as they generally require high quality reference audio of speech to clone the voice. However while doing the research I found the following potential options to try:

Model Zero-shot cloning Language support Runs on CPU
GPT-SoVITS Yes Chinese, English, Japanese, Korean Yes
F5-TTS Yes Chinese, English Yes (fast via MLX)
Fish Audio S2 Pro Yes Multilingual No (4B params, GPU required)
Bark No (presets only) Multilingual Yes, slow
XTTS v2 Yes 17 languages Yes

XTTS v2 was the clear choice for this implementation. It is the only model that combines zero-shot voice cloning (no training required — just pass a few seconds of reference audio), broad multilingual support, and the ability to run on CPU. No API keys or accounts are required. The table above serves as a reference for the current state of zero-shot cloning; different project requirements (e.g., needing GPU acceleration or specific language support) might lead you to a different choice.

The original Coqui TTS project was abandoned in 2024, but Idiap has continued maintaining it under the package name coqui-tts. It supports 17 languages natively: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, Japanese and Hindi.

Check your target language is on that list before starting. Finnish, Estonian, Swedish and many others are not supported. If your language is missing, you would need to fine-tune the model on additional data, which is a significantly more involved project.

The model is approximately 2.1 GB and downloads automatically on first use. It is licensed under the CPML non-commercial licence — it cannot be used in commercial products or services. For personal, research, or non-profit AAC use this is fine, but worth knowing before building anything with broader distribution in mind.

Setting Up the Environment

I used a dedicated conda environment with Python 3.11 to keep dependencies isolated. The version pinning matters for runtime stability — more on that in the gotchas section below.

conda create -n voice_clone python=3.11 -y
conda activate voice_clone
conda install -c conda-forge ffmpeg -y

pip install torch==2.9.1 torchaudio==2.9.1
pip install "torchcodec>=0.9,<0.10"
pip install coqui-tts "transformers>=4.57,<5"
pip install noisereduce librosa soundfile

Hardware requirements: approximately 4 GB of RAM at inference time, plus ~4 GB of disk space for the model and dependencies.

The Audio Pipeline

The quality of the source recordings matters more than quantity. XTTS v2 extracts a speaker embedding from the reference audio, truncating each file to 12 seconds when computing the embedding — any material beyond that point is ignored. A few clean, representative seconds will outperform minutes of noisy material.

In practice, recordings from messaging apps contain background noise, silence and sometimes other speakers. The preprocessing pipeline handles this in three steps.

Step 1 — Convert to WAV

XTTS v2 expects 22050 Hz mono audio:

ffmpeg -i recording.mp4 -ar 22050 -ac 1 -sample_fmt s16 audio.wav

Step 2 — Noise reduction

Spectral gating with noisereduce removes background noise (room hum, AC units, etc.) while preserving the voice:

import noisereduce as nr
reduced = nr.reduce_noise(y=audio, sr=sr, stationary=False, prop_decrease=0.8)

stationary=False estimates a varying noise profile rather than assuming a fixed one, which is the right choice for phone recordings where background conditions shift. stationary=True works better for constant noise sources like a running fan.

Step 3 — Voice activity detection

Silero VAD identifies where speech is present and splits the audio into segments, discarding silence and noise-only sections. It is a 2 MB model that loads in under a second.

After running the pipeline on seven source files, the output looked like this:

audio_01.wav: 14 segments, 11.8s speech
audio_02.wav:  3 segments,  3.0s speech
audio_03.wav:  1 segment,   0.6s speech

Clean segments:   20
Clean duration:   19.3s
Quality: GOOD — enough audio for XTTS v2 (>= 12s optimal)

If you only have 5–8 seconds of usable audio the model will still work, but expect a weaker voice match — the speaker embedding has less material to generalise from, and the output is more likely to drift toward the model’s default voice characteristics.

All cleaned segments are passed as a list to the model. XTTS v2 extracts speaker embeddings from each file independently and averages them, which produces a more stable voice representation than using a single clip.

Synthesis and Post-Processing

Loading the model and synthesising speech is straightforward:

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cpu")

wav = tts.tts(
    text="Hello, how are you?",
    speaker_wav=reference_files,   # list of cleaned WAV segments
    language="en",
    gpt_cond_len=12,               # seconds of reference audio used per file
)

On CPU, synthesis takes roughly 10–30 seconds per sentence depending on length. This is acceptable for generating phrases in advance for an AAC application, though not real-time.

The synthesised audio inherits a faint noise floor from the reference recordings, even after preprocessing. The --denoise flag applies an additional noise reduction pass on the output before saving. The --format flag controls encoding and is independent — combine both to get clean, small files:

python synthesize.py "Hello, how are you?" --denoise 0.5 --format ogg --play
# Saved: hello_en_20260418.ogg (2.1s audio, 1.8s gen)

The size difference between formats is significant — a 2-second clip is roughly 95 KB as WAV but only 10 KB as OGG/Opus, which is natively supported by WhatsApp, Telegram and most messaging apps.

Gotchas Worth Knowing

A few compatibility issues that cost me time:

  • torchcodec version must match PyTorch exactly. The version table maps torchcodec releases to torch releases (0.9 → torch 2.9, 0.10 → torch 2.10, etc.). Getting this wrong produces a cryptic Symbol not found linker error.
  • Pin transformers to <5. The coqui-tts package imports isin_mps_friendly from transformers.pytorch_utils, which was removed in transformers 5.x.
  • MPS (Apple Silicon GPU) does not work with XTTS v2. The speaker encoder hits a PyTorch MPS limitation with large convolutional filters. Use device="cpu" — there is no workaround on macOS.
  • Licence prompt blocks non-interactive runs. On first model download, coqui-tts asks you to accept the CPML licence interactively. Run the synthesis once interactively before automating, so the prompt is answered and the model is cached.

Results and Limitations

The cloned voice captures the pitch, timbre and general character of the original speaker well enough that listeners familiar with the person recognise it immediately. For an AAC use case — where the goal is that the synthesised voice sounds like this person rather than a generic TTS voice — that is the most important thing.

The main limitation is prosody. XTTS v2 was trained on adult speech datasets, so the model imposes its own learned rhythm and intonation onto the output. The result sounds natural and fluent, but in the model’s cadence rather than the speaker’s. Addressing this properly would require fine-tuning on more data from the same speaker.

For the immediate use case, this is a reasonable trade-off. A voice that sounds like the person — even if the rhythm and intonation are the model’s rather than the speaker’s — is meaningfully better than a generic TTS voice for communication aids.

With the current setup you can synthesise arbitrary Russian phrases on demand, export them as OGG files, and share them directly via messaging apps. The next step will be to package the codebase and publish it in the repository, with a dockerized setup as a potential option for easier deployment.