Build a private text-to-speech application from a local Python project and your own approved voice data.
This repository is a public scaffold extracted from one local F5-TTS practice workflow:
script text -> approved voice profile -> F5-TTS -> WAV voiceover
It introduces two paths:
- zero-shot reference TTS: generate speech from a short reference recording and its exact transcript without training
- fine-tuning TTS: prepare reviewed recordings, train an F5-TTS checkpoint, and register that checkpoint behind the same local API
The installable Python package and CLI are named mytts.
Implemented:
- FastAPI service with local safety defaults
- registry-driven voices and language/gender placeholders
- F5-TTS engine adapter with optional fine-tuned checkpoint path
- synchronous WAV generation contract
- long narration chunking and WAV merge path
- long-recording preparation CLI with review-gated F5 metadata export
- tests that mock synthesis instead of downloading model weights
For each new voice before real cloned output:
- add your first approved reference WAV
- replace its registry transcript with the exact spoken text
- install the optional F5-TTS inference dependencies in a Python 3.11 environment
- run an audio quality pass before starting fine-tuning
approved voice data
-> voice registry
-> F5-TTS engine adapter
-> FastAPI generation service
-> generated WAV + metadata log
Training data follows a separate staged path:
prepared narration script
-> script recording manifest
-> reviewed per-clip recordings and exact text
-> reviewed audio_file|text metadata
-> F5-TTS fine-tuning handoff
long recording
-> FFmpeg normalization
-> faster-whisper segmentation/transcription
-> review_manifest.csv
-> reviewed audio_file|text metadata
-> F5-TTS fine-tuning handoff
.
|-- pyproject.toml package metadata and optional dependency groups
|-- src/mytts/
| |-- app.py FastAPI endpoints and local access controls
| |-- schemas.py API request and response models
| |-- service.py narration chunk generation orchestration
| |-- audio.py WAV part merging
| |-- storage.py output file allocation and metadata log
| |-- voices.py voice registry models and path loading
| |-- cli.py data preparation CLI entrypoint
| |-- data_prep.py normalization, transcription, review export helpers
| `-- engines/
| |-- base.py engine protocol
| `-- f5_tts.py F5-TTS adapter
|-- voices/
| |-- registry.json approved voices and reserved placeholders
| `-- self_male_en/ example English reference clip location
|-- data/raw/ long source recordings
|-- outputs/ generated WAV files and metadata log
`-- tests/ API and data-prep tests with a fake engine
The enabled example voice is self_male_en. self_male_zh, self_female_en, and self_female_zh are disabled registry placeholders so language and gender expansion stays configuration-driven.
Use Python 3.11 for the model stack on Apple Silicon. Python 3.12 is allowed for the lightweight API and test scaffold, but the upstream model ecosystem is easier to reproduce with 3.11.
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,inference,prep]"Install FFmpeg if it is not already available:
brew install ffmpegF5-TTS downloads its pretrained assets on first inference. Before using the project for monetized or distributed work, review the pretrained model license yourself.
Zero-shot reference TTS is the quickest setup path. It is inference from a short reference clip, so it does not train a new model. The API request stays the same whether a voice is running from only a reference clip or from a fine-tuned checkpoint.
Configure the first reference voice:
- Put a clean short reference clip at
voices/self_male_en/reference.wav. - Edit
voices/registry.json. - Replace
reference_textwith the exact transcript of that reference clip.
Reference clips under roughly 12 seconds are a practical starting point for F5-TTS inference. Use a clean clip with consistent microphone distance and no music.
Use the following ffmpeg to convert ios m4a file into wav:
ffmpeg -i reference.m4a -ac 1 -ar 24000 voices/self_male_en/reference.wavKeep checkpoint_path set to null for zero-shot reference cloning:
{
"voice_id": "self_male_en",
"reference_audio": "self_male_en/reference.wav",
"reference_text": "Exact spoken reference transcript.",
"checkpoint_path": null
}The F5-TTS engine still uses the reference clip and transcript after fine-tuning. A fine-tuned checkpoint_path changes the model weights used for the same registered voice; it does not add uploaded speaker references to the API.
uvicorn mytts.app:app --host 127.0.0.1 --port 8000 --reloadOpen the dashboard at http://127.0.0.1:8000/ to paste a narration script, choose the enabled voice, generate a WAV, play it, and download it.
The default mode accepts localhost requests. Set MYTTS_API_KEY and pass X-API-Key to require a key.
Check the service and approved voice profiles:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/voicesGenerate from inline narration text:
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Here is the narration script.",
"voice_id": "self_male_en",
"language": "en",
"format": "wav",
"speed": 1.0
}'Generate from a narration script file such as scripts/microsoft_quarter_test.txt:
jq -Rs '{
text: .,
voice_id: "self_male_en",
language: "en",
format: "wav",
speed: 1.0
}' scripts/microsoft_quarter_test.txt > /private/tmp/mytts-request.jsoncurl -sS -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
--data-binary @/private/tmp/mytts-request.jsonSuccessful generation returns a generated file id and a local file endpoint:
{
"file_id": "generated-id",
"audio_url": "/files/generated-id.wav",
"voice_id": "self_male_en",
"language": "en"
}The generated WAV is written under outputs/. It can also be fetched from the running API:
curl -O http://127.0.0.1:8000/files/generated-id.wavReplace generated-id.wav with the filename returned in audio_url.
Endpoints:
GET /health
GET /voices
POST /generate
GET /files/{filename}
Generation is synchronous in V1. Narration scripts are split into short internal chunks before F5-TTS inference and merged into one WAV file. Generated text, voice id, language, and timestamps are logged under outputs/metadata.jsonl.
The API only accepts voice ids from voices/registry.json. It does not accept arbitrary uploaded speaker references.
If you already have a clean script to read, create a script recording session first:
mytts prepare-script-session \
scripts/finetune_script.txt \
data/prepared/finetune_script \
--max-chars 240The command ignores Title: lines and Markdown headings, splits long narration into short recording prompts, creates data/prepared/finetune_script/recordings/, and writes review_manifest.csv.
Record one clean WAV for each manifest row at the audio_file path shown in that row. Keep the spoken words equal to the transcript value; re-record or edit the transcript before marking the row reviewed=true. The export step also requires every reviewed row to point to an existing audio file.
If your recordings are numbered 1.m4a, 2.m4a, and so on in the manifest recording folder, convert them to the manifest WAV names with:
mytts convert-numbered-recordings \
data/prepared/finetune_script/review_manifest.csvThe numbered source recordings are preserved. The converted WAV files are mono 24 kHz files at the manifest audio_file paths.
Export the reviewed script session to the F5-TTS metadata contract:
mytts export-f5-metadata \
data/prepared/finetune_script/review_manifest.csv \
data/prepared/finetune_script/f5_metadata.csvPut long recordings in data/raw/, then run:
mytts prepare-recording data/raw/my_recording.wav data/prepared/my_recordingThe command:
- normalizes the source to mono 24 kHz WAV with FFmpeg
- transcribes speech segments with
faster-whisperCPUint8 - writes extracted clips
- writes
review_manifest.csv
Review every transcript row and set reviewed to true only after the clip and transcript match. Export to the F5-TTS metadata contract after review:
mytts export-f5-metadata \
data/prepared/my_recording/review_manifest.csv \
data/prepared/my_recording/f5_metadata.csvThe exported file uses:
audio_file|text
/absolute/path/to/clip.wav|Reviewed transcript.
Fine-tuning starts only after the recording manifest is reviewed and exported. Install F5-TTS from an editable checkout when you are ready to fine-tune. Run the helper from this repo, then run its printed commands from the F5-TTS checkout. The dataset preparation command writes into F5-TTS data/<dataset>_pinyin, which is where its fine-tuning CLI resolves --dataset_name.
mytts finetune-commands data/prepared/finetune_script/f5_metadata.csv --dataset-name my_voiceThe helper uses a conservative frame batch size for Apple Silicon. Increase it only after a short run proves the local accelerator has enough memory.
The practice run behind this repository used a reviewed script dataset, Apple Silicon, and a pip-installed F5-TTS CLI from a local .venv. The concrete paths below came from a Python 3.12 virtual environment; if your .venv uses another Python version, replace that path segment or use the commands printed by mytts finetune-commands.
Prepare the dataset into the data directory that the installed CLI resolves:
mkdir -p .venv/lib/python3.12/data/Emilia_ZH_EN_pinyin
cp \
.venv/lib/python3.12/site-packages/f5_tts/infer/examples/vocab.txt \
.venv/lib/python3.12/data/Emilia_ZH_EN_pinyin/vocab.txt
.venv/bin/python \
.venv/lib/python3.12/site-packages/f5_tts/train/datasets/prepare_csv_wavs.py \
data/prepared/finetune_script/f5_metadata.csv \
.venv/lib/python3.12/data/my_voice_pinyin \
--workers 4On Apple Silicon, start with a short run before committing to more epochs. Run one epoch first so you validate the model stack and dataset before starting a long job:
.venv/bin/f5-tts_finetune-cli \
--exp_name F5TTS_v1_Base \
--dataset_name my_voice \
--finetune \
--learning_rate 1e-5 \
--batch_size_per_gpu 800 \
--epochs 1 \
--num_warmup_updates 10 \
--save_per_updates 37 \
--last_per_updates 37 \
--keep_last_n_checkpoints 1The F5 CLI resumes from model_last.pt in its dataset checkpoint directory. Its --epochs value is the total epoch target, not the number of extra epochs. For example, after a successful one-epoch run, this command continues to a total of 21 epochs:
.venv/bin/f5-tts_finetune-cli \
--exp_name F5TTS_v1_Base \
--dataset_name my_voice \
--finetune \
--learning_rate 1e-5 \
--batch_size_per_gpu 800 \
--epochs 21 \
--num_warmup_updates 10 \
--save_per_updates 185 \
--last_per_updates 37 \
--keep_last_n_checkpoints 1For the local .venv layout above, checkpoints are written under:
.venv/lib/python3.12/ckpts/my_voice/
After a checkpoint is produced, set checkpoint_path for the registered voice in voices/registry.json:
{
"voice_id": "self_male_en",
"checkpoint_path": "../.venv/lib/python3.12/ckpts/my_voice/model_last.pt"
}The API keeps using the same voice id and request payload. To compare zero-shot output against fine-tuned output, switch only checkpoint_path between null and the checkpoint file path.
- Prove local English male voice cloning through the API with
self_male_en. - Record or collect clean long English narration files.
- Prepare clips and transcripts with
mytts prepare-recording. - Review transcripts before exporting F5-TTS metadata.
- Fine-tune only after zero-shot output has been evaluated.
- Register and compare the fine-tuned checkpoint against the reference-audio path.
- Add Chinese or female voice profiles by enabling new registry entries only after data and reference assets exist.
pytestThe automated tests use a fake WAV-writing engine. They do not download F5-TTS weights.
Current coverage verifies:
- API validation and registry exposure
- disabled and unsupported voice/language cases
- output-file path safety
- narration chunk generation through a mocked engine
- review-gated F5 metadata export
Project source code is licensed under the MIT License. See LICENSE.
This project integrates with F5-TTS. F5-TTS source code is MIT-licensed, while upstream pretrained F5-TTS model weights are licensed separately under CC-BY-NC. This project license does not relicense third-party model weights, datasets, or voice assets.
This first milestone keeps four layers separate:
voice data -> cloning or fine-tuned model -> local API -> video workflow
Do not train a foundation model from scratch on the M1 for this project. Start with the approved reference clip, listen to generated narration, then decide how much reviewed voice data is worth fine-tuning.
For long recordings, prefer clean narration:
WAV or other lossless source
one speaker
no background music
no reverb
consistent microphone distance
Keep the service single-user while the registry only represents your approved voices. Do not add arbitrary uploaded speaker references to the generation endpoint.