Skip to content

monchewharry/Your_Local_TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Your Local TTS

Build a private text-to-speech application from a local Python project and your own approved voice data.

This repository is a public scaffold extracted from one local F5-TTS practice workflow:

script text -> approved voice profile -> F5-TTS -> WAV voiceover

It introduces two paths:

  1. zero-shot reference TTS: generate speech from a short reference recording and its exact transcript without training
  2. fine-tuning TTS: prepare reviewed recordings, train an F5-TTS checkpoint, and register that checkpoint behind the same local API

The installable Python package and CLI are named mytts.

Current status

Implemented:

  • FastAPI service with local safety defaults
  • registry-driven voices and language/gender placeholders
  • F5-TTS engine adapter with optional fine-tuned checkpoint path
  • synchronous WAV generation contract
  • long narration chunking and WAV merge path
  • long-recording preparation CLI with review-gated F5 metadata export
  • tests that mock synthesis instead of downloading model weights

For each new voice before real cloned output:

  • add your first approved reference WAV
  • replace its registry transcript with the exact spoken text
  • install the optional F5-TTS inference dependencies in a Python 3.11 environment
  • run an audio quality pass before starting fine-tuning

Architecture

approved voice data
  -> voice registry
  -> F5-TTS engine adapter
  -> FastAPI generation service
  -> generated WAV + metadata log

Training data follows a separate staged path:

prepared narration script
  -> script recording manifest
  -> reviewed per-clip recordings and exact text
  -> reviewed audio_file|text metadata
  -> F5-TTS fine-tuning handoff

long recording
  -> FFmpeg normalization
  -> faster-whisper segmentation/transcription
  -> review_manifest.csv
  -> reviewed audio_file|text metadata
  -> F5-TTS fine-tuning handoff

Project layout

.
|-- pyproject.toml          package metadata and optional dependency groups
|-- src/mytts/
|   |-- app.py              FastAPI endpoints and local access controls
|   |-- schemas.py          API request and response models
|   |-- service.py          narration chunk generation orchestration
|   |-- audio.py            WAV part merging
|   |-- storage.py          output file allocation and metadata log
|   |-- voices.py           voice registry models and path loading
|   |-- cli.py              data preparation CLI entrypoint
|   |-- data_prep.py        normalization, transcription, review export helpers
|   `-- engines/
|       |-- base.py         engine protocol
|       `-- f5_tts.py       F5-TTS adapter
|-- voices/
|   |-- registry.json       approved voices and reserved placeholders
|   `-- self_male_en/       example English reference clip location
|-- data/raw/               long source recordings
|-- outputs/                generated WAV files and metadata log
`-- tests/                  API and data-prep tests with a fake engine

The enabled example voice is self_male_en. self_male_zh, self_female_en, and self_female_zh are disabled registry placeholders so language and gender expansion stays configuration-driven.

Setup

Use Python 3.11 for the model stack on Apple Silicon. Python 3.12 is allowed for the lightweight API and test scaffold, but the upstream model ecosystem is easier to reproduce with 3.11.

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,inference,prep]"

Install FFmpeg if it is not already available:

brew install ffmpeg

F5-TTS downloads its pretrained assets on first inference. Before using the project for monetized or distributed work, review the pretrained model license yourself.

Path 1: zero-shot reference TTS

Zero-shot reference TTS is the quickest setup path. It is inference from a short reference clip, so it does not train a new model. The API request stays the same whether a voice is running from only a reference clip or from a fine-tuned checkpoint.

Configure the first reference voice:

  1. Put a clean short reference clip at voices/self_male_en/reference.wav.
  2. Edit voices/registry.json.
  3. Replace reference_text with the exact transcript of that reference clip.

Reference clips under roughly 12 seconds are a practical starting point for F5-TTS inference. Use a clean clip with consistent microphone distance and no music.

Use the following ffmpeg to convert ios m4a file into wav:

ffmpeg -i reference.m4a -ac 1 -ar 24000 voices/self_male_en/reference.wav

Keep checkpoint_path set to null for zero-shot reference cloning:

{
  "voice_id": "self_male_en",
  "reference_audio": "self_male_en/reference.wav",
  "reference_text": "Exact spoken reference transcript.",
  "checkpoint_path": null
}

The F5-TTS engine still uses the reference clip and transcript after fine-tuning. A fine-tuned checkpoint_path changes the model weights used for the same registered voice; it does not add uploaded speaker references to the API.

Run the local app

uvicorn mytts.app:app --host 127.0.0.1 --port 8000 --reload

Open the dashboard at http://127.0.0.1:8000/ to paste a narration script, choose the enabled voice, generate a WAV, play it, and download it.

Use the API

The default mode accepts localhost requests. Set MYTTS_API_KEY and pass X-API-Key to require a key.

Check the service and approved voice profiles:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/voices

Generate from inline narration text:

curl -X POST http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Here is the narration script.",
    "voice_id": "self_male_en",
    "language": "en",
    "format": "wav",
    "speed": 1.0
  }'

Generate from a narration script file such as scripts/microsoft_quarter_test.txt:

jq -Rs '{
  text: .,
  voice_id: "self_male_en",
  language: "en",
  format: "wav",
  speed: 1.0
}' scripts/microsoft_quarter_test.txt > /private/tmp/mytts-request.json
curl -sS -X POST http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  --data-binary @/private/tmp/mytts-request.json

Successful generation returns a generated file id and a local file endpoint:

{
  "file_id": "generated-id",
  "audio_url": "/files/generated-id.wav",
  "voice_id": "self_male_en",
  "language": "en"
}

The generated WAV is written under outputs/. It can also be fetched from the running API:

curl -O http://127.0.0.1:8000/files/generated-id.wav

Replace generated-id.wav with the filename returned in audio_url.

Endpoints:

GET  /health
GET  /voices
POST /generate
GET  /files/{filename}

Generation is synchronous in V1. Narration scripts are split into short internal chunks before F5-TTS inference and merged into one WAV file. Generated text, voice id, language, and timestamps are logged under outputs/metadata.jsonl.

The API only accepts voice ids from voices/registry.json. It does not accept arbitrary uploaded speaker references.

Prepare script recordings for fine-tuning

If you already have a clean script to read, create a script recording session first:

mytts prepare-script-session \
  scripts/finetune_script.txt \
  data/prepared/finetune_script \
  --max-chars 240

The command ignores Title: lines and Markdown headings, splits long narration into short recording prompts, creates data/prepared/finetune_script/recordings/, and writes review_manifest.csv.

Record one clean WAV for each manifest row at the audio_file path shown in that row. Keep the spoken words equal to the transcript value; re-record or edit the transcript before marking the row reviewed=true. The export step also requires every reviewed row to point to an existing audio file.

If your recordings are numbered 1.m4a, 2.m4a, and so on in the manifest recording folder, convert them to the manifest WAV names with:

mytts convert-numbered-recordings \
  data/prepared/finetune_script/review_manifest.csv

The numbered source recordings are preserved. The converted WAV files are mono 24 kHz files at the manifest audio_file paths.

Export the reviewed script session to the F5-TTS metadata contract:

mytts export-f5-metadata \
  data/prepared/finetune_script/review_manifest.csv \
  data/prepared/finetune_script/f5_metadata.csv

Prepare long recordings for fine-tuning

Put long recordings in data/raw/, then run:

mytts prepare-recording data/raw/my_recording.wav data/prepared/my_recording

The command:

  1. normalizes the source to mono 24 kHz WAV with FFmpeg
  2. transcribes speech segments with faster-whisper CPU int8
  3. writes extracted clips
  4. writes review_manifest.csv

Review every transcript row and set reviewed to true only after the clip and transcript match. Export to the F5-TTS metadata contract after review:

mytts export-f5-metadata \
  data/prepared/my_recording/review_manifest.csv \
  data/prepared/my_recording/f5_metadata.csv

The exported file uses:

audio_file|text
/absolute/path/to/clip.wav|Reviewed transcript.

Path 2: fine-tuning TTS

Fine-tuning starts only after the recording manifest is reviewed and exported. Install F5-TTS from an editable checkout when you are ready to fine-tune. Run the helper from this repo, then run its printed commands from the F5-TTS checkout. The dataset preparation command writes into F5-TTS data/<dataset>_pinyin, which is where its fine-tuning CLI resolves --dataset_name.

mytts finetune-commands data/prepared/finetune_script/f5_metadata.csv --dataset-name my_voice

The helper uses a conservative frame batch size for Apple Silicon. Increase it only after a short run proves the local accelerator has enough memory.

Practice-based training run

The practice run behind this repository used a reviewed script dataset, Apple Silicon, and a pip-installed F5-TTS CLI from a local .venv. The concrete paths below came from a Python 3.12 virtual environment; if your .venv uses another Python version, replace that path segment or use the commands printed by mytts finetune-commands.

Prepare the dataset into the data directory that the installed CLI resolves:

mkdir -p .venv/lib/python3.12/data/Emilia_ZH_EN_pinyin
cp \
  .venv/lib/python3.12/site-packages/f5_tts/infer/examples/vocab.txt \
  .venv/lib/python3.12/data/Emilia_ZH_EN_pinyin/vocab.txt

.venv/bin/python \
  .venv/lib/python3.12/site-packages/f5_tts/train/datasets/prepare_csv_wavs.py \
  data/prepared/finetune_script/f5_metadata.csv \
  .venv/lib/python3.12/data/my_voice_pinyin \
  --workers 4

On Apple Silicon, start with a short run before committing to more epochs. Run one epoch first so you validate the model stack and dataset before starting a long job:

.venv/bin/f5-tts_finetune-cli \
  --exp_name F5TTS_v1_Base \
  --dataset_name my_voice \
  --finetune \
  --learning_rate 1e-5 \
  --batch_size_per_gpu 800 \
  --epochs 1 \
  --num_warmup_updates 10 \
  --save_per_updates 37 \
  --last_per_updates 37 \
  --keep_last_n_checkpoints 1

The F5 CLI resumes from model_last.pt in its dataset checkpoint directory. Its --epochs value is the total epoch target, not the number of extra epochs. For example, after a successful one-epoch run, this command continues to a total of 21 epochs:

.venv/bin/f5-tts_finetune-cli \
  --exp_name F5TTS_v1_Base \
  --dataset_name my_voice \
  --finetune \
  --learning_rate 1e-5 \
  --batch_size_per_gpu 800 \
  --epochs 21 \
  --num_warmup_updates 10 \
  --save_per_updates 185 \
  --last_per_updates 37 \
  --keep_last_n_checkpoints 1

For the local .venv layout above, checkpoints are written under:

.venv/lib/python3.12/ckpts/my_voice/

After a checkpoint is produced, set checkpoint_path for the registered voice in voices/registry.json:

{
  "voice_id": "self_male_en",
  "checkpoint_path": "../.venv/lib/python3.12/ckpts/my_voice/model_last.pt"
}

The API keeps using the same voice id and request payload. To compare zero-shot output against fine-tuned output, switch only checkpoint_path between null and the checkpoint file path.

Suggested workflow

  1. Prove local English male voice cloning through the API with self_male_en.
  2. Record or collect clean long English narration files.
  3. Prepare clips and transcripts with mytts prepare-recording.
  4. Review transcripts before exporting F5-TTS metadata.
  5. Fine-tune only after zero-shot output has been evaluated.
  6. Register and compare the fine-tuned checkpoint against the reference-audio path.
  7. Add Chinese or female voice profiles by enabling new registry entries only after data and reference assets exist.

Tests

pytest

The automated tests use a fake WAV-writing engine. They do not download F5-TTS weights.

Current coverage verifies:

  • API validation and registry exposure
  • disabled and unsupported voice/language cases
  • output-file path safety
  • narration chunk generation through a mocked engine
  • review-gated F5 metadata export

License

Project source code is licensed under the MIT License. See LICENSE.

This project integrates with F5-TTS. F5-TTS source code is MIT-licensed, while upstream pretrained F5-TTS model weights are licensed separately under CC-BY-NC. This project license does not relicense third-party model weights, datasets, or voice assets.

Notes

This first milestone keeps four layers separate:

voice data -> cloning or fine-tuned model -> local API -> video workflow

Do not train a foundation model from scratch on the M1 for this project. Start with the approved reference clip, listen to generated narration, then decide how much reviewed voice data is worth fine-tuning.

For long recordings, prefer clean narration:

WAV or other lossless source
one speaker
no background music
no reverb
consistent microphone distance

Keep the service single-user while the registry only represents your approved voices. Do not add arbitrary uploaded speaker references to the generation endpoint.

About

You can build a private text-to-speech (TTS) application from this project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors