Warning
Customain is in transition. As OpenAI and other commercial providers discontinue or reshape hosted fine-tuning APIs, the project is planned to move soon toward open-weight model workflows and providers such as Modal, Together AI, and FireworksHQ. Expect provider support and setup instructions to change as this migration lands.
Post-training experiment bench for managed and open-weight provider workflows.
Customain is for running post-training experiments, evaluating the resulting models with pluggable metrics, and selecting the best model for a task. It is no longer centered on learning one person's email style; the core project is a generic experimentation pipeline for provider-hosted and open-weight post-training workflows.
Generic JSONL data -> post-training sweeps across providers/models -> eval runs -> weighted model ranking
Customain focuses on the operational loop around post-training:
- Define generic SFT or DPO datasets.
- Sweep models, providers, methods, and hyperparameters.
- Launch provider-hosted or open-weight post-training jobs.
- Run baseline and post-trained models on the same test split.
- Evaluate outputs with pluggable task metrics.
- Rank models with configurable metric weights.
The pipeline is currently provider-API first, with an upcoming shift toward open-weight post-training infrastructure. If you want local full training today, use a project built for training infrastructure such as torchtune, Axolotl, LLaMA-Factory, or Unsloth.
| Provider | Models | Methods | Status |
|---|---|---|---|
| OpenAI | GPT-4.1, 4.1-mini, 4.1-nano | SFT, DPO | ✅ Available |
| Together AI | Llama, Mixtral, Qwen + any HF model | -- | 🔜 Planned |
- Python 3.11+
- uv
- API keys for the post-training providers you plan to use
- Optional: Weights & Biases key for experiment tracking
git clone https://github.com/user/customain.git
cd customain
uv syncCreate .secrets/api_keps.json:
{
"openai_api_key": "sk-...",
"wandb_api_key": "optional",
"together_api_key": "optional",
"together_base_url": "https://api.together.xyz/v1",
"fireworks_api_key": "optional",
"fireworks_base_url": "https://api.fireworks.ai/inference/v1"
}Only configure providers you use.
Edit ft/training_configs.py:
baseline_models = [
{"provider": "openai", "model": "gpt-4.1-2025-04-14"},
]
llms = [
{"provider": "openai", "model": "gpt-4.1-mini-2025-04-14"},
{"provider": "openai", "model": "gpt-4.1-2025-04-14"},
]
training_methods = ["supervised", "dpo"]
metric_weights = {
"task_judge": 1.0,
}uv run python -m ft.run_pipeline --data-dir data/my_experimentFor a small smoke test, use mock files:
uv run python -m ft.run_pipeline \
--data-dir data/my_experiment \
--test-runSkip completed stages when iterating:
uv run python -m ft.run_pipeline \
--data-dir data/my_experiment \
--skip 1 2The pipeline writes:
| File | Purpose |
|---|---|
ft/_experiments.json |
Provider/model/method/job metadata |
ft/_ft_models_eval_runs.json |
Raw generations from baseline and post-trained models |
ft/_evaluation_results.json |
Per-datapoint and average metric scores |
ft/_model_ranking.json |
Weighted ranking used for model selection |
Evaluation is pluggable. Drop a new evaluator into ft/evaluation/evaluators/; it will be auto-discovered if it subclasses BaseEvaluator.
The default direction is task-oriented model selection, not similarity scoring. The main generic evaluator is:
| Evaluator | What it measures |
|---|---|
task_judge |
LLM-as-judge score for task quality, instruction following, correctness, completeness, and clarity |
Legacy/specialized evaluators remain available but are skipped by default:
| Evaluator | Use when |
|---|---|
bleu, meteor, semantic_similarity |
You explicitly want reference similarity metrics |
tone_judge |
You explicitly care about style/register matching |
authorship_classifier |
You are running an authorship/style experiment with a trained classifier |
Configure default skips and the model-selection formula in ft/training_configs.py:
skip_evaluators = [
"authorship_classifier",
"bleu",
"meteor",
"semantic_similarity",
"tone_judge",
]
metric_weights = {
"task_judge": 1.0,
}Evaluators can require any subset of:
promptexpectedgenerated
This keeps the evaluation layer independent from Gmail, email tone, or reference-similarity assumptions.
The old Gmail preprocessing pipeline can still build SFT/DPO data if you want to experiment on email-style tasks:
uv run python -m gmail_preprocessing_pipeline.run_pipeline --targets sft dpoThat path is now optional project history, not the center of Customain.
This project is licensed under the GNU Affero General Public License v3.0 (AGPLv3).