Synthetic evaluation data and tooling for dikw-core.
This repository is a data factory for retrieval evaluation. It contains curated synthetic datasets, scripts for generating and cleaning data, a MiniMax-backed LLM client with task-level retries, and a small web review UI.
Tracked in Git:
datasets/: versioned evaluation datasets consumed bydikw-core.src/dikw_data/: shared Python library code for config, LLM calls, retries, task IDs, and audit persistence.scripts/: generation, repair, cleaning, validation, and dataset maintenance commands.web/: local FastAPI review UI.configs/: non-secret provider and retry configuration.tests/: unit tests for retry and JSON-repair behavior.
Not tracked:
.env: local API keys..venv/,.uv-cache/,.pytest_cache/,__pycache__/: local runtime state.generated/: intermediate LLM outputs, audit databases, quarantined data, and deprecated generated artifacts.reports/: local evaluation reports.
Install dependencies with uv:
uv syncCreate a local .env from the example:
Copy-Item .env.example .envThen set:
ANTHROPIC_API_KEY=your_minimax_key
MiniMax is called through its Anthropic-compatible endpoint. The endpoint,
model, timeout, retry, and concurrency settings live in
configs/minimax.yml.
Current versioned datasets:
synthetic-diverse-v1: small mixed-domain text retrieval dataset.synthetic-diverse-v2: expanded mixed-domain text retrieval dataset covering Chinese history, world history, science, medicine, law, finance, geography, literature, economics, and technology.synthetic-multimodal-datasets-v1: multimodal dataset with Markdown text, local PNG image assets, asset-level targets, single-image and multi-image chunk targets, and compatible doc-level query fields.
Dataset details and file formats are documented in
docs/dataset-format.md.
Validate a dataset:
uv run python scripts/validate_dataset.py datasets/synthetic-multimodal-datasets-v1Run unit tests:
uv run pytestGenerate with the MiniMax-backed pipeline:
uv run python scripts/generate_factbook.py --dataset demo --topic "DIKW knowledge engine"
uv run python scripts/generate_corpus.py --dataset demo --resume
uv run python scripts/generate_candidates.py --dataset demo --resume
uv run python scripts/llm_review.py --dataset demo --resumeAll LLM generation scripts support:
--resume: skip successful tasks and continue unfinished work.--retry-failed: retry failed tasks.--max-attempts N: override configured retry attempts.--concurrency N: override configured concurrency.--dry-run: list tasks without calling the model.
Start the local review UI:
uv run uvicorn web.app:app --host 127.0.0.1 --port 8000Then open:
http://127.0.0.1:8000
The review UI can preview corpus Markdown, render local Markdown images, inspect
LLM generation audit status, run LLM quality review, review generated query
candidates, persist approve/reject/rewrite decisions in
generated/<dataset>/review.sqlite, and export approved items into
datasets/<dataset>/queries.yaml.
The web UI has two separate audit views:
- LLM Generation Audit reads MiniMax call status from
generated/<dataset>/audit.sqlite. - LLM Quality Review asks MiniMax to review corpus, queries, and target
metadata, then writes review batches to
generated/<dataset>/quality_review.sqlite.
Generation and maintenance workflows are documented in
docs/maintenance.md.
scripts/run_eval.py orchestrates retrieval/synth evaluation of the read-only
dikw-core engine over the datasets in this repo, following
docs/dikw-eval-plan.md. It hands each dataset to the
engine by absolute path, captures the NDJSON EvalReport under reports/, and
writes a gate-able summary.json.
One-time prerequisites:
- Install the engine editable, with the CJK extra for Chinese BM25:
uv pip install -e "../dikw-core[cjk]". - Put provider keys in
.env.eval(gitignored):MINIMAX_API_KEY,GITEE_API_KEY.
Plan a run without spending anything (validates datasets, checks key names, prints the exact commands):
uv run python scripts/run_eval.py --dry-runRun the default retrieval eval over every dataset (a one-shot server per dataset):
uv run python scripts/run_eval.py --retrieval allThe eval base provider config lives in
configs/eval-base.dikw.yml (MiniMax + Gitee
embeddings + sqlite); run_eval.py materialises it into the gitignored
bases/eval-base/ on first run. Reports and bases stay out of Git.
The shared client in src/dikw_data/llm_client.py
adds a task-level retry layer on top of the Anthropic-compatible SDK:
- Retries 408, 409, 429, 5xx, 529, connection errors, and read timeouts.
- Does not blindly retry authentication errors or schema-level bad requests.
- Uses exponential backoff with optional jitter.
- Repairs malformed JSON once with the same model.
- Persists task status in
generated/<dataset>/audit.sqlite. - Uses stable task IDs for resume and retry workflows.
The macro evaluation plan for the dikw-core engine — engineering pipeline,
environment deployment, eval dimensions, and dataset construction — is documented in
docs/dikw-eval-plan.md. It is the cornerstone for
building dikw evaluation datasets.
The current dikw-core runner is doc-level. For multimodal datasets,
expect_any remains for compatibility smoke tests, but real multimodal quality
should be measured with asset/chunk-level metrics such as:
asset_hit_at_3asset_hit_at_10asset_mrrchunk_hit_at_3chunk_hit_at_10chunk_mrr
Before pushing publicly:
- Confirm
.envis ignored and contains no committed history. - Keep
generated/out of Git unless a specific artifact is intentionally promoted intodatasets/. - Run
uv run pytest. - Run
scripts/validate_dataset.pyfor each dataset touched. - Add a repository license before public reuse.