An AI-native knowledge base engine that turns your documents into Data → Information → Knowledge → Wisdom.
dikw-core is a client/server knowledge base engine. A long-lived dikw serve process owns the base — its directory tree, its index, and its database are self-managed by the server: it ingests sources, synthesizes a knowledge layer, indexes everything for hybrid search, and reconciles the on-disk files with the index. Clients (dikw client …, or any agent over HTTP) talk to it; they never reach into its storage directly. Knowledge is persisted as an open, portable Markdown tree you can move, diff, and version anywhere.
This is different from Obsidian: Obsidian is a local editor for a vault you maintain by hand, with no engine behind it. dikw-core is the engine — it authors, indexes, retrieves, and keeps the base consistent — and the Markdown it manages happens to open in any editor.
Inspired by Karpathy's LLM Wiki pattern and extended end-to-end across the full DIKW pyramid: where Karpathy's pattern stops at a compounding markdown knowledge base (the K layer), dikw-core adds a first-class Wisdom layer for human-authored principles, lessons, and patterns that apply beyond any single source. Its on-disk format shares that lineage with Google's Open Knowledge Format and aims to interoperate with it.
Status: alpha. Under active construction — APIs, on-disk formats, database schema, and CLI will change. See CHANGELOG.md.
flowchart LR
agent["Agent / dikw client"] -- "HTTP + NDJSON" --> engine
subgraph server["dikw serve — self-managed base"]
direction TB
engine["FastAPI engine"]
D["D · sources/<br/>(you curate)"] --> I["I · chunk · FTS · embed"]
I --> K["K · knowledge/<br/>(synth-authored)"]
W["W · wisdom/<br/>(hand-written)"] --> I
engine --- store[("index<br/>SQLite / Postgres")]
end
engine -.retrieve.-> agent
The server owns the base on disk and the index/DB behind it; agents drive it over HTTP and get back ranked chunks + page refs. dikw-core does not synthesize the final answer — retrieve returns evidence and the agent (Claude Code, ChatGPT, your own script) runs its own LLM over it. See GUIDE_FOR_AGENTS.md.
- A server-managed knowledge base — the dikw base — whose on-disk layout is a plain, portable Markdown tree. The server is the authority for what gets written, indexed, and reconciled; the files stay open and any Markdown editor can read them.
- Four explicit DIKW layers, each with its own operations:
- Data — raw sources you curate under
sources/. - Information — parsed, chunked, embedded, indexed (FTS5 + vectors).
- Knowledge — LLM-authored knowledge pages with
[[wikilinks]], filed under a configurablecategorytree (knowledge/<category>/<slug>.md). Engine-written, and your hand-edits are reconciled rather than clobbered. - Wisdom — hand-written principles / lessons / patterns under
wisdom/<author>/, indexed (chunked + embedded) so they surface inretrievealongside K-layer pages.
- Data — raw sources you curate under
- Client / server architecture. A long-lived
dikw serve(FastAPI + NDJSON) hosts the engine; thedikw client …Typer CLI talks to it over HTTP, streams progress events for long ops, and supports cancel / resume. - Pluggable LLM providers (API-first): Anthropic + OpenAI-compatible (covers OpenAI, Azure, Ollama, DeepSeek, Gemini-compat).
- Pluggable storage: SQLite + sqlite-vec (default), Postgres + pgvector (enterprise) — swap by config.
Requires Python 3.12+ and uv.
git clone https://github.com/OpenDIKW/dikw-core
cd dikw-core
uv sync
uv run dikw init my-base --description "my research base"
cd my-base
# Drop some markdown into sources/, then run any single command via
# `dikw client serve-and-run` — it spawns a local server, runs the
# inner command, and tears it down.
uv run dikw client serve-and-run -- ingest --no-embed
uv run dikw client serve-and-run -- retrieve "What does Karpathy mean by deterministic scoping?"For interactive sessions or long iterations, run dikw serve once and
keep using dikw client * against it:
uv run dikw serve --base . # in one terminal
# in another:
uv run dikw client status
uv run dikw client synth # K layer (needs ANTHROPIC_API_KEY or OpenAI-compat)
uv run dikw client retrieve "What does Karpathy mean by deterministic scoping?"Every HTTP-bound command is spelled out as dikw client <verb> — there are no
top-level short aliases, which keeps the local-vs-HTTP boundary unambiguous.
The block above is the from-source flow. To use dikw-core without a
checkout, install the published wheel — the dikw command lands on your PATH:
uv pip install 'dikw-core[postgres]' # or plain pip; drop [postgres] for SQLite-only
dikw init my-base && cd my-base
dikw serve --base .Optional extras: postgres (pgvector backend), cjk (jieba — the
default config selects it, so ingesting Chinese/CJK text errors without it),
otel (OpenTelemetry). Pin a version to match
your server — dikw client hard-fails on a version skew.
Full extras matrix in
docs/getting-started.md.
For container deployment, see examples/docker/
(Dockerfile + compose stack with pgvector/pgvector:0.8.2-pg18) and the
long-form docs/deployment-docker.md.
Local-only commands run in this process:
| command | does |
|---|---|
dikw version |
print the package version |
dikw init <path> |
scaffold a dikw base (sources / knowledge / wisdom / .dikw/ + dikw.yml) |
dikw serve --base <path> |
start the FastAPI + NDJSON server bound to one base |
Everything else lives under dikw client * and talks to a running server.
There are no top-level short aliases — spelling out the client prefix
keeps the local-vs-HTTP boundary unambiguous for both agents and humans:
| command | does |
|---|---|
dikw client status |
counts across DIKW layers |
dikw client info |
raw GET /v1/info passthrough — version, storage backend, auth posture |
dikw client health |
server self-description (base, version, storage, providers) — the first call an agent makes |
dikw client check |
ping the configured LLM + embedding endpoints to verify dikw.yml + keys |
dikw client import <path> |
pre-flight + import local md packages (md + referenced assets) into the server's sources/ |
dikw client ingest [--no-embed] |
parse + chunk + FTS-index + embed the server's sources/ tree |
dikw client retrieve "<q>" |
hybrid search returning ranked chunks + page refs (no LLM call); agent supplies its own synthesis |
dikw client synth [--all] |
LLM turns source docs into K-layer knowledge pages, filed under the configured category tree |
dikw client wisdom write |
write a W-layer page (slug + title + body + metadata) and index it |
dikw client delete <path> |
soft-delete a D/K/W document to <base>/trash/ and purge its rows |
dikw client lint [propose|proposals|apply] |
report broken wikilinks / orphan pages / duplicate titles / disk-index drift; propose + apply structured fixes |
dikw client pages {list,get,links,provenance} |
enumerate pages / read a page body + chunk anchors / walk the K-layer link graph / walk the K↔D provenance edge |
dikw client graph get |
fetch the whole base graph (nodes + edges + unresolved wikilinks) in one read |
dikw client assets get <id> --output <file> |
download a content-addressed asset by sha256 id |
dikw client eval [--dataset] |
run retrieval-quality evaluation against packaged or custom datasets |
dikw client tasks {list,status,events,wait,cancel} |
inspect running / past async tasks on the server |
dikw client serve-and-run -- <cmd> |
one-shot server + inner command + teardown (no long-lived dikw serve needed) |
The dikw auth {login,import,status,list,logout} subgroup is local —
it manages OAuth tokens in <base>/.dikw/auth.json without talking to a
server (used by the openai_codex provider; see docs/providers.md).
Configured via dikw.yml. llm names a wire protocol (which SDK to speak),
not a vendor — the actual vendor is whatever llm_base_url points at:
provider:
llm: anthropic_compat # or: openai_compat
llm_model: claude-sonnet-4-6
llm_base_url: null # set for any Anthropic-protocol-compatible endpoint
llm_api_key_env: ANTHROPIC_API_KEY # required: names the env var holding the LLM key
embedding: openai_compat
embedding_model: text-embedding-3-small
embedding_base_url: https://api.openai.com/v1
embedding_api_key_env: OPENAI_API_KEY # required: names the env var holding the embedding key
embedding_dim: 1536 # required: must match what the endpoint returns
embedding_normalize: true
embedding_distance: cosineanthropic_compat→ uses theanthropicasync SDK withcache_controlon the system prompt, so repeated synth calls hit the prompt cache. Setllm_base_urlto retarget the SDK at any Anthropic-protocol-compatible endpoint (e.g., MiniMax'shttps://api.minimaxi.com/anthropic).openai_compat→ uses theopenaiasync SDK against any base URL that speaks the OpenAI HTTP surface (Azure, Ollama, vLLM, DeepSeek, MiniMax, …).
Each leg names its own env var via llm_api_key_env / embedding_api_key_env,
so keys are vendor-canonical (ANTHROPIC_API_KEY, OPENAI_API_KEY,
DEEPSEEK_API_KEY, MINIMAX_API_KEY, GITEE_API_KEY, …) and multiple
same-protocol vendors coexist in one .env — each base picks which var it reads.
Verify connectivity before a real run — dikw client check pings each provider
with one tiny request and prints endpoint / latency / dim, exiting non-zero on
failure (scriptable in CI):
uv run dikw client check --llm-only # just LLM
uv run dikw client check --embed-only # just embedding
uv run dikw client check # bothThe full vendor cookbook (MiniMax, GLM, Gemini, DeepSeek, Gitee AI, Ollama, …),
a worked MiniMax LLM + Gitee AI embeddings config, and the production
gotchas around batch size, embedding dimensions, retry, and prompt caching live
in docs/providers.md.
Two backends ship, selected in dikw.yml:
storage:
backend: sqlite # sqlite | postgres
# --- sqlite (default): single-user local ---
path: .dikw/index.sqlite
# --- postgres (enterprise): multi-user, pgvector + tsvector ---
# backend: postgres
# dsn: postgresql://user:pw@host:5432/dikw
# schema: dikw
# pool_size: 10- SQLite +
sqlite-vec+ FTS5 — the default. No extras required. - Postgres +
pgvector— install viauv pip install dikw-core[postgres]. Requires thepg_trgmandvectorextensions (standard on thepgvector/pgvector:0.8.2-pg18Docker image). The adapter usestsvector+ GIN for FTS andvector(N)for embeddings; the vector dimension is set at first insert.
Engine code talks only to the Storage Protocol
(storage/base.py); each adapter
implements the same contract and is swappable by changing dikw.yml.
Markdown ships out of the box. A new format is one SourceBackend
subclass + a register() call away — see
domains/data/backends/markdown.py
for the reference impl.
| doc | what it covers |
|---|---|
docs/getting-started.md |
end-to-end walkthrough from a blank directory to a queryable base |
docs/architecture.md |
module map, layer contracts, the seams to extend |
docs/design.md |
approved design doc — the source of truth for intent |
docs/server.md |
server deployment, security posture, the wire contract |
docs/providers.md |
per-vendor config cookbook + production gotchas |
docs/deployment-docker.md |
container deployment (Dockerfile + compose) |
docs/observability.md |
OpenTelemetry traces / metrics / logs |
docs/releasing.md |
maintainer release flow (PyPI + GHCR + GitHub Release) |
docs/adr/ |
architecture decision records |
Issues and pull requests are welcome. Start with
CONTRIBUTING.md for the dev setup (uv sync --all-extras),
the local CI gate (uv run python tools/check.py), and the delivery workflow.
Please also read the Code of Conduct. To report a
security issue, see SECURITY.md.
MIT — see LICENSE.