Why
No federation engine today exposes a code repository as a SQL surface.
Sourcegraph is search, CodeQL is a heavy DSL, ripgrep is grep. dbfy
fits the niche perfectly because the pitch "every existing thing as
a SQL table" maps cleanly onto code: each function, type, parameter
and reference is a row, each file is a row, each diagnostic from the
language server is a row.
Self-dogfooding immediate: dbfy can query its own codebase in the
README showcase. "How many TODOs do we have, grouped by component?"
"Which public symbols have no doc comment?" "Which files have the
most LSP errors AND the most recent commits AND a related incident?"
— all answerable in one SELECT.
Architecture (decided)
This is NOT rows_file material. Code is far too structured. New
dedicated crate dbfy-provider-code (sibling of dbfy-provider-rest)
implementing the ProgrammaticTableProvider trait.
Three backend strategies:
| Backend |
Extracts |
Speed |
Depth |
tree_sitter |
AST: names, kinds, lines, package |
⚡ very fast |
shallow |
lsp |
types, refs, diagnostics, defs |
🐢 5-30s init |
deep |
hybrid (default) |
tree-sitter for fast columns, LSP for semantic ones |
🚀 best of both |
full |
Hybrid is the SOTA bet — same approach Sourcegraph uses internally.
Schema
type: code
root: ./
backend: hybrid # tree_sitter | lsp | hybrid
languages: [rust, python, typescript, go, java]
components:
# Logical bounded-context labels overlaid on path globs.
- { glob: "crates/dbfy-provider*/**", name: providers }
- { glob: "crates/dbfy-frontend-*/**", name: frontend }
- { glob: "bindings/jvm/**", name: jvm }
tables:
files:
columns: [path, language, component, package, lines, last_modified]
symbols:
# one row per declared fn/struct/class/const/impl/trait/interface
columns:
- { name: id, type: string } # stable hash(path:start_line:name)
- { name: path, type: string }
- { name: component, type: string }
- { name: package, type: string } # com.dbfy.auth | crate::auth | dbfy.auth | …
- { name: kind, type: string } # function | struct | class | trait | enum | impl | const
- { name: name, type: string }
- { name: visibility, type: string } # public | private | protected | crate
- { name: signature, type: string } # rich form via LSP: "pub fn verify(p: &str) -> Result<()>"
- { name: lines, type: int64 }
- { name: start_line, type: int64 }
- { name: doc, type: string } # docstring (when LSP returns one)
parameters:
# one row per parameter of a function/method
columns:
- { name: function_id, type: string } # FK -> symbols.id
- { name: position, type: int64 }
- { name: name, type: string }
- { name: type, type: string } # actual declared type via LSP: "&str", "Option<User>"
- { name: is_mutable, type: boolean }
types:
# one row per struct / enum / interface / trait / type alias
columns: [id, path, component, package, kind, name, visibility, fields_count, derived_traits, doc]
references:
# one row per use of a symbol — answers "who calls this?"
# Only LSP can populate this reliably (cross-file resolution).
columns: [from_path, from_symbol, to_symbol, line, column, kind]
# kind: call | type_use | impl_for | trait_use | mention
diagnostics:
# what the language server reports right now
columns: [path, line, severity, code, message, source]
# severity: error | warning | info | hint
comments:
columns: [path, component, line, kind, text]
# kind: TODO | FIXME | NOTE
Example queries this enables
-- Functions with > 5 parameters (cognitive-load smell)
SELECT s.path, s.name, count(p.position) AS arity
FROM repo.symbols s
JOIN repo.parameters p ON p.function_id = s.id
WHERE s.kind = 'function'
GROUP BY 1, 2 HAVING count(*) > 5
ORDER BY arity DESC;
-- Public symbols never called (dead-code candidates)
SELECT s.component, s.path, s.name
FROM repo.symbols s
LEFT JOIN repo.references r ON r.to_symbol = s.id
WHERE s.visibility = 'public' AND r.to_symbol IS NULL
AND s.path NOT LIKE '%/tests/%';
-- Public API without doc (compliance)
SELECT path, name FROM repo.symbols
WHERE visibility = 'public' AND doc = '';
-- The killer: hot files = warnings × commits × incidents
SELECT d.path,
count(distinct d.line) AS warnings,
count(distinct g.sha) AS commits_30d,
count(distinct i.id) AS incidents
FROM repo.diagnostics d
LEFT JOIN git.commits g ON g.path = d.path AND g.date > now() - INTERVAL '30 days'
LEFT JOIN incidents.events i ON i.touched_path = d.path
WHERE d.severity IN ('error','warning')
GROUP BY d.path
ORDER BY warnings + commits_30d * 2 + incidents * 5 DESC;
The last is the centrepiece: code-as-data × git history × incident DB
in a single SELECT. A risk-per-file map nobody can build today
without writing 200 lines of glue.
Pushdown
| Predicate |
Pushdown level |
WHERE language = 'rust' |
filter at filesystem walk (extension + LSP server selection) |
WHERE component = 'auth' |
YAML component-glob matching at walk time |
WHERE path LIKE 'src/%' |
glob filter at walk time |
WHERE kind = 'function' |
tree-sitter query specialisation per kind |
LIMIT N |
early-break of file walk |
Plus L3 zone-map sidecar (.dbfy_idx_code) keyed on
(path, mtime, lsp_version) so unchanged files are skipped on
re-scan, mirroring the rows-file indexing pattern.
LSP lifecycle (the non-trivial part)
dbfy-provider-code ─► LSP supervisor task
├─ rust-analyzer (running, indexed)
├─ pyright (warming up …)
├─ gopls (running, indexed)
└─ tsserver (running, indexed)
- Discovery: LSP binaries auto-detected on
$PATH; YAML can
override per language (lsp.rust: /custom/path/to/rust-analyzer).
- Warm-up: first query blocks until the server is fully indexed
(rust-analyzer can be ~30s on a big repo). Subsequent queries are
~ms.
- Caching:
(file, mtime, lsp_version) → cached result.
- Crash recovery: supervisor respawns crashed servers transparent
to the running query.
- Concurrency: each query batches
documentSymbol /
findReferences calls per file; the LSP server itself parallelises
internally.
VSCode integration (separate, optional)
A VSCode extension where you write *.dbfy.sql files and get inline
results via Code Lens, with click-through to the matching symbol. This
is its own ~3-day effort, queued after the provider is stable.
Effort estimate
| Step |
Effort |
dbfy-provider-code scaffold + tree-sitter backend (5 langs, 4 tables) |
2.5 d |
| LSP backend + supervisor + cache |
3 d |
| Hybrid mode (tree-sitter fast path, LSP enrichment) |
1 d |
| 5 language LSP integrations + binary discovery |
1.5 d |
references table (batched LSP findReferences) |
0.5 d |
| Showcase: codebase × git × incidents |
0.5 d |
| Provider v1 total |
~9 days |
| VSCode extension (separate milestone) |
+3 d |
Acceptance criteria for v1
- 5 languages (Rust, Python, TypeScript, Go, Java) extract correctly into
the schema above
- All five tables (
files, symbols, parameters, references,
diagnostics) populated for at least Rust + Python
- Pushdown works for
language, component, path LIKE, kind,
LIMIT
- LSP server crash mid-query produces a typed error, not a panic
- 4 integration tests against dbfy's own codebase
- Showcase example with the killer "hot files" query produces a
meaningful result on dbfy itself
Targeted milestone: v0.6.
Why
No federation engine today exposes a code repository as a SQL surface.
Sourcegraph is search, CodeQL is a heavy DSL, ripgrep is grep. dbfy
fits the niche perfectly because the pitch "every existing thing as
a SQL table" maps cleanly onto code: each function, type, parameter
and reference is a row, each file is a row, each diagnostic from the
language server is a row.
Self-dogfooding immediate: dbfy can query its own codebase in the
README showcase. "How many TODOs do we have, grouped by component?"
"Which public symbols have no doc comment?" "Which files have the
most LSP errors AND the most recent commits AND a related incident?"
— all answerable in one SELECT.
Architecture (decided)
This is NOT rows_file material. Code is far too structured. New
dedicated crate
dbfy-provider-code(sibling ofdbfy-provider-rest)implementing the
ProgrammaticTableProvidertrait.Three backend strategies:
tree_sitterlsphybrid(default)Hybrid is the SOTA bet — same approach Sourcegraph uses internally.
Schema
Example queries this enables
The last is the centrepiece: code-as-data × git history × incident DB
in a single SELECT. A risk-per-file map nobody can build today
without writing 200 lines of glue.
Pushdown
WHERE language = 'rust'WHERE component = 'auth'WHERE path LIKE 'src/%'WHERE kind = 'function'LIMIT NPlus L3 zone-map sidecar (
.dbfy_idx_code) keyed on(path, mtime, lsp_version)so unchanged files are skipped onre-scan, mirroring the rows-file indexing pattern.
LSP lifecycle (the non-trivial part)
$PATH; YAML canoverride per language (
lsp.rust: /custom/path/to/rust-analyzer).(rust-analyzer can be ~30s on a big repo). Subsequent queries are
~ms.
(file, mtime, lsp_version)→ cached result.to the running query.
documentSymbol/findReferencescalls per file; the LSP server itself parallelisesinternally.
VSCode integration (separate, optional)
A VSCode extension where you write
*.dbfy.sqlfiles and get inlineresults via Code Lens, with click-through to the matching symbol. This
is its own ~3-day effort, queued after the provider is stable.
Effort estimate
dbfy-provider-codescaffold + tree-sitter backend (5 langs, 4 tables)referencestable (batched LSP findReferences)Acceptance criteria for v1
the schema above
files,symbols,parameters,references,diagnostics) populated for at least Rust + Pythonlanguage,component,path LIKE,kind,LIMITmeaningful result on dbfy itself
Targeted milestone: v0.6.