Skip to content

Source: codebase as a SQL surface (LSP + tree-sitter) #15

@frhack

Description

@frhack

Why

No federation engine today exposes a code repository as a SQL surface.
Sourcegraph is search, CodeQL is a heavy DSL, ripgrep is grep. dbfy
fits the niche perfectly because the pitch "every existing thing as
a SQL table"
maps cleanly onto code: each function, type, parameter
and reference is a row, each file is a row, each diagnostic from the
language server is a row.

Self-dogfooding immediate: dbfy can query its own codebase in the
README showcase. "How many TODOs do we have, grouped by component?"
"Which public symbols have no doc comment?" "Which files have the
most LSP errors AND the most recent commits AND a related incident?"

— all answerable in one SELECT.

Architecture (decided)

This is NOT rows_file material. Code is far too structured. New
dedicated crate dbfy-provider-code (sibling of dbfy-provider-rest)
implementing the ProgrammaticTableProvider trait.

Three backend strategies:

Backend Extracts Speed Depth
tree_sitter AST: names, kinds, lines, package ⚡ very fast shallow
lsp types, refs, diagnostics, defs 🐢 5-30s init deep
hybrid (default) tree-sitter for fast columns, LSP for semantic ones 🚀 best of both full

Hybrid is the SOTA bet — same approach Sourcegraph uses internally.

Schema

type: code
root: ./
backend: hybrid              # tree_sitter | lsp | hybrid
languages: [rust, python, typescript, go, java]

components:
  # Logical bounded-context labels overlaid on path globs.
  - { glob: "crates/dbfy-provider*/**", name: providers }
  - { glob: "crates/dbfy-frontend-*/**", name: frontend }
  - { glob: "bindings/jvm/**",          name: jvm }

tables:

  files:
    columns: [path, language, component, package, lines, last_modified]

  symbols:
    # one row per declared fn/struct/class/const/impl/trait/interface
    columns:
      - { name: id,         type: string }      # stable hash(path:start_line:name)
      - { name: path,       type: string }
      - { name: component,  type: string }
      - { name: package,    type: string }      # com.dbfy.auth | crate::auth | dbfy.auth | …
      - { name: kind,       type: string }      # function | struct | class | trait | enum | impl | const
      - { name: name,       type: string }
      - { name: visibility, type: string }      # public | private | protected | crate
      - { name: signature,  type: string }      # rich form via LSP: "pub fn verify(p: &str) -> Result<()>"
      - { name: lines,      type: int64 }
      - { name: start_line, type: int64 }
      - { name: doc,        type: string }      # docstring (when LSP returns one)

  parameters:
    # one row per parameter of a function/method
    columns:
      - { name: function_id, type: string }     # FK -> symbols.id
      - { name: position,    type: int64 }
      - { name: name,        type: string }
      - { name: type,        type: string }     # actual declared type via LSP: "&str", "Option<User>"
      - { name: is_mutable,  type: boolean }

  types:
    # one row per struct / enum / interface / trait / type alias
    columns: [id, path, component, package, kind, name, visibility, fields_count, derived_traits, doc]

  references:
    # one row per use of a symbol — answers "who calls this?"
    # Only LSP can populate this reliably (cross-file resolution).
    columns: [from_path, from_symbol, to_symbol, line, column, kind]
            # kind: call | type_use | impl_for | trait_use | mention

  diagnostics:
    # what the language server reports right now
    columns: [path, line, severity, code, message, source]
            # severity: error | warning | info | hint

  comments:
    columns: [path, component, line, kind, text]
            # kind: TODO | FIXME | NOTE

Example queries this enables

-- Functions with > 5 parameters (cognitive-load smell)
SELECT s.path, s.name, count(p.position) AS arity
  FROM repo.symbols s
  JOIN repo.parameters p ON p.function_id = s.id
 WHERE s.kind = 'function'
 GROUP BY 1, 2 HAVING count(*) > 5
 ORDER BY arity DESC;

-- Public symbols never called (dead-code candidates)
SELECT s.component, s.path, s.name
  FROM repo.symbols s
  LEFT JOIN repo.references r ON r.to_symbol = s.id
 WHERE s.visibility = 'public' AND r.to_symbol IS NULL
   AND s.path NOT LIKE '%/tests/%';

-- Public API without doc (compliance)
SELECT path, name FROM repo.symbols
 WHERE visibility = 'public' AND doc = '';

-- The killer: hot files = warnings × commits × incidents
SELECT d.path,
       count(distinct d.line) AS warnings,
       count(distinct g.sha)  AS commits_30d,
       count(distinct i.id)   AS incidents
  FROM repo.diagnostics d
  LEFT JOIN git.commits g       ON g.path = d.path AND g.date > now() - INTERVAL '30 days'
  LEFT JOIN incidents.events i  ON i.touched_path = d.path
 WHERE d.severity IN ('error','warning')
 GROUP BY d.path
 ORDER BY warnings + commits_30d * 2 + incidents * 5 DESC;

The last is the centrepiece: code-as-data × git history × incident DB
in a single SELECT. A risk-per-file map nobody can build today
without writing 200 lines of glue.

Pushdown

Predicate Pushdown level
WHERE language = 'rust' filter at filesystem walk (extension + LSP server selection)
WHERE component = 'auth' YAML component-glob matching at walk time
WHERE path LIKE 'src/%' glob filter at walk time
WHERE kind = 'function' tree-sitter query specialisation per kind
LIMIT N early-break of file walk

Plus L3 zone-map sidecar (.dbfy_idx_code) keyed on
(path, mtime, lsp_version) so unchanged files are skipped on
re-scan, mirroring the rows-file indexing pattern.

LSP lifecycle (the non-trivial part)

dbfy-provider-code  ─► LSP supervisor task
                          ├─ rust-analyzer   (running, indexed)
                          ├─ pyright         (warming up …)
                          ├─ gopls           (running, indexed)
                          └─ tsserver        (running, indexed)
  • Discovery: LSP binaries auto-detected on $PATH; YAML can
    override per language (lsp.rust: /custom/path/to/rust-analyzer).
  • Warm-up: first query blocks until the server is fully indexed
    (rust-analyzer can be ~30s on a big repo). Subsequent queries are
    ~ms.
  • Caching: (file, mtime, lsp_version) → cached result.
  • Crash recovery: supervisor respawns crashed servers transparent
    to the running query.
  • Concurrency: each query batches documentSymbol /
    findReferences calls per file; the LSP server itself parallelises
    internally.

VSCode integration (separate, optional)

A VSCode extension where you write *.dbfy.sql files and get inline
results via Code Lens, with click-through to the matching symbol. This
is its own ~3-day effort, queued after the provider is stable.

Effort estimate

Step Effort
dbfy-provider-code scaffold + tree-sitter backend (5 langs, 4 tables) 2.5 d
LSP backend + supervisor + cache 3 d
Hybrid mode (tree-sitter fast path, LSP enrichment) 1 d
5 language LSP integrations + binary discovery 1.5 d
references table (batched LSP findReferences) 0.5 d
Showcase: codebase × git × incidents 0.5 d
Provider v1 total ~9 days
VSCode extension (separate milestone) +3 d

Acceptance criteria for v1

  • 5 languages (Rust, Python, TypeScript, Go, Java) extract correctly into
    the schema above
  • All five tables (files, symbols, parameters, references,
    diagnostics) populated for at least Rust + Python
  • Pushdown works for language, component, path LIKE, kind,
    LIMIT
  • LSP server crash mid-query produces a typed error, not a panic
  • 4 integration tests against dbfy's own codebase
  • Showcase example with the killer "hot files" query produces a
    meaningful result on dbfy itself

Targeted milestone: v0.6.

Metadata

Metadata

Assignees

No one assigned

    Labels

    source-requestRequest for a new source kind to be addedtier-1Wow factor — opens a new mark

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions