Source: codebase as a SQL surface (LSP + tree-sitter)

## Why

No federation engine today exposes a code repository as a SQL surface.
Sourcegraph is search, CodeQL is a heavy DSL, ripgrep is grep. dbfy
fits the niche perfectly because the pitch *"every existing thing as
a SQL table"* maps cleanly onto code: each function, type, parameter
and reference is a row, each file is a row, each diagnostic from the
language server is a row.

Self-dogfooding immediate: dbfy can query its own codebase in the
README showcase. *"How many TODOs do we have, grouped by component?"*
*"Which public symbols have no doc comment?"* *"Which files have the
most LSP errors AND the most recent commits AND a related incident?"*
— all answerable in one SELECT.

## Architecture (decided)

This is **NOT** rows_file material. Code is far too structured. New
dedicated crate `dbfy-provider-code` (sibling of `dbfy-provider-rest`)
implementing the `ProgrammaticTableProvider` trait.

Three backend strategies:

| Backend | Extracts | Speed | Depth |
|---|---|---|---|
| `tree_sitter` | AST: names, kinds, lines, package | ⚡ very fast | shallow |
| `lsp` | types, refs, diagnostics, defs | 🐢 5-30s init | deep |
| `hybrid` (default) | tree-sitter for fast columns, LSP for semantic ones | 🚀 best of both | full |

Hybrid is the SOTA bet — same approach Sourcegraph uses internally.

## Schema

```yaml
type: code
root: ./
backend: hybrid              # tree_sitter | lsp | hybrid
languages: [rust, python, typescript, go, java]

components:
  # Logical bounded-context labels overlaid on path globs.
  - { glob: "crates/dbfy-provider*/**", name: providers }
  - { glob: "crates/dbfy-frontend-*/**", name: frontend }
  - { glob: "bindings/jvm/**",          name: jvm }

tables:

  files:
    columns: [path, language, component, package, lines, last_modified]

  symbols:
    # one row per declared fn/struct/class/const/impl/trait/interface
    columns:
      - { name: id,         type: string }      # stable hash(path:start_line:name)
      - { name: path,       type: string }
      - { name: component,  type: string }
      - { name: package,    type: string }      # com.dbfy.auth | crate::auth | dbfy.auth | …
      - { name: kind,       type: string }      # function | struct | class | trait | enum | impl | const
      - { name: name,       type: string }
      - { name: visibility, type: string }      # public | private | protected | crate
      - { name: signature,  type: string }      # rich form via LSP: "pub fn verify(p: &str) -> Result<()>"
      - { name: lines,      type: int64 }
      - { name: start_line, type: int64 }
      - { name: doc,        type: string }      # docstring (when LSP returns one)

  parameters:
    # one row per parameter of a function/method
    columns:
      - { name: function_id, type: string }     # FK -> symbols.id
      - { name: position,    type: int64 }
      - { name: name,        type: string }
      - { name: type,        type: string }     # actual declared type via LSP: "&str", "Option<User>"
      - { name: is_mutable,  type: boolean }

  types:
    # one row per struct / enum / interface / trait / type alias
    columns: [id, path, component, package, kind, name, visibility, fields_count, derived_traits, doc]

  references:
    # one row per use of a symbol — answers "who calls this?"
    # Only LSP can populate this reliably (cross-file resolution).
    columns: [from_path, from_symbol, to_symbol, line, column, kind]
            # kind: call | type_use | impl_for | trait_use | mention

  diagnostics:
    # what the language server reports right now
    columns: [path, line, severity, code, message, source]
            # severity: error | warning | info | hint

  comments:
    columns: [path, component, line, kind, text]
            # kind: TODO | FIXME | NOTE
```

## Example queries this enables

```sql
-- Functions with > 5 parameters (cognitive-load smell)
SELECT s.path, s.name, count(p.position) AS arity
  FROM repo.symbols s
  JOIN repo.parameters p ON p.function_id = s.id
 WHERE s.kind = 'function'
 GROUP BY 1, 2 HAVING count(*) > 5
 ORDER BY arity DESC;

-- Public symbols never called (dead-code candidates)
SELECT s.component, s.path, s.name
  FROM repo.symbols s
  LEFT JOIN repo.references r ON r.to_symbol = s.id
 WHERE s.visibility = 'public' AND r.to_symbol IS NULL
   AND s.path NOT LIKE '%/tests/%';

-- Public API without doc (compliance)
SELECT path, name FROM repo.symbols
 WHERE visibility = 'public' AND doc = '';

-- The killer: hot files = warnings × commits × incidents
SELECT d.path,
       count(distinct d.line) AS warnings,
       count(distinct g.sha)  AS commits_30d,
       count(distinct i.id)   AS incidents
  FROM repo.diagnostics d
  LEFT JOIN git.commits g       ON g.path = d.path AND g.date > now() - INTERVAL '30 days'
  LEFT JOIN incidents.events i  ON i.touched_path = d.path
 WHERE d.severity IN ('error','warning')
 GROUP BY d.path
 ORDER BY warnings + commits_30d * 2 + incidents * 5 DESC;
```

The last is the centrepiece: code-as-data × git history × incident DB
in a single SELECT. A risk-per-file map nobody can build today
without writing 200 lines of glue.

## Pushdown

| Predicate | Pushdown level |
|---|---|
| `WHERE language = 'rust'` | filter at filesystem walk (extension + LSP server selection) |
| `WHERE component = 'auth'` | YAML component-glob matching at walk time |
| `WHERE path LIKE 'src/%'` | glob filter at walk time |
| `WHERE kind = 'function'` | tree-sitter query specialisation per kind |
| `LIMIT N` | early-break of file walk |

Plus L3 zone-map sidecar (`.dbfy_idx_code`) keyed on
`(path, mtime, lsp_version)` so unchanged files are skipped on
re-scan, mirroring the rows-file indexing pattern.

## LSP lifecycle (the non-trivial part)

```
dbfy-provider-code  ─► LSP supervisor task
                          ├─ rust-analyzer   (running, indexed)
                          ├─ pyright         (warming up …)
                          ├─ gopls           (running, indexed)
                          └─ tsserver        (running, indexed)
```

- **Discovery**: LSP binaries auto-detected on `$PATH`; YAML can
  override per language (`lsp.rust: /custom/path/to/rust-analyzer`).
- **Warm-up**: first query blocks until the server is fully indexed
  (rust-analyzer can be ~30s on a big repo). Subsequent queries are
  ~ms.
- **Caching**: `(file, mtime, lsp_version)` → cached result.
- **Crash recovery**: supervisor respawns crashed servers transparent
  to the running query.
- **Concurrency**: each query batches `documentSymbol` /
  `findReferences` calls per file; the LSP server itself parallelises
  internally.

## VSCode integration (separate, optional)

A VSCode extension where you write `*.dbfy.sql` files and get inline
results via Code Lens, with click-through to the matching symbol. This
is its own ~3-day effort, queued **after** the provider is stable.

## Effort estimate

| Step | Effort |
|---|---|
| `dbfy-provider-code` scaffold + tree-sitter backend (5 langs, 4 tables) | 2.5 d |
| LSP backend + supervisor + cache | 3 d |
| Hybrid mode (tree-sitter fast path, LSP enrichment) | 1 d |
| 5 language LSP integrations + binary discovery | 1.5 d |
| `references` table (batched LSP findReferences) | 0.5 d |
| Showcase: codebase × git × incidents | 0.5 d |
| **Provider v1 total** | **~9 days** |
| VSCode extension (separate milestone) | +3 d |

## Acceptance criteria for v1

- 5 languages (Rust, Python, TypeScript, Go, Java) extract correctly into
  the schema above
- All five tables (`files`, `symbols`, `parameters`, `references`,
  `diagnostics`) populated for at least Rust + Python
- Pushdown works for `language`, `component`, `path LIKE`, `kind`,
  `LIMIT`
- LSP server crash mid-query produces a typed error, not a panic
- 4 integration tests against dbfy's own codebase
- Showcase example with the killer "hot files" query produces a
  meaningful result on dbfy itself

Targeted milestone: **v0.6**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source: codebase as a SQL surface (LSP + tree-sitter) #15

Why

Architecture (decided)

Schema

Example queries this enables

Pushdown

LSP lifecycle (the non-trivial part)

VSCode integration (separate, optional)

Effort estimate

Acceptance criteria for v1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Backend	Extracts	Speed	Depth
`tree_sitter`	AST: names, kinds, lines, package	⚡ very fast	shallow
`lsp`	types, refs, diagnostics, defs	🐢 5-30s init	deep
`hybrid` (default)	tree-sitter for fast columns, LSP for semantic ones	🚀 best of both	full

Predicate	Pushdown level
`WHERE language = 'rust'`	filter at filesystem walk (extension + LSP server selection)
`WHERE component = 'auth'`	YAML component-glob matching at walk time
`WHERE path LIKE 'src/%'`	glob filter at walk time
`WHERE kind = 'function'`	tree-sitter query specialisation per kind
`LIMIT N`	early-break of file walk

Step	Effort
`dbfy-provider-code` scaffold + tree-sitter backend (5 langs, 4 tables)	2.5 d
LSP backend + supervisor + cache	3 d
Hybrid mode (tree-sitter fast path, LSP enrichment)	1 d
5 language LSP integrations + binary discovery	1.5 d
`references` table (batched LSP findReferences)	0.5 d
Showcase: codebase × git × incidents	0.5 d
Provider v1 total	~9 days
VSCode extension (separate milestone)	+3 d

Source: codebase as a SQL surface (LSP + tree-sitter) #15

Description

Why

Architecture (decided)

Schema

Example queries this enables

Pushdown

LSP lifecycle (the non-trivial part)

VSCode integration (separate, optional)

Effort estimate

Acceptance criteria for v1

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions