Skip to content

Resync Rust port to graphify-py v0.8.27 (4b17f19)#16

Merged
rblaine95 merged 34 commits into
masterfrom
resync/graphify-py/4b17f19
Jun 2, 2026
Merged

Resync Rust port to graphify-py v0.8.27 (4b17f19)#16
rblaine95 merged 34 commits into
masterfrom
resync/graphify-py/4b17f19

Conversation

@rblaine95
Copy link
Copy Markdown
Member

@rblaine95 rblaine95 commented Jun 1, 2026

Resyncs the Rust workspace from graphify-py v0.8.25 → v0.8.27 (4b17f19),
porting 17 upstream commits and bumping the workspace version to 0.8.27.

Ported

  • detect — anchored .graphifyignore/.gitignore patterns match the
    anchor-relative path directly, no subtree leak (#1087); deterministic
    file-list + per-bucket sort.
  • export — obsidian/canvas filenames capped to 200 UTF-8 bytes with an
    8-char SHA-1 suffix on truncation (#1094); ASCII backup messages.
  • analyze — new find_import_cycles (file-level cycle detection, #961).
  • report## Import Cycles section.
  • extract{parent_dir}_{stem} file-node-id spec (#1033) + root-level
    symbol prefix relativisation (#1096); pnpm-workspace '.' (#1083); TS
    interface extends via extends_type_clause (#1095). extract() now
    honours cache_root as the id-relativisation root.
  • llm — custom OpenAI-compatible provider registry via providers.json
    (load + detect_backend tail + extract/call_llm routing, #1084);
    batched community labelling (#1097).
  • cliprovider add|list|show|remove, a label command, and
    --no-label/--backend on cluster-only; console em-dash/arrow → ASCII
    (#992).

Divergences from graphify-py

  • Rust extract is a one-shot pipeline (already writes GRAPH_REPORT.md and a
    placeholder labels file), so the Python "next: run cluster-only" hint is
    omitted — run graphify label to LLM-name communities.
  • GRAPHIFY_DEBUG traceback hook has no Rust equivalent (extractors return
    FileResult.error strings).
  • The provider command refuses to overwrite a malformed providers.json
    (graphify-py clobbers it); a deliberate, safer choice.

Verification

  • cargo clippy --all-targets --all-features --workspace — clean
  • cargo nextest run --workspace1651 passed, 2 skipped (net_)
  • hk check — clean
  • 80+ new parity tests across the touched crates

CodeRabbit

Four review rounds; substantive findings (anchored-include semantics,
#[serial] on env tests, malformed-registry clobber, max_tokens overflow,
atomic registry write, deterministic-sort key, EnvGuard dedup) all fixed or
documented as parity disputes in the commit history. The final pass hit the
org's CodeRabbit usage limit; remaining items were the documented disputes
(global-wins provider load order — matches graphify-py; per-chunk provider
load — custom-backends only; sha1 0.11 vs digest — verified no duplication).

Summary by CodeRabbit

  • New Features

    • Custom OpenAI-compatible provider registry and CLI (add/list/show/remove) with routing
    • LLM-driven community labeling CLI/API with backend selection and placeholder fallbacks
    • Report adds an “Import Cycles” section
    • Automatic filename truncation to avoid ENAMETOOLONG
  • Bug Fixes

    • Deterministic file discovery and stable output ordering
    • Anchored ignore/include matching fixed with correct subtree semantics
    • Improved TypeScript/JS inheritance and Swift extension merging
  • Documentation

    • Expanded README/USAGE for providers, backend selection, labeling, and detection rules
  • Tests

    • New parity/integration tests for providers, labeling, filename capping, cycles, ignore/include, and extraction
  • Chores

    • Workspace version bumped (0.8.25 → 0.8.27)

rblaine95 added 7 commits June 1, 2026 11:42
Port the v0.8.25 -> v0.8.27 changes from `graphify-py` and bump the
workspace version to 0.8.27.

Ports:

- detect: anchored `.graphifyignore`/`.gitignore` patterns now match the
  anchor-relative path directly (no subtree/basename fallback), so
  `/inbox/` no longer leaks into `src/inbox/` (#1087). Sort the walked
  file list and each file-type bucket for deterministic output.
- export: cap obsidian/canvas note filenames to 200 UTF-8 bytes with an
  8-char SHA-1 suffix on truncation, preventing ENAMETOOLONG on long or
  multibyte labels (#1094).
- analyze: add `find_import_cycles`, collapsing the graph to file-level
  `imports_from`/`re_exports` edges and reporting simple cycles (#961).
- report: render an `## Import Cycles` section from `find_import_cycles`.
- extract: file-level node IDs follow the `{parent_dir}_{stem}` spec so
  AST and semantic nodes coincide (#1033); root-level symbol prefixes are
  relativised the same way (#1096); `pnpm-workspace.yaml` `'.'` resolves
  to the root without crashing (#1083); TypeScript `interface A extends B`
  now emits an `inherits` edge via the `extends_type_clause` branch
  (#1095). `extract()` honours `cache_root` as the id-relativisation root.
- llm: load custom OpenAI-compatible providers from `providers.json`,
  route `extract`/`call_llm` through them, and consult them in
  `detect_backend` after the built-ins (#1084). Add batched community
  labelling (`label_communities` / `generate_community_labels`, #1097).
- cli: `provider add|list|show|remove`, a `label` command, and
  `--no-label`/`--backend` on `cluster-only`; auto-name communities when
  no labels file exists. Replace console em-dash/arrow glyphs with ASCII
  for non-UTF-8 terminals (#992).

Divergences from graphify-py:

- The Rust `extract` is a one-shot pipeline that already writes
  `GRAPH_REPORT.md` and a placeholder labels file, so the Python
  "next: run cluster-only" hint is omitted; run `graphify label` to
  LLM-name communities.
- The `GRAPHIFY_DEBUG` traceback hook has no Rust equivalent (extractors
  return `FileResult.error` strings, not a stack trace).

Glory to the Omnissiah
Close the gaps the `cargo llvm-cov` pass surfaced in the v0.8.27
resync's new code:

- `extract_files_direct` with a custom provider now has a mockito test
  exercising `extract_custom` (llm/extract.rs 45% -> 72% line).
- Community labelling gains an end-to-end test through the public
  `label_communities` / `generate_community_labels` wrappers (via a
  mock custom provider), a `quiet=false` degradation path, and a
  prose-wrapped-JSON parse case (llm/labeling.rs 81% -> 90% line).

Test-only; no production code changes.

By the will of the Machine God
Apply the two valid findings from the CodeRabbit review and document
the two disputed ones.

Fixed:

- `custom_providers_path` falls back to the local path when `$HOME` is
  unset, and `load_custom_providers_from` reads identical local/global
  paths only once, so an unset `$HOME` no longer reads the same file
  twice (CodeRabbit, providers.rs).
- The `provider` command now aborts when `providers.json` exists but is
  unreadable or not a JSON object, instead of silently overwriting it on
  `add`/`remove`. graphify-py clobbers a malformed file via a bare
  `except`; refusing to is a deliberate, safer divergence that prevents
  data loss (CodeRabbit, src/cli/provider.rs).

Disputed (left as-is):

- Provider load order `[local, global]` (global wins): this matches
  graphify-py `_load_custom_providers`, which iterates `(local, global)`
  and assigns `providers[name] = cfg`. The suggested "local wins" would
  diverge from the reference.
- `load_custom_providers` per call in `extract_files_direct` / `call_llm`:
  only reached for custom (non-built-in) backends, since
  `is_builtin_backend` short-circuits built-ins without touching the
  filesystem. The file is tiny and correctness is unaffected; threading a
  process-lifetime cache through the parallel corpus path is not worth the
  added shared state.

Tests cover the new branches (identical-path dedup, malformed-registry
abort).

Ave Deus Mechanicus
Fixed:

- Rename the `ts_inheritance` test helper `has_inherits` to `has_edge`
  since it takes an arbitrary relation argument.
- Use `sort_by_cached_key` for the deterministic file sort in
  `detect::walk` so each path is stringified once, not per comparison.
- Annotate the env-mutating provider/labeling tests with `#[serial]`
  so they are safe under `cargo test` (thread-per-test), not only under
  nextest (process-per-test).
- Make the community-labelling `max_tokens` arithmetic fully saturating.

Disputed (left as-is, parity with graphify-py):

- `is_included` anchored matching uses a direct fnmatch with no ancestor
  walk. This is exactly the #1087 change in graphify-py `_is_included`;
  adding an ancestor walk would revert that fix for the include path.
- `cluster-only` regenerates `Community N` placeholders when an existing
  `.graphify_labels.json` fails to parse. graphify-py does the same, and
  the labels file is a regenerable cache, not user-authored config (unlike
  `providers.json`, whose clobber we did guard against).
- The `ts_inheritance` per-test boilerplate is left inline for readability
  (trivial).

By the will of the Machine God
Fixed (all trivial):

- Atomic registry write in `provider`: write a sibling temp file then
  rename, so a crash mid-write cannot truncate `providers.json`.
- Share one `EnvGuard` across the `graphify-llm` integration tests via
  `tests/common/mod.rs` instead of duplicating it.
- Add a clarifying comment on the `find_import_cycles` edge-orientation
  fallback.

Disputed (left as-is):

- `sha1 = "0.11"` does NOT duplicate `digest`: `cargo tree` shows both
  `sha1 0.11` and `sha2 0.11` resolve to `digest 0.11.3`. Downgrading to
  `sha1 0.10` would pull in `digest 0.10` and create the duplication
  CodeRabbit warned about, so the suggestion is inverted.
- The per-call `load_custom_providers` in `extract_files_direct` /
  `call_llm` only runs for custom backends (built-ins short-circuit via
  `is_builtin_backend` with no filesystem access); threading a cache
  through the parallel corpus path is not worth the shared state.

Glory to the Omnissiah
Replace the post-empty-check index `c.cycle[0]` with a `let else` on
`c.cycle.first()` so the non-empty invariant is explicit and there is no
direct indexing (CodeRabbit, trivial).

Ave Deus Mechanicus
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2637718c-01f7-4255-899f-5733c2bbad83

📥 Commits

Reviewing files that changed from the base of the PR and between 26322af and 8089979.

📒 Files selected for processing (4)
  • crates/graphify-detect/src/walk.rs
  • crates/graphify-detect/tests/parity_include.rs
  • src/cli/provider.rs
  • tests/cli_commands.rs

📝 Walkthrough

Walkthrough

Adds file-level import-cycle detection and report rendering; introduces custom OpenAI-compatible LLM providers and batched community labeling (CLI + registry); fixes robust path relativisation and file-node IDs; caps export filenames to avoid ENAMETOOLONG; makes detect deterministic; and adds tests and docs.

Changes

Main Functional Changes

Layer / File(s) Summary
Import cycle detection & tests
crates/graphify-analyze/src/cycles.rs, crates/graphify-analyze/src/lib.rs, crates/graphify-analyze/tests/parity.rs
Adds ImportCycle and find_import_cycles(_bounded), builds file adjacency from import/re-export edges, enumerates elementary cycles with DFS and deterministic ordering, and adds parity tests for cycles and bounded behavior.
Report integration
crates/graphify-report/src/sections/cycles.rs, crates/graphify-report/src/sections/mod.rs, crates/graphify-report/src/render.rs, crates/graphify-report/tests/parity.rs, crates/graphify-report/Cargo.toml
Adds “Import Cycles” section renderer, integrates it into report generation, and adds tests and dependency.
Custom provider registry & detection
crates/graphify-llm/src/providers.rs, crates/graphify-llm/src/backends.rs, crates/graphify-llm/Cargo.toml
Defines CustomProvider, loads local/global providers.json with precedence and safe defaults, adds detect_backend_with to consult built-ins/ollama then custom providers, and adds regex workspace dep.
Custom provider routing & labeling
crates/graphify-llm/src/call.rs, crates/graphify-llm/src/extract.rs, crates/graphify-llm/src/labeling.rs, crates/graphify-llm/tests/*
Routes call_llm/direct extraction through OpenAI-compatible custom providers, implements strict/tolerant community labeling APIs (label_communities*, generate_community_labels*), JSON parsing/cleanup, placeholder fallback, and tests.
Provider CLI and label wiring
src/cli/provider.rs, src/cli/mod.rs, src/cli/args.rs, src/cli/dispatch.rs, src/cli/cluster_only.rs
Implements graphify provider add/list/show/remove; adds label subcommand and --no-label/--backend flags; threads LabelOptions into cmd_cluster_only to preserve/extract/generate labels; updates dispatch.
Extraction robustness & file-node IDs
crates/graphify-extract/src/extractors/multi.rs, crates/graphify-extract/src/ids.rs, crates/graphify-extract/src/lib.rs, crates/graphify-extract/src/postprocess.rs, crates/graphify-extract/src/workspace.rs, crates/graphify-extract/tests/*
Adds relativise_under_root with canonicalize fallback; two-stage ID remap (prefix and symbol rewriting); file_node_id API and re-export; per-file Swift extension name collection/remapping; treats packages: "." as workspace root; and adds/extends extraction tests (file_id, pnpm workspace, TS inheritance).
Deterministic detect & ignore/include semantics
crates/graphify-detect/src/walk.rs, crates/graphify-detect/src/ignore.rs, crates/graphify-detect/tests/*
Switches detection buckets to IndexMap, sorts file lists for deterministic output, changes anchored ignore/include matching to use anchor-relative paths (with descendant coverage), and adds parity tests.
Export filename capping & tests
crates/graphify-export/src/util.rs, crates/graphify-export/src/canvas.rs, crates/graphify-export/src/obsidian.rs, crates/graphify-export/Cargo.toml, crates/graphify-export/tests/filename_cap.rs
Adds cap_filename (truncate to 200 UTF-8 bytes, append 8-hex SHA-1 suffix when truncated), integrates into safe-name paths for Obsidian/Canvas, adds sha1 dep, and parity tests for ASCII/CJK, collisions, wikilinks, and Canvas references.
CLI/text punctuation & misc
src/cli/clone.rs, src/cli/extract.rs, src/cli/global.rs, crates/graphify-prs/src/dashboard.rs, crates/graphify-export/src/json.rs
Replaces Unicode arrows/em-dashes with ASCII ->/hyphen in several messages; adjusts incremental manifest reconstruction to use IndexMap where relevant.
Docs, version & submodule
README.md, USAGE.md, Cargo.toml, graphify-py
Documents labeling behavior and custom providers in README/USAGE, bumps workspace version to 0.8.27, and updates graphify-py submodule commit.
Tests & utilities
crates/graphify-llm/tests/common/mod.rs, tests/cli_commands.rs, various new parity tests
Adds EnvGuard RAII for env mutation in tests, many unit/integration tests covering provider registry, labeling behaviors, include/ignore parity, cluster-only label paths, extraction, and export filename caps.

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs:

"I nibble through diffs and bounce with delight,
Cycles, labels, short names — all trimmed just right.
Providers and tests in a neatly packed hop,
Graphs hum with new signals — onward I hop!"

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch resync/graphify-py/4b17f19

coderabbitai[bot]

This comment was marked as resolved.

rblaine95 added 5 commits June 1, 2026 14:38
Applies the still-valid findings from CodeRabbit's review and documents
the verified false positives.

Fixed:

- `find_import_cycles`: visit start files and neighbours in lexicographic
  order so the `cap` truncation collects a deterministic set by graph
  content, not edge-insertion order.
- `find_import_cycles_bounded`: return early for `max_cycle_length == 0`
  so self-loops (length 1) cannot leak past a zero bound, matching
  Python's `len(cycle) <= max_cycle_length` filter.
- Provider registry loader: skip records missing or blanking any of
  `base_url` / `default_model` / `env_key`, mirroring the `provider add`
  contract that already requires all three; a half-formed provider can
  never authenticate or address an endpoint.
- `relativise_under_root` and `EnvGuard::new` gain `#[must_use]`.
- `file_node_id_spec.rs` tests now return `Result` and propagate with
  `?` instead of a blanket `expect_used` allow.
- `cli_commands.rs`: pass temp paths via `.arg(path)` rather than
  `to_str().unwrap()`.

Added parity coverage:

- `re_exports`-only cycle detection (treated identically to
  `imports_from`).
- Zero `max_cycle_length` returns no cycles.
- Provider records missing a required field are skipped.

Disputed (verified false positives, left unchanged):

- cluster-only `--no-label` precedence and malformed-labels overwrite:
  both match `graphify-py/__main__.py:2417-2448` exactly (existing file
  wins over `--no-label`; a parse failure falls back to placeholders and
  the file is rewritten unconditionally). The labels file is a
  regenerable cache, unlike the user-authored `providers.json`.
- `prefix_remap` key normalisation: keys and the `n.source_file` lookup
  are the identical absolute-path string at that point (relativisation
  runs later), so switching to `file_node_id` keys would only add cross
  -file collision risk.
- pnpm `.`-package edge assertion: the upstream test asserts only crash
  safety (no `IndexError`); there is no `main` entry to resolve `my-app`
  to deterministically.

Glory to the Omnissiah
Fixed:

- Custom providers now honour a `max_completion_tokens` field from
  `providers.json` on the extraction path (default 8192), mirroring
  Python's `_resolve_max_tokens(cfg.get("max_completion_tokens", 8192))`
  at `llm.py:720`. Previously the Rust port hardcoded 8192, so a
  hand-tuned custom provider budget was silently ignored.
- `build_file_graph`: drop the dead `&& v_file != src_file` guard in the
  orientation fallback — the preceding arm already rules out
  `v_file == src_file`.
- `EnvGuard` gains a `Default` impl alongside `new`.

Added parity coverage for `max_completion_tokens` parse + default.

Disputed (verified false positives, left unchanged):

- `sha1` / `sha2` digest-major split: `cargo tree` shows both crates at
  `0.11.0` resolving to a single `digest v0.11.3`; there is no split and
  `util.rs`/`json.rs` each use their own crate's identical `Digest`.
- Caching `load_custom_providers()` in a process-global `LazyLock`: would
  capture `$HOME` at first access and make the `#[serial]` env-mutating
  tests order-dependent under a shared-process runner. The redundant read
  is bounded to custom (non-built-in) backends and keeps a freshly written
  `providers.json` visible within a long-running process.

By the will of the Machine God
Fixed:

- `DetectResult.files` is now an `IndexMap` instead of a `HashMap`. The
  buckets are built in the fixed order `code, document, paper, image,
  video` (matching Python's insertion-ordered dict), but `HashMap`
  discarded that order. The order flows through `collect_extract_files`
  into `extract()` and thus into `graph.json` node order (which `to_json`
  does not re-sort) and the manifest, so the previous type made those
  outputs nondeterministic run-to-run and divergent from Python. The
  reconstruction path in `detect_result_from_incremental` and
  `persist_manifest` are updated to match; the latter now passes the map
  straight through (no clone-collect).
- `anchored_file_not_matched_at_depth` gains a positive assertion that
  `/build` still matches `build` at the repo root, alongside the existing
  negative `src/build` case.

Disputed (verified false positives, left unchanged):

- Submodule pointer bump to `4b17f19`: intentional — it is the subject of
  this resync, documented in the first commit and the PR body.
- Caching `load_custom_providers()` in a process-global static: re-raised
  from the prior round; same rationale (captures `$HOME` at first access,
  makes `#[serial]` env-mutating tests order-dependent under a shared
  -process runner, redundant read bounded to custom backends only).
- Extracting a shared setup helper across the `ts_inheritance` tests: a
  stylistic test-DRY nitpick, not a defect; the explicit per-test setup
  keeps each case readable and isolated.

Ave Deus Mechanicus
Fixed:

- Import-cycles report renderer no longer clones the cycle vector to
  close the path; it joins the files and appends the start file inline.

Disputed (verified false positives, left unchanged):

- `--temperature` flag on `provider add`: Python's `provider add` hardcodes
  `"temperature": 0` (`__main__.py:1877`) and exposes no such flag, so the
  Rust port already matches; adding the flag would diverge from the
  reference CLI surface. A custom `temperature` is still honoured when
  present in a hand-edited `providers.json`.
- `sha1` 0.10 downgrade: the workspace pins `sha2 = "0.11"`, so within
  `graphify-export` both `sha1` and `sha2` resolve to a single
  `digest v0.11.3`. Downgrading `sha1` to 0.10 would pair it with the 0.10
  digest while `sha2` stays on 0.11 — creating the split the finding aims
  to remove. The 0.10 `sha*`/`digest` entries in `Cargo.lock` come from
  unrelated transitive deps.
- Caching `load_custom_providers()` and `prefix_remap` keying: re-raised
  from prior rounds; same verified rationale (process-global cache breaks
  `#[serial]` test isolation; `source_file` is the verbatim input-path
  string at remap time, with relativisation running strictly afterward).

Glory to the Omnissiah
Fixed:

- `#[must_use]` on the pure `node_label_map` / `god_node_ids` helpers.

Disputed (verified false positives, left unchanged):

- Making `provider add`'s `--base-url` / `--default-model` / `--env-key`
  clap-required: the handler already validates them (provider.rs:110) and
  returns the byte-identical error message Python uses (`__main__.py:1863`),
  matching the reference's optional-parse + validate pattern. The loader's
  skip-incomplete guard is defense-in-depth for hand-edited registries.
- Merging curated labels when `generate_community_labels` degrades: the
  `else` branch only runs when no labels file exists (nothing to lose) or
  `--force-relabel` was passed. Python behaves identically
  (`__main__.py:2424-2448`): `label` / force-relabel is a deliberate
  regenerate that degrades to placeholders by design, and the labels file
  is a regenerable cache. Merging would make `--force-relabel` not fully
  relabel.

By the will of the Machine God
coderabbitai[bot]

This comment was marked as resolved.

rblaine95 added 6 commits June 1, 2026 15:44
These were dismissed in earlier rounds either on an unverified assumption
(finding 1) or on "matches graphify-py" grounds (findings 2-3), which the
project rules reject: feature parity is the bar, not bug parity.

1. pnpm `.`-package import resolution (#1083 follow-up). The resolver
   canonicalises the entry path (on macOS `/tmp` -> `/private/tmp`), so the
   resolved `imports_from` target id (`make_id1` of the canonical path) did
   not match the input-path key in `id_remap` and the edge dangled off the
   relativised file node. `id_remap` now also maps each input file's
   canonicalised spelling to its node id, so the resolved edge connects.
   The parity test now asserts the import resolves to a REAL node id, not
   just that extraction produced nodes.

2. `cluster-only --no-label` was silently ignored when a labels file
   already existed, because the load-existing branch was checked first.
   The `--no-label` branch now runs first, so the flag always yields
   `Community N` placeholders as documented.

3. `cluster-only` silently overwrote a malformed/unreadable
   `.graphify_labels.json` with placeholders, destroying hand-curated
   edits. The read is now fallible (`read_existing_labels`): on failure it
   warns, falls back to placeholders for the run, and leaves the file on
   disk untouched. This is a deliberate divergence from graphify-py
   (`__main__.py:2418-2448`), which clobbers on parse failure.

Added cluster-only parity tests: `--no-label` overrides curated names;
malformed labels file is preserved.

Ave Deus Mechanicus
Fixed:

- `file_node_id_spec` root-leak guard now normalises the temp-dir name with
  `make_id1` (the same routine ids use) instead of only lowercasing and
  replacing `-`, so it detects a leak regardless of which non-alphanumeric
  characters the temp dir name contains.
- Custom-provider `max_completion_tokens` accepts a whole-number JSON float
  (e.g. `8192.0`) in addition to an integer; negative / non-finite /
  out-of-range values fall back to the 8192 default instead of being
  silently dropped or wrapping.
- USAGE.md custom-providers section now documents that the registry is read
  from both `~/.graphify/providers.json` (global) and `.graphify/providers.json`
  (project-local), that the global entry wins on a name clash, and that an
  entry needs a non-empty base_url/default_model/env_key plus the optional
  `max_completion_tokens` cap.

Added coverage: float + negative `max_completion_tokens` parsing.

Disputed (engineering cost/benefit, not parity):

- Threading a pre-resolved provider from the corpus boundary to avoid the
  per-chunk `load_custom_providers()` read: the read is already gated to
  custom (non-built-in) backends via `&&` short-circuit, so the common path
  has zero overhead, and for custom backends a sub-kilobyte JSON read is
  negligible next to each chunk's LLM HTTP round-trip. Threading an owned
  provider through `CorpusConfig`/`AdaptiveRetryCtx`/retry into a new
  non-public extract variant (the public `extract_files_direct*` API and its
  tests can't change) is disproportionate plumbing for that bounded gain.

Glory to the Omnissiah
Doc/comment polish only:

- Add `// SAFETY:` comments to the `EnvGuard` env-mutating `unsafe` blocks
  (`set`/`unset`/`Drop`) stating the `#[serial]`-execution invariant that
  rules out concurrent process-environment access.
- Make the `detect_backend_with` doc comment a complete sentence.

By the will of the Machine God
Fixed:

- `is_included` anchored patterns now also match subtree descendants
  (`/src` includes `src/main.py`), mirroring `could_contain_included_path`
  and normal allowlist semantics. This is the inverse of the anchored
  *ignore* precision fix (#1087): an ignore pattern must not leak into a
  subtree, but an include directory is meant to pull its whole subtree in.
  `is_included` is a public helper not yet wired into the walk (the walk
  rescues via `could_contain_included_path`, and `.graphifyinclude` is a
  loaded-but-unused stub in graphify-py), so this corrects the helper with
  no change to walk behaviour.
- Community-label JSON sanitizer always slices the first `{` … last `}`
  span, even when the reply already starts with `{`, so a leading-JSON +
  trailing-prose reply parses instead of degrading to placeholders.
  Diverges from graphify-py (`llm.py:1308`), which keeps that data loss.

Added coverage: anchored include root/subtree/depth + anchored-file +
unanchored-at-depth; label reply with trailing prose.

Disputed (engineering judgment, not parity):

- Extracting a shared `tempdir/canonicalize/write/extract` helper across the
  `ts_inheritance` tests: a stylistic test-DRY change with no correctness
  impact. The explicit per-test setup keeps each case self-contained and
  readable; collapsing it trades that for indirection. Per the repo's
  surgical-changes guidance, leaving working tests untouched.

Ave Deus Mechanicus
The headline fix corrects a regression I introduced two commits earlier.

- `cluster-only --no-label` no longer clobbers a hand-curated
  `.graphify_labels.json`. My earlier "finding 2" change reordered the
  `--no-label` branch ahead of the load-existing branch so the flag always
  produced `Community N` placeholders and rewrote the file — silently
  destroying curated labels. That was wrong: an existing labels file already
  means no LLM call, so `--no-label` should be a harmless no-op there. The
  load-existing branch is restored to first position, so `--no-label`
  preserves curated labels and only yields placeholders when no file exists.
  The malformed-file guard (warn + don't overwrite) is kept.

Also fixed:

- `god_node_ids` magic `20` → named `GOD_NODE_CAP` constant.
- `max_completion_tokens_from` doc now states it accepts any finite
  non-negative float, truncated toward zero (`8192.9` → `8192`), matching
  the implementation.

Test changes: replaced the test that asserted `--no-label` wipes curated
labels with one asserting it PRESERVES them, plus a test that `--no-label`
with no file yields placeholders.

Disputed (verified false positives):

- "Print updated vs added in `provider add`": Python `__main__.py:1880`
  always prints `'added'` (upsert, no distinction). Adding an "updated"
  message would diverge from the reference, not match it.
- "Use `NamedTempFile` in `save_registry`": Python writes the registry
  non-atomically (`write_text`); the current same-dir temp + rename already
  exceeds that. `provider` is an interactive, non-concurrent command, so a
  unique temp + new runtime dep is unwarranted.
- Caching the per-chunk `load_custom_providers()` read: re-raised; same
  engineering rationale (gated to custom backends, dwarfed by the LLM call,
  threading through the parallel pipeline is disproportionate).

By the will of the Machine God
Fixed:

- `is_included` anchored subtree check now uses a byte-level prefix test
  instead of allocating a `"{p}/"` String per (pattern, path).
- Document the security rationale for global-over-local provider precedence
  in `load_custom_providers_from` (see disputes).

Disputed (verified, non-parity reasons):

- Flip provider precedence to local-wins: declined for security. A
  project-local `.graphify/providers.json` can come from a cloned/untrusted
  repo; letting it shadow a globally-defined provider name would let the repo
  redirect that backend's `base_url` to an attacker endpoint and exfiltrate
  the API key (resolved from the user's env via `env_key`). global-wins
  blocks that hijack; built-in names are already un-shadowable. Comment added
  so the precedence reads as deliberate.
- Extract a `remap_if_different` helper in `multi.rs`: the two id_remap
  insertions have intentionally different overwrite semantics (the input-path
  id uses `insert` so it wins; the canonical spelling uses `or_insert` so it
  never displaces a real input-path mapping). One helper would change
  behavior on a make_id collision or need a flag that adds noise.
- Validate `base_url` shape in `provider add`: Python only checks non-empty
  (`__main__.py:1862`); the URL is validated by the SSRF guard at request
  time, so early `Url::parse` would add a runtime dep for marginal gain.
- Cache the per-chunk `load_custom_providers()` read: re-raised; same
  engineering rationale as prior rounds.

Glory to the Omnissiah
coderabbitai[bot]

This comment was marked as resolved.

rblaine95 added 8 commits June 1, 2026 17:08
Both findings were valid and are fixed:

- `detect_result_from_incremental` now seeds all canonical file-type buckets
  in fixed order (even when empty) before merging the incremental results, so
  a reconstructed `DetectResult` is structurally identical to a fresh
  `detect` walk. Extracted the canonical kind list into a shared
  `graphify_detect::FILE_TYPE_KINDS` constant so the walk and the CLI
  reconstruction use one source of truth.
- `is_included` anchored subtree matching now handles globbed directory
  stems. This matcher's `*` does not cross `/` (gitignore semantics), so
  `/src*` previously matched `src1` but not `src1/deep.py`; the subtree check
  now falls back to `{p}/**` for globbed stems while keeping the zero-alloc
  byte-prefix fast path for literal stems. Converges with graphify-py, whose
  `fnmatch`-based `_is_included` (`*` crosses `/`) already pulls such
  descendants in.

Added coverage: anchored globbed include dir matches its subtree but not an
unrelated directory.

Ave Deus Mechanicus
All trivial cleanups:

- Drop the unused `clippy::float_cmp` / `unsafe_code` allows from
  `labeling.rs` (the file has no float comparisons or `unsafe`); keep only
  `expect_used`.
- Extract the dense anchored-include predicate into a named
  `anchored_include_matches` helper for readability.
- `save_registry` serializes the `Map` directly instead of cloning it into a
  `Value::Object`.

Disputed (already satisfied):

- Add `#[must_use]` to `cap_filename`: it already carries the attribute
  (`export/util.rs:134`); no change needed.

By the will of the Machine God
Fixes five findings and documents one dispute.

- Add `EnvGuard::scrub_backends` to `graphify-llm`'s test `common`
  module and route the duplicated backend-env-unset loops in
  `labeling.rs` and `provider_registry.rs` through it. The canonical
  list now includes `AWS_CONTAINER_CREDENTIALS_FULL_URI` (read by
  `credentials_appear_configured` but previously omitted from the
  scrub), so a host with that variable set can no longer make
  `detect_backend` pick `bedrock` and pollute the no-backend path.
- Enforce the mock endpoint in `label_communities_real_path_via_custom_provider`
  with an explicit `mock.assert()` rather than relying on the label
  assertions alone.
- Add an `extract_files` helper to `ts_inheritance.rs` and collapse the
  repeated tempdir/canonicalize/write/extract setup across all seven
  inheritance tests onto it.
- Reconcile the README LLM-backend list with the `--backend`
  identifiers (`Claude (Anthropic)`, `Kimi (Moonshot)`) and spell out
  the identifier names so readers know what to pass.

Disputed (not changed): CodeRabbit suggested guarding `provider add`
against overwriting an existing provider with a prompt or `--force`
flag. graphify-py's `provider add` upserts unconditionally by design
(add-or-update); a prompt would diverge from that behaviour, break
non-interactive scripted usage, and fail the existing parity tests.
No underlying bug, so left as-is.

Glory to the Omnissiah
Fixes four findings and documents two disputes.

- Centralise the backend-selection env var list behind a new
  `graphify_llm::backend_selection_env_vars()`. It is built from the
  per-backend `ENV_KEY` constants, the new `bedrock::CREDENTIAL_ENV_VARS`,
  and `ollama::BASE_URL_ENV` (the same names `detect_backend` reads), and
  `credentials_appear_configured` now derives from that const too. The
  test helper `EnvGuard::scrub_backends` calls the shared function, so the
  scrub set can no longer drift from the detection logic.
- Simplify the glob-char scan in `anchored_include_matches` to
  `p.contains('*') || p.contains('?')`.
- Rename the `ids` test helper in `file_node_id_spec.rs` to `node_ids` so
  the per-test `let ids = ...` bindings no longer shadow it.
- Remove the orphaned temp file in `provider::save_registry` when the
  atomic rename fails, preserving the original error.

Disputed (not changed): CodeRabbit flagged `load_custom_providers()` in
`call_llm` and `extract_files_direct_mode` as repeated disk I/O and
suggested a process-lifetime cache. Both calls are gated behind
`!is_builtin_backend(backend)`, so built-in backends never touch disk; only
the custom-provider path reads the (small) registry. A global OnceLock cache
would freeze the first-loaded registry for the whole process, breaking the
`#[serial]` custom-provider tests that each point `HOME` at a different
`providers.json`, and would introduce global mutable state the workspace
guidelines steer away from. The marginal I/O does not justify that risk.

By the will of the Machine God
Fixes two findings and re-documents two standing disputes.

- Rewrite the literal-stem subtree test in `anchored_include_matches`
  as `rel_str.strip_prefix(p).is_some_and(|rest| rest.starts_with('/'))`,
  replacing the manual length/index/`starts_with` checks. Same zero-alloc
  behaviour, clearer intent.
- Swap the `seen` dedup set in `label_communities`'s name builder from
  `IndexSet` to `std::collections::HashSet`. The set is membership-only;
  `names` (a `Vec`) carries the observable order, so insertion-order
  tracking was never needed.

Disputed again (not changed): CodeRabbit re-flagged `load_custom_providers()`
in `call_llm` and `extract_files_direct_mode` as repeated disk I/O. Both
calls are gated behind `!is_builtin_backend(backend)`, so only the
custom-provider path ever reads the (small) registry. A process-lifetime
`OnceLock` cache would freeze the first-loaded registry and introduce the
global mutable state the workspace guidelines steer away from, while the
alternative (threading a preloaded map through the public `call_llm` /
`extract_files_direct` API and the per-chunk retry context) is an invasive
signature change across crates for a trivial perf nit. The cost optimised
away is a few KB of JSON per chunk, custom-provider-only. Left as-is
pending owner direction.

Ave Deus Mechanicus
All four findings this round are no-change determinations.

Disputed (approved won't-fix): CodeRabbit flagged `load_custom_providers()`
as repeated disk I/O in three call sites this pass:
- `call_llm` (call.rs) and `extract_files_direct_mode` (extract.rs) are
  gated behind `!is_builtin_backend`, so only the custom-provider path reads
  the small registry.
- `detect_backend` (backends.rs) loads it ~once per run for auto-selection.
A process-lifetime `OnceLock` cache would add global mutable state (against
the workspace guideline) and stale `serve`/`watch`; threading a preloaded map
is an invasive multi-crate API change for a trivial perf nit. Project owner
approved keeping these as documented disputes.

False positive: CodeRabbit suggested adding
`#![allow(clippy::expect_used, clippy::unwrap_used)]` to
`file_node_id_spec.rs`. That file uses `?` with a `TestResult` return type and
contains zero `.expect()`/`.unwrap()` calls, so the suppression would silence
lints that never fire. Clippy already passes on the file. Not added.

By the will of the Machine God
Adds one regression test; both findings this round are disputes.

- Add `extract_swift_same_file_class_and_extension_keeps_canonical` to
  lock in that a single Swift file declaring both `class Foo` and
  `extension Foo` collapses to exactly one canonical node.

Disputed (false positive): CodeRabbit claimed `merge_swift_extensions`
over-marks same-file class+extension pairs because it matches extension
nodes by `(source_file, label)` instead of the node's own kind. It cannot:
tree-sitter-swift parses both as `class_declaration` and both derive the
same `{stem}_{label}` id, so the extractor's `seen_ids` collapses them to a
single node before postprocessing ever runs. There is only ever one node
per `(source_file, label)`, so the label match is equivalent to graphify-py's
nid-based `swift_extensions` tracking. The new test proves the canonical
node is preserved.

Disputed (parity): CodeRabbit wanted `provider add` to write
`"temperature": 0.0` instead of `0`. graphify-py writes integer `0`
(`__main__.py`), and byte-identical registry output is a resync goal;
`serde_json::to_string_pretty` matches it, and the loader reads the integer
back via `as_f64()`, so there is no functional difference. Changing it would
diverge the Rust registry from the Python one. Left as integer `0`.

Glory to the Omnissiah
Fixes three findings.

- Correct the `scrub_backends` doc comment: it unsets only the built-in
  backend-selection vars from `backend_selection_env_vars()` and does NOT
  clear custom-provider `env_key`s (those are registry-defined, so a test
  relying on one must handle it explicitly).
- Short-circuit `find_import_cycles_bounded` when `top_n == 0`, mirroring
  the existing `max_cycle_length == 0` guard, so a zero-results request
  skips `build_file_graph`. Add a `top_n == 0` parity test alongside the
  zero-max-length one.
- Add descriptive failure messages to the `has_edge` assertions in the
  TypeScript inheritance tests so a failure names the missing edge.

By the will of the Machine God
Both findings this round are disputes.

Disputed (over-abstraction): CodeRabbit suggested making the `has_edge`
test helper in `ts_inheritance.rs` generic over a map-like type. It is a
private, single-file helper that only ever receives `ExtractOutput.edges`
(`Vec<IndexMap<String, Value>>`); a generic signature adds an abstraction
no caller needs (against the workspace's "no abstractions for single-use
code" guideline), and the suggested `serde_json::Map` signature would not
even match the concrete `IndexMap` field type. Left concrete.

Disputed (parity, owner-approved): CodeRabbit wanted `provider add` to
upsert and preserve hand-edited `temperature` / `max_completion_tokens`
fields rather than replace the whole entry. graphify-py's `add` replaces
the entry outright (`existing[name] = {...}`), and byte-identical registry
output is a resync goal. The project owner chose to keep the replace
behaviour for parity. No test in either codebase pins re-add semantics.

Ave Deus Mechanicus
coderabbitai[bot]

This comment was marked as resolved.

rblaine95 added 6 commits June 1, 2026 20:16
The headline change completes the `.graphifyinclude` allowlist, which
`graphify-py` defines but never wires into `detect()` (dead code in the
reference). The Rust port already used `could_contain_included_path` for
directory pruning yet never rescued files at classification time, so the
allowlist was inert end-to-end. Completing a feature the reference leaves
inert is a deliberate divergence approved by the project owner (feature
parity, not bug parity), and is documented in `USAGE.md`.

- `classify_one` now keeps an ignored file when it (or an ancestor
  directory) matches `.graphifyinclude`. The sensitive-file guard still
  runs after the rescue, so an allowlist can never pull secrets into the
  corpus.
- `could_contain_included_path` now mirrors `anchored_include_matches`,
  so anchored stems (`/src`, `/src*`) keep the walker descending through
  the whole subtree instead of stopping one level deep. Adds an
  end-to-end `detect` regression for the rescue.
- `detect_result_from_incremental` sorts each bucket so incremental
  reconstruction is byte-identical to a fresh `detect` walk.
- The Swift extension-merge parity tests now assert the survivor is the
  canonical class (by `source_file`) carrying both the class and merged
  extension methods, not just a node count.
- The `provider.rs` temp-rename comment no longer claims unconditional
  atomicity; overwrite and atomicity semantics are platform- and
  filesystem-dependent.

By the will of the Machine God
All three findings this round are previously-approved disputes verified
against the current code; no changes warranted.

- `crates/graphify-llm/src/call.rs` and `crates/graphify-llm/src/extract.rs`:
  CodeRabbit again flags the per-call `load_custom_providers()` disk read
  and asks for a process-wide cache. Approved won't-fix: a `OnceLock` /
  `LazyLock` global breaks the `#[serial]` HOME-mutating test isolation
  (each test points `HOME` at a different `providers.json`), and threading
  a resolved registry through every call site is disproportionate for a
  path taken only by custom (non-built-in) backends.
- `src/cli/provider.rs`: CodeRabbit asks for a `--temperature` flag on
  `provider add`. Disputed on parity grounds: graphify-py `__main__.py`
  (lines 1872-1878) hardcodes `"temperature": 0` and parses no
  `--temperature` argument, so the Rust `add` is byte-identical to the
  reference. Adding a flag would exceed the feature-parity bar.

Glory to the Omnissiah
All three findings this round are disputes verified against the current
code and the graphify-py reference; no changes warranted.

- `crates/graphify-export/Cargo.toml`: the recurring sha1/sha2 digest
  "version split" is a false positive. `cargo tree` shows `sha1 v0.11.0`
  and `sha2 v0.11.0` resolving through a single `digest v0.11.3`; the
  workspace pins `sha2 = "0.11"`, so `sha1 = "0.11"` already unifies the
  trait family. Downgrading sha1 would *create* the split.
- `crates/graphify-export/src/canvas.rs`: `cap_filename` is parity-matched.
  graphify-py `export.py` (`safe_name` -> `_cap_filename(..., limit=200)`
  then `f"{base}_{count}"`) caps before dedup exactly as Rust does, and the
  200-byte cap reserves 55 bytes under the 255-byte limit for `.md` plus
  the dedup suffix. Overflow would need >1e51 collisions; re-ordering the
  cap would diverge from the reference for no real-world gain.
- `crates/graphify-analyze/src/cycles.rs`: the `top_n * 10` enumeration cap
  matches graphify-py `analyze.py` (#961), which caps `nx.simple_cycles`
  output at `top_n * 10` before `cycles.sort(key=len)`. Both suggested
  changes (sort-then-limit, or length-prioritised enumeration) would defeat
  the documented combinatorial-explosion guard and break byte-identical
  parity with the reference.

Ave Deus Mechanicus
Twenty-third CodeRabbit round: one fix, one dispute.

- `crates/graphify-llm/tests/provider_registry.rs`: add `scrub_backends()`
  to the `call_llm` and `extract_files_direct` custom-provider routing
  tests. Both were already hermetic (an explicit non-builtin backend
  routes straight through the custom path and never consults built-in
  env vars), but scrubbing makes the isolation explicit, matches the
  sibling detection test's setup, and guards against a future refactor
  that grew a built-in fallback. `CUSTOM*_KEY` is a registry env_key and
  is untouched by the scrub.

Disputed (no change):

- `src/cli/cluster_only.rs`: CodeRabbit asked to pretty-print
  `.graphify_labels.json`. Declined: `extract.rs` writes the same file
  with compact `serde_json::to_string`, and graphify-py `__main__.py`
  (line 2448) writes it single-line via `json.dumps` without `indent`.
  Pretty-printing would break both internal consistency and byte-level
  parity with the reference.

By the will of the Machine God
Twenty-fourth CodeRabbit round: three fixes, four disputes.

Fixes:

- `crates/graphify-llm/tests/labeling.rs`: add `scrub_backends()` to the
  custom-provider labeling test, matching the routing tests hardened in
  the previous round (hermetic, explicit-backend setup).
- `crates/graphify-llm/tests/provider_registry.rs`: drop the now-redundant
  `unsafe_code` from the crate-level allow. CodeRabbit flagged the asymmetry
  with `labeling.rs`; the correct direction is removal. The earlier round
  replaced this file's inline `std::env::remove_var` scrub loop with the
  safe `EnvGuard::scrub_backends()` call, so no direct `unsafe` remains here
  (the `unsafe` blocks live in `tests/common/mod.rs`, which carries its own
  inner allow). Clippy confirms it compiles clean.
- `src/cli/cluster_only.rs`: annotate the bare `false` argument to
  `generate_community_labels` as `// quiet`.

Disputed (no change):

- `crates/graphify-extract/tests/ts_inheritance.rs`: a single-use `EdgeMap`
  type alias for `IndexMap<String, Value>` is over-abstraction; the concrete
  type matches the file's style.
- `crates/graphify-export/Cargo.toml`: the workspace MSRV is already 1.95
  (`rust-version = "1.95"`), well above `sha1 = "0.11"`'s 1.85 floor.
- `crates/graphify-analyze/src/cycles.rs`: the requested explanatory comment
  on the edge-orientation fallback already exists (lines 139-143).
- `src/cli/cluster_only.rs`: pretty-printing or conditionally skipping the
  `.graphify_labels.json` write both diverge from graphify-py
  `__main__.py:2448`, which writes single-line via `json.dumps` and rewrites
  unconditionally.

Glory to the Omnissiah
Twenty-fifth CodeRabbit round: one fix, five disputes.

Fix:

- `crates/graphify-detect/tests/parity_include.rs`: add a regression test
  for an anchored `?`-glob include stem (`/src?`), exercising the
  `p.contains('?')` branch of `anchored_include_matches` that the existing
  `/src*` test left uncovered.

Disputed (no change):

- `crates/graphify-detect/src/walk.rs`: `FileType` has exactly five variants
  and `as_str()` is an exhaustive match onto exactly `FILE_TYPE_KINDS`, so
  bucket insertion can only ever hit a pre-seeded key. A runtime assertion
  would guard a compile-time-impossible state.
- `crates/graphify-llm/src/call.rs` and `.../extract.rs`: the per-call
  `load_custom_providers()` read is the standing approved won't-fix (a global
  cache breaks `#[serial]` HOME-mutating test isolation).
- `crates/graphify-llm/src/bedrock.rs`: the closure is correct and
  clippy-clean. `CREDENTIAL_ENV_VARS: &[&str]` means `.iter()` yields `&&str`,
  so the `filter` predicate receives `&&&str` and `**k` is the `&str`.
  CodeRabbit's `|&k|` (binding `&&str`) would not match the `&str` literal
  patterns.
- `crates/graphify-export/Cargo.toml`: the recurring sha1/sha2 "digest split"
  is a false positive. `cargo tree` shows `sha1 v0.11.0` and `sha2 v0.11.0`
  resolving through one `digest v0.11.3`; the workspace pins `sha2 = "0.11"`.

Ave Deus Mechanicus
coderabbitai[bot]

This comment was marked as resolved.

Address two CodeRabbit findings on the `.graphifyinclude` rescue and the
`provider add` pricing inputs.

`ConvertCtx::record` re-checked `is_ignored` on the converted sidecar
path without honouring the source's allowlist verdict. A source rescued
by `.graphifyinclude` but ignored by a broad pattern such as `*` had its
office/Google-Workspace sidecar silently dropped, so the very content
the rescue was meant to keep never reached the corpus. `record` now
takes the source path and bypasses the sidecar ignore check when
`is_included(src)` holds, mirroring the `Direct` path in `classify_one`.
The allowlist is keyed on source paths, so the verdict is taken from the
source rather than the derived `.md` path. An ordinary (non-rescued)
source still has its sidecar filtered through `.graphifyignore`, matching
graphify-py `detect()`.

`provider add` accepted non-finite `--pricing-input` / `--pricing-output`
values. `serde_json` serializes `NaN`/`+-Inf` to `null`, which the loader
reads back as the `0.0` default, so an invalid price was silently lost
rather than rejected. `add` now validates both values with
`f64::is_finite` and returns an error before writing the registry.

Both branches are covered: a docx rescued past a `*` ignore whose sidecar
must survive, and a `--pricing-input nan` add that must fail without
creating a registry file.

Glory to the Omnissiah
@rblaine95 rblaine95 force-pushed the resync/graphify-py/4b17f19 branch from 7f5a5c9 to 8089979 Compare June 2, 2026 05:04
@rblaine95 rblaine95 merged commit f75b074 into master Jun 2, 2026
12 checks passed
@rblaine95 rblaine95 deleted the resync/graphify-py/4b17f19 branch June 2, 2026 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant