Skip to content

Fuzzy node-grounding pass: ground 13 labels via normalized exact match#106

Merged
realmarcin merged 1 commit into
mainfrom
claude/fuzzy-node-grounding
Jun 14, 2026
Merged

Fuzzy node-grounding pass: ground 13 labels via normalized exact match#106
realmarcin merged 1 commit into
mainfrom
claude/fuzzy-node-grounding

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

What

Reaches residual causal-node labels that exact matching missed, by applying deterministic morphological normalization (X biosynthesisX biosynthetic process, X degradationX catabolic process, plural→singular) and then exact-matching against the kg-microbe ontology index, constrained per node_type. Still exact match after a transform — not edit-distance fuzzy — and every candidate was hand-verified against the index (canonical name + deprecated flag) before grounding.

Node grounding: 995 → 1011 / 1643 (60% → 61%).

13 verified mappings (mappings/node_grounding.tsv)

node_type targets
CHEMICAL → CHEBI glucose, manganese(2+), ethanol, lactate, propionate, xylose, capsular polysaccharide, lipoteichoic acids
BIOLOGICAL_PROCESS → GO horizontal gene transfer, phenazine biosynthesis (→ "phenazine biosynthetic process")
CELLULAR_LOCALIZATION → GO-CC type iv pilus
ENVIRONMENTAL_FACTOR → ENVO habitat, ultraviolet radiation

Rejected on verification

  • nutrient uptake → GO:0009935 — confirmed deprecated in the index
  • spore coat → GO "spore wall" — distinct bacterial structures
  • plant tissue colonization, intracellular membrane — semantic mismatch / too loose
  • ambiguous chemicals where charge/stereo forms compete: n-acetylglucosamine, electron acceptor, pyocyanin, polyhydroxyalkanoate, immune evasion

Verification

  • just validate-strict: 477 files, 0 errors.
  • Idempotent (re-run grounds 0).
  • Remaining residual is dominated by non-ontological descriptive phrases (maximal growth rate, salt-in strategy, lateral cell-wall elongation) with no ontology home — the diminishing tail.

🤖 Generated with Claude Code

Applies deterministic morphological normalization (X biosynthesis ->
X biosynthetic process, plural -> singular, ...) to residual causal-node
labels, then exact-matches against the kg-microbe ontology index constrained
per node_type. Every candidate hand-verified against the index (canonical
name + deprecated flag) before grounding. Node grounding rises
995 -> 1011 / 1643 (60% -> 61%).

13 verified mappings added to mappings/node_grounding.tsv:
- CHEMICAL -> CHEBI: glucose, manganese(2+), ethanol, lactate, propionate,
  xylose, capsular polysaccharide, lipoteichoic acids
- BIOLOGICAL_PROCESS -> GO: horizontal gene transfer, phenazine biosynthesis
  (normalized to 'phenazine biosynthetic process')
- CELLULAR_LOCALIZATION -> GO-CC: type iv pilus
- ENVIRONMENTAL_FACTOR -> ENVO: habitat, ultraviolet radiation

Rejected on verification: nutrient uptake (GO:0009935 confirmed deprecated),
spore coat (GO 'spore wall' — distinct bacterial structure), plant tissue
colonization and intracellular membrane (semantic mismatch), and ambiguous
chemicals where charge/stereo forms compete (n-acetylglucosamine, electron
acceptor, pyocyanin, polyhydroxyalkanoate, immune evasion).

validate-strict: 477 files, 0 errors. Grounding is idempotent. The remaining
residual is dominated by non-ontological descriptive phrases (maximal growth
rate, salt-in strategy, lateral cell-wall elongation) with no ontology home.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 6f19573 into main Jun 14, 2026
2 checks passed
@realmarcin realmarcin deleted the claude/fuzzy-node-grounding branch June 14, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant