Fuzzy node-grounding pass: ground 13 labels via normalized exact match#106
Merged
Conversation
Applies deterministic morphological normalization (X biosynthesis -> X biosynthetic process, plural -> singular, ...) to residual causal-node labels, then exact-matches against the kg-microbe ontology index constrained per node_type. Every candidate hand-verified against the index (canonical name + deprecated flag) before grounding. Node grounding rises 995 -> 1011 / 1643 (60% -> 61%). 13 verified mappings added to mappings/node_grounding.tsv: - CHEMICAL -> CHEBI: glucose, manganese(2+), ethanol, lactate, propionate, xylose, capsular polysaccharide, lipoteichoic acids - BIOLOGICAL_PROCESS -> GO: horizontal gene transfer, phenazine biosynthesis (normalized to 'phenazine biosynthetic process') - CELLULAR_LOCALIZATION -> GO-CC: type iv pilus - ENVIRONMENTAL_FACTOR -> ENVO: habitat, ultraviolet radiation Rejected on verification: nutrient uptake (GO:0009935 confirmed deprecated), spore coat (GO 'spore wall' — distinct bacterial structure), plant tissue colonization and intracellular membrane (semantic mismatch), and ambiguous chemicals where charge/stereo forms compete (n-acetylglucosamine, electron acceptor, pyocyanin, polyhydroxyalkanoate, immune evasion). validate-strict: 477 files, 0 errors. Grounding is idempotent. The remaining residual is dominated by non-ontological descriptive phrases (maximal growth rate, salt-in strategy, lateral cell-wall elongation) with no ontology home. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reaches residual causal-node labels that exact matching missed, by applying deterministic morphological normalization (
X biosynthesis→X biosynthetic process,X degradation→X catabolic process, plural→singular) and then exact-matching against the kg-microbe ontology index, constrained pernode_type. Still exact match after a transform — not edit-distance fuzzy — and every candidate was hand-verified against the index (canonical name + deprecated flag) before grounding.Node grounding: 995 → 1011 / 1643 (60% → 61%).
13 verified mappings (
mappings/node_grounding.tsv)Rejected on verification
nutrient uptake→ GO:0009935 — confirmed deprecated in the indexspore coat→ GO "spore wall" — distinct bacterial structuresplant tissue colonization,intracellular membrane— semantic mismatch / too loosen-acetylglucosamine,electron acceptor,pyocyanin,polyhydroxyalkanoate,immune evasionVerification
just validate-strict: 477 files, 0 errors.maximal growth rate,salt-in strategy,lateral cell-wall elongation) with no ontology home — the diminishing tail.🤖 Generated with Claude Code