Skip to content

UniProt protein-grounding pass: ground 9 protein labels (+ 8-col writer fix)#105

Merged
realmarcin merged 1 commit into
mainfrom
claude/uniprot-grounding-pass
Jun 14, 2026
Merged

UniProt protein-grounding pass: ground 9 protein labels (+ 8-col writer fix)#105
realmarcin merged 1 commit into
mainfrom
claude/uniprot-grounding-pass

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

What

Grounds residual GENE_OR_PROTEIN causal-node labels via scripts/match_uniprot_to_proteins.py (streams the kg-microbe UniProt index), picks tier-1/2/3 representatives, and applies them with ground_causal_nodes.py --apply.

Node grounding: 985 → 995 / 1643 (59% → 60%).

9 representatives added (mappings/node_grounding.tsv)

All single-protein, tier-1 exact-name match unless noted:
nitrogenase, multicopper oxidase, catalase, urease, luciferase, proteorhodopsin, metal efflux pump (tier-2 suffix), drug efflux pump (tier-2), czc cation-efflux system (tier-3 paren).

The matcher correctly declined multi-protein complexes — photosystem II, type III/IX secretion systems, flagellar motor/basal body, photosynthetic reaction center — rather than grounding a whole machine to one subunit. Those stay residual.

Bug fix (carried in this PR)

match_uniprot_to_proteins.py --apply emitted a 7-column row, but node_grounding.tsv gained a predicate_id column in #83 (PR #101). Left unfixed, --apply would have misaligned every column. The writer now emits 8 columns with predicate_id=skos:closeMatch — a single UniProt sequence represents a generic protein concept, so the match is close, never exact (consistent with the existing protein groundings).

Verification

  • just validate-strict: 477 files, 0 errors.
  • Idempotent (re-run grounds 0).
  • Representatives selected by the script's deterministic tiered picker, against the real UniProt index.

🤖 Generated with Claude Code

…writer)

Runs scripts/match_uniprot_to_proteins.py to match residual GENE_OR_PROTEIN
causal-node labels against the kg-microbe UniProt index, picks tier-1/2/3
representatives, and grounds them via ground_causal_nodes.py --apply.
Node grounding rises 985 -> 995 / 1643 (59% -> 60%).

9 representatives added to mappings/node_grounding.tsv (all single-protein,
tier-1 exact-name unless noted):
  nitrogenase, multicopper oxidase, catalase, urease, luciferase,
  proteorhodopsin, metal efflux pump (tier-2), drug efflux pump (tier-2),
  czc cation-efflux system (tier-3 paren)

The matcher correctly declined multi-protein complexes (photosystem II,
type III/IX secretion system, flagellar motor/basal body, photosynthetic
reaction center) — these stay residual rather than grounding to one subunit.

Bug fix: match_uniprot_to_proteins.py --apply emitted a 7-column row, but
mappings/node_grounding.tsv gained a predicate_id column in #83. The writer
now emits 8 columns with predicate_id=skos:closeMatch (a single UniProt
sequence represents a generic protein concept — never an exact match).

validate-strict: 477 files, 0 errors. Grounding is idempotent.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 5953256 into main Jun 14, 2026
3 checks passed
@realmarcin realmarcin deleted the claude/uniprot-grounding-pass branch June 14, 2026 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant