Skip to content

fix: guard division-by-zero and uninitialized state in metrics/masking code#1576

Open
mooreneural wants to merge 4 commits into
NVIDIA:mainfrom
mooreneural:mooreneural/BioNeMoore
Open

fix: guard division-by-zero and uninitialized state in metrics/masking code#1576
mooreneural wants to merge 4 commits into
NVIDIA:mainfrom
mooreneural:mooreneural/BioNeMoore

Conversation

@mooreneural
Copy link
Copy Markdown

Fix

  • mlm_memmap.py — codon weight normalization: pos_weight[1:-1] / pos_weight[1:-1].mean() produced NaN when all token weights are zero, which propagated into np.random.binomial and either crashed or silently corrupted masking probabilities. Fixed by skipping normalization when the mean is zero.
  • mlm_memmap.py — random-replacement probability: random_replace_prob / (1 - mask_replace_prob) caused ZeroDivisionError when mask_replace_prob == 1.0, and produced an out-of-range probability (> 1.0) that crashed np.random.binomial for other extreme combos. Fixed by guarding against zero denominator and clamping to [0, 1].
  • dead_latents.py — uninitialized _last_avg_nonzero: DeadLatentTracker.get_stats() referenced self._last_avg_nonzero, which was only assigned inside update(). Calling get_stats() before the first update() raised AttributeError. Fixed by initializing to 0.0 in __init__.
  • evo2_dataset.py — duplicate set entry: ASCII value 45 (-, gap character) appeared twice in VALID_DNA_AND_DEGENERATE. Python sets silently deduplicate so no runtime impact, but it is a copy-paste error. Removed the duplicate.

Test plan

  • process_item with codon_weights all zeros no longer raises or produces NaN
  • process_item with mask_replace_prob=1.0 and random_replace_prob > 0 no longer raises ZeroDivisionError
  • DeadLatentTracker().get_stats() can be called immediately after construction without AttributeError
  • Evo2Dataset.VALID_DNA_AND_DEGENERATE still contains 45 exactly once

…s/masking code

- mlm_memmap.py: normalize codon weights only when mean > 0 to prevent NaN
  propagating into np.random.binomial when all token weights are zero
- mlm_memmap.py: clamp conditional random-replace probability to [0, 1] and
  guard against mask_replace_prob == 1.0 causing ZeroDivisionError
- dead_latents.py: initialize _last_avg_nonzero in __init__ so get_stats()
  is safe to call before the first update()
- evo2_dataset.py: remove duplicate ASCII 45 ('-') entry in VALID_DNA_AND_DEGENERATE

Signed-off-by: Clay Moore <claytonwaynemoore@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07a8855d-54bb-4fde-a1b7-ff144d3657ca

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@trvachov
Copy link
Copy Markdown
Collaborator

Thanks for the contribution -- running CI and reviewing.

@trvachov
Copy link
Copy Markdown
Collaborator

/ok to test 7c89e10

@mooreneural
Copy link
Copy Markdown
Author

The single failing test (test_infer_phylogenetic_prompt) is a pre-existing CI flake, unrelated to any of the four changes in this PR. The error is:

torch.distributed.DistNetworkError: EADDRINUSE - port 40935 already in use
The test hardcodes --master_port 40935 in its torchrun invocation. Another process on the CI runner was holding that port during this run. Our only change to evo2_megatron was removing a duplicate entry from a Python set literal, which cannot cause a port conflict.

Could someone re-run the mbridge-unit-tests (recipes/evo2_megatron) job? A fresh run should pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants