fix: guard division-by-zero and uninitialized state in metrics/masking code by mooreneural · Pull Request #1576 · NVIDIA/bionemo-framework

mooreneural · 2026-05-18T23:25:58Z

Fix

mlm_memmap.py — codon weight normalization: pos_weight[1:-1] / pos_weight[1:-1].mean() produced NaN when all token weights are zero, which propagated into np.random.binomial and either crashed or silently corrupted masking probabilities. Fixed by skipping normalization when the mean is zero.
mlm_memmap.py — random-replacement probability: random_replace_prob / (1 - mask_replace_prob) caused ZeroDivisionError when mask_replace_prob == 1.0, and produced an out-of-range probability (> 1.0) that crashed np.random.binomial for other extreme combos. Fixed by guarding against zero denominator and clamping to [0, 1].
dead_latents.py — uninitialized _last_avg_nonzero: DeadLatentTracker.get_stats() referenced self._last_avg_nonzero, which was only assigned inside update(). Calling get_stats() before the first update() raised AttributeError. Fixed by initializing to 0.0 in __init__.
evo2_dataset.py — duplicate set entry: ASCII value 45 (-, gap character) appeared twice in VALID_DNA_AND_DEGENERATE. Python sets silently deduplicate so no runtime impact, but it is a copy-paste error. Removed the duplicate.

Test plan

process_item with codon_weights all zeros no longer raises or produces NaN
process_item with mask_replace_prob=1.0 and random_replace_prob > 0 no longer raises ZeroDivisionError
DeadLatentTracker().get_stats() can be called immediately after construction without AttributeError
Evo2Dataset.VALID_DNA_AND_DEGENERATE still contains 45 exactly once

…s/masking code - mlm_memmap.py: normalize codon weights only when mean > 0 to prevent NaN propagating into np.random.binomial when all token weights are zero - mlm_memmap.py: clamp conditional random-replace probability to [0, 1] and guard against mask_replace_prob == 1.0 causing ZeroDivisionError - dead_latents.py: initialize _last_avg_nonzero in __init__ so get_stats() is safe to call before the first update() - evo2_dataset.py: remove duplicate ASCII 45 ('-') entry in VALID_DNA_AND_DEGENERATE Signed-off-by: Clay Moore <claytonwaynemoore@gmail.com>

copy-pr-bot · 2026-05-18T23:26:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T23:26:07Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07a8855d-54bb-4fde-a1b7-ff144d3657ca

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

trvachov · 2026-05-27T14:58:11Z

Thanks for the contribution -- running CI and reviewing.

trvachov · 2026-05-27T14:58:18Z

/ok to test 7c89e10

mooreneural · 2026-05-27T18:46:02Z

The single failing test (test_infer_phylogenetic_prompt) is a pre-existing CI flake, unrelated to any of the four changes in this PR. The error is:

torch.distributed.DistNetworkError: EADDRINUSE - port 40935 already in use
The test hardcodes --master_port 40935 in its torchrun invocation. Another process on the CI runner was holding that port during this run. Our only change to evo2_megatron was removing a duplicate entry from a Python set literal, which cannot cause a port conflict.

Could someone re-run the mbridge-unit-tests (recipes/evo2_megatron) job? A fresh run should pass.

mooreneural requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners May 18, 2026 23:25

mooreneural added 3 commits May 19, 2026 10:16

Merge branch 'main' into mooreneural/BioNeMoore

9184a26

Merge branch 'main' into mooreneural/BioNeMoore

0ad7a85

Merge branch 'main' into mooreneural/BioNeMoore

7c89e10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard division-by-zero and uninitialized state in metrics/masking code#1576

fix: guard division-by-zero and uninitialized state in metrics/masking code#1576
mooreneural wants to merge 4 commits into
NVIDIA:mainfrom
mooreneural:mooreneural/BioNeMoore

mooreneural commented May 18, 2026

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review skipped

Uh oh!

trvachov commented May 27, 2026

Uh oh!

trvachov commented May 27, 2026

Uh oh!

mooreneural commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mooreneural commented May 18, 2026

Fix

Test plan

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

trvachov commented May 27, 2026

Uh oh!

trvachov commented May 27, 2026

Uh oh!

mooreneural commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 18, 2026 •

edited

Loading