fix: guard division-by-zero and uninitialized state in metrics/masking code#1576
fix: guard division-by-zero and uninitialized state in metrics/masking code#1576mooreneural wants to merge 4 commits into
Conversation
…s/masking code
- mlm_memmap.py: normalize codon weights only when mean > 0 to prevent NaN
propagating into np.random.binomial when all token weights are zero
- mlm_memmap.py: clamp conditional random-replace probability to [0, 1] and
guard against mask_replace_prob == 1.0 causing ZeroDivisionError
- dead_latents.py: initialize _last_avg_nonzero in __init__ so get_stats()
is safe to call before the first update()
- evo2_dataset.py: remove duplicate ASCII 45 ('-') entry in VALID_DNA_AND_DEGENERATE
Signed-off-by: Clay Moore <claytonwaynemoore@gmail.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Thanks for the contribution -- running CI and reviewing. |
|
/ok to test 7c89e10 |
|
The single failing test (test_infer_phylogenetic_prompt) is a pre-existing CI flake, unrelated to any of the four changes in this PR. The error is: torch.distributed.DistNetworkError: EADDRINUSE - port 40935 already in use Could someone re-run the mbridge-unit-tests (recipes/evo2_megatron) job? A fresh run should pass. |
Fix
mlm_memmap.py— codon weight normalization:pos_weight[1:-1] / pos_weight[1:-1].mean()producedNaNwhen all token weights are zero, which propagated intonp.random.binomialand either crashed or silently corrupted masking probabilities. Fixed by skipping normalization when the mean is zero.mlm_memmap.py— random-replacement probability:random_replace_prob / (1 - mask_replace_prob)causedZeroDivisionErrorwhenmask_replace_prob == 1.0, and produced an out-of-range probability (> 1.0) that crashednp.random.binomialfor other extreme combos. Fixed by guarding against zero denominator and clamping to [0, 1].dead_latents.py— uninitialized_last_avg_nonzero:DeadLatentTracker.get_stats()referencedself._last_avg_nonzero, which was only assigned insideupdate(). Callingget_stats()before the firstupdate()raisedAttributeError. Fixed by initializing to0.0in__init__.evo2_dataset.py— duplicate set entry: ASCII value45(-, gap character) appeared twice inVALID_DNA_AND_DEGENERATE. Python sets silently deduplicate so no runtime impact, but it is a copy-paste error. Removed the duplicate.Test plan
process_itemwithcodon_weightsall zeros no longer raises or produces NaNprocess_itemwithmask_replace_prob=1.0andrandom_replace_prob > 0no longer raisesZeroDivisionErrorDeadLatentTracker().get_stats()can be called immediately after construction withoutAttributeErrorEvo2Dataset.VALID_DNA_AND_DEGENERATEstill contains45exactly once