Multi-GPU training does not data-parallelize: the DDP wrapper is bypassed#49
Multi-GPU training does not data-parallelize: the DDP wrapper is bypassed#49TonyChen06 wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughIn ChangesDistributed Gradient Averaging
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
curriculum_learning.py (1)
1118-1123: 🚀 Performance & Scalability | 🔵 Trivial | 🏗️ Heavy liftPer-parameter all-reduce is communication-inefficient.
Issuing a separate
all_reducefor every parameter creates many small collectives with no compute/communication overlap, which scales poorly on large models. Two options:
- Coalesce gradients into flat buckets and reduce in fewer calls (e.g.,
torch._utils._flatten_dense_tensors/torch.distributed.algorithms).- Preferably, drive the backward through the DDP wrapper so DDP's bucketed, overlapped reducer is used instead of this manual path. The root cause is that
compute_lossruns on the unwrapped_get_model(), which bypasses DDP's autograd hooks; routing the loss computation throughself.model(the DDP module) would remove the need for this manual averaging entirely.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@curriculum_learning.py` around lines 1118 - 1123, The manual gradient averaging in `curriculum_learning.py` is doing a separate `dist.all_reduce` for each parameter, which is inefficient. Update the training flow around `compute_loss` and `_get_model()` so the backward pass goes through `self.model` (the DDP wrapper) instead of the unwrapped model, allowing DDP’s bucketed reducer to handle synchronization automatically. If that isn’t possible, at least coalesce parameter grads into fewer buckets before reducing rather than iterating over `self._get_model().parameters()` one by one.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@curriculum_learning.py`:
- Around line 1119-1122: The gradient synchronization loop in self._get_model()
can hang because dist.all_reduce is called only when p.grad is not None, making
the collective sequence differ across ranks. Update the reduction logic so every
rank iterates over the same deterministic set of trainable parameters and always
participates in the same all_reduce calls, materializing a zero gradient for
missing grads before calling dist.all_reduce.
---
Nitpick comments:
In `@curriculum_learning.py`:
- Around line 1118-1123: The manual gradient averaging in
`curriculum_learning.py` is doing a separate `dist.all_reduce` for each
parameter, which is inefficient. Update the training flow around `compute_loss`
and `_get_model()` so the backward pass goes through `self.model` (the DDP
wrapper) instead of the unwrapped model, allowing DDP’s bucketed reducer to
handle synchronization automatically. If that isn’t possible, at least coalesce
parameter grads into fewer buckets before reducing rather than iterating over
`self._get_model().parameters()` one by one.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 9ced9a8f-b67e-4c3c-a6c2-8b68a4378f75
📒 Files selected for processing (1)
curriculum_learning.py
65dd2cb to
26dcf3a
Compare
The model is DDP-wrapped, but the training loop computes loss on the unwrapped module (`_get_model().compute_loss()`), which bypasses DDP's reducer -- so gradients are never synced across ranks. Each rank trains an independent replica on its own data shard and only rank 0's is saved, so multi-GPU runs do not data-parallelize. Fix: average gradients across ranks after backward().
26dcf3a to
4a4e983
Compare
cc @RealLast
♻️ Current situation & Problem
curriculum_learning.pywraps the model inDDP(model), but every training step computes the loss on the unwrapped module:Calling the module instead of the wrapper means DDP's reducer never fires, so gradients are never synchronized across ranks. With
DistributedSamplersharding the data, each rank trains an independent replica on its own 1/N of the data,_save_checkpointkeeps only rank 0, and ranks 1…N−1's training is discarded.⚙️ Release Notes
backward()); previously each rank trained an independent replica and only rank 0 was saved.📚 Documentation
A one-line inline comment at the fix site explains why the manual all-reduce is needed (the
_get_model()unwrap bypasses DDP). No public-interface change.✅ Testing
The toy below replicates the trainer's exact pattern —
DDP(model)wrapped (as in_initialize_model), loss computed onmodel.module(as in_get_model().compute_loss(...)), and a different data shard per rank (asDistributedSamplergives) — then logs the gradient each rank holds and how far the two ranks' weights have drifted apart.Output:
model.module): each rank holds different gradients (synced=False), takes a different step, and the replicas drift apart — divergence grows every step and keeps climbing.+ all-reduce): gradients are identical across ranks and equal the average of the per-rank grads (step 1:(0.04043 + 0.19431) / 2 = 0.11737), so the replicas stay bit-identical (0.000000).The same divergence also reproduces on the actual pipeline; the two ranks diverge after the first step under the current loop and stay identical with the fix:
Code of Conduct & Contributing Guidelines
By creating and submitting this pull request, you agree to follow our Code of Conduct and Contributing Guidelines:
Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.Summary by CodeRabbit