Enable C++ PPR Sampling#556
Conversation
4d2db2a to
cfcb8cb
Compare
7009d70 to
dd118ef
Compare
…/GiGL into mkolodner-sc/cpp-infrastructure
|
/unit_test |
GiGL Automation@ 21:35:08UTC : 🔄 @ 21:37:04UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 21:35:10UTC : 🔄 @ 21:44:18UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 21:35:13UTC : 🔄 @ 22:29:26UTC : ❌ Workflow failed. |
| ref_score = reference_ppr[ntype_str][node_id] | ||
| sam_score = ntype_to_sampler_ppr[ntype_str][node_id] | ||
| assert abs(sam_score - ref_score) < 1e-6, ( | ||
| assert abs(sam_score - ref_score) < 1e-5, ( |
There was a problem hiding this comment.
This needed to be updating 1e-6 -> 1e-5 for the margin of error. With the C++ conversions, my understanding is we have lost some float precision here, but the 1e-5 margin should still be sufficient for validating that the PPR algorithm is correct here.
There was a problem hiding this comment.
It's a little odd we lost precision, have we looked into why?
I think it's fine to update this I'm just concerned if we don't know why.
There was a problem hiding this comment.
robots have some thoughts about this, if it is floating point ordering then it's fine imo.
The tolerance in both assertions was loosened from 1e-6 (Python original)
to 1e-5. The justification given in the comments at lines 273-275 and
375-378 — "theoretical bound is ~1.5e-6, so 1e-5 provides a safety margin"
— is true for the new C++ implementation, but the Python implementation
passed at 1e-6 with observed deltas around 1e-7.
The C++ kernel is supposed to be mathematically equivalent. A 10x increase
in observed error suggests the floating-point summation order changed —
likely because unordered_map iteration order differs from Python dict
insertion order, and the residual push at ppr_forward_push.cpp:178-225
sums in nondeterministic order.
There was a problem hiding this comment.
The original pure-Python implementation used Python float (float64) for scores end-to-end. The C++ kernel keeps internal scores as double but extractTopK returns float32 tensors to match PyG's edge weight convention — that's should be the new source of precision loss. 1e-5 accounts for it.
There was a problem hiding this comment.
Wait did we use to always use float64 / double? Or never convert to float?
There was a problem hiding this comment.
Hmm, I believe we used to use doubles here. I've investigated this a bit further and tried changing all the floats to doubles in the cpp code, but it still is not within the 1e-6 tol. The robots came up with a potential reason similar to yours above:
The delta is purely a consequence of std::unordered_set iterating in a different order than CPython's set for small integer node IDs — different processing order within each push step leads to a different (but valid) residual distribution at convergence.
An example of this:
Iteration 2 queue = {1, 2}, both with residual 0.125
- Order: process 1 before 2 — when 1 pushes, it adds residual to 2 (bumping 2 to 0.15625). Then 2 pushes with 0.15625, so ppr[2] = 0.15625.
- Order: process 2 before 1 — 2 pushes first with 0.125, so ppr[2] = 0.125. The extra 0.03125 from node 1 lands back in 2's residual for a later round.
Both paths converge to the same true PPR eventually, but when convergence is declared (residuals just drop below threshold), the absorbed ppr_hat values can differ by up to the remaining residual magnitude. C++ unordered_set ordering for int keys differs from CPython (both libraries have undefined behavior for ordering of an unordered object), giving a different residual distribution at termination — hence the ~1.38e-6 delta.
Given the theoretical bound is 1.5 e-6, we should probably have been setting it to 2e-6 in the first place. Running the tests with this leads to success.
kmontemayor2-sc
left a comment
There was a problem hiding this comment.
Took a first pass, thanks Matt.
I was leaving a good amount of perf-related comments but we probably don't need to address those.
I do think we should address the file structure and variable names, it's hard to know what pk is, for instance.
| for (int32_t s = 0; s < _batchSize; ++s) { | ||
| for (int32_t nt = 0; nt < _numNodeTypes; ++nt) { |
There was a problem hiding this comment.
Can we give these loop vars better names too? batchNode and nodeType maybe? nt is better but s is kinda magical (in a bad way)
There was a problem hiding this comment.
Renamed throughout: s → seedIdx, nt → nodeTypeId, eid → edgeTypeId, dstNt → dstNodeTypeId, peid → neighborEdgeTypeId, pk → packedKey.
| ref_score = reference_ppr[ntype_str][node_id] | ||
| sam_score = ntype_to_sampler_ppr[ntype_str][node_id] | ||
| assert abs(sam_score - ref_score) < 1e-6, ( | ||
| assert abs(sam_score - ref_score) < 1e-5, ( |
There was a problem hiding this comment.
Wait did we use to always use float64 / double? Or never convert to float?
| # num_sampled_edges may not exist at all (e.g. in tests or when GLT doesn't | ||
| # populate it), and may lack entries for edge types with zero samples. | ||
| if hasattr(data, "num_sampled_edges"): | ||
| data.num_sampled_edges.pop(edge_type, None) |
There was a problem hiding this comment.
Hmmm, why did this only surface now?
Scope of work done
PPR-Based Neighbor Sampling for GiGL
Adds C++ based PPR (Personalized PageRank) based neighbor sampling as an alternative to k-hop sampling in GiGL's distributed training stack, switching from the previous pythonic implementation.
C++ Kernel (
gigl-core/csrc/sampling/)PPRForwardPushState— a pybind11-wrapped C++ class that owns all hot-loop state (scores, residuals, queue, neighbor cache).run_in_executor.gigl-coreis a separately installable package (uv pip install -e gigl-core/) with apy.typedmarker and.pyistub for full mypy coverage.Python Sampler (
gigl/distributed/dist_ppr_sampler.py)DistPPRNeighborSamplerextendsBaseGiGLSamplerand integrates with the existingDistNeighborLoader/ Graph Store infrastructure.(seed_type, "ppr", neighbor_type)edge types withedge_index([2, N]int64) andedge_attr([N]float PPR scores).max_fetch_iterations: Optional[int] = Nonecaps RPC calls per batch; the algorithm continues to convergence using cached neighbor lists after the budget is exhausted.num_neighbors_per_hopdefaults to1000— high-degree hub nodes receive diminishing residual per neighbor, so capping the fetch has negligible effect on PPR accuracy.Configuration
PPRSamplerOptionsadded tosampler_options.pyalongside existingKHopNeighborSamplerOptions; pass viasampler_options=onDistNeighborLoader.C++ Tests (
gigl-core/tests/ppr_forward_push_test.cpp)tests/CMakeLists.txtupdated to auto-link torch and auto-discover kernel sources undercsrc/(excludingpython_*entry points).Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO