Skip to content

Enable C++ PPR Sampling#556

Open
mkolodner-sc wants to merge 192 commits intomainfrom
mkolodner-sc/cpp_ppr_tracer
Open

Enable C++ PPR Sampling#556
mkolodner-sc wants to merge 192 commits intomainfrom
mkolodner-sc/cpp_ppr_tracer

Conversation

@mkolodner-sc
Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc commented Mar 24, 2026

Scope of work done

PPR-Based Neighbor Sampling for GiGL

Adds C++ based PPR (Personalized PageRank) based neighbor sampling as an alternative to k-hop sampling in GiGL's distributed training stack, switching from the previous pythonic implementation.

C++ Kernel (gigl-core/csrc/sampling/)

  • Implements the Forward Push algorithm (Andersen et al., 2006) as PPRForwardPushState — a pybind11-wrapped C++ class that owns all hot-loop state (scores, residuals, queue, neighbor cache).
  • Python drives the async RPC neighbor fetches; C++ handles the residual push loop, keeping the asyncio event loop unblocked via run_in_executor.
  • gigl-core is a separately installable package (uv pip install -e gigl-core/) with a py.typed marker and .pyi stub for full mypy coverage.

Python Sampler (gigl/distributed/dist_ppr_sampler.py)

  • DistPPRNeighborSampler extends BaseGiGLSampler and integrates with the existing DistNeighborLoader / Graph Store infrastructure.
  • Output format: (seed_type, "ppr", neighbor_type) edge types with edge_index ([2, N] int64) and edge_attr ([N] float PPR scores).
  • max_fetch_iterations: Optional[int] = None caps RPC calls per batch; the algorithm continues to convergence using cached neighbor lists after the budget is exhausted.
  • num_neighbors_per_hop defaults to 1000 — high-degree hub nodes receive diminishing residual per neighbor, so capping the fetch has negligible effect on PPR accuracy.

Configuration

  • PPRSamplerOptions added to sampler_options.py alongside existing KHopNeighborSamplerOptions; pass via sampler_options= on DistNeighborLoader.

C++ Tests (gigl-core/tests/ppr_forward_push_test.cpp)

  • 6 GoogleTest cases covering: initial queue drain, convergence, score absorption, residual distribution, cross-seed deduplication, and top-k limiting.
  • tests/CMakeLists.txt updated to auto-link torch and auto-discover kernel sources under csrc/ (excluding python_* entry points).

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@mkolodner-sc mkolodner-sc force-pushed the mkolodner-sc/cpp_ppr_tracer branch from 4d2db2a to cfcb8cb Compare March 25, 2026 18:59
@mkolodner-sc mkolodner-sc changed the base branch from main to mkolodner-sc/cpp-infrastructure March 25, 2026 19:27
@mkolodner-sc mkolodner-sc force-pushed the mkolodner-sc/cpp_ppr_tracer branch from 7009d70 to dd118ef Compare March 25, 2026 21:58
Base automatically changed from mkolodner-sc/cpp-infrastructure to main April 29, 2026 01:58
@mkolodner-sc mkolodner-sc changed the title PPR C++ Tracer Bullet Enable C++ PPR Sampling Apr 29, 2026
@mkolodner-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

GiGL Automation

@ 21:35:08UTC : 🔄 C++ Unit Test started.

@ 21:37:04UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

GiGL Automation

@ 21:35:10UTC : 🔄 Scala Unit Test started.

@ 21:44:18UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

GiGL Automation

@ 21:35:13UTC : 🔄 Python Unit Test started.

@ 22:29:26UTC : ❌ Workflow failed.
Please check the logs for more details.

ref_score = reference_ppr[ntype_str][node_id]
sam_score = ntype_to_sampler_ppr[ntype_str][node_id]
assert abs(sam_score - ref_score) < 1e-6, (
assert abs(sam_score - ref_score) < 1e-5, (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed to be updating 1e-6 -> 1e-5 for the margin of error. With the C++ conversions, my understanding is we have lost some float precision here, but the 1e-5 margin should still be sufficient for validating that the PPR algorithm is correct here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little odd we lost precision, have we looked into why?

I think it's fine to update this I'm just concerned if we don't know why.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

robots have some thoughts about this, if it is floating point ordering then it's fine imo.

  The tolerance in both assertions was loosened from 1e-6 (Python original)
  to 1e-5. The justification given in the comments at lines 273-275 and
  375-378 — "theoretical bound is ~1.5e-6, so 1e-5 provides a safety margin"
  — is true for the new C++ implementation, but the Python implementation
  passed at 1e-6 with observed deltas around 1e-7.

  The C++ kernel is supposed to be mathematically equivalent. A 10x increase
  in observed error suggests the floating-point summation order changed —
  likely because unordered_map iteration order differs from Python dict
  insertion order, and the residual push at ppr_forward_push.cpp:178-225
  sums in nondeterministic order.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original pure-Python implementation used Python float (float64) for scores end-to-end. The C++ kernel keeps internal scores as double but extractTopK returns float32 tensors to match PyG's edge weight convention — that's should be the new source of precision loss. 1e-5 accounts for it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait did we use to always use float64 / double? Or never convert to float?

Copy link
Copy Markdown
Collaborator Author

@mkolodner-sc mkolodner-sc May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe we used to use doubles here. I've investigated this a bit further and tried changing all the floats to doubles in the cpp code, but it still is not within the 1e-6 tol. The robots came up with a potential reason similar to yours above:

The delta is purely a consequence of std::unordered_set iterating in a different order than CPython's set for small integer node IDs — different processing order within each push step leads to a different (but valid) residual distribution at convergence. 

An example of this:

  Iteration 2 queue = {1, 2}, both with residual 0.125                                                                                                                                                                                              
   
  - Order: process 1 before 2 — when 1 pushes, it adds residual to 2 (bumping 2 to 0.15625). Then 2 pushes with 0.15625, so ppr[2] = 0.15625.                                                                                                       
  - Order: process 2 before 1 — 2 pushes first with 0.125, so ppr[2] = 0.125. The extra 0.03125 from node 1 lands back in 2's residual for a later round.

Both paths converge to the same true PPR eventually, but when convergence is declared (residuals just drop below threshold), the absorbed ppr_hat values can differ by up to the remaining residual magnitude. C++ unordered_set ordering for int keys differs from CPython (both libraries have undefined behavior for ordering of an unordered object), giving a different residual distribution at termination — hence the ~1.38e-6 delta.

Given the theoretical bound is 1.5 e-6, we should probably have been setting it to 2e-6 in the first place. Running the tests with this leads to success.

Copy link
Copy Markdown
Collaborator

@kmontemayor2-sc kmontemayor2-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass, thanks Matt.

I was leaving a good amount of perf-related comments but we probably don't need to address those.

I do think we should address the file structure and variable names, it's hard to know what pk is, for instance.

Comment thread gigl-core/core/sampling/ppr_forward_push.cpp
Comment thread gigl-core/csrc/sampling/ppr_forward_push.h Outdated
Comment thread gigl-core/core/sampling/ppr_forward_push.h
Comment thread gigl-core/core/sampling/ppr_forward_push.cpp
Comment thread gigl-core/csrc/sampling/ppr_forward_push.cpp Outdated
Comment thread gigl-core/csrc/sampling/ppr_forward_push.cpp Outdated
Comment thread gigl-core/csrc/sampling/python_ppr_forward_push.cpp Outdated
Comment thread gigl/distributed/dist_ppr_sampler.py Outdated
Comment thread gigl/distributed/dist_ppr_sampler.py Outdated
Comment thread gigl/distributed/dist_ppr_sampler.py Outdated
Comment thread gigl-core/core/sampling/ppr_forward_push.cpp Outdated
Comment thread gigl-core/core/sampling/ppr_forward_push.cpp Outdated
Comment thread gigl-core/core/sampling/ppr_forward_push.cpp Outdated
Comment on lines +106 to +107
for (int32_t s = 0; s < _batchSize; ++s) {
for (int32_t nt = 0; nt < _numNodeTypes; ++nt) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give these loop vars better names too? batchNode and nodeType maybe? nt is better but s is kinda magical (in a bad way)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed throughout: sseedIdx, ntnodeTypeId, eidedgeTypeId, dstNtdstNodeTypeId, peidneighborEdgeTypeId, pkpackedKey.

Comment thread gigl-core/core/sampling/ppr_forward_push.cpp Outdated
ref_score = reference_ppr[ntype_str][node_id]
sam_score = ntype_to_sampler_ppr[ntype_str][node_id]
assert abs(sam_score - ref_score) < 1e-6, (
assert abs(sam_score - ref_score) < 1e-5, (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait did we use to always use float64 / double? Or never convert to float?

Comment thread Makefile
Comment on lines +193 to +196
# num_sampled_edges may not exist at all (e.g. in tests or when GLT doesn't
# populate it), and may lack entries for edge types with zero samples.
if hasattr(data, "num_sampled_edges"):
data.num_sampled_edges.pop(edge_type, None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, why did this only surface now?

Comment thread gigl/distributed/dist_ppr_sampler.py
Comment thread gigl/distributed/dist_ppr_sampler.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants