Various madmatrix improvements and bugfixes by theoheimel · Pull Request #3 · MadGraphTeam/MadGraph7

theoheimel · 2026-05-28T12:32:56Z

GPU:

reduce number of cudaMallocAsync calls in UMAMI interface
allow for fully asynchronous calls of sigmaKin without device or stream synchronization
allow for parallelization over helicities in a single kernel launch instead of using CUDA streams

SIMD:

use one flavor index per SIMD vector instead of per batch
reorder events in UMAMI such that the flavor index is the same for each vector

Define a const nw6 in ALOHAOBJ class and change the loop calls.

Before, the map was build when scanning over the different wavefunctions, however, we have access to the full model in the constructor, so we can build that maps there by looping over the interactions. This passage is done only once.

Fix of inconsistency in the index of the 'S' particles. Addition of fix for sxxxxx. Fix of the broken_symmetry function to iterate only on the outcoming particles

Same CM energy as fortran and standalone_cpp. Hardcoded momenta block to uncomment. Sensible flavor index.

Solution for pointer mismatch and compile error

User can specify either --flavor <int> or -f <int> while calling ./check_sa.exe. TODO: check on GPU, add guard for index out of range

Subcommand matrix (default) computes the matrix element for each flavor combination and for one phase space point (it generates automatically 8 events with RAMBO and keeps only the first one - nthreads = 8, nblocks = 1, niterations = 1). Subcommand perf (can be activated also with -p flag) computes the matrix element for nthreads * nblocks * niterations events and outputs performance counters (timings) for each phase of the computation.

…compile

…r all events in a vector

…h7 into feat-madmatrix-theo

…tructure The merged standalone_cpp evaluates a flavor by index via process.sigmaKin(iflav), which reads CPPProcess's internal flavor_table and the per-flavor bookkeeping arrays sized by nflavors. The old check_sa-local flavor_arr no longer exists, so the test's patch silently no-op'd it and the two injected non-representative flavors evaluated to wrong/missing values (C++ backend failed; Fortran backend was unaffected). Rewrite the C++ injection to the new architecture: extend the internal flavor_table (CPPProcess.cc) + nflavors (CPPProcess.h) and maxflavor/pdg_arr (check_sa.cpp), via a multi-line-safe array extender. Verified the test passes on both backends: s c~ > s c~ reproduces d u~ > d u~ (8.5706e-3), s c~ > c c~ vanishes, masks partial (known) vs all-on (lookup miss). Marks MERGE_TEST_INVESTIGATION.md item #3 RESOLVED. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Additional key, error and case to obtain masses in get_meta()

Same as in standalone_cpp, will need vectorization in future. Copy of it to dir.

Same as the massless RAMBO, for now internal RNG and not splitted into the inital and final state particles sampling

New flag -r [c|ml] (classic, massless) Flag only for perf - matrix always massive Massive host only so for now copy to device self RNG inside Massless kept for back compability for now RNG outside (we pass buffer with rnd numbers)

Qubitol · 2026-06-15T09:42:35Z

+    }};
+    std::size_t total_size = 0;
+    for (auto [ptr, size] : ptrs_and_sizes) {
+        std::size_t aligned_size = (size + 7) / 8 * 8;


Maybe naive question: this is used to round to the closest multiple of 8 because 8 is sizeof(double)? If that's the case, I suggest 2 things: to use fptype and not double, and I propose to make the numbers explicit by doing like const std::size_t ROUND = sizeof(fptype); std::size_t aligned_size = (size + ROUND - 1) / ROUND * ROUND;, which helps a bit for readibility while constants are optimised out.

Qubitol · 2026-06-15T09:46:35Z

      {
-        for( std::size_t i_diag = 0; i_diag < CPPProcess::ndiagrams; ++i_diag )
+        std::size_t i_sorted = permutation[i_event];
+        std::size_t page_size = MemoryAccessMomentaBase::neppM;


Maybe let's call it simd_vector_size instead of page_size to avoid confusion, since we have already pages to handle the memory. I guess it can be made constant as well.

Qubitol · 2026-06-15T09:47:27Z

+        std::size_t i_sorted = permutation[i_event];
+        std::size_t page_size = MemoryAccessMomentaBase::neppM;
+        std::size_t i_page = i_sorted / page_size;
+        std::size_t i_vector = i_sorted % page_size;


Suggested change

std::size_t i_vector = i_sorted % page_size;

std::size_t i_vector = i_sorted % page_size; // vector lane

Qubitol

I just added few comments on nomenclature, just to try to improve readibility from a not-expert of the code. But the logic looks good to me!

naming changed to rambo.h and massless_rambo.h flag instead of switch (--rambo-massless)

Introduction of massive RAMBO in standalone_mg7.

stloufra and others added 29 commits May 11, 2026 12:51

Adjustment of write_combined_cc to parton_grouping

b080be2

Fix loop on wavefunctions: they should loop over nw6

5e265d9

Define a const nw6 in ALOHAOBJ class and change the loop calls.

Fix for M handling in the combined functions

00888d2

Bug fixing for FD gauge and ALOHA obj

a2009a1

Fix of inconsistency in the index of the 'S' particles. Addition of fix for sxxxxx. Fix of the broken_symmetry function to iterate only on the outcoming particles

Easier life with check_sa

b8b4b2c

Same CM energy as fortran and standalone_cpp. Hardcoded momenta block to uncomment. Sensible flavor index.

Bug fix for FPTYPE=f

940f11b

Solution for pointer mismatch and compile error

Bugfix one(1.f) in vxxxx

2f19468

Do not use internal writer formatting for helas file

32a8ac5

Flag -f in standalone to select flavor

56a7d6d

User can specify either --flavor <int> or -f <int> while calling ./check_sa.exe. TODO: check on GPU, add guard for index out of range

Rename Makefile->makefile

a8caffd

Use default misc.compile as a compilation command

5ae6480

Guard compilation command for standalone_mg7 in the same way as misc.…

ddad296

…compile

Rename standalone folders with PROCMG7 prefix

e5a1d6b

Merge remote-tracking branch 'upstream/main' into feat-madmatrix

176325a

single mallocAsync call in umami on GPU

f1431e6

fix common memory allocation

082c053

update run_card.toml

b61c32c

fix unused warning, fix madnis.enable setting

4f11d13

Multiple repeat regex fix

7f7414c

fully async gpu matrix elements

fb610e9

reorder events in simd mode such that the flavor index is the same fo…

a85f8ca

…r all events in a vector

Merge branch 'main' into feat-madmatrix-theo

2f89900

apply flavor ordering to matrix element outputs

cfd6aed

fix SIMD flavor selection

d147d36

fix flavor sampling and limitation of open files

952ede5

fix async gpu matrix element

412df9d

Merge branch 'feat-madmatrix-theo' of github.com:MadGraphTeam/MadGrap…

95b9da1

…h7 into feat-madmatrix-theo

theoheimel requested a review from Qubitol May 28, 2026 12:32

stloufra and others added 3 commits June 10, 2026 10:55

Same energy as fortran in the standalone_mg7 matrix mode

a242b84

remove extra square in transverse mass scale

4faab79

Merge branch 'main' into feat-madmatrix-theo

ae82740

stloufra added 4 commits June 11, 2026 17:57

Umami addition of get meta masses

0664e6c

Additional key, error and case to obtain masses in get_meta()

Addition of RAMBO respecting mass

b462399

Same as in standalone_cpp, will need vectorization in future. Copy of it to dir.

RamboSamplingKernels classic_rambo implementation

52a6d94

Same as the massless RAMBO, for now internal RNG and not splitted into the inital and final state particles sampling

Massive RAMBO implementation in check_sa

a111d96

New flag -r [c|ml] (classic, massless) Flag only for perf - matrix always massive Massive host only so for now copy to device self RNG inside Massless kept for back compability for now RNG outside (we pass buffer with rnd numbers)

Qubitol reviewed Jun 15, 2026

View reviewed changes

Qubitol requested changes Jun 15, 2026

View reviewed changes

stloufra and others added 4 commits June 15, 2026 14:22

PR suggestion implementation

6f599e8

naming changed to rambo.h and massless_rambo.h flag instead of switch (--rambo-massless)

Merge pull request #5 from MadGraphTeam/feat-madmatrix-rambo

1dbe28a

Introduction of massive RAMBO in standalone_mg7.

Merge branch 'main' into feat-madmatrix

b539ddb

Merge branch 'main' into feat-madmatrix-theo

149d8aa

theoheimel marked this pull request as draft June 17, 2026 10:04

implement changes suggested by daniele

f9f4ad4

theoheimel marked this pull request as ready for review June 17, 2026 10:21

theoheimel merged commit bd54791 into main Jun 17, 2026
56 of 148 checks passed

theoheimel deleted the feat-madmatrix-theo branch June 17, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various madmatrix improvements and bugfixes#3

Various madmatrix improvements and bugfixes#3
theoheimel merged 41 commits into
mainfrom
feat-madmatrix-theo

theoheimel commented May 28, 2026

Uh oh!

Qubitol Jun 15, 2026

Uh oh!

Qubitol Jun 15, 2026

Uh oh!

Qubitol Jun 15, 2026

Uh oh!

Qubitol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	std::size_t i_vector = i_sorted % page_size;
	std::size_t i_vector = i_sorted % page_size; // vector lane

Conversation

theoheimel commented May 28, 2026

Uh oh!

Qubitol Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Qubitol Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Qubitol Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Qubitol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants