Skip to content

Exploiting VectorTyped to avoid copies during the NL construction#1401

Open
Iximiel wants to merge 1 commit intoplumed:masterfrom
Iximiel:feature/dataInNL
Open

Exploiting VectorTyped to avoid copies during the NL construction#1401
Iximiel wants to merge 1 commit intoplumed:masterfrom
Iximiel:feature/dataInNL

Conversation

@Iximiel
Copy link
Copy Markdown
Member

@Iximiel Iximiel commented Apr 22, 2026

Description

First of all I added a set of test that can be useful to understand how the NL is breaking in all the possible situations (serial, serial+omp, mpi, mpi+omp)

Then I exploited the trick PLUMED uses to transfer Vectors wit MPI to avoid creating an extra unsigned array to be gathered and then copied into the local neigborlist.
In a serial or non mpi run this skips a significant portion of code since it works directly on the neighbors_ array. In a MPI run the code still uses a temporary array, but communicate directly into the main neighbors_ array, skipping the last copy.

I think this will lower the RAM tax that the NL imposes on the PC for larger systems

I have a fast benchmark on drag races of 60 seconds (how many steps in 60 seconds) with 4 omp threads:

Atomi LC LC-v Δ% LC→LC-v NL NL-v Δ% NL→NL-v
125 506813 557075 +9.92% 445501 456190 +2.40%
1000 42082 44953 +6.82% 38427 40167 +4.53%
8000 6705 6732 +0.40% 3201 3342 +4.40%
27000 1559 1586 +1.73% 501 517 +3.19%
42875 946 960 +1.48% 201 221 +9.95%

And with 2 MPI processes with 3 opemMPtreads:

Atomi LC LC-v Δ% LC→LC-v NL NL-v Δ% NL→NL-v
125 329986 335234 +1.59% 278411 278566 +0.06%
1000 29640 29800 +0.54% 25404 25970 +2.23%
8000 3909 3879 −0.77% 1941 2001 +3.09%
27000 901 894 −0.78% 281 301 +7.12%
42875 531 501 −5.65% 121 121 0.00%

NL runs this:

cpu:     COORDINATION GROUPA=@mdatoms GROUPB=@mdatoms SWITCH={RATIONAL R_0=0.5 NN=6 MM=10 D_MAX=2.0} NLIST      NL_CUTOFF=3.5 NL_STRIDE=20

LC runs this:

cpucl:   COORDINATION GROUPA=@mdatoms GROUPB=@mdatoms SWITCH={RATIONAL R_0=0.5 NN=6 MM=10 D_MAX=2.0} NLISTCELLS NL_CUTOFF=2.0 NL_STRIDE=1

**-v is this thread performances, considering that the LC do not make the extra copy I am surprised by the improvement even if it is not expected. NL has a worse improvement, but the algorithm runs every 20 steps, where for LC at each step
(me being lazy I made an AI calculate the % columns, the number of steps are correct, I double checked them)

With mpi looks less attractive, but I only did a single run for both the tries on my workstation

If this work for you, it will be the base of the "standard" NL accelerated by the linked cells algorithm. This is not necessary for that, a somehow working version already exists, but I believe that rebasing that on these modifications will enhance its performace a little more.

Target release

I would like my code to appear in release v2.11

Type of contribution
  • changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
  • changes to a module not authored by you
  • new module contribution or edit of a module authored by you
Copyright
  • I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.
  • the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.
Tests
  • I added a new regtest or modified an existing regtest to validate my changes.
  • I verified that all regtests are passed successfully on GitHub Actions.

Comment thread src/tools/NeighborList.cpp Fixed
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

@Iximiel
Copy link
Copy Markdown
Member Author

Iximiel commented Apr 28, 2026

@GiovanniBussi I redid the benchmarks: again steps in fixed time, with runs of 600 seconds.

mpirun with 2 processes with 3 threads each (I have 6 physical cores):

  • Here in NL the communication of the couples puts the result directly in the neighbors_ array skipping a copy, but it still needs a temporary local array.
  • in Base the extra cost to converting the array to a pair, I think
  • In LC the branch changes the push_back (I think?), note that LC does not exploits omp or MPI in PLMD::NeighborList, but the calculations which atom is in which cell is distributed with MPI
  • 3375* appears twice because the 57491 steps for LC was too much out of the trend and I reran the benchmark for every configuration, it is interesting as a check for consistency
  • I think that the 199941 in the 8000 atoms for the NL column is also a lucky shot for my branch or an unlucky run for master
#atoms NL NL-v Δ% NL→NL-v LC LC-v Δ% LC→LC-v Base Base-v Δ% Base→Base-v
125 2779181 2781688 +0.09% 3294460 3368342 +2.24% 3700608 3691941 -0.23%
1000 254581 257540 +1.16% 295449 300581 +1.74% 95715 94087 -1.70%
3375 62526 63162 +1.02% 76009 57491 -24.36% 9129 9110 -0.21%
3375* 62203 62934 +1.18% 75581 76189 +0.80% 9128 9110 -0.20%
8000 18081 19941* +10.29% 39019 38511 -1.30% 1622 1617 -0.31%
15625 7221 7481 +3.60% 16153 16276 +0.76% 431 427 -0.93%
27000 3081 3161 +2.60% 9308 9465 +1.69% 144 144 +0.00%
42875 1401 1441 +2.86% 5970 5939 -0.52% 57 56 -1.75%

4 threads:

  • NL with only omp skips the extra copy in NL, by merging the list of couples directly in the neighbors_ array in the #omp critical section
  • I have no Idea why there is a boost in performance in LC
#atoms NL NL-v Δ% NL→NL-v LC LC-v Δ% LC→LC-v Base Base-v Δ% Base→Base-v
125 4447001 4518810 +1.61% 4998651 5537396 +10.78% 7443775 7419281 -0.33%
1000 371291 387762 +4.44% 419098 453133 +8.12% 164844 163544 -0.79%
3375 97551 100121 +2.63% 124122 130720 +5.32% 14740 13731 -6.85%
3375* 97081 99262 +2.25% 124830 130810 +4.79% 14745 14614 -0.89%
8000 32202 33517 +4.08% 67039 69639 +3.88% 2620 2613 -0.27%
15625 12482 12756 +2.20% 28289 29102 +2.87% 677 687 +1.48%
27000 5201 5361 +3.08% 16367 16847 +2.93% 230 230 +0.00%
42875 2381 2481 +4.20% 10450 10883 +4.14% 91 91 +0.00%

maybe in the NL algorithm, creating the extra arrays with some capacity to avoid too many reallocation in the push_back might help

@Iximiel Iximiel force-pushed the feature/dataInNL branch from b1da497 to 58570ca Compare May 4, 2026 08:09
using the even simpler std::array to speed up the NL
@Iximiel Iximiel force-pushed the feature/dataInNL branch from 58570ca to 3afba3f Compare May 4, 2026 08:20
&local_nl_size[0],
&disp[0]);
}
// no need for an else neighbors_.resize(0);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants