Skip to content

Improve parallelism by solving most difficult deals first#216

Draft
tameware wants to merge 6 commits into
dds-bridge:developfrom
tameware:opus-improve-parallelism
Draft

Improve parallelism by solving most difficult deals first#216
tameware wants to merge 6 commits into
dds-bridge:developfrom
tameware:opus-improve-parallelism

Conversation

@tameware

Copy link
Copy Markdown
Collaborator
$ time ./benchmark.sh --branch opus-two-percent --branch opus-improve-parallelism --repeats 5 --max-deals 10000
Building dtest from 'opus-two-percent'...
Building dtest from 'opus-improve-parallelism'...
Restoring 'benchmark'...
DDS dtest benchmark
===================
branch:      branch 'opus-two-percent' (/var/folders/12/xtx6dlwd0mdcxkspvmsszsrc0000gn/T//dds-dtest-branch.Cf0Cly)
compare:     branch 'opus-improve-parallelism' (/var/folders/12/xtx6dlwd0mdcxkspvmsszsrc0000gn/T//dds-dtest-compare.gwQNLJ)
details:     off (summary only)
run order:   interleaved branch, compare
epsilon:     0.5%
hands dir:   /Users/adamw/src/dds/hands
max_deals:   10000
files:       list10000.txt list1000.txt list100.txt list10.txt list1.txt
git branch:  benchmark
repeats:     5


Summary (avg user ms)
==============================================================================
solver file          opus-improve opus-two-per cmp/branch note
------ ------------- ------------ ------------ ---------- ---------------
solve  list10000.txt         2.67         2.63      1.01x opus-two-percent faster
solve  list1000.txt          2.12         2.09      1.01x opus-two-percent faster
solve  list100.txt           2.06         2.05      1.00x equal
solve  list10.txt            4.20         4.30      0.98x opus-improve-parallelism faster
solve  list1.txt            12.00        12.00      1.00x equal
calc   list10000.txt         9.68        10.25      0.94x opus-improve-parallelism faster
calc   list1000.txt          9.63        10.64      0.91x opus-improve-parallelism faster
calc   list100.txt           8.22         8.70      0.95x opus-improve-parallelism faster
calc   list10.txt           14.44        14.38      1.00x equal
calc   list1.txt            38.20        38.00      1.01x opus-two-percent faster
------ ------------- ------------ ------------ ---------- ---------------
TOTAL  elapsed (s)         684.60       716.16      0.96x opus-improve-parallelism faster

Completed 100 runs (100 expected).

tameware and others added 5 commits June 27, 2026 18:51
The heuristic extraction refactor changed weight_alloc_trump_void1's
first branch from `lead_suit == trump` to `suit == trump`. Since that
is exhaustive with the following `else if (suit != trump)`, the three
ruffing branches (using the `24 - rank + ...` formula) became dead
code, and trump ruffs were scored with side-suit discard weights
instead. This mis-ordered ruffs, costing alpha-beta cutoffs.

The effect is small for solve but compounds heavily in calc's warm-TT
iterative deepening: calc explored ~34% more nodes than v2.9. Restoring
the original `lead_suit == trump` pitch branch makes the ruffing
branches reachable again and cuts calc time ~25% (gap to v2.9: 1.37x ->
1.02x). Ordering-only change; double-dummy results are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
Per Copilot.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
The heuristic/quick-tricks refactor introduced static_cast<unsigned char>
wrappers on values that v2.9 used as signed, changing search behavior:

- make_3 / make_3_ctx: winner[]/second_best[] .hand and .rank were cast
  to unsigned char, turning the -1 "no card" sentinel into 255. This broke
  winner[trump].hand == -1 style checks in QuickTricks, losing cutoffs.
- weight_alloc_trump_void2 / _void3: rel_rank[aggr[suit]][...] indexed
  through static_cast<unsigned char>(aggr[suit]), truncating the 13-bit
  aggregate holding to 8 bits and reading the wrong rel_rank row.
- QuickTricksPartnerHand{Trump,NT}: bit_map_rank index cast the signed
  rank through unsigned char.

With these reverted to v2.9's signed handling, the per-move-generation
ordering trace now matches v2.9 exactly (0 divergences on list1), closing
the residual calc gap to parity. Ordering/pruning-only change; double-dummy
results are unchanged and all library tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
The parallel board loop handed boards out in index order via an atomic
counter, so a hard board picked near the end left one worker running long
while the others sat idle. Hand out the hardest boards first (longest-
processing-time-first) so the tail consists of cheap boards.

parallel_all_boards_n gains an optional dispatch-order permutation: workers
still pull from the same atomic counter, but the slot is mapped through the
order before becoming a board number, so only the dispatch sequence changes
and result placement is unaffected. The solve path passes no order and is
unchanged.

calc estimates per-deal difficulty with a cheap, trump-independent
structural proxy (deal_fanout, mirroring Scheduler::Fanout) and sorts board
indices by descending difficulty before dispatch.

calc list1000 -n18: ~11.0s -> ~9.6s wall (~13%), user CPU unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
CalcDDtableN builds one board per strain for a single deal. deal_fanout is
trump-independent, so all boards share one fanout and the difficulty sort is a
pure no-op there. Gate the sort behind a difficulty_sort flag (default on for
batch CalcAllTablesN) and disable it for the single-deal path.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve overall throughput for batch calculations by reducing “tail latency” in parallel workloads: it estimates deal difficulty cheaply, sorts boards hardest-first, and adds an optional dispatch-order mechanism to the parallel board runner.

Changes:

  • Extend parallel_all_boards_n() to optionally dispatch boards in a caller-provided order.
  • Add a cheap per-deal “fanout” estimate and use it to stable-sort batch calc boards hardest-first before parallel execution.
  • Simplify/remove several legacy static_cast<unsigned char>(...) conversions in solver/heuristic code paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
library/src/system/parallel_boards.hpp Adds optional order parameter to control dispatch order (hardest-first, etc.).
library/src/system/parallel_boards.cpp Implements ordered dispatch via slot→board mapping.
library/src/calc_tables.cpp Computes per-deal difficulty estimate and dispatches hardest boards first for batch calc.
library/src/heuristic_sorting/heuristic_sorting.cpp Cleans up heuristic code (including rel-rank indexing casts) and adjusts some void/trump logic.
library/src/quick_tricks.cpp Removes redundant casts when indexing with abs_rank[..].rank.
library/src/ab_search.cpp Removes redundant casts when copying abs_rank winner/second-best into Pos.

Comment on lines +45 to +51
// Map a dispatch slot to the board number to process. With an order, hand out
// boards in that sequence (e.g. hardest first); otherwise in index order.
const bool use_order =
(order != nullptr && static_cast<int>(order->size()) == count);
auto board_of = [&](const int slot) -> int {
return use_order ? (*order)[static_cast<unsigned>(slot)] : slot;
};
Only honor the optional dispatch order when it is a valid permutation
of [0, count: each element in range and unique. A malformed order
(duplicates or out-of-range values) now falls back to index order,
preventing invalid board indices from reaching process_board.
EOF
)

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment on lines +45 to +51
// Map a dispatch slot to the board number to process. With an order, hand out
// boards in that sequence (e.g. hardest first); otherwise in index order.
const bool use_order =
(order != nullptr && static_cast<int>(order->size()) == count);
auto board_of = [&](const int slot) -> int {
return use_order ? (*order)[static_cast<unsigned>(slot)] : slot;
};

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants