gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat by corona10 · Pull Request #148217 · python/cpython

corona10 · 2026-04-07T13:59:09Z

Manually emit _LOAD_FAST_BORROW at JIT compile time, encoding the operand offset directly into the instruction instead of loading it from the GOT at runtime.
This shrinks the generic case (oparg ≥ 8) from 28 bytes to 8 bytes and eliminates 27 stencil functions.
I've compared machine code through godbolt:
- x86-64 / aarch64: https://godbolt.org/z/3M9zKeosj
- i686: https://godbolt.org/z/cdjdzev5Y

Issue: Some (mostly) easy ways to reduce the size of JIT generated code #145742

corona10 · 2026-04-08T16:00:38Z

For i686: https://godbolt.org/z/cdjdzev5Y

diegorusso · 2026-04-08T16:08:56Z

Some initial feedback on this:

we should not to pollute jit.c with uops implementation. They should live in separate compile units and have the same signature of the other ones (e.g.: void emit__UOP_NAME(unsigned char *code, unsigned char *data, _PyExecutorObject *executor, const _PyUOpInstruction *instruction, jit_state *state)
ifdefs can select the right architecture of the custom implementation
in bytecodes.c we should have a way to tell the JIT machinery not to generated any code for a specific uops but the uops implemetation should be accounted in the table in the jit-stencils-*.h (static const StencilGroup stencil_groups[MAX_UOP_REGS_ID + 1])
The linker later on will pick up our own version of the uops implementation.

corona10 · 2026-04-08T16:13:51Z

Thanks, @diegorusso. I’ll keep working on this based on your feedback.

markshannon · 2026-04-09T17:21:41Z

A couple of other things:

This PR asserts that the immediate value fits into the space given, but this will fail for larger opargs.
I don't if this matters, but the x86 code is inferior to that generated by the stencils for oparg 0-5. For example, for LOAD_FAST_BORROW_1_r01 in the stencil generated code uses a 1 byte offset instead of the 4 byte offset this PR generates. For oparg > 5, the code is the same.

I think you need to split _LOAD_FAST_BORROW into two variants for the JIT. One for all normal opargs, that can use manual code generation, and a generated fallback for huge opargs.

corona10 · 2026-04-12T14:04:06Z

@markshannon I would like to get feedback with current imolementation before starting to refactor as Diego suggested. :)

markshannon · 2026-04-13T17:00:20Z

The generated code looks nice and efficient.

You'll need to split the two parts into different uops and insert them in the fix up pass in uop_optimize, as the JIT can only handle fixed length stencils.

E.g you might need LOAD_FAST_BORROW_0_to_5, LOAD_FAST_BORROW_TO_500. Then replace LOAD_FAST_BORROW with them in uop_optimize. Mark them both as "manual" or something like that, so that the stencil generator knows to omit them.

brandtbucher · 2026-04-13T21:03:31Z

Are there benchmarks for this yet?

brandtbucher · 2026-04-13T21:15:01Z

In general this is a big increase in scope and complexity for the JIT backend, so I'd definitely want to make sure it's justified.

From a maintenance perspective, can I propose an alternate design that I figured would make sense if we ever wanted to do something like this? Basically, stick a directory full of .S files in the Tools/jit directory. So, we could add a file, Tools/jit/overrides/_LOAD_FAST_BORROW.S that contains syntax-highlighted, manually-written textual assembly. The JIT build scripts would just use that instead once we re-compile everything after the textual assembly pass, and the rest of the backend stays unchanged. We can also put a hash at the top of the file to make sure we check the file whenever the source of its corresponding uop changes.

That seems like a much more maintainable, scalable way of doing this, IMO.

corona10 · 2026-04-13T23:46:31Z

Are there benchmarks for this yet?

With Ken Jin’s x86 benchmarks, it does not show any regression. That is not too surprising, though, since this change only reduces binary size.

From a maintenance perspective, can I propose an alternate design that I figured would make sense if we ever wanted to do something like this

I agree. If we want to do this in better way, we may need an assembly generator like ZJIT, or use DynASM. Alternatively, we could go with the approach you suggested.

brandtbucher · 2026-04-14T00:49:12Z

I don’t think we should do this if it isn’t faster.

markshannon · 2026-04-14T11:30:01Z

So, we could add a file, Tools/jit/overrides/_LOAD_FAST_BORROW.S that contains syntax-highlighted, manually-written textual assembly.

Sadly, that won't work. It isn't the assembly, but the patching that needs customization.

Take this example from the PR:

// AArch64: preserve_none CC: x21=frame, x24/x25/x26=cache0/1/2
// Small oparg (imm12 fits): ldr x8, [x21, #off] ; orr xDST, x8, #1  (8 bytes)

We need to compute the offset at JIT time, and somehow communicate to Clang that #off needs a custom relocation that it isn't aware of.

I don’t think we should do this if it isn’t faster.

This change will be too small to measure, but it is still an improvement. The assembly code is objectively better.
If the change is clean and makes it easy to add other similar improvements, then it will be worth it.

brandtbucher · 2026-04-14T16:29:50Z

Sadly, that won't work. It isn't the assembly, but the patching that needs customization.

Take this example from the PR:
// AArch64: preserve_none CC: x21=frame, x24/x25/x26=cache0/1/2
// Small oparg (imm12 fits): ldr x8, [x21, #off] ; orr xDST, x8, #1  (8 bytes)
We need to compute the offset at JIT time, and somehow communicate to Clang that #off needs a custom relocation that it isn't aware of.

This is a just 12-bit immediate for a load instruction, right? The existing JIT does this all the time. You can encode this by prefixing the symbol with :lo12:.

Here's what that looks like for your example. If we add "R_AARCH64_LDST64_ABS_LO12_NC": "patch_aarch64_12" to our relocation mapping in _stencils.py, the existing backend just works.

I don’t think we should do this if it isn’t faster.

This change will be too small to measure, but it is still an improvement. The assembly code is objectively better. If the change is clean and makes it easy to add other similar improvements, then it will be worth it.

I still disagree. The JIT backend's goal has never been to make the "objectively best" assembly, it's to make fast code with the right maintenance tradeoffs. It was (and still is) carefully designed to be as easy-to-reason-about and hard-to-break as possible, while still giving us quite performant code. This PR hacks some really subtle special cases onto that for no performance benefit. And I certainly don't want to make it "easy to add similar improvements", because I'm still not convinced that this is an improvement at all.

(I also have a hunch that the additional overhead from this patch will negate any potential wins in practice. But as you note, it's be impossible to tell, since it's all in the noise anyways. 🙃)

pythongh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat

c6fa882

corona10 requested review from brandtbucher, diegorusso, markshannon and savannahostrowski as code owners April 7, 2026 13:59

bedevere-app bot added the awaiting core review label Apr 7, 2026

bedevere-app bot mentioned this pull request Apr 7, 2026

Some (mostly) easy ways to reduce the size of JIT generated code #145742

Open

corona10 added 2 commits April 9, 2026 00:53

Support i686

e24e176

Merge remote-tracking branch 'upstream/main' into pythongh-145742-impl

d459f5e

Merge remote-tracking branch 'upstream/main' into pythongh-145742-impl

a08df36

corona10 added the topic-JIT label Apr 12, 2026

corona10 changed the title ~~gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat~~ [WIP] gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat Apr 12, 2026

corona10 added 3 commits April 12, 2026 19:02

Address code review

e0cf241

Generate fallback for aarch64

7957e62

fix for aarch64

bbf847f

corona10 changed the title ~~[WIP] gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat~~ gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat Apr 12, 2026

corona10 mentioned this pull request Apr 12, 2026

gh-148501: Fix cross-block adrp+ldr to movz optimization on AArch64 #148436

Open

corona10 added the DO-NOT-MERGE label Apr 12, 2026

corona10 mentioned this pull request Apr 13, 2026

Fix cross-block small constant optimization on AArch64 #148501

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat#148217

gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat#148217
corona10 wants to merge 7 commits intopython:mainfrom
corona10:gh-145742-impl

corona10 commented Apr 7, 2026 •

edited

Loading

Uh oh!

corona10 commented Apr 8, 2026

Uh oh!

diegorusso commented Apr 8, 2026

Uh oh!

corona10 commented Apr 8, 2026

Uh oh!

markshannon commented Apr 9, 2026

Uh oh!

corona10 commented Apr 12, 2026

Uh oh!

markshannon commented Apr 13, 2026

Uh oh!

brandtbucher commented Apr 13, 2026

Uh oh!

brandtbucher commented Apr 13, 2026

Uh oh!

corona10 commented Apr 13, 2026

Uh oh!

brandtbucher commented Apr 14, 2026

Uh oh!

markshannon commented Apr 14, 2026

Uh oh!

brandtbucher commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

corona10 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corona10 commented Apr 8, 2026

Uh oh!

diegorusso commented Apr 8, 2026

Uh oh!

corona10 commented Apr 8, 2026

Uh oh!

markshannon commented Apr 9, 2026

Uh oh!

corona10 commented Apr 12, 2026

Uh oh!

markshannon commented Apr 13, 2026

Uh oh!

brandtbucher commented Apr 13, 2026

Uh oh!

brandtbucher commented Apr 13, 2026

Uh oh!

corona10 commented Apr 13, 2026

Uh oh!

brandtbucher commented Apr 14, 2026

Uh oh!

markshannon commented Apr 14, 2026

Uh oh!

brandtbucher commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

corona10 commented Apr 7, 2026 •

edited

Loading