gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat#148217
gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat#148217corona10 wants to merge 7 commits intopython:mainfrom
Conversation
|
For i686: https://godbolt.org/z/cdjdzev5Y |
|
Some initial feedback on this:
|
|
Thanks, @diegorusso. I’ll keep working on this based on your feedback. |
|
A couple of other things:
I think you need to split |
|
@markshannon I would like to get feedback with current imolementation before starting to refactor as Diego suggested. :) |
|
The generated code looks nice and efficient. You'll need to split the two parts into different uops and insert them in the fix up pass in E.g you might need |
|
Are there benchmarks for this yet? |
|
In general this is a big increase in scope and complexity for the JIT backend, so I'd definitely want to make sure it's justified. From a maintenance perspective, can I propose an alternate design that I figured would make sense if we ever wanted to do something like this? Basically, stick a directory full of That seems like a much more maintainable, scalable way of doing this, IMO. |
With Ken Jin’s x86 benchmarks, it does not show any regression. That is not too surprising, though, since this change only reduces binary size.
I agree. If we want to do this in better way, we may need an assembly generator like ZJIT, or use DynASM. Alternatively, we could go with the approach you suggested. |
|
I don’t think we should do this if it isn’t faster. |
Sadly, that won't work. It isn't the assembly, but the patching that needs customization. Take this example from the PR: We need to compute the offset at JIT time, and somehow communicate to Clang that
This change will be too small to measure, but it is still an improvement. The assembly code is objectively better. |
This is a just 12-bit immediate for a load instruction, right? The existing JIT does this all the time. You can encode this by prefixing the symbol with Here's what that looks like for your example. If we add
I still disagree. The JIT backend's goal has never been to make the "objectively best" assembly, it's to make fast code with the right maintenance tradeoffs. It was (and still is) carefully designed to be as easy-to-reason-about and hard-to-break as possible, while still giving us quite performant code. This PR hacks some really subtle special cases onto that for no performance benefit. And I certainly don't want to make it "easy to add similar improvements", because I'm still not convinced that this is an improvement at all. (I also have a hunch that the additional overhead from this patch will negate any potential wins in practice. But as you note, it's be impossible to tell, since it's all in the noise anyways. 🙃) |
Uh oh!
There was an error while loading. Please reload this page.