You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @Kludex — first, thanks for the perf wave you shipped today in 0.0.32: #295 (header find+translate scanning + dropping the per-callback logger.debug), #296 (bound the field-name size before validating), and #300 (rfind lookbehind for partial boundaries). I'd been profiling the streaming MultipartParser._internal_write hot path independently and several of my findings just got obsoleted by yours — which is great. This issue is to surface the 2-3 things that are still novel after 0.0.32, with numbers re-measured against 0.0.32, and to ask whether you'd welcome them as small PRs before I open anything.
Context
I rebuilt my measurements on a fresh clone at tag 0.0.32 (multipart.py, 1925 lines) rather than my stale 0.0.30 baseline, precisely because #295/#296/#300 overlap the area I was looking at. The honest outcome: a couple of my earlier "wins" are now subsumed or no longer worthwhile, and what remains is smaller but still real and orthogonal to what you just landed.
Inlining the old per-header-byteadvance_header_size() call — no longer relevant: Speed up multipart header parsing and callback dispatch #295's data.find() span-scan means it's now called ~3×/part instead of per byte, so the lever I was pulling no longer exists. Dropped.
Proposal — what's still novel on top of 0.0.32
1. Reorder the per-byte state dispatch so hot states are tested first (the headline). The if/elif ladder in _internal_write is still source-order: START, START_BOUNDARY, HEADER_FIELD_START, … with PART_DATA tested 8th (line 1344) on every part-data byte, after all the cold header/boundary states. Reordering so PART_DATA, HEADER_VALUE, HEADER_FIELD come first is a pure permutation of the arms: same 1925-line file, arm bodies byte-identical (I verified the arm-body multiset is unchanged after extracting and reassembling all 12 MultipartState arms hot-first). Interestingly #295strengthens this: now that header bytes jump via find, the ladder is hit predominantly for PART_DATA bytes, so putting PART_DATA first matters more, not less. None of #295/#296/#300 touched the dispatch order.
2. Inline the set_mark / delete_mark closures. Still defined as nested closures (lines 1098/1102) and still called directly in the state arms (1225/1243/1291/1338). Replacing them with direct self.marks[name] = i / self.marks.pop(name, None) is independent of the find-based header rewrite.
3. Drop the residual func = cast("Callable[..., Any]", func) in callback() (still at line 654). typing.cast is a runtime no-op that costs a call on every dispatch; removing it changes nothing observable.
Evidence (re-measured vs 0.0.32, not 0.0.30)
Drift-immune, single-process interleaved A/B (load pristine → measure, load variant → measure; each measure = N parses of a multi-part corpus), median-of-medians over 11-15 runs × 3-5 invocations. CPython 3.11.6, single Apple-silicon machine — absolute % will vary by interpreter/CPU (a Linux / 3.13 CodSpeed run will land differently), but the A/B is relative/interleaved so the sign and rough magnitude are robust (per-invocation stdev <3 ms on ~190 ms, invocations agree within ~1.5 pt for the reorder).
Change
vs 0.0.32
State-ladder reorder alone (the standalone PR candidate)
~+10% (×1.11), stable 9.7–10.2% across invocations
All three combined (reorder + mark inline + cast drop)
~+14% (×1.15–1.19), machine-dependent
For honesty: the reorder was +25% vs 0.0.30 but shrank to ~+10% vs 0.0.32 — it did not shrink to noise, and it's still by far the biggest single lever among what's left. (The combined figure is more machine-sensitive than the reorder, so I'm quoting it as a range; the reorder-only number is the one I'd stand behind tightly.)
Correctness: differential equivalence vs pristine 0.0.32 across whole / fixed-chunk feeding (chunk sizes 1, 2, 3, 7, 13, 64 — chunk=1 is effectively byte-by-byte) including boundary-straddling and \r\n---laden part data, over a 95-entry oracle (multi-file, binary, base64 + quoted-printable decoders, odd dispositions, garbage/truncated bodies), comparing the canonical per-part result and the raised (type, msg) — 95/95 iso-functional, 0 divergences (90 parse-OK, 5 parse-or-raise). Repo suite (test_multipart.py + test_file.py) green for the reorder-only and the combined variant. I'm happy to re-run your full chunk-split sweep (whole / byte-by-byte / fixed / random + boundary-edge) and the Starlette test_formparsers.py downstream check before any PR, exactly as #295 did.
Per-change breakdown / overlap audit vs your merged PRs
Do you want any of these? If only one, I'd lead with the state-ladder reorder (~+10%, byte-identical arm bodies, clean to review as a pure permutation).
One PR or split? I'd lean toward the reorder as a standalone PR, with the mark-inline + cast drop folded into a second small one — but happy to bundle or sequence however you prefer.
Benchmark methodology — I see CodSpeed is wired into CI (tests/test_benchmarks.py --codspeed). I'll report against your named workloads (simple_form, large_form, file_upload, worstcase_boundary_chars, querystring_large_form) on Py 3.13/3.14 and/or defnull/multipart_bench MB/s — whatever number you want to see in the PR, since my own A/B is just my local proxy and absolute % is CPU-dependent.
Anything I'd open would be behaviour-preserving for valid input, target main, keep 100% coverage, pass scripts/check/scripts/test, add a CHANGELOG Unreleased bullet, and carry an AI-assistance disclaimer (this work was done with Claude's help; I review and verify everything). Thanks!
Hi @Kludex — first, thanks for the perf wave you shipped today in 0.0.32: #295 (header
find+translatescanning + dropping the per-callbacklogger.debug), #296 (bound the field-name size before validating), and #300 (rfindlookbehind for partial boundaries). I'd been profiling the streamingMultipartParser._internal_writehot path independently and several of my findings just got obsoleted by yours — which is great. This issue is to surface the 2-3 things that are still novel after 0.0.32, with numbers re-measured against 0.0.32, and to ask whether you'd welcome them as small PRs before I open anything.Context
I rebuilt my measurements on a fresh clone at tag 0.0.32 (
multipart.py, 1925 lines) rather than my stale 0.0.30 baseline, precisely because #295/#296/#300 overlap the area I was looking at. The honest outcome: a couple of my earlier "wins" are now subsumed or no longer worthwhile, and what remains is smaller but still real and orthogonal to what you just landed.What's already done by you (so NOT proposed here)
logger.debugcalls in the multipart callback — gone in 0.0.32 (Speed up multipart header parsing and callback dispatch #295). I'm not re-proposing that.advance_header_size()call — no longer relevant: Speed up multipart header parsing and callback dispatch #295'sdata.find()span-scan means it's now called ~3×/part instead of per byte, so the lever I was pulling no longer exists. Dropped.Proposal — what's still novel on top of 0.0.32
1. Reorder the per-byte state dispatch so hot states are tested first (the headline). The
if/elifladder in_internal_writeis still source-order:START,START_BOUNDARY,HEADER_FIELD_START, … withPART_DATAtested 8th (line 1344) on every part-data byte, after all the cold header/boundary states. Reordering soPART_DATA,HEADER_VALUE,HEADER_FIELDcome first is a pure permutation of the arms: same 1925-line file, arm bodies byte-identical (I verified the arm-body multiset is unchanged after extracting and reassembling all 12MultipartStatearms hot-first). Interestingly #295 strengthens this: now that header bytes jump viafind, the ladder is hit predominantly forPART_DATAbytes, so puttingPART_DATAfirst matters more, not less. None of #295/#296/#300 touched the dispatch order.2. Inline the
set_mark/delete_markclosures. Still defined as nested closures (lines 1098/1102) and still called directly in the state arms (1225/1243/1291/1338). Replacing them with directself.marks[name] = i/self.marks.pop(name, None)is independent of thefind-based header rewrite.3. Drop the residual
func = cast("Callable[..., Any]", func)incallback()(still at line 654).typing.castis a runtime no-op that costs a call on every dispatch; removing it changes nothing observable.Evidence (re-measured vs 0.0.32, not 0.0.30)
Drift-immune, single-process interleaved A/B (load pristine → measure, load variant → measure; each measure = N parses of a multi-part corpus), median-of-medians over 11-15 runs × 3-5 invocations. CPython 3.11.6, single Apple-silicon machine — absolute % will vary by interpreter/CPU (a Linux / 3.13 CodSpeed run will land differently), but the A/B is relative/interleaved so the sign and rough magnitude are robust (per-invocation stdev <3 ms on ~190 ms, invocations agree within ~1.5 pt for the reorder).
For honesty: the reorder was +25% vs 0.0.30 but shrank to ~+10% vs 0.0.32 — it did not shrink to noise, and it's still by far the biggest single lever among what's left. (The combined figure is more machine-sensitive than the reorder, so I'm quoting it as a range; the reorder-only number is the one I'd stand behind tightly.)
Correctness: differential equivalence vs pristine 0.0.32 across whole / fixed-chunk feeding (chunk sizes 1, 2, 3, 7, 13, 64 — chunk=1 is effectively byte-by-byte) including boundary-straddling and
\r\n---laden part data, over a 95-entry oracle (multi-file, binary, base64 + quoted-printable decoders, odd dispositions, garbage/truncated bodies), comparing the canonical per-part result and the raised(type, msg)— 95/95 iso-functional, 0 divergences (90 parse-OK, 5 parse-or-raise). Repo suite (test_multipart.py+test_file.py) green for the reorder-only and the combined variant. I'm happy to re-run your full chunk-split sweep (whole / byte-by-byte / fixed / random + boundary-edge) and the Starlettetest_formparsers.pydownstream check before any PR, exactly as #295 did.Per-change breakdown / overlap audit vs your merged PRs
logger.debugin callbackadvance_header_sizefind; lever goneset_mark/delete_markcast(...)incallbackAsk
castdrop folded into a second small one — but happy to bundle or sequence however you prefer.tests/test_benchmarks.py --codspeed). I'll report against your named workloads (simple_form,large_form,file_upload,worstcase_boundary_chars,querystring_large_form) on Py 3.13/3.14 and/ordefnull/multipart_benchMB/s — whatever number you want to see in the PR, since my own A/B is just my local proxy and absolute % is CPU-dependent.PREAMBLEstate to this sameif/elifladder. If the reorder lands, that new arm should sit near the bottom (it's cold) — happy to rebase around Skip multipart preamble bytes before the first boundary #260 whichever order you merge them.Anything I'd open would be behaviour-preserving for valid input, target
main, keep 100% coverage, passscripts/check/scripts/test, add a CHANGELOGUnreleasedbullet, and carry an AI-assistance disclaimer (this work was done with Claude's help; I review and verify everything). Thanks!