Perf: a few small streaming-parser micro-opts that stack on top of 0.0.32 (#295/#296/#300) — would a PR be welcome?

Hi @Kludex — first, thanks for the perf wave you shipped today in 0.0.32: #295 (header `find`+`translate` scanning + dropping the per-callback `logger.debug`), #296 (bound the field-name size before validating), and #300 (`rfind` lookbehind for partial boundaries). I'd been profiling the streaming `MultipartParser._internal_write` hot path independently and several of my findings just got obsoleted by yours — which is great. This issue is to surface the **2-3 things that are still novel after 0.0.32**, with numbers **re-measured against 0.0.32**, and to ask whether you'd welcome them as small PRs before I open anything.

## Context

I rebuilt my measurements on a fresh clone at tag **0.0.32** (`multipart.py`, 1925 lines) rather than my stale 0.0.30 baseline, precisely because #295/#296/#300 overlap the area I was looking at. The honest outcome: a couple of my earlier "wins" are now subsumed or no longer worthwhile, and what remains is smaller but still real and **orthogonal** to what you just landed.

## What's already done by you (so NOT proposed here)

- The two `logger.debug` calls in the multipart callback — **gone in 0.0.32 (#295)**. I'm not re-proposing that.
- Inlining the old per-header-**byte** `advance_header_size()` call — **no longer relevant**: #295's `data.find()` span-scan means it's now called ~3×/part instead of per byte, so the lever I was pulling no longer exists. Dropped.

## Proposal — what's still novel on top of 0.0.32

**1. Reorder the per-byte state dispatch so hot states are tested first (the headline).** The `if/elif` ladder in `_internal_write` is still source-order: `START`, `START_BOUNDARY`, `HEADER_FIELD_START`, … with `PART_DATA` tested **8th** (line 1344) on *every* part-data byte, after all the cold header/boundary states. Reordering so `PART_DATA`, `HEADER_VALUE`, `HEADER_FIELD` come first is a **pure permutation** of the arms: same 1925-line file, arm bodies byte-identical (I verified the arm-body multiset is unchanged after extracting and reassembling all 12 `MultipartState` arms hot-first). Interestingly #295 *strengthens* this: now that header bytes jump via `find`, the ladder is hit predominantly for `PART_DATA` bytes, so putting `PART_DATA` first matters more, not less. None of #295/#296/#300 touched the dispatch *order*.

**2. Inline the `set_mark` / `delete_mark` closures.** Still defined as nested closures (lines 1098/1102) and still called directly in the state arms (1225/1243/1291/1338). Replacing them with direct `self.marks[name] = i` / `self.marks.pop(name, None)` is independent of the `find`-based header rewrite.

**3. Drop the residual `func = cast("Callable[..., Any]", func)` in `callback()`** (still at line 654). `typing.cast` is a runtime no-op that costs a call on every dispatch; removing it changes nothing observable.

## Evidence (re-measured vs 0.0.32, not 0.0.30)

Drift-immune, single-process **interleaved A/B** (load pristine → measure, load variant → measure; each measure = N parses of a multi-part corpus), median-of-medians over 11-15 runs × 3-5 invocations. CPython 3.11.6, single Apple-silicon machine — **absolute % will vary by interpreter/CPU** (a Linux / 3.13 CodSpeed run will land differently), but the A/B is relative/interleaved so the sign and rough magnitude are robust (per-invocation stdev <3 ms on ~190 ms, invocations agree within ~1.5 pt for the reorder).

| Change | vs 0.0.32 |
|---|---|
| **State-ladder reorder alone** (the standalone PR candidate) | **~+10% (×1.11)**, stable 9.7–10.2% across invocations |
| All three combined (reorder + mark inline + cast drop) | ~+14% (×1.15–1.19), machine-dependent |

For honesty: the reorder was **+25% vs 0.0.30** but shrank to **~+10% vs 0.0.32** — it did *not* shrink to noise, and it's still by far the biggest single lever among what's left. (The combined figure is more machine-sensitive than the reorder, so I'm quoting it as a range; the reorder-only number is the one I'd stand behind tightly.)

**Correctness:** differential equivalence vs pristine 0.0.32 across whole / fixed-chunk feeding (chunk sizes 1, 2, 3, 7, 13, 64 — chunk=1 is effectively byte-by-byte) including boundary-straddling and `\r\n--`-laden part data, over a 95-entry oracle (multi-file, binary, base64 + quoted-printable decoders, odd dispositions, garbage/truncated bodies), comparing the canonical per-part result **and** the raised `(type, msg)` — **95/95 iso-functional, 0 divergences** (90 parse-OK, 5 parse-or-raise). Repo suite (`test_multipart.py` + `test_file.py`) green for the reorder-only and the combined variant. I'm happy to re-run your full chunk-split sweep (whole / byte-by-byte / fixed / random + boundary-edge) and the Starlette `test_formparsers.py` downstream check before any PR, exactly as #295 did.

<details>
<summary>Per-change breakdown / overlap audit vs your merged PRs</summary>

| My finding | Status vs 0.0.32 | Reason |
|---|---|---|
| Drop 2 `logger.debug` in callback | **Subsumed by #295** | removed upstream; not re-proposed |
| Inline per-byte `advance_header_size` | **Obsoleted by #295** | header parsing now jumps via `find`; lever gone |
| **Reorder state ladder (PART_DATA first)** | **Novel** | dispatch order untouched by #295/#296/#300 |
| **Inline `set_mark`/`delete_mark`** | **Novel** | closures untouched |
| **Drop residual `cast(...)` in `callback`** | **Novel** | still at line 654 |

</details>

## Ask

1. Do you want any of these? If only one, I'd lead with the **state-ladder reorder** (~+10%, byte-identical arm bodies, clean to review as a pure permutation).
2. **One PR or split?** I'd lean toward the reorder as a standalone PR, with the mark-inline + `cast` drop folded into a second small one — but happy to bundle or sequence however you prefer.
3. **Benchmark methodology** — I see CodSpeed is wired into CI (`tests/test_benchmarks.py --codspeed`). I'll report against your named workloads (`simple_form`, `large_form`, `file_upload`, `worstcase_boundary_chars`, `querystring_large_form`) on Py 3.13/3.14 and/or `defnull/multipart_bench` MB/s — whatever number you want to see in the PR, since my own A/B is just my local proxy and absolute % is CPU-dependent.
4. Heads-up on sequencing: #260 (skip preamble) adds a `PREAMBLE` state to this same `if/elif` ladder. If the reorder lands, that new arm should sit near the bottom (it's cold) — happy to rebase around #260 whichever order you merge them.

Anything I'd open would be behaviour-preserving for valid input, target `main`, keep 100% coverage, pass `scripts/check`/`scripts/test`, add a CHANGELOG `Unreleased` bullet, and carry an AI-assistance disclaimer (this work was done with Claude's help; I review and verify everything). Thanks!


My finding	Status vs 0.0.32	Reason
Drop 2 `logger.debug` in callback	Subsumed by #295	removed upstream; not re-proposed
Inline per-byte `advance_header_size`	Obsoleted by #295	header parsing now jumps via `find`; lever gone
Reorder state ladder (PART_DATA first)	Novel	dispatch order untouched by #295/#296/#300
Inline `set_mark`/`delete_mark`	Novel	closures untouched
Drop residual `cast(...)` in `callback`	Novel	still at line 654

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: a few small streaming-parser micro-opts that stack on top of 0.0.32 (#295/#296/#300) — would a PR be welcome? #305

Context

What's already done by you (so NOT proposed here)

Proposal — what's still novel on top of 0.0.32

Evidence (re-measured vs 0.0.32, not 0.0.30)

Ask

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Change	vs 0.0.32
State-ladder reorder alone (the standalone PR candidate)	~+10% (×1.11), stable 9.7–10.2% across invocations
All three combined (reorder + mark inline + cast drop)	~+14% (×1.15–1.19), machine-dependent

Uh oh!

Perf: a few small streaming-parser micro-opts that stack on top of 0.0.32 (#295/#296/#300) — would a PR be welcome? #305

Description

Context

What's already done by you (so NOT proposed here)

Proposal — what's still novel on top of 0.0.32

Evidence (re-measured vs 0.0.32, not 0.0.30)

Ask

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions