perf(rust): pre-reserve buffer capacity for struct primitive fields by Geethapranay1 · Pull Request #3580 · apache/fory

Geethapranay1 · 2026-04-17T11:43:35Z

Why?

Struct primitive field writes ran a buffer capacity check on each write.
Repeated checks added overhead on primitive-heavy structs.
This pr removes those repeated checks on the fast path.

What does this PR do?

Compute the maximum byte length for all primitive fields at macro expansion time.
Calls reserve once before processing any fields to ensure memory is available.
Uses direct memory pointer writes (put_*_at) for both fixed and variable length encodings.
Tracks the writer index offset locally during execution without modifying vector constraints.
Commits the final writer index exactly once after processing all contiguous primitive fields.
Matches Rust serialization design directly with the C++ struct_serializer.h.

Related issues

Fix #3569

AI Contribution Checklist

Substantial AI assistance was used in this PR: yes / no
If yes, I included a completed AI Contribution Checklist in this PR description and the required AI Usage Disclosure.
If yes, my PR description includes the required ai_review summary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.

Does this PR introduce any user-facing change?

Does this PR introduce any public API change?
Does this PR introduce any binary protocol compatibility change?

Benchmark

chaokunyang · 2026-04-17T12:05:38Z

@Geethapranay1 Please run benchmarks/rust in your branch and apache/main branch, and share both benchmark plots to here

Geethapranay1 · 2026-04-17T13:03:09Z

Benchmarks:

apache/main:

struct plot:

throughput plot:

Benchmark Results

Timing Results (nanoseconds)

Datatype	Operation	fory (ns)	protobuf (ns)	Fastest
Struct	Serialize	110.9	183.1	fory
Struct	Deserialize	65.2	113.0	fory
Sample	Serialize	166.0	932.9	fory
Sample	Deserialize	308.7	1322.6	fory
MediaContent	Serialize	384.0	561.8	fory
MediaContent	Deserialize	514.0	915.9	fory
StructList	Serialize	270.9	924.6	fory
StructList	Deserialize	222.0	752.9	fory
SampleList	Serialize	653.2	6220.3	fory
SampleList	Deserialize	2128.5	6932.7	fory
MediaContentList	Serialize	1122.0	3809.1	fory
MediaContentList	Deserialize	2758.9	5396.7	fory

Throughput Results (ops/sec)

Datatype	Operation	fory TPS	protobuf TPS	Fastest
Struct	Serialize	9,019,572	5,462,988	fory
Struct	Deserialize	15,341,423	8,853,475	fory
Sample	Serialize	6,023,734	1,071,972	fory
Sample	Deserialize	3,239,601	756,086	fory
MediaContent	Serialize	2,603,963	1,779,866	fory
MediaContent	Deserialize	1,945,374	1,091,882	fory
StructList	Serialize	3,691,808	1,081,490	fory
StructList	Deserialize	4,505,316	1,328,286	fory
SampleList	Serialize	1,530,925	160,764	fory
SampleList	Deserialize	469,814	144,244	fory
MediaContentList	Serialize	891,266	262,529	fory
MediaContentList	Deserialize	362,463	185,298	fory

pr branch:

struct plot:

throughput plot:

Benchmark Results

Timing Results (nanoseconds)

Datatype	Operation	fory (ns)	protobuf (ns)	Fastest
Struct	Serialize	109.8	183.0	fory
Struct	Deserialize	60.3	112.5	fory
Sample	Serialize	170.2	924.9	fory
Sample	Deserialize	295.4	1241.1	fory
MediaContent	Serialize	342.9	554.7	fory
MediaContent	Deserialize	478.1	901.8	fory
StructList	Serialize	261.1	897.5	fory
StructList	Deserialize	220.7	713.2	fory
SampleList	Serialize	606.3	5991.8	fory
SampleList	Deserialize	2342.3	6954.9	fory
MediaContentList	Serialize	1142.2	3876.4	fory
MediaContentList	Deserialize	3066.6	5380.4	fory

Throughput Results (ops/sec)

Datatype	Operation	fory TPS	protobuf TPS	Fastest
Struct	Serialize	9,104,980	5,463,884	fory
Struct	Deserialize	16,575,501	8,890,469	fory
Sample	Serialize	5,876,476	1,081,151	fory
Sample	Deserialize	3,385,584	805,737	fory
MediaContent	Serialize	2,915,877	1,802,906	fory
MediaContent	Deserialize	2,091,744	1,108,881	fory
StructList	Serialize	3,829,950	1,114,243	fory
StructList	Deserialize	4,532,064	1,402,171	fory
SampleList	Serialize	1,649,430	166,895	fory
SampleList	Deserialize	426,931	143,784	fory
MediaContentList	Serialize	875,503	257,971	fory
MediaContentList	Deserialize	326,094	185,860	fory

And for final comparison, I also ran repeated struct rounds (N=3) on the same machine and used median values:

Serialize: 165.72 ns -> 159.20 ns (-3.93%)
Deserialize: 88.731 ns -> 74.397 ns (-16.15%)

Geethapranay1 · 2026-04-17T18:27:33Z

added a few more things in the latest commit:

switched from copy_nonoverlapping to write_unaligned for the fixed primitive put_*_at methods. Removes the temporary stack reference overhead.
batched the varint byte writes into wider u16/u32 stores, same way the existing _write_var_uint32 already does it.
fixed a missing 8-byte case in put_var_uint64_at that I accidentally dropped in the earlier commit.

Reran benchmarks on both branches, same machine, back to back.

main branch:

Benchmark Results

Timing Results (nanoseconds)

Datatype	Operation	fory (ns)	protobuf (ns)	Fastest
Struct	Serialize	115.0	189.8	fory
Struct	Deserialize	62.0	116.6	fory
Sample	Serialize	174.3	962.4	fory
Sample	Deserialize	306.8	1322.7	fory
MediaContent	Serialize	351.7	582.1	fory
MediaContent	Deserialize	497.5	910.1	fory
StructList	Serialize	269.3	939.6	fory
StructList	Deserialize	228.4	736.9	fory
SampleList	Serialize	637.3	6523.7	fory
SampleList	Deserialize	2167.2	7059.6	fory
MediaContentList	Serialize	1158.0	3967.8	fory
MediaContentList	Deserialize	2731.3	5433.4	fory

Throughput Results (ops/sec)

Datatype	Operation	fory TPS	protobuf TPS	Fastest
Struct	Serialize	8,696,408	5,269,259	fory
Struct	Deserialize	16,141,789	8,573,388	fory
Sample	Serialize	5,737,564	1,039,058	fory
Sample	Deserialize	3,259,346	756,029	fory
MediaContent	Serialize	2,843,332	1,717,859	fory
MediaContent	Deserialize	2,009,929	1,098,780	fory
StructList	Serialize	3,713,331	1,064,260	fory
StructList	Deserialize	4,377,325	1,356,999	fory
SampleList	Serialize	1,569,120	153,287	fory
SampleList	Deserialize	461,425	141,651	fory
MediaContentList	Serialize	863,558	252,029	fory
MediaContentList	Deserialize	366,126	184,047	fory

this pr:

Benchmark Results

Timing Results (nanoseconds)

Datatype	Operation	fory (ns)	protobuf (ns)	Fastest
Struct	Serialize	112.7	212.1	fory
Struct	Deserialize	61.9	103.1	fory
Sample	Serialize	183.0	949.8	fory
Sample	Deserialize	301.5	1275.4	fory
MediaContent	Serialize	349.4	567.6	fory
MediaContent	Deserialize	489.0	896.8	fory
StructList	Serialize	247.0	999.5	fory
StructList	Deserialize	225.9	669.3	fory
SampleList	Serialize	641.8	6174.4	fory
SampleList	Deserialize	2169.7	6931.8	fory
MediaContentList	Serialize	1171.5	3914.1	fory
MediaContentList	Deserialize	2733.3	5389.5	fory

Throughput Results (ops/sec)

Datatype	Operation	fory TPS	protobuf TPS	Fastest
Struct	Serialize	8,875,477	4,714,090	fory
Struct	Deserialize	16,157,960	9,698,380	fory
Sample	Serialize	5,463,884	1,052,809	fory
Sample	Deserialize	3,316,970	784,068	fory
MediaContent	Serialize	2,862,213	1,761,773	fory
MediaContent	Deserialize	2,044,906	1,115,088	fory
StructList	Serialize	4,048,911	1,000,520	fory
StructList	Deserialize	4,426,737	1,494,076	fory
SampleList	Serialize	1,558,215	161,959	fory
SampleList	Deserialize	460,893	144,263	fory
MediaContentList	Serialize	853,606	255,487	fory
MediaContentList	Deserialize	365,858	185,546	fory

Struct Serialize: 115.0ns -> 112.7ns (2% faster)
StructList Serialize: 269.3ns -> 247.0ns (8.3% faster)
Serialized data sizes are the same across all types.

Geethapranay1 · 2026-04-19T13:30:06Z

@chaokunyang PTAL

chaokunyang · 2026-04-19T13:33:10Z

the performance gains are too less, could you dive into it to see why it's bring too much gains? 2% is more about noise and not deserve such complexibility

Geethapranay1 · 2026-04-19T14:16:05Z

@chaokunyang as i gone through some internals of rust compiler, the existing rust path uses Vec::extend_from_slice which llvm inlines pretty aggressively and the capacity check becomes a single well-predicted branch, so pre-reserving only saves ~1 predicted branch per field. With 8 fields that's maybe 3-4ns on a 115ns op.

and when compared with the C++ buffer.h and the difference is clear. C++ uses a raw data_ pointer with separate writer_index_ and grow() bt rust Vec wraps len/capacity/ptr into one abstraction that llvm already optimizes well, so there's less room to gain.

so to get real gains I think we'd need to either:
replace the rust Writer's Vec with a raw pointer + writer_index layout like C++ buffer.h or find a different view on increasing gains

should i want to go through option 1 or close this?

chaokunyang · 2026-04-19T15:00:29Z

No need for such change, you can reserve capacity once, and use reset to update Vec index later.

Geethapranay1 · 2026-04-19T15:34:35Z

that's exactly what this pr is prepare_write reserves the capacity once, and finish_write updates the Vec length at the end.

I re-ran the benches with my CPU strictly pinned, and while single Struct only showed a small change, i get a solid 7.3% speedup on StructList serialization (164ns -> 152ns).

Geethapranay1 · 2026-04-20T11:19:52Z

@chaokunyang PTAL

Geethapranay1 requested review from chaokunyang and theweipeng as code owners April 17, 2026 11:43

Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from fbf26e9 to 72a6af6 Compare April 17, 2026 11:45

perf(rust): pre-reserve buffer capacity for struct primitive fields

ddc418d

Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from 72a6af6 to ddc418d Compare April 17, 2026 18:02

Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from ddc418d to 30364e0 Compare April 18, 2026 07:17

chaokunyang reviewed Apr 18, 2026

View reviewed changes

Comment thread rust/fory-core/src/buffer.rs Outdated

refactor(rust): move put_at methods from buffer.rs to unsafe_util.rs

b175a09

Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from 30364e0 to b175a09 Compare April 18, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rust): pre-reserve buffer capacity for struct primitive fields#3580

perf(rust): pre-reserve buffer capacity for struct primitive fields#3580
Geethapranay1 wants to merge 2 commits intoapache:mainfrom
Geethapranay1:perf/rust-struct-buffer-prereserve

Geethapranay1 commented Apr 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

chaokunyang commented Apr 17, 2026

Uh oh!

Geethapranay1 commented Apr 17, 2026

Uh oh!

Geethapranay1 commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Geethapranay1 commented Apr 19, 2026

Uh oh!

chaokunyang commented Apr 19, 2026 •

edited

Loading

Uh oh!

Geethapranay1 commented Apr 19, 2026

Uh oh!

chaokunyang commented Apr 19, 2026

Uh oh!

Geethapranay1 commented Apr 19, 2026

Uh oh!

Geethapranay1 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Geethapranay1 commented Apr 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why?

What does this PR do?

Related issues

AI Contribution Checklist

Does this PR introduce any user-facing change?

Benchmark

Uh oh!

chaokunyang commented Apr 17, 2026

Uh oh!

Geethapranay1 commented Apr 17, 2026

Benchmarks:

apache/main:

Benchmark Results

Timing Results (nanoseconds)

Throughput Results (ops/sec)

pr branch:

Benchmark Results

Timing Results (nanoseconds)

Throughput Results (ops/sec)

Uh oh!

Geethapranay1 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

main branch:

Benchmark Results

Timing Results (nanoseconds)

Throughput Results (ops/sec)

this pr:

Benchmark Results

Timing Results (nanoseconds)

Throughput Results (ops/sec)

Uh oh!

Uh oh!

Geethapranay1 commented Apr 19, 2026

Uh oh!

chaokunyang commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Geethapranay1 commented Apr 19, 2026

Uh oh!

chaokunyang commented Apr 19, 2026

Uh oh!

Geethapranay1 commented Apr 19, 2026

Uh oh!

Geethapranay1 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Geethapranay1 commented Apr 17, 2026 •

edited by github-actions Bot

Loading

Geethapranay1 commented Apr 17, 2026 •

edited

Loading

chaokunyang commented Apr 19, 2026 •

edited

Loading