Skip to content

perf(rust): pre-reserve buffer capacity for struct primitive fields#3580

Open
Geethapranay1 wants to merge 2 commits intoapache:mainfrom
Geethapranay1:perf/rust-struct-buffer-prereserve
Open

perf(rust): pre-reserve buffer capacity for struct primitive fields#3580
Geethapranay1 wants to merge 2 commits intoapache:mainfrom
Geethapranay1:perf/rust-struct-buffer-prereserve

Conversation

@Geethapranay1
Copy link
Copy Markdown
Contributor

@Geethapranay1 Geethapranay1 commented Apr 17, 2026

Why?

  • Struct primitive field writes ran a buffer capacity check on each write.
  • Repeated checks added overhead on primitive-heavy structs.
  • This pr removes those repeated checks on the fast path.

What does this PR do?

  • Compute the maximum byte length for all primitive fields at macro expansion time.
  • Calls reserve once before processing any fields to ensure memory is available.
  • Uses direct memory pointer writes (put_*_at) for both fixed and variable length encodings.
  • Tracks the writer index offset locally during execution without modifying vector constraints.
  • Commits the final writer index exactly once after processing all contiguous primitive fields.
  • Matches Rust serialization design directly with the C++ struct_serializer.h.

Related issues

Fix #3569

AI Contribution Checklist

  • Substantial AI assistance was used in this PR: yes / no
  • If yes, I included a completed AI Contribution Checklist in this PR description and the required AI Usage Disclosure.
  • If yes, my PR description includes the required ai_review summary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
  • Does this PR introduce any binary protocol compatibility change?

Benchmark

@Geethapranay1 Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from fbf26e9 to 72a6af6 Compare April 17, 2026 11:45
@chaokunyang
Copy link
Copy Markdown
Collaborator

@Geethapranay1 Please run benchmarks/rust in your branch and apache/main branch, and share both benchmark plots to here

@Geethapranay1
Copy link
Copy Markdown
Contributor Author

Benchmarks:

apache/main:

struct plot:
struct
throughput plot:
throughput

Benchmark Results

Timing Results (nanoseconds)

Datatype Operation fory (ns) protobuf (ns) Fastest
Struct Serialize 110.9 183.1 fory
Struct Deserialize 65.2 113.0 fory
Sample Serialize 166.0 932.9 fory
Sample Deserialize 308.7 1322.6 fory
MediaContent Serialize 384.0 561.8 fory
MediaContent Deserialize 514.0 915.9 fory
StructList Serialize 270.9 924.6 fory
StructList Deserialize 222.0 752.9 fory
SampleList Serialize 653.2 6220.3 fory
SampleList Deserialize 2128.5 6932.7 fory
MediaContentList Serialize 1122.0 3809.1 fory
MediaContentList Deserialize 2758.9 5396.7 fory

Throughput Results (ops/sec)

Datatype Operation fory TPS protobuf TPS Fastest
Struct Serialize 9,019,572 5,462,988 fory
Struct Deserialize 15,341,423 8,853,475 fory
Sample Serialize 6,023,734 1,071,972 fory
Sample Deserialize 3,239,601 756,086 fory
MediaContent Serialize 2,603,963 1,779,866 fory
MediaContent Deserialize 1,945,374 1,091,882 fory
StructList Serialize 3,691,808 1,081,490 fory
StructList Deserialize 4,505,316 1,328,286 fory
SampleList Serialize 1,530,925 160,764 fory
SampleList Deserialize 469,814 144,244 fory
MediaContentList Serialize 891,266 262,529 fory
MediaContentList Deserialize 362,463 185,298 fory

pr branch:

struct plot:
struct
throughput plot:
throughput

Benchmark Results

Timing Results (nanoseconds)

Datatype Operation fory (ns) protobuf (ns) Fastest
Struct Serialize 109.8 183.0 fory
Struct Deserialize 60.3 112.5 fory
Sample Serialize 170.2 924.9 fory
Sample Deserialize 295.4 1241.1 fory
MediaContent Serialize 342.9 554.7 fory
MediaContent Deserialize 478.1 901.8 fory
StructList Serialize 261.1 897.5 fory
StructList Deserialize 220.7 713.2 fory
SampleList Serialize 606.3 5991.8 fory
SampleList Deserialize 2342.3 6954.9 fory
MediaContentList Serialize 1142.2 3876.4 fory
MediaContentList Deserialize 3066.6 5380.4 fory

Throughput Results (ops/sec)

Datatype Operation fory TPS protobuf TPS Fastest
Struct Serialize 9,104,980 5,463,884 fory
Struct Deserialize 16,575,501 8,890,469 fory
Sample Serialize 5,876,476 1,081,151 fory
Sample Deserialize 3,385,584 805,737 fory
MediaContent Serialize 2,915,877 1,802,906 fory
MediaContent Deserialize 2,091,744 1,108,881 fory
StructList Serialize 3,829,950 1,114,243 fory
StructList Deserialize 4,532,064 1,402,171 fory
SampleList Serialize 1,649,430 166,895 fory
SampleList Deserialize 426,931 143,784 fory
MediaContentList Serialize 875,503 257,971 fory
MediaContentList Deserialize 326,094 185,860 fory

And for final comparison, I also ran repeated struct rounds (N=3) on the same machine and used median values:

  • Serialize: 165.72 ns -> 159.20 ns (-3.93%)
  • Deserialize: 88.731 ns -> 74.397 ns (-16.15%)

@Geethapranay1 Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from 72a6af6 to ddc418d Compare April 17, 2026 18:02
@Geethapranay1
Copy link
Copy Markdown
Contributor Author

Geethapranay1 commented Apr 17, 2026

added a few more things in the latest commit:

  • switched from copy_nonoverlapping to write_unaligned for the fixed primitive put_*_at methods. Removes the temporary stack reference overhead.
  • batched the varint byte writes into wider u16/u32 stores, same way the existing _write_var_uint32 already does it.
  • fixed a missing 8-byte case in put_var_uint64_at that I accidentally dropped in the earlier commit.

Reran benchmarks on both branches, same machine, back to back.

main branch:

struct

Benchmark Results

Timing Results (nanoseconds)

Datatype Operation fory (ns) protobuf (ns) Fastest
Struct Serialize 115.0 189.8 fory
Struct Deserialize 62.0 116.6 fory
Sample Serialize 174.3 962.4 fory
Sample Deserialize 306.8 1322.7 fory
MediaContent Serialize 351.7 582.1 fory
MediaContent Deserialize 497.5 910.1 fory
StructList Serialize 269.3 939.6 fory
StructList Deserialize 228.4 736.9 fory
SampleList Serialize 637.3 6523.7 fory
SampleList Deserialize 2167.2 7059.6 fory
MediaContentList Serialize 1158.0 3967.8 fory
MediaContentList Deserialize 2731.3 5433.4 fory

Throughput Results (ops/sec)

Datatype Operation fory TPS protobuf TPS Fastest
Struct Serialize 8,696,408 5,269,259 fory
Struct Deserialize 16,141,789 8,573,388 fory
Sample Serialize 5,737,564 1,039,058 fory
Sample Deserialize 3,259,346 756,029 fory
MediaContent Serialize 2,843,332 1,717,859 fory
MediaContent Deserialize 2,009,929 1,098,780 fory
StructList Serialize 3,713,331 1,064,260 fory
StructList Deserialize 4,377,325 1,356,999 fory
SampleList Serialize 1,569,120 153,287 fory
SampleList Deserialize 461,425 141,651 fory
MediaContentList Serialize 863,558 252,029 fory
MediaContentList Deserialize 366,126 184,047 fory

this pr:

struct

Benchmark Results

Timing Results (nanoseconds)

Datatype Operation fory (ns) protobuf (ns) Fastest
Struct Serialize 112.7 212.1 fory
Struct Deserialize 61.9 103.1 fory
Sample Serialize 183.0 949.8 fory
Sample Deserialize 301.5 1275.4 fory
MediaContent Serialize 349.4 567.6 fory
MediaContent Deserialize 489.0 896.8 fory
StructList Serialize 247.0 999.5 fory
StructList Deserialize 225.9 669.3 fory
SampleList Serialize 641.8 6174.4 fory
SampleList Deserialize 2169.7 6931.8 fory
MediaContentList Serialize 1171.5 3914.1 fory
MediaContentList Deserialize 2733.3 5389.5 fory

Throughput Results (ops/sec)

Datatype Operation fory TPS protobuf TPS Fastest
Struct Serialize 8,875,477 4,714,090 fory
Struct Deserialize 16,157,960 9,698,380 fory
Sample Serialize 5,463,884 1,052,809 fory
Sample Deserialize 3,316,970 784,068 fory
MediaContent Serialize 2,862,213 1,761,773 fory
MediaContent Deserialize 2,044,906 1,115,088 fory
StructList Serialize 4,048,911 1,000,520 fory
StructList Deserialize 4,426,737 1,494,076 fory
SampleList Serialize 1,558,215 161,959 fory
SampleList Deserialize 460,893 144,263 fory
MediaContentList Serialize 853,606 255,487 fory
MediaContentList Deserialize 365,858 185,546 fory
  • Struct Serialize: 115.0ns -> 112.7ns (2% faster)
  • StructList Serialize: 269.3ns -> 247.0ns (8.3% faster)
  • Serialized data sizes are the same across all types.

@Geethapranay1 Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from ddc418d to 30364e0 Compare April 18, 2026 07:17
Comment thread rust/fory-core/src/buffer.rs Outdated
@Geethapranay1 Geethapranay1 force-pushed the perf/rust-struct-buffer-prereserve branch from 30364e0 to b175a09 Compare April 18, 2026 10:43
@Geethapranay1
Copy link
Copy Markdown
Contributor Author

@chaokunyang PTAL

@chaokunyang
Copy link
Copy Markdown
Collaborator

chaokunyang commented Apr 19, 2026

the performance gains are too less, could you dive into it to see why it's bring too much gains? 2% is more about noise and not deserve such complexibility

@Geethapranay1
Copy link
Copy Markdown
Contributor Author

@chaokunyang as i gone through some internals of rust compiler, the existing rust path uses Vec::extend_from_slice which llvm inlines pretty aggressively and the capacity check becomes a single well-predicted branch, so pre-reserving only saves ~1 predicted branch per field. With 8 fields that's maybe 3-4ns on a 115ns op.

and when compared with the C++ buffer.h and the difference is clear. C++ uses a raw data_ pointer with separate writer_index_ and grow() bt rust Vec wraps len/capacity/ptr into one abstraction that llvm already optimizes well, so there's less room to gain.

so to get real gains I think we'd need to either:
replace the rust Writer's Vec with a raw pointer + writer_index layout like C++ buffer.h or find a different view on increasing gains

should i want to go through option 1 or close this?

@chaokunyang
Copy link
Copy Markdown
Collaborator

No need for such change, you can reserve capacity once, and use reset to update Vec index later.

@Geethapranay1
Copy link
Copy Markdown
Contributor Author

that's exactly what this pr is prepare_write reserves the capacity once, and finish_write updates the Vec length at the end.

I re-ran the benches with my CPU strictly pinned, and while single Struct only showed a small change, i get a solid 7.3% speedup on StructList serialization (164ns -> 152ns).

@Geethapranay1
Copy link
Copy Markdown
Contributor Author

@chaokunyang PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Rust] Struct Performance Optimization

2 participants