feat: per-session pipelining via seq-ordered data ops by dazzling-no-more · Pull Request #903 · therealaleph/MasterHttpRelayVPN-RUST

dazzling-no-more · 2026-05-08T09:49:33Z

Summary

Per-session pipelining for full-tunnel mode. Allows up to 2 in-flight data ops per session to overlap one Apps Script round-trip with the next on a single TCP stream, lifting the per-session throughput cap that came from strict request/await/request sequencing.

Wire protocol (forward-compatible — old peers ignore the new fields):

BatchOp.seq: Option<u64> — per-session monotonic sequence number, sent only on data ops in pipelined sessions.
TunnelResponse.seq: Option<u64> — echoed by the server.
TunnelResponse.caps: Option<u32> — advertised on connect / connect_data success replies. Bit 0 = CAPS_PIPELINE_SEQ.
u64 rather than u32 so a long-lived TCP session generating ~100 ops/s doesn't saturate after ~1.4 years.

Tunnel-node (new): when a data op carries seq, the server processes the entire (write, drain) sequence under a per-session seq lock and only releases it after expected advances. Two ops can travel through different deployments and arrive out of order — the seq lock guarantees they're processed (and replied to) in send order, so a client reorder buffer never sees bytes from seq=N+1 land before seq=N's bytes. Old data ops without seq continue down the legacy unordered path. connect/connect_data success replies set caps: Some(CAPS_PIPELINE_SEQ) — old clients ignore the unknown field, new clients opt into pipelining.

Client (new): tunnel_loop_pipelined keeps up to PIPELINE_DEPTH = 2 data ops in flight per pipelined session. Reordering is implicit — VecDeque<(seq, oneshot::Receiver)> awaited in FIFO order means replies for later seqs sit in their oneshot buffers until our task gets to them. Idle sessions stay at depth=1 (server long-polls; no speculative quota burn). Pipelining only enables when the connect reply advertised caps AND no prior reply has been observed dropping the seq field (mixed-version backend protection).

Cross-cutting correctness:

Per-batch BatchWait primitive replaces N×M per-job watchers with one watcher per unique SessionInner (deduplicated by Arc::as_ptr). reader_task switched to notify_waiters() so concurrent batches all wake on each push.
pipelined_reply_timeout derived as effective_batch * 2 + slack — covers permit-saturated semaphore wait plus the op's own request budget.
Active seq ops (had_uplink == true) wait per-session up to ACTIVE_DRAIN_DEADLINE and run a straggler-settle loop, so multi-packet upstream responses land in one drain instead of being split across batches. Idle ops still subscribe to the shared BatchWait for cross-session wake.
close ops collected during dispatch and processed AFTER seq jobs ONLY when an earlier same-sid op is already deferred (else run inline) — preserves both [data(seq), close] (data writes before close tears the session down) AND [close, data] (close runs first, data hits the closed session).
client_send_closed flag in tunnel_loop_pipelined so a TCP half-close from the local client stops queuing new ops but keeps draining already-queued reply oneshots — valid HTTP request/response with client-side shutdown no longer drops downlink bytes.
Effective batch timeout threaded through tunnel_batch_request_with_timeout (3-arg tunnel_batch_request_to kept as backward-compat wrapper) so both h2 (h2_relay_request) and h1 (read_http_response_with_timeout) honor the pipelined-batch floor — avoids inner 30 s default firing before our outer 60 s budget on user-tuned configs.
Phase 2's TCP-drain budget reservation capped at TCP_DRAIN_MAX_BYTES per iteration (mirrors the seq path), so concurrent seq jobs continue to see headroom while Phase 2's drain is in flight.

Public API change

TunnelMux::udp_open and udp_data already take impl Into<Bytes> (carried over from the prior zero-copy PR). tunnel_batch_request_to keeps its 3-arg signature; new tunnel_batch_request_with_timeout exposes the explicit timeout for callers that need it. Public structs TunnelResponse and BatchOp gain new optional fields with #[serde(default)] / #[serde(skip_serializing_if = "Option::is_none")] — old wire shapes deserialize cleanly and serialize identically.

Realistic perf delta

For active streaming downloads on a single TCP session, expect 30–80% throughput improvement depending on what fraction of the Apps Script RTT is server-side vs network. Less for idle / long-poll workloads (server-side seq enforcement serializes drains per session). No win on multi-session workloads (those were already parallel across sessions). Documented latency trade-off: a stuck seq op (lost earlier seq from a dropped batch) holds the batch HTTP response open for up to SEQ_WAIT_TIMEOUT = 30 s; the pinned regression test unrelated_seq_session_in_same_batch_is_not_delayed_past_seq_wait ensures unrelated sessions' intrinsic processing time stays sub-second within that window.

Test plan

cargo build --bins --lib clean both crates
cargo test --lib (client) + cargo test (tunnel-node) pass — 216 + 57 = 273 total
Server-side correctness: seq ordering across out-of-order arrivals, missed-notify race in wait_for_seq_turn, write-failure path bumps expected (SeqAdvanceOnDrop), tail preservation under cap, batch budget shared with seq jobs, EOF withholding while buffer non-empty, [data(seq), close] runs data before close, [close, data] runs close first
Server-side concurrency: BatchWait wakes all jobs on single push, dedupes watchers per unique inner, wakes across concurrent batches on the same session, mixed ready+idle batch latency, Phase 2 budget reservation capped at TCP_DRAIN_MAX_BYTES
Client-side: pipelined loop emits seq=0 first, in-seq-order output when replies arrive reversed, closes on seq mismatch, closes on seq=None (mixed-version detection), drains in-flight on client half-close, pipelined_reply_timeout covers 2× batch_timeout, mark_pipelining_disabled sticky and overrides caps
Manual: load a heavy site through full-tunnel mode and confirm no regressions on legacy non-pipelined sessions (mixed connect_data capability detection)
Manual: pipelined-capable session under heavy concurrent load — confirm no avoidable disconnects under semaphore saturation

therealaleph · 2026-05-08T13:03:03Z

Reviewed via Anthropic Claude.

Verified locally on top of v1.9.18:

cargo build --bins --lib --release: clean
cargo test --lib --release: 216/216 ✅
(cd tunnel-node && cargo test --release): 57/57 ✅

Wire-protocol design looks careful — #[serde(default)] + #[serde(skip_serializing_if = "Option::is_none")] on the new optional fields means old peers ignore the new fields and old shapes deserialize cleanly. The opt-in cap-bit (CAPS_PIPELINE_SEQ on connect-success) + sticky mark_pipelining_disabled on first seq=None reply is the right shape for mixed-deployment fleets where one Apps Script is running old Code.gs.

Hesitating on immediate merge because of two factors:

Surface area: 3620 additions across the three highest-traffic files in the project (tunnel_client, domain_fronter, tunnel-node). Even with the test count this is the kind of PR where edge-case races only show up under real Iran-ISP latency + h2 connection drops + concurrent sessions on a busy VPS.
Manual checkboxes still unchecked in the test plan: heavy load against a non-pipelined backend (mixed-version), and pipelined session under semaphore-saturation.

Plan: leaving open for 5–7 days of community testing like we did with #359 (drive queue). If two or three users on different ISPs report it stable for streaming downloads + heavy concurrent browsing, I'll squash-merge as v1.9.19.

For testers reading this: pull feature/tunnel-session-pipelining, run cargo build --release --features ui --bin mhrv-rs-ui, switch to Full mode, do a heavy download (Steam / large CDN file) + a few hours of normal browsing, and report back here with:

ISP / country
mhrv-rs version your tunnel-node deployment is running (so we know if you're testing pure-pipelined or mixed-version)
any disconnects, stalls, or out-of-order data the browser noticed (e.g. partial JS bundles, mid-stream HTTPS errors)

Thanks @dazzling-no-more — this is the kind of structural perf work that's hard to land safely. The seq-lock + capability-negotiation design is exactly the right architecture for what is ultimately a not-quite-ordered transport layer.

w0l4i

Great commit, clever move !
keep it going champ 💪

Fixes #924 — the canonical tracking thread for the v1.9.15 Full-mode regression cluster that spanned ~3 weeks and 18+ duplicate reports. **Root cause** (rigorously bisected to PR #799's `warm()`): PR #799 added HTTP/2 multiplexing on the relay leg, and gated the h1 socket-pool prewarm behind `ensure_h2().await`. `ensure_h2()` is bounded by `H2_OPEN_TIMEOUT_SECS = 8s` but can take the full window on a cold connection. During that window the h1 fallback pool is empty, so any request that arrives gets: 1. `Err((Relay("h2 unavailable"), No))` immediately → falls back to h1 2. h1 path calls `acquire()` → empty pool → cold `open()` → fresh TCP+TLS handshake to `connect_host:443` 3. Same network conditions that stalled h2 also stall h1; cold open exceeds the 30s `batch_timeout` enforced in `dispatch_full_tunnel` 4. User sees `batch timed out after 30s` while apps_script mode keeps working **Fix** (two commits, both `domain_fronter.rs`-only): Commit 1 — `warm h1 pool in parallel with h2`: Spawn h2 prewarm in a separate task so the h1 prewarm loop runs concurrently. Full `n` h1 sockets are warm before user traffic, even when the h2 handshake stalls or hits its 8s timeout. `run_pool_refill` trims the pool back to `POOL_MIN_H2_FALLBACK = 2` within 5s once h2 lands as the fast path. Commit 2 — `bound h1 open() + detect dead h2 cells synchronously`: - `H1_OPEN_TIMEOUT_SECS = 8` wraps the TCP+TLS handshake in `open()` so a stuck handshake to a blackholed `connect_host:443` doesn't block `acquire()` until the outer 30s batch budget elapses (same symptom #924 hits during the warm-race window). - `H2Cell.dead: Arc<AtomicBool>` flipped by the connection driver task when `Connection::await` ends (GOAWAY, network error, normal close). `ensure_h2`'s fast path and `run_pool_refill`'s pool-target check both consult the flag — known-dead cells are rejected within ≤5s instead of waiting for `H2_CONN_TTL_SECS = 540s` to expire or for a request to discover the breakage via `ready()` failure. **API impact**: `h2_handshake_post_tls`'s return type changes to `(SendRequest, Arc<AtomicBool>)`. One existing test (`h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1`) tweaks its `Ok` arm to match — panic message unchanged. **Verified locally on top of v1.9.19**: - `cargo test --lib --release`: 209/209 (was 208; +1 new test `ensure_h2_rejects_dead_cell_within_ttl`) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean - `(cd tunnel-node && cargo test --release)`: 36/36 **Live end-to-end** (from PR description): - 5 cold restarts (warm-up race window): 5/5 pass, 9.6-22.5s - Concurrent burst (5 simultaneous SOCKS5 streams): 5/5 - Default full.json baseline: 200 OK in 13.3s - `force_http1: true` sanity: 200 OK in 17.7s **A/B vs PR #903** (per-session pipelining): commits land in disjoint functions, cherry-picked clean on top. If #903 lands first, this needs a mechanical rebase only. Exemplary debugging work — bisect with concrete probe data, root cause identification down to specific commit + line, working fix with bounded timeouts on the two adjacent paths the same stall pattern could recur through, +1 regression test. The kind of PR that lands fast. Closes #924. Reviewed via Anthropic Claude. Co-Authored-By: rezaisrad <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yyoyoian-pixel · 2026-05-11T00:18:09Z

i am suspicious these reports were related to the h2 for full tunnel which are being reverted. i'm skeptical if this changes anything or just complicates the full tunnel mode, please provide data if this is actually helping. have you tested and seen noticeable change? because i did and i see nothing dramatically changed as the description suggests.

yyoyoian-pixel · 2026-05-11T00:22:44Z

youtube does not work like this, youtube already opens multiple calls to google videos cdn, batching already handles this, tunnel node waits 1000 if we got data and batch send them all.

dazzling-no-more · 2026-05-11T05:32:45Z

@yyoyoian-pixel
Batching ≠ pipelining. The 1s coalesce window in mux_loop groups ops from different sessions into one HTTP request — it does not let a single session have two ops in flight. Look at tunnel_client.rs:1594 on main: reply_rx.await blocks before the next MuxMsg::Data can be produced. One sid = one round-trip in flight, period. I even called it out in tunnel-node/src/main.rs:60-63: "tunnel_loop is strictly serial (one in-flight op per session)."

On YouTube specifically — yes, it opens ~6 CDN streams, but each becomes its own sid with its own serial loop. 6 × 1 in-flight aggregate, not 6 sharing one batch. The pipelined loop lifts each to 6 × 2.

That said, you may genuinely not see a dramatic change for your workload, and that's expected:

Idle sessions stay at depth=1 (line 1721-1722) — normal browsing won't show it.
The win is RTT-overlap, not server parallelism (seq lock serializes drains per session).
You need ≥2 deployments for the two in-flight batches to actually run in parallel — with h2 reverted in v1.9.21, single-deployment users see less.
Multi-session workloads (YouTube, many tabs) were already aggregate-parallel.

The headline scenario is one TCP session doing a sustained download with multiple deployments. A clean test: single-stream curl https://speed.cloudflare.com/__down?bytes=104857600 with 3+ deployments, v1.9.21 base, pipelining on vs off. Anything multi-session is structurally a bad test for what this PR fixes

Resolve conflict in src/domain_fronter.rs (tunnel_batch_request_with_timeout). main (therealaleph#1040) intentionally removed the h2 fast path from tunnel batches because coalesced batches see no h2-multiplexing benefit. The branch's contribution — threading caller-supplied `batch_timeout` (so pipelined batches honor `max(configured, PIPELINED_BATCH_TIMEOUT_FLOOR)`) — is preserved on the h1 header-read path, which is the only path now. Build: clean on both crates. Tests: 217/217 client lib, 57/57 tunnel-node.

yyoyoian-pixel · 2026-05-11T07:30:50Z

what if session seq 1 drops? there are protocols like that and it's a sophisticated one like quic do this but they have very safe recovery, this PR adds 3600 lines complexity which 99% of users won't even see. multiple deployments but one single tcp download which barely happens, all good downloaders divide downloads in chunks even.

yyoyoian-pixel · 2026-05-11T08:48:18Z

@dazzling-no-more let's have this tested through behind censor, for all the apps, browsing, youtube, video calls, messaging. (android would be enough) provide some feedbacks and make sure there is no regression, ideally also some numbers.

yyoyoian-pixel · 2026-05-11T09:26:47Z

@dazzling-no-more i'm fine with moving forward with this if tests are done and if we rebase on master after #1041 is merged. h2 should not be enabled back for full tunnel mode.

dazzling-no-more · 2026-05-11T09:38:39Z

@yyoyoian-pixel I have already merged and solved the conflict. Test again.

yyoyoian-pixel · 2026-05-11T09:42:55Z

@dazzling-no-more 1041 is still open so we need to wait for that, also please share your test results behind censor. i think this benefits uploads more than downloads, sending to app script is a lot slower behind censor than from tunnel node. and if so, we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2?

dazzling-no-more · 2026-05-11T11:01:50Z

@yyoyoian-pixel The community needs to test it. I am not in that region.

dazzling-no-more · 2026-05-11T11:24:22Z

@dazzling-no-more 1041 is still open so we need to wait for that, also please share your test results behind censor. i think this benefits uploads more than downloads, sending to app script is a lot slower behind censor than from tunnel node. and if so, we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2?

@yyoyoian-pixel we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2?

Different layers. The 8-connection pool is cross-session — it fills the bandwidth to Google's edge so any session's batches can get a connection without paying a TLS handshake. Per-session pipeline depth is about how many ops for a single session can be in flight server-side.

Depth>2 doesn't help because the tunnel-node's per-session seq lock serializes the (write, drain) sequence for each session. With depth=2 the second op is always queued at the server while the first is being processed — that's the full overlap. Depth=8 would just queue 6 more ops on the server's seq lock without speeding anything up, and it costs 4× the reorder-buffer memory client-side.

The bandwidth argument doesn't apply within a session either: 30 concurrent slots per deployment already saturate Apps Script's per-account fan-out. Per-session pipelining adds RTT-overlap on top of that, not more bandwidth.

yyoyoian-pixel · 2026-05-11T11:58:27Z

@dazzling-no-more i don't think your AI auto reply and I are talking about the same thing, maybe it's better to add some drawings of before and after to the pull request so we can understand what is happening, this is a big change and I don't think it's right to merge it without properly understand it. if we support ordering then we can send N number of ops in N different batches then client handles ordering, except memory and cpu nothing prevents us from doing so. and as I said earlier, download doesn't make any dramatic difference since tunnel to app script is MUCH faster than client to app script, the bottleneck is not the first.

dazzling-no-more · 2026-05-11T12:53:54Z

Before — `tunnel_loop`, depth=1 (strict request/await/request per session)

sequenceDiagram
    autonumber
    participant Client
    participant AS as Apps Script
    participant TN as tunnel-node

    Note over Client,TN: Op A
    Client->>AS: data(A)<br/>~slow client→AS leg
    AS->>TN: ~fast
    TN-->>AS: reply A<br/>~fast
    AS-->>Client: response A<br/>~AS→client leg
    Note right of Client: only NOW can Op B be sent

    Note over Client,TN: Op B (serial)
    Client->>AS: data(B)<br/>~slow client→AS leg
    AS->>TN: ~fast
    TN-->>AS: reply B
    AS-->>Client: response B

    Note right of Client: Total wall clock = 2 × RTT<br/>slow client→AS leg utilized SEQUENTIALLY

After — `tunnel_loop_pipelined`, depth=2 with seq

sequenceDiagram
    autonumber
    participant Client
    participant AS as Apps Script
    participant TN as tunnel-node

    Note over Client,TN: Client fires A and B back-to-back<br/>(no await between)
    Client->>AS: data(A, seq=0)
    Client->>AS: data(B, seq=1)
    Note over Client,AS: BOTH in the slow client→AS leg<br/>at the same time — pipe utilized in PARALLEL

    par
    AS->>TN: forward A
    and
    AS->>TN: forward B
    end

    Note over TN: per-session seq lock<br/>processes A then B in order
    TN-->>AS: reply A (seq=0)
    TN-->>AS: reply B (seq=1)

    par
    AS-->>Client: response A
    and
    AS-->>Client: response B
    end

    Note right of Client: Total wall clock ≈ 1 × RTT + small server-side serial delta<br/>~35–80% faster, scaling with how much<br/>client↔AS leg dominates the round-trip

dazzling-no-more · 2026-05-11T12:56:24Z

Depth > 2 has diminishing returns when client -> AS is not the bottleneck, but real returns when it is.

The win is in the slow leg

yyoyoian-pixel · 2026-05-11T13:33:10Z

Depth > 2 has diminishing returns when client -> AS is not the bottleneck, but real returns when it is.

The win is in the slow leg

so want to try > 2? :) ideally as much as min tcp connections 8.

feat: per-session pipelining via seq-ordered data ops

ed7aa79

github-actions Bot added the type: feature feat: PR — auto-applied by release-drafter label May 8, 2026

This was referenced May 9, 2026

feat(youtube): relay_url_patterns + SABR strip + exit-node-full SNI #977

Open

feat(tls): chrome ClientHello via BoringSSL behind --features utls (#369) #963

Open

FULL mode problem! #969

Open

w0l4i approved these changes May 10, 2026

View reviewed changes

This was referenced May 10, 2026

Full Tunnel low speed #1001

Closed

مشکل اختلال های ثانیه ای و کندی در تانل نود #1008

Closed

rezaisrad mentioned this pull request May 10, 2026

fix: v1.9.15 full-mode warm-up race during h2 init (#924) #1029

Merged

therealaleph mentioned this pull request May 10, 2026

سرعت پایین مدل Full Tunnel #1038

Closed

Conversation

dazzling-no-more commented May 8, 2026

Summary

Public API change

Realistic perf delta

Test plan

Uh oh!

therealaleph commented May 8, 2026

Uh oh!

w0l4i left a comment

Choose a reason for hiding this comment

Uh oh!

yyoyoian-pixel commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

dazzling-no-more commented May 11, 2026

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

dazzling-no-more commented May 11, 2026

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

dazzling-no-more commented May 11, 2026

Uh oh!

dazzling-no-more commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

dazzling-no-more commented May 11, 2026

Before — tunnel_loop, depth=1 (strict request/await/request per session)

After — tunnel_loop_pipelined, depth=2 with seq

Uh oh!

dazzling-no-more commented May 11, 2026

Uh oh!

yyoyoian-pixel commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yyoyoian-pixel commented May 11, 2026 •

edited

Loading

dazzling-no-more commented May 11, 2026 •

edited

Loading

Before — `tunnel_loop`, depth=1 (strict request/await/request per session)

After — `tunnel_loop_pipelined`, depth=2 with seq