Skip to content

feat: per-session pipelining via seq-ordered data ops#903

Open
dazzling-no-more wants to merge 2 commits intotherealaleph:mainfrom
dazzling-no-more:feature/tunnel-session-pipelining
Open

feat: per-session pipelining via seq-ordered data ops#903
dazzling-no-more wants to merge 2 commits intotherealaleph:mainfrom
dazzling-no-more:feature/tunnel-session-pipelining

Conversation

@dazzling-no-more
Copy link
Copy Markdown
Contributor

Summary

Per-session pipelining for full-tunnel mode. Allows up to 2 in-flight data ops per session to overlap one Apps Script round-trip with the next on a single TCP stream, lifting the per-session throughput cap that came from strict request/await/request sequencing.

Wire protocol (forward-compatible — old peers ignore the new fields):

  • BatchOp.seq: Option<u64> — per-session monotonic sequence number, sent only on data ops in pipelined sessions.
  • TunnelResponse.seq: Option<u64> — echoed by the server.
  • TunnelResponse.caps: Option<u32> — advertised on connect / connect_data success replies. Bit 0 = CAPS_PIPELINE_SEQ.
  • u64 rather than u32 so a long-lived TCP session generating ~100 ops/s doesn't saturate after ~1.4 years.

Tunnel-node (new): when a data op carries seq, the server processes the entire (write, drain) sequence under a per-session seq lock and only releases it after expected advances. Two ops can travel through different deployments and arrive out of order — the seq lock guarantees they're processed (and replied to) in send order, so a client reorder buffer never sees bytes from seq=N+1 land before seq=N's bytes. Old data ops without seq continue down the legacy unordered path. connect/connect_data success replies set caps: Some(CAPS_PIPELINE_SEQ) — old clients ignore the unknown field, new clients opt into pipelining.

Client (new): tunnel_loop_pipelined keeps up to PIPELINE_DEPTH = 2 data ops in flight per pipelined session. Reordering is implicit — VecDeque<(seq, oneshot::Receiver)> awaited in FIFO order means replies for later seqs sit in their oneshot buffers until our task gets to them. Idle sessions stay at depth=1 (server long-polls; no speculative quota burn). Pipelining only enables when the connect reply advertised caps AND no prior reply has been observed dropping the seq field (mixed-version backend protection).

Cross-cutting correctness:

  • Per-batch BatchWait primitive replaces N×M per-job watchers with one watcher per unique SessionInner (deduplicated by Arc::as_ptr). reader_task switched to notify_waiters() so concurrent batches all wake on each push.
  • pipelined_reply_timeout derived as effective_batch * 2 + slack — covers permit-saturated semaphore wait plus the op's own request budget.
  • Active seq ops (had_uplink == true) wait per-session up to ACTIVE_DRAIN_DEADLINE and run a straggler-settle loop, so multi-packet upstream responses land in one drain instead of being split across batches. Idle ops still subscribe to the shared BatchWait for cross-session wake.
  • close ops collected during dispatch and processed AFTER seq jobs ONLY when an earlier same-sid op is already deferred (else run inline) — preserves both [data(seq), close] (data writes before close tears the session down) AND [close, data] (close runs first, data hits the closed session).
  • client_send_closed flag in tunnel_loop_pipelined so a TCP half-close from the local client stops queuing new ops but keeps draining already-queued reply oneshots — valid HTTP request/response with client-side shutdown no longer drops downlink bytes.
  • Effective batch timeout threaded through tunnel_batch_request_with_timeout (3-arg tunnel_batch_request_to kept as backward-compat wrapper) so both h2 (h2_relay_request) and h1 (read_http_response_with_timeout) honor the pipelined-batch floor — avoids inner 30 s default firing before our outer 60 s budget on user-tuned configs.
  • Phase 2's TCP-drain budget reservation capped at TCP_DRAIN_MAX_BYTES per iteration (mirrors the seq path), so concurrent seq jobs continue to see headroom while Phase 2's drain is in flight.

Public API change

TunnelMux::udp_open and udp_data already take impl Into<Bytes> (carried over from the prior zero-copy PR). tunnel_batch_request_to keeps its 3-arg signature; new tunnel_batch_request_with_timeout exposes the explicit timeout for callers that need it. Public structs TunnelResponse and BatchOp gain new optional fields with #[serde(default)] / #[serde(skip_serializing_if = "Option::is_none")] — old wire shapes deserialize cleanly and serialize identically.

Realistic perf delta

For active streaming downloads on a single TCP session, expect 30–80% throughput improvement depending on what fraction of the Apps Script RTT is server-side vs network. Less for idle / long-poll workloads (server-side seq enforcement serializes drains per session). No win on multi-session workloads (those were already parallel across sessions). Documented latency trade-off: a stuck seq op (lost earlier seq from a dropped batch) holds the batch HTTP response open for up to SEQ_WAIT_TIMEOUT = 30 s; the pinned regression test unrelated_seq_session_in_same_batch_is_not_delayed_past_seq_wait ensures unrelated sessions' intrinsic processing time stays sub-second within that window.

Test plan

  • cargo build --bins --lib clean both crates
  • cargo test --lib (client) + cargo test (tunnel-node) pass — 216 + 57 = 273 total
  • Server-side correctness: seq ordering across out-of-order arrivals, missed-notify race in wait_for_seq_turn, write-failure path bumps expected (SeqAdvanceOnDrop), tail preservation under cap, batch budget shared with seq jobs, EOF withholding while buffer non-empty, [data(seq), close] runs data before close, [close, data] runs close first
  • Server-side concurrency: BatchWait wakes all jobs on single push, dedupes watchers per unique inner, wakes across concurrent batches on the same session, mixed ready+idle batch latency, Phase 2 budget reservation capped at TCP_DRAIN_MAX_BYTES
  • Client-side: pipelined loop emits seq=0 first, in-seq-order output when replies arrive reversed, closes on seq mismatch, closes on seq=None (mixed-version detection), drains in-flight on client half-close, pipelined_reply_timeout covers 2× batch_timeout, mark_pipelining_disabled sticky and overrides caps
  • Manual: load a heavy site through full-tunnel mode and confirm no regressions on legacy non-pipelined sessions (mixed connect_data capability detection)
  • Manual: pipelined-capable session under heavy concurrent load — confirm no avoidable disconnects under semaphore saturation

@github-actions github-actions Bot added the type: feature feat: PR — auto-applied by release-drafter label May 8, 2026
@therealaleph
Copy link
Copy Markdown
Owner

Reviewed via Anthropic Claude.

Verified locally on top of v1.9.18:

  • cargo build --bins --lib --release: clean
  • cargo test --lib --release: 216/216 ✅
  • (cd tunnel-node && cargo test --release): 57/57 ✅

Wire-protocol design looks careful — #[serde(default)] + #[serde(skip_serializing_if = "Option::is_none")] on the new optional fields means old peers ignore the new fields and old shapes deserialize cleanly. The opt-in cap-bit (CAPS_PIPELINE_SEQ on connect-success) + sticky mark_pipelining_disabled on first seq=None reply is the right shape for mixed-deployment fleets where one Apps Script is running old Code.gs.

Hesitating on immediate merge because of two factors:

  1. Surface area: 3620 additions across the three highest-traffic files in the project (tunnel_client, domain_fronter, tunnel-node). Even with the test count this is the kind of PR where edge-case races only show up under real Iran-ISP latency + h2 connection drops + concurrent sessions on a busy VPS.
  2. Manual checkboxes still unchecked in the test plan: heavy load against a non-pipelined backend (mixed-version), and pipelined session under semaphore-saturation.

Plan: leaving open for 5–7 days of community testing like we did with #359 (drive queue). If two or three users on different ISPs report it stable for streaming downloads + heavy concurrent browsing, I'll squash-merge as v1.9.19.

For testers reading this: pull feature/tunnel-session-pipelining, run cargo build --release --features ui --bin mhrv-rs-ui, switch to Full mode, do a heavy download (Steam / large CDN file) + a few hours of normal browsing, and report back here with:

  • ISP / country
  • mhrv-rs version your tunnel-node deployment is running (so we know if you're testing pure-pipelined or mixed-version)
  • any disconnects, stalls, or out-of-order data the browser noticed (e.g. partial JS bundles, mid-stream HTTPS errors)

Thanks @dazzling-no-more — this is the kind of structural perf work that's hard to land safely. The seq-lock + capability-negotiation design is exactly the right architecture for what is ultimately a not-quite-ordered transport layer.

Copy link
Copy Markdown

@w0l4i w0l4i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great commit, clever move !
keep it going champ 💪

therealaleph pushed a commit that referenced this pull request May 10, 2026
Fixes #924 — the canonical tracking thread for the v1.9.15 Full-mode
regression cluster that spanned ~3 weeks and 18+ duplicate reports.

**Root cause** (rigorously bisected to PR #799's `warm()`):
PR #799 added HTTP/2 multiplexing on the relay leg, and gated the h1
socket-pool prewarm behind `ensure_h2().await`. `ensure_h2()` is bounded
by `H2_OPEN_TIMEOUT_SECS = 8s` but can take the full window on a cold
connection. During that window the h1 fallback pool is empty, so any
request that arrives gets:

1. `Err((Relay("h2 unavailable"), No))` immediately → falls back to h1
2. h1 path calls `acquire()` → empty pool → cold `open()` → fresh
   TCP+TLS handshake to `connect_host:443`
3. Same network conditions that stalled h2 also stall h1; cold open
   exceeds the 30s `batch_timeout` enforced in `dispatch_full_tunnel`
4. User sees `batch timed out after 30s` while apps_script mode keeps
   working

**Fix** (two commits, both `domain_fronter.rs`-only):

Commit 1 — `warm h1 pool in parallel with h2`:
Spawn h2 prewarm in a separate task so the h1 prewarm loop runs
concurrently. Full `n` h1 sockets are warm before user traffic, even
when the h2 handshake stalls or hits its 8s timeout. `run_pool_refill`
trims the pool back to `POOL_MIN_H2_FALLBACK = 2` within 5s once h2
lands as the fast path.

Commit 2 — `bound h1 open() + detect dead h2 cells synchronously`:
- `H1_OPEN_TIMEOUT_SECS = 8` wraps the TCP+TLS handshake in `open()`
  so a stuck handshake to a blackholed `connect_host:443` doesn't
  block `acquire()` until the outer 30s batch budget elapses (same
  symptom #924 hits during the warm-race window).
- `H2Cell.dead: Arc<AtomicBool>` flipped by the connection driver task
  when `Connection::await` ends (GOAWAY, network error, normal close).
  `ensure_h2`'s fast path and `run_pool_refill`'s pool-target check
  both consult the flag — known-dead cells are rejected within ≤5s
  instead of waiting for `H2_CONN_TTL_SECS = 540s` to expire or for a
  request to discover the breakage via `ready()` failure.

**API impact**:
`h2_handshake_post_tls`'s return type changes to `(SendRequest, Arc<AtomicBool>)`.
One existing test (`h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1`)
tweaks its `Ok` arm to match — panic message unchanged.

**Verified locally on top of v1.9.19**:
- `cargo test --lib --release`: 209/209 (was 208; +1 new test
  `ensure_h2_rejects_dead_cell_within_ttl`)
- `cargo build --release --features ui --bin mhrv-rs-ui`: clean
- `(cd tunnel-node && cargo test --release)`: 36/36

**Live end-to-end** (from PR description):
- 5 cold restarts (warm-up race window): 5/5 pass, 9.6-22.5s
- Concurrent burst (5 simultaneous SOCKS5 streams): 5/5
- Default full.json baseline: 200 OK in 13.3s
- `force_http1: true` sanity: 200 OK in 17.7s

**A/B vs PR #903** (per-session pipelining): commits land in disjoint
functions, cherry-picked clean on top. If #903 lands first, this needs
a mechanical rebase only.

Exemplary debugging work — bisect with concrete probe data, root cause
identification down to specific commit + line, working fix with bounded
timeouts on the two adjacent paths the same stall pattern could recur
through, +1 regression test. The kind of PR that lands fast.

Closes #924.

Reviewed via Anthropic Claude.

Co-Authored-By: rezaisrad <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

yyoyoian-pixel commented May 11, 2026

i am suspicious these reports were related to the h2 for full tunnel which are being reverted. i'm skeptical if this changes anything or just complicates the full tunnel mode, please provide data if this is actually helping. have you tested and seen noticeable change? because i did and i see nothing dramatically changed as the description suggests.

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

youtube does not work like this, youtube already opens multiple calls to google videos cdn, batching already handles this, tunnel node waits 1000 if we got data and batch send them all.

@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

@yyoyoian-pixel
Batching ≠ pipelining. The 1s coalesce window in mux_loop groups ops from different sessions into one HTTP request — it does not let a single session have two ops in flight. Look at tunnel_client.rs:1594 on main: reply_rx.await blocks before the next MuxMsg::Data can be produced. One sid = one round-trip in flight, period. I even called it out in tunnel-node/src/main.rs:60-63: "tunnel_loop is strictly serial (one in-flight op per session)."

On YouTube specifically — yes, it opens ~6 CDN streams, but each becomes its own sid with its own serial loop. 6 × 1 in-flight aggregate, not 6 sharing one batch. The pipelined loop lifts each to 6 × 2.

That said, you may genuinely not see a dramatic change for your workload, and that's expected:

  • Idle sessions stay at depth=1 (line 1721-1722) — normal browsing won't show it.
  • The win is RTT-overlap, not server parallelism (seq lock serializes drains per session).
  • You need ≥2 deployments for the two in-flight batches to actually run in parallel — with h2 reverted in v1.9.21, single-deployment users see less.
  • Multi-session workloads (YouTube, many tabs) were already aggregate-parallel.

The headline scenario is one TCP session doing a sustained download with multiple deployments. A clean test: single-stream curl https://speed.cloudflare.com/__down?bytes=104857600 with 3+ deployments, v1.9.21 base, pipelining on vs off. Anything multi-session is structurally a bad test for what this PR fixes

Resolve conflict in src/domain_fronter.rs (tunnel_batch_request_with_timeout).

main (therealaleph#1040) intentionally removed the h2 fast path from tunnel batches
because coalesced batches see no h2-multiplexing benefit. The branch's
contribution — threading caller-supplied `batch_timeout` (so pipelined
batches honor `max(configured, PIPELINED_BATCH_TIMEOUT_FLOOR)`) — is
preserved on the h1 header-read path, which is the only path now.

Build: clean on both crates.
Tests: 217/217 client lib, 57/57 tunnel-node.
@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

what if session seq 1 drops? there are protocols like that and it's a sophisticated one like quic do this but they have very safe recovery, this PR adds 3600 lines complexity which 99% of users won't even see. multiple deployments but one single tcp download which barely happens, all good downloaders divide downloads in chunks even.

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

@dazzling-no-more let's have this tested through behind censor, for all the apps, browsing, youtube, video calls, messaging. (android would be enough) provide some feedbacks and make sure there is no regression, ideally also some numbers.

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

@dazzling-no-more i'm fine with moving forward with this if tests are done and if we rebase on master after #1041 is merged. h2 should not be enabled back for full tunnel mode.

@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

@yyoyoian-pixel I have already merged and solved the conflict. Test again.

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

@dazzling-no-more 1041 is still open so we need to wait for that, also please share your test results behind censor. i think this benefits uploads more than downloads, sending to app script is a lot slower behind censor than from tunnel node. and if so, we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2?

@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

@yyoyoian-pixel The community needs to test it. I am not in that region.

@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

dazzling-no-more commented May 11, 2026

@dazzling-no-more 1041 is still open so we need to wait for that, also please share your test results behind censor. i think this benefits uploads more than downloads, sending to app script is a lot slower behind censor than from tunnel node. and if so, we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2?

@yyoyoian-pixel we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2?

Different layers. The 8-connection pool is cross-session — it fills the bandwidth to Google's edge so any session's batches can get a connection without paying a TLS handshake. Per-session pipeline depth is about how many ops for a single session can be in flight server-side.

Depth>2 doesn't help because the tunnel-node's per-session seq lock serializes the (write, drain) sequence for each session. With depth=2 the second op is always queued at the server while the first is being processed — that's the full overlap. Depth=8 would just queue 6 more ops on the server's seq lock without speeding anything up, and it costs 4× the reorder-buffer memory client-side.

The bandwidth argument doesn't apply within a session either: 30 concurrent slots per deployment already saturate Apps Script's per-account fan-out. Per-session pipelining adds RTT-overlap on top of that, not more bandwidth.

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

@dazzling-no-more i don't think your AI auto reply and I are talking about the same thing, maybe it's better to add some drawings of before and after to the pull request so we can understand what is happening, this is a big change and I don't think it's right to merge it without properly understand it. if we support ordering then we can send N number of ops in N different batches then client handles ordering, except memory and cpu nothing prevents us from doing so. and as I said earlier, download doesn't make any dramatic difference since tunnel to app script is MUCH faster than client to app script, the bottleneck is not the first.

@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

Before — tunnel_loop, depth=1 (strict request/await/request per session)

sequenceDiagram
    autonumber
    participant Client
    participant AS as Apps Script
    participant TN as tunnel-node

    Note over Client,TN: Op A
    Client->>AS: data(A)<br/>~slow client→AS leg
    AS->>TN: ~fast
    TN-->>AS: reply A<br/>~fast
    AS-->>Client: response A<br/>~AS→client leg
    Note right of Client: only NOW can Op B be sent

    Note over Client,TN: Op B (serial)
    Client->>AS: data(B)<br/>~slow client→AS leg
    AS->>TN: ~fast
    TN-->>AS: reply B
    AS-->>Client: response B

    Note right of Client: Total wall clock = 2 × RTT<br/>slow client→AS leg utilized SEQUENTIALLY
Loading

After — tunnel_loop_pipelined, depth=2 with seq

sequenceDiagram
    autonumber
    participant Client
    participant AS as Apps Script
    participant TN as tunnel-node

    Note over Client,TN: Client fires A and B back-to-back<br/>(no await between)
    Client->>AS: data(A, seq=0)
    Client->>AS: data(B, seq=1)
    Note over Client,AS: BOTH in the slow client→AS leg<br/>at the same time — pipe utilized in PARALLEL

    par
    AS->>TN: forward A
    and
    AS->>TN: forward B
    end

    Note over TN: per-session seq lock<br/>processes A then B in order
    TN-->>AS: reply A (seq=0)
    TN-->>AS: reply B (seq=1)

    par
    AS-->>Client: response A
    and
    AS-->>Client: response B
    end

    Note right of Client: Total wall clock ≈ 1 × RTT + small server-side serial delta<br/>~35–80% faster, scaling with how much<br/>client↔AS leg dominates the round-trip
Loading

@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

Depth > 2 has diminishing returns when client -> AS is not the bottleneck, but real returns when it is.

The win is in the slow leg

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

Depth > 2 has diminishing returns when client -> AS is not the bottleneck, but real returns when it is.

The win is in the slow leg

so want to try > 2? :) ideally as much as min tcp connections 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: feature feat: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants