feat: per-session pipelining via seq-ordered data ops#903
feat: per-session pipelining via seq-ordered data ops#903dazzling-no-more wants to merge 2 commits intotherealaleph:mainfrom
Conversation
|
Reviewed via Anthropic Claude. Verified locally on top of v1.9.18:
Wire-protocol design looks careful — Hesitating on immediate merge because of two factors:
Plan: leaving open for 5–7 days of community testing like we did with #359 (drive queue). If two or three users on different ISPs report it stable for streaming downloads + heavy concurrent browsing, I'll squash-merge as v1.9.19. For testers reading this: pull
Thanks @dazzling-no-more — this is the kind of structural perf work that's hard to land safely. The seq-lock + capability-negotiation design is exactly the right architecture for what is ultimately a not-quite-ordered transport layer. |
w0l4i
left a comment
There was a problem hiding this comment.
Great commit, clever move !
keep it going champ 💪
Fixes #924 — the canonical tracking thread for the v1.9.15 Full-mode regression cluster that spanned ~3 weeks and 18+ duplicate reports. **Root cause** (rigorously bisected to PR #799's `warm()`): PR #799 added HTTP/2 multiplexing on the relay leg, and gated the h1 socket-pool prewarm behind `ensure_h2().await`. `ensure_h2()` is bounded by `H2_OPEN_TIMEOUT_SECS = 8s` but can take the full window on a cold connection. During that window the h1 fallback pool is empty, so any request that arrives gets: 1. `Err((Relay("h2 unavailable"), No))` immediately → falls back to h1 2. h1 path calls `acquire()` → empty pool → cold `open()` → fresh TCP+TLS handshake to `connect_host:443` 3. Same network conditions that stalled h2 also stall h1; cold open exceeds the 30s `batch_timeout` enforced in `dispatch_full_tunnel` 4. User sees `batch timed out after 30s` while apps_script mode keeps working **Fix** (two commits, both `domain_fronter.rs`-only): Commit 1 — `warm h1 pool in parallel with h2`: Spawn h2 prewarm in a separate task so the h1 prewarm loop runs concurrently. Full `n` h1 sockets are warm before user traffic, even when the h2 handshake stalls or hits its 8s timeout. `run_pool_refill` trims the pool back to `POOL_MIN_H2_FALLBACK = 2` within 5s once h2 lands as the fast path. Commit 2 — `bound h1 open() + detect dead h2 cells synchronously`: - `H1_OPEN_TIMEOUT_SECS = 8` wraps the TCP+TLS handshake in `open()` so a stuck handshake to a blackholed `connect_host:443` doesn't block `acquire()` until the outer 30s batch budget elapses (same symptom #924 hits during the warm-race window). - `H2Cell.dead: Arc<AtomicBool>` flipped by the connection driver task when `Connection::await` ends (GOAWAY, network error, normal close). `ensure_h2`'s fast path and `run_pool_refill`'s pool-target check both consult the flag — known-dead cells are rejected within ≤5s instead of waiting for `H2_CONN_TTL_SECS = 540s` to expire or for a request to discover the breakage via `ready()` failure. **API impact**: `h2_handshake_post_tls`'s return type changes to `(SendRequest, Arc<AtomicBool>)`. One existing test (`h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1`) tweaks its `Ok` arm to match — panic message unchanged. **Verified locally on top of v1.9.19**: - `cargo test --lib --release`: 209/209 (was 208; +1 new test `ensure_h2_rejects_dead_cell_within_ttl`) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean - `(cd tunnel-node && cargo test --release)`: 36/36 **Live end-to-end** (from PR description): - 5 cold restarts (warm-up race window): 5/5 pass, 9.6-22.5s - Concurrent burst (5 simultaneous SOCKS5 streams): 5/5 - Default full.json baseline: 200 OK in 13.3s - `force_http1: true` sanity: 200 OK in 17.7s **A/B vs PR #903** (per-session pipelining): commits land in disjoint functions, cherry-picked clean on top. If #903 lands first, this needs a mechanical rebase only. Exemplary debugging work — bisect with concrete probe data, root cause identification down to specific commit + line, working fix with bounded timeouts on the two adjacent paths the same stall pattern could recur through, +1 regression test. The kind of PR that lands fast. Closes #924. Reviewed via Anthropic Claude. Co-Authored-By: rezaisrad <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
i am suspicious these reports were related to the h2 for full tunnel which are being reverted. i'm skeptical if this changes anything or just complicates the full tunnel mode, please provide data if this is actually helping. have you tested and seen noticeable change? because i did and i see nothing dramatically changed as the description suggests. |
|
youtube does not work like this, youtube already opens multiple calls to google videos cdn, batching already handles this, tunnel node waits 1000 if we got data and batch send them all. |
|
@yyoyoian-pixel On YouTube specifically — yes, it opens ~6 CDN streams, but each becomes its own sid with its own serial loop. 6 × 1 in-flight aggregate, not 6 sharing one batch. The pipelined loop lifts each to 6 × 2. That said, you may genuinely not see a dramatic change for your workload, and that's expected:
The headline scenario is one TCP session doing a sustained download with multiple deployments. A clean test: single-stream curl https://speed.cloudflare.com/__down?bytes=104857600 with 3+ deployments, v1.9.21 base, pipelining on vs off. Anything multi-session is structurally a bad test for what this PR fixes |
Resolve conflict in src/domain_fronter.rs (tunnel_batch_request_with_timeout). main (therealaleph#1040) intentionally removed the h2 fast path from tunnel batches because coalesced batches see no h2-multiplexing benefit. The branch's contribution — threading caller-supplied `batch_timeout` (so pipelined batches honor `max(configured, PIPELINED_BATCH_TIMEOUT_FLOOR)`) — is preserved on the h1 header-read path, which is the only path now. Build: clean on both crates. Tests: 217/217 client lib, 57/57 tunnel-node.
|
what if session seq 1 drops? there are protocols like that and it's a sophisticated one like quic do this but they have very safe recovery, this PR adds 3600 lines complexity which 99% of users won't even see. multiple deployments but one single tcp download which barely happens, all good downloaders divide downloads in chunks even. |
|
@dazzling-no-more let's have this tested through behind censor, for all the apps, browsing, youtube, video calls, messaging. (android would be enough) provide some feedbacks and make sure there is no regression, ideally also some numbers. |
|
@dazzling-no-more i'm fine with moving forward with this if tests are done and if we rebase on master after #1041 is merged. h2 should not be enabled back for full tunnel mode. |
|
@yyoyoian-pixel I have already merged and solved the conflict. Test again. |
|
@dazzling-no-more 1041 is still open so we need to wait for that, also please share your test results behind censor. i think this benefits uploads more than downloads, sending to app script is a lot slower behind censor than from tunnel node. and if so, we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2? |
|
@yyoyoian-pixel The community needs to test it. I am not in that region. |
@yyoyoian-pixel we already have 8 min pool of ready connections in client, why not having seq pool 8? as much as ready connections? why 2? Different layers. The 8-connection pool is cross-session — it fills the bandwidth to Google's edge so any session's batches can get a connection without paying a TLS handshake. Per-session pipeline depth is about how many ops for a single session can be in flight server-side. Depth>2 doesn't help because the tunnel-node's per-session seq lock serializes the (write, drain) sequence for each session. With depth=2 the second op is always queued at the server while the first is being processed — that's the full overlap. Depth=8 would just queue 6 more ops on the server's seq lock without speeding anything up, and it costs 4× the reorder-buffer memory client-side. The bandwidth argument doesn't apply within a session either: 30 concurrent slots per deployment already saturate Apps Script's per-account fan-out. Per-session pipelining adds RTT-overlap on top of that, not more bandwidth. |
|
@dazzling-no-more i don't think your AI auto reply and I are talking about the same thing, maybe it's better to add some drawings of before and after to the pull request so we can understand what is happening, this is a big change and I don't think it's right to merge it without properly understand it. if we support ordering then we can send N number of ops in N different batches then client handles ordering, except memory and cpu nothing prevents us from doing so. and as I said earlier, download doesn't make any dramatic difference since tunnel to app script is MUCH faster than client to app script, the bottleneck is not the first. |
Before —
|
|
Depth > 2 has diminishing returns when client -> AS is not the bottleneck, but real returns when it is. The win is in the slow leg |
The win is in the slow leg so want to try > 2? :) ideally as much as min tcp connections 8. |
Summary
Per-session pipelining for full-tunnel mode. Allows up to 2 in-flight
dataops per session to overlap one Apps Script round-trip with the next on a single TCP stream, lifting the per-session throughput cap that came from strict request/await/request sequencing.Wire protocol (forward-compatible — old peers ignore the new fields):
BatchOp.seq: Option<u64>— per-session monotonic sequence number, sent only ondataops in pipelined sessions.TunnelResponse.seq: Option<u64>— echoed by the server.TunnelResponse.caps: Option<u32>— advertised onconnect/connect_datasuccess replies. Bit 0 =CAPS_PIPELINE_SEQ.u64rather thanu32so a long-lived TCP session generating ~100 ops/s doesn't saturate after ~1.4 years.Tunnel-node (new): when a
dataop carriesseq, the server processes the entire (write, drain) sequence under a per-session seq lock and only releases it afterexpectedadvances. Two ops can travel through different deployments and arrive out of order — the seq lock guarantees they're processed (and replied to) in send order, so a client reorder buffer never sees bytes from seq=N+1 land before seq=N's bytes. Olddataops withoutseqcontinue down the legacy unordered path.connect/connect_datasuccess replies setcaps: Some(CAPS_PIPELINE_SEQ)— old clients ignore the unknown field, new clients opt into pipelining.Client (new):
tunnel_loop_pipelinedkeeps up toPIPELINE_DEPTH = 2dataops in flight per pipelined session. Reordering is implicit —VecDeque<(seq, oneshot::Receiver)>awaited in FIFO order means replies for later seqs sit in their oneshot buffers until our task gets to them. Idle sessions stay at depth=1 (server long-polls; no speculative quota burn). Pipelining only enables when the connect reply advertisedcapsAND no prior reply has been observed dropping theseqfield (mixed-version backend protection).Cross-cutting correctness:
BatchWaitprimitive replaces N×M per-job watchers with one watcher per uniqueSessionInner(deduplicated byArc::as_ptr).reader_taskswitched tonotify_waiters()so concurrent batches all wake on each push.pipelined_reply_timeoutderived aseffective_batch * 2 + slack— covers permit-saturated semaphore wait plus the op's own request budget.had_uplink == true) wait per-session up toACTIVE_DRAIN_DEADLINEand run a straggler-settle loop, so multi-packet upstream responses land in one drain instead of being split across batches. Idle ops still subscribe to the sharedBatchWaitfor cross-session wake.closeops collected during dispatch and processed AFTER seq jobs ONLY when an earlier same-sid op is already deferred (else run inline) — preserves both[data(seq), close](data writes before close tears the session down) AND[close, data](close runs first, data hits the closed session).client_send_closedflag intunnel_loop_pipelinedso a TCP half-close from the local client stops queuing new ops but keeps draining already-queued reply oneshots — valid HTTP request/response with client-side shutdown no longer drops downlink bytes.tunnel_batch_request_with_timeout(3-argtunnel_batch_request_tokept as backward-compat wrapper) so both h2 (h2_relay_request) and h1 (read_http_response_with_timeout) honor the pipelined-batch floor — avoids inner 30 s default firing before our outer 60 s budget on user-tuned configs.TCP_DRAIN_MAX_BYTESper iteration (mirrors the seq path), so concurrent seq jobs continue to see headroom while Phase 2's drain is in flight.Public API change
TunnelMux::udp_openandudp_dataalready takeimpl Into<Bytes>(carried over from the prior zero-copy PR).tunnel_batch_request_tokeeps its 3-arg signature; newtunnel_batch_request_with_timeoutexposes the explicit timeout for callers that need it. Public structsTunnelResponseandBatchOpgain new optional fields with#[serde(default)]/#[serde(skip_serializing_if = "Option::is_none")]— old wire shapes deserialize cleanly and serialize identically.Realistic perf delta
For active streaming downloads on a single TCP session, expect 30–80% throughput improvement depending on what fraction of the Apps Script RTT is server-side vs network. Less for idle / long-poll workloads (server-side seq enforcement serializes drains per session). No win on multi-session workloads (those were already parallel across sessions). Documented latency trade-off: a stuck seq op (lost earlier seq from a dropped batch) holds the batch HTTP response open for up to
SEQ_WAIT_TIMEOUT = 30 s; the pinned regression testunrelated_seq_session_in_same_batch_is_not_delayed_past_seq_waitensures unrelated sessions' intrinsic processing time stays sub-second within that window.Test plan
cargo build --bins --libclean both cratescargo test --lib(client) +cargo test(tunnel-node) pass — 216 + 57 = 273 totalwait_for_seq_turn, write-failure path bumpsexpected(SeqAdvanceOnDrop), tail preservation under cap, batch budget shared with seq jobs, EOF withholding while buffer non-empty,[data(seq), close]runs data before close,[close, data]runs close firstBatchWaitwakes all jobs on single push, dedupes watchers per unique inner, wakes across concurrent batches on the same session, mixed ready+idle batch latency, Phase 2 budget reservation capped atTCP_DRAIN_MAX_BYTESseq=0first, in-seq-order output when replies arrive reversed, closes on seq mismatch, closes onseq=None(mixed-version detection), drains in-flight on client half-close,pipelined_reply_timeoutcovers2× batch_timeout,mark_pipelining_disabledsticky and overrides capsconnect_datacapability detection)