fix: v1.9.15 full-mode warm-up race during h2 init (#924)#1029
fix: v1.9.15 full-mode warm-up race during h2 init (#924)#1029therealaleph merged 3 commits intotherealaleph:mainfrom
Conversation
| // h1 pool above is already populated either way. JoinError | ||
| // collapses to "h2 not alive" — same as if ensure_h2 returned | ||
| // None — so we still log a useful line. | ||
| let h2_alive = h2_handle.await.unwrap_or(false); |
There was a problem hiding this comment.
A panic in the spawned ensure_h2() task is now invisible — unwrap_or(false) swallows
JoinError and we just log "pool pre-warmed with N connection(s)" as if h2 simply wasn't
available. Pre-PR a panic here would have crashed the warm task and shown up; now it's
indistinguishable from a clean ALPN refusal.
Suggest:
let h2_alive = match h2_handle.await {
Ok(v) => v,
Err(e) => {
tracing::warn!("h2 prewarm task failed to join: {}", e);
false
}
};
| // a fake address in `fronter_for_test`, so the open will fail | ||
| // — and ensure_h2 returns None. The point is: the stale gen=1 | ||
| // SendRequest was NOT served. | ||
| let result = fronter.ensure_h2().await; |
There was a problem hiding this comment.
This test relies on 127.0.0.1:443 being closed so open_h2's TCP-connect fails fast
with ECONNREFUSED. On most dev/CI hosts that holds, but if a developer has HTTPS bound on
localhost (mkcert/Caddy/local nginx) the connect will succeed and the TLS handshake will
stall up to H2_OPEN_TIMEOUT_SECS = 8s before the assertion finally passes — flaky and
slow.
The test only cares that a dead-flagged cell isn't served, not that the open path actually
runs. Pre-set the failure-backoff so ensure_h2 short-circuits before any network I/O:
*fronter.h2_open_failed_at.lock().await = Some(Instant::now());
That isolates the assertion to the new dead-flag check and makes the test deterministic
regardless of what's running on localhost.
Bonus: a follow-up assertion that *fronter.h2_cell.lock().await is None after the call
would lock down the "cleanup on failed reopen" invariant too.
| /// prewarm behind `ensure_h2()` so the h1 pool stayed empty during | ||
| /// the h2 init window. | ||
| /// | ||
| /// Staggered 500 ms apart so we don't burst N TLS handshakes at the |
There was a problem hiding this comment.
This paragraph predates the parallel-prewarm change. With h2 now spawned alongside h1[0],
startup actually fires two TLS handshakes back-to-back (h2 handshake + h1[0]) before any
500 ms stagger kicks in — the stagger only applies between h1[i] and h1[i+1] for i ≥ 1.
The new behavior is intentional and acceptable, but a future reader chasing "why are we
seeing 2 handshakes simultaneously at boot?" will be misled by the current wording.
Suggest one sentence noting that the spawned h2 handshake races h1[0], and the 500 ms
stagger only applies to h1[1..n].
Bumps Cargo.toml v1.9.19 → v1.9.20 and ships the changelog. Headline fix: the v1.9.15 Full-mode regression that's been tracking in #924 for ~3 weeks is resolved by @rezaisrad's PR #1029. Bisect-quality root cause (h1 prewarm gated behind h2 handshake, both stall on cold start under the same network conditions). Affected users can drop the `force_http1: true` workaround now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip H2 for `tunnel_batch_request_to` (the Full-mode batch path). Tunnel batches already coalesce N ops into one HTTP request — H2 stream multiplexing has nothing to multiplex. The H2 try/fallback path on this specific code path introduced three regressions vs v1.9.14: 1. **Long-poll stalls**: idle polls completed at 16-17s (`LONGPOLL_DEADLINE` + network latency) instead of timing out at 10s on H1. Each poll held an Apps Script execution slot 60% longer. 2. **Silent batch drops**: `RequestSent::Maybe` failures dropped the entire batch with no retry — a failure mode H1 doesn't have. 3. **Pool starvation**: `POOL_MIN_H2_FALLBACK = 2` trimmed the H1 pool from 8 → 2 once H2 connected, but tunnel batches still used H1 and needed the full pool. H2 multiplexing is **kept active for relay mode** (non-full) where each browser request is a separate HTTP call that genuinely benefits from stream multiplexing. r0ar's controlled A/B test in #962 confirmed h2 is strictly better than `force_http1: true` for apps_script-mode users, and that path is unchanged here. ## Changes - `tunnel_batch_request_to`: remove H2 try/fallback/NonRetryable block, go straight to H1 pool `acquire()` (-54 lines). - `run_pool_refill`: always maintain `POOL_MIN = 8`. Remove the `POOL_MIN_H2_FALLBACK = 2` trim that was starving tunnel batches (-12 lines). ## A/B results (Pixel 6 Pro, 30 batch samples each) | Metric | H2 (stock v1.9.20) | H1 (this PR) | v1.9.14 (baseline) | |---|---|---|---| | 16-17s batches | **8-10/30** | **0/30** | **0/30** | | 10s timeouts | 0 | 4/30 | 5/30 | | Active RTTs | 1.4-2.4s | 1.3-2.2s | 1.4-2.3s | Restores v1.9.14 tunnel performance while keeping all v1.9.15+ improvements (H2 for relay, zero-copy mux, block DoH/QUIC, TLS pool tuning, PR #1029's warm-race fix). ## Verified locally on top of v1.9.20 - `cargo test --lib --release`: 209/209 ✅ (matches v1.9.20 baseline) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅ ## Interaction with PR #1029 (just shipped in v1.9.20) PR #1029 added `H2Cell.dead: Arc<AtomicBool>` for synchronous dead-cell detection. With this PR removing the H2 path for tunnel batches, the dead-cell flag is no longer consulted on the tunnel batch path — that's intentional (the flag now scopes to relay mode, which is the path it was protecting in practice). Reviewed via Anthropic Claude. Co-Authored-By: yyoyoian-pixel <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes #924, the canonical tracking thread for an 18+ duplicate cluster of "Full-tunnel mode broken since v1.9.15" reports.
What affected users see:
batch timed out after 30sandtunnel connect_data error … batch timed outin mhrv-rs logscurl: (28) SSL connection timeoutagainst127.0.0.1:8086(the SOCKS5 listener)"force_http1": truekill switch added by PR feat: HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback #799 itselfRoot cause:
Bisect (see below) lands on PR #799, which added HTTP/2 multiplexing on the relay leg. That change is correct in steady state — one multiplexed TCP/TLS socket replacing the legacy 8–80-socket pool fixes the head-of-line blocking that #781/#773 were filed for. The bug is one line of startup ordering inside
warm():ensure_h2().await—ensure_h2()is bounded byH2_OPEN_TIMEOUT_SECS(8 s) but can take the full 8 s on a cold first connection, and during that window the h1 pool is emptyFallout chain when a request arrives in the warm-up window:
ensure_h2()is still holdingh2_open_lock(or has yet to populate the cell) — request getsErr((Relay("h2 unavailable"), No))immediately and falls back to h1acquire()→ empty pool →open()→ fresh TCP+TLS toconnect_host:443batch_timeoutenforced intunnel_client.rs::dispatch_full_tunnelThis PR:
run_pool_refilltrims the h1 pool back toPOOL_MIN_H2_FALLBACK(2) once h2 landsopen()and synchronous detection of mid-session h2 deaths, closing two adjacent paths where the same stall pattern could recurBisect
Bisected
git log v1.9.14..v1.9.15(4 commits, 2 functional). The kill switch added by PR #799 itself made it possible to confirm the SHA at runtime instead of running a fullgit bisect run:v1.9.14baselinev1.9.15default (h2 enabled)curl: (28) SSL connection timeoutafter 12 sv1.9.15+"force_http1": trueBisected commit:
0e678630a8bde9ccb02b9d40dd2de4bc65bdfbf4("HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback", PR #799).Symptoms
Unique log lines on the failing default-
v1.9.15run (comm -23 bad good):curlto the same Apps Script deployment with the sameAUTH_KEYcleanly drives a tunnel-node sessionChanges
Commit 1:
fix: warm h1 pool in parallel with h2 (#924)nh1 sockets (6–16 depending onnum_scripts.clamp) are warm before user traffic, even when the h2 handshake is slow or hits its 8 s timeoutrun_pool_refilltrims the pool back toPOOL_MIN_H2_FALLBACK(2) within 5 s when h2 ends up the active fast pathPOOL_TTL_SECSrather than being kept aliveCommit 2:
fix: bound h1 open() with timeout, detect dead h2 cells synchronouslyTwo changes, both
domain_fronter.rs-only:H1_OPEN_TIMEOUT_SECS = 8wraps the TCP+TLS handshake inopen()connect_host:443blocksacquire()(and thewarm()prewarm loop) until the outer 30 s batch budget elapsesH2Cell.dead: Arc<AtomicBool>flipped by the connection driver task whenConnection::awaitends (GOAWAY, network error, normal close)ensure_h2's fast path andrun_pool_refill's pool-target check both consult the flagH2_CONN_TTL_SECS(540 s) to expire or for a request to discover the breakage viaready()failureAPI impact:
h2_handshake_post_tls's return type changes to(SendRequest, Arc<AtomicBool>)h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1) tweaks itsOkarm to match — panic message unchangedTesting
Unit + integration
ensure_h2_rejects_dead_cell_within_ttl)Live end-to-end
Topology:
216.239.38.120full.jsonbaseline (noforce_http1)force_http1: true(sanity, h2 disabled)A/B vs PR #903 (per-session pipelining, currently approved-but-unmerged)
pr/903headwarm()/open()/H2Celltunnel_batch_request_to/parse_redirect/read_http_response_with_timeoutpr/903alonepr/903+ this PRIf #903 lands first, this PR will need a small rebase but the overlap is mechanical.
Reproducing
A public-facing tunnel-node is required (Apps Script lives in Google's cloud, can't reach localhost).
Stand up tunnel-node + a public URL:
docker run -d --name mhrv-tunnel-test \ -p 127.0.0.1:8080:8080 \ -e TUNNEL_AUTH_KEY="$(openssl rand -hex 24)" \ ghcr.io/therealaleph/mhrv-tunnel-node:latest cloudflared tunnel --no-autoupdate --protocol http2 --url http://127.0.0.1:8080(Or use a real VPS / Cloud Run / GitHub Actions — see
assets/github-actions-tunnel/.)Deploy
assets/apps_script/CodeFull.gsas a Web App with:TUNNEL_SERVER_URLpointing at the URL from step 1TUNNEL_AUTH_KEYmatching thedocker run -evalueAUTH_KEYWrite
config.full.jsonwith:"mode": "full""google_ip": "216.239.38.120""front_domain": "www.google.com"script_idauth_keyReproduce the regression on
v1.9.15:Verify on this branch:
Probe-time variance comes from the Tailscale / Apps Script / cloudflared chain, not the proxy itself. On a direct VPS path, expect 3–5 s.
Out of scope
H2_OPEN_TIMEOUT_SECS(8 s) occasionally hitting on a cold first connection. Variable network behaviour (Apps Script container warm-up, Google edge cache miss, exit-node tunneling latency), not a code bug. The warm-race + dead-flag + open-timeout fixes in this PR mean a slow h2 init is gracefully handled by the warm h1 pool.