fix: v1.9.15 full-mode warm-up race during h2 init (#924) by rezaisrad · Pull Request #1029 · therealaleph/MasterHttpRelayVPN-RUST

rezaisrad · 2026-05-10T18:19:05Z

Summary

Fixes #924, the canonical tracking thread for an 18+ duplicate cluster of "Full-tunnel mode broken since v1.9.15" reports.

What affected users see:

batch timed out after 30s and tunnel connect_data error … batch timed out in mhrv-rs logs
curl: (28) SSL connection timeout against 127.0.0.1:8086 (the SOCKS5 listener)
Apps-Script mode keeps working — narrows the regression to the Full-mode data plane: SOCKS5 → domain-fronted relay → Apps Script → tunnel-node → internet
Only available workaround: the "force_http1": true kill switch added by PR feat: HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback #799 itself

Root cause:

Bisect (see below) lands on PR #799, which added HTTP/2 multiplexing on the relay leg. That change is correct in steady state — one multiplexed TCP/TLS socket replacing the legacy 8–80-socket pool fixes the head-of-line blocking that #781/#773 were filed for. The bug is one line of startup ordering inside warm():

v1.9.14: prewarmed the h1 socket pool unconditionally — by the time the user's first request arrived, 6–16 sockets were ready
v1.9.15 (PR feat: HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback #799): gated the h1 prewarm loop behind ensure_h2().await — ensure_h2() is bounded by H2_OPEN_TIMEOUT_SECS (8 s) but can take the full 8 s on a cold first connection, and during that window the h1 pool is empty

Fallout chain when a request arrives in the warm-up window:

ensure_h2() is still holding h2_open_lock (or has yet to populate the cell) — request gets Err((Relay("h2 unavailable"), No)) immediately and falls back to h1
h1 path calls acquire() → empty pool → open() → fresh TCP+TLS to connect_host:443
Same network conditions that stalled the h2 handshake also stall the h1 handshake; cold open exceeds the 30 s batch_timeout enforced in tunnel_client.rs::dispatch_full_tunnel
Handshake never gets to send the POST → tunnel-node never sees the request → user sees a 30 s timeout that "works on apps_script mode" can't explain

This PR:

Restores v1.9.14 ordering — h1 prewarm runs in parallel with the h2 handshake instead of after it
Fallback pool is ready whether or not h2 succeeds
h2 fast path is unchanged; multiplexing still wins in steady state
run_pool_refill trims the h1 pool back to POOL_MIN_H2_FALLBACK (2) once h2 lands
Two commits: the first is the race fix; the second adds bounded timeouts on open() and synchronous detection of mid-session h2 deaths, closing two adjacent paths where the same stall pattern could recur

Bisect

Bisected git log v1.9.14..v1.9.15 (4 commits, 2 functional). The kill switch added by PR #799 itself made it possible to confirm the SHA at runtime instead of running a full git bisect run:

Build	Probe result
`v1.9.14` baseline	✅ 200 OK in 17.9 s
`v1.9.15` default (h2 enabled)	❌ `curl: (28) SSL connection timeout` after 12 s
`v1.9.15` + `"force_http1": true`	✅ 200 OK in 17.7 s

Bisected commit: 0e678630a8bde9ccb02b9d40dd2de4bc65bdfbf4 ("HTTP/2 multiplexing on relay leg with idempotency-safe h1 fallback", PR #799).

Symptoms

Unique log lines on the failing default-v1.9.15 run (comm -23 bad good):

DEBUG h2 batch request pre-send failure: relay error: h2 unavailable — falling back to h1
DEBUG h2 open timed out after 8s — falling back to h1
WARN  batch timed out after 30s (script AKfycbwj…, 1 ops)
ERROR tunnel connect_data error for api.ipify.org:443: batch timed out

h1 fallback also fails on default v1.9.15
Manual curl to the same Apps Script deployment with the same AUTH_KEY cleanly drives a tunnel-node session
The h1 path itself is fine; it's just empty when it's needed

Changes

Commit 1: `fix: warm h1 pool in parallel with h2 (#924)`

Spawn the h2 prewarm in a separate task so the h1 prewarm loop runs concurrently
Full n h1 sockets (6–16 depending on num_scripts.clamp) are warm before user traffic, even when the h2 handshake is slow or hits its 8 s timeout
Trade-off: extra TLS handshakes at startup
- run_pool_refill trims the pool back to POOL_MIN_H2_FALLBACK (2) within 5 s when h2 ends up the active fast path
- Surplus h1 sockets age out via POOL_TTL_SECS rather than being kept alive

Commit 2: `fix: bound h1 open() with timeout, detect dead h2 cells synchronously`

Two changes, both domain_fronter.rs-only:

H1_OPEN_TIMEOUT_SECS = 8 wraps the TCP+TLS handshake in open()
- Without it, a stuck handshake to a blackholed connect_host:443 blocks acquire() (and the warm() prewarm loop) until the outer 30 s batch budget elapses
- Same symptom مشکل حالت فول تانل vps از نسخه ۱.۹.۱۵ به بعد #924 hits during the warm-race window
H2Cell.dead: Arc<AtomicBool> flipped by the connection driver task when Connection::await ends (GOAWAY, network error, normal close)
- ensure_h2's fast path and run_pool_refill's pool-target check both consult the flag
- A known-dead cell is rejected within ≤5 s instead of waiting for H2_CONN_TTL_SECS (540 s) to expire or for a request to discover the breakage via ready() failure

API impact:

h2_handshake_post_tls's return type changes to (SendRequest, Arc<AtomicBool>)
One existing test (h2_handshake_post_tls_returns_alpn_refused_when_peer_picks_h1) tweaks its Ok arm to match — panic message unchanged

Testing

Unit + integration

209 lib tests pass (was 208; added ensure_h2_rejects_dead_cell_within_ttl)
36 tunnel-node tests pass

cargo test --lib
cargo test --manifest-path tunnel-node/Cargo.toml

Live end-to-end

Topology:

macOS arm64 client
→ Tailscale exit-node Hetzner Helsinki
→ Google edge 216.239.38.120
→ Apps Script (web app deployment)
→ cloudflared quick-tunnel
→ tunnel-node v1.9.18 in local Docker
→ real internet

Scenario	Result
5 cold restarts (warm-up race window)	✅ 5/5 — 9.6 s, 13.0 s, 22.5 s, 16.4 s, 10.6 s
Concurrent burst (5 simultaneous SOCKS5 streams)	✅ 5/5 — all completed in 10.29 s (h2 multiplex)
Default `full.json` baseline (no `force_http1`)	✅ 200 OK in 13.3 s
`force_http1: true` (sanity, h2 disabled)	✅ 200 OK in 17.7 s

A/B vs PR #903 (per-session pipelining, currently approved-but-unmerged)

Cherry-picked these two commits onto pr/903 head
Both auto-merged without conflict
Changes are in disjoint functions:
- This PR: warm() / open() / H2Cell
- PR feat: per-session pipelining via seq-ordered data ops #903: tunnel_batch_request_to / parse_redirect / read_http_response_with_timeout

Branch	Pass rate (5 alternating cold restarts)	Mean probe time
`pr/903` alone	5/5	8.86 s
`pr/903` + this PR	5/5	6.78 s

If #903 lands first, this PR will need a small rebase but the overlap is mechanical.

Reproducing

A public-facing tunnel-node is required (Apps Script lives in Google's cloud, can't reach localhost).

Stand up tunnel-node + a public URL:

docker run -d --name mhrv-tunnel-test \
  -p 127.0.0.1:8080:8080 \
  -e TUNNEL_AUTH_KEY="$(openssl rand -hex 24)" \
  ghcr.io/therealaleph/mhrv-tunnel-node:latest
cloudflared tunnel --no-autoupdate --protocol http2 --url http://127.0.0.1:8080

(Or use a real VPS / Cloud Run / GitHub Actions — see assets/github-actions-tunnel/.)

Deploy assets/apps_script/CodeFull.gs as a Web App with:
- TUNNEL_SERVER_URL pointing at the URL from step 1
- TUNNEL_AUTH_KEY matching the docker run -e value
- A fresh AUTH_KEY
- "Execute as: Me", "Who has access: Anyone"
Write config.full.json with:
- "mode": "full"
- "google_ip": "216.239.38.120"
- "front_domain": "www.google.com"
- The deployment ID as script_id
- The matching auth_key

Reproduce the regression on v1.9.15:

git worktree add ../mhrv-v1915 v1.9.15
cd ../mhrv-v1915
cargo build --bin mhrv-rs
./target/debug/mhrv-rs --config /path/to/config.full.json &
curl --socks5-hostname 127.0.0.1:8086 --max-time 30 https://api.ipify.org?format=json
# → curl: (28) SSL connection timeout (within ~12 s)

Verify on this branch:

git checkout fix/full-mode-warm-race-924
cargo build --bin mhrv-rs
./target/debug/mhrv-rs --config /path/to/config.full.json &
curl --socks5-hostname 127.0.0.1:8086 --max-time 30 https://api.ipify.org?format=json
# → 200 OK with {"ip": "..."} body, typically 5–17 s

Probe-time variance comes from the Tailscale / Apps Script / cloudflared chain, not the proxy itself. On a direct VPS path, expect 3–5 s.

Out of scope

H2_OPEN_TIMEOUT_SECS (8 s) occasionally hitting on a cold first connection. Variable network behaviour (Apps Script container warm-up, Google edge cache miss, exit-node tunneling latency), not a code bug. The warm-race + dead-flag + open-timeout fixes in this PR mean a slow h2 init is gracefully handled by the warm h1 pool.
ngrok-free.dev / trycloudflare DNS-blocking at TCI (Iran ISP). Tracked separately in مشکل حالت فول تانل vps از نسخه ۱.۹.۱۵ به بعد #924's discussion.
Mid-session h2 stability ("works for hours then dies" reports). The dead-flag fix closes one mechanism (silent stale-cell reuse); broader long-running stability tracked separately.

dazzling-no-more · 2026-05-10T19:44:30Z

+        // h1 pool above is already populated either way. JoinError
+        // collapses to "h2 not alive" — same as if ensure_h2 returned
+        // None — so we still log a useful line.
+        let h2_alive = h2_handle.await.unwrap_or(false);


A panic in the spawned ensure_h2() task is now invisible — unwrap_or(false) swallows
JoinError and we just log "pool pre-warmed with N connection(s)" as if h2 simply wasn't
available. Pre-PR a panic here would have crashed the warm task and shown up; now it's
indistinguishable from a clean ALPN refusal.

Suggest:

let h2_alive = match h2_handle.await { Ok(v) => v, Err(e) => { tracing::warn!("h2 prewarm task failed to join: {}", e); false } };

dazzling-no-more · 2026-05-10T19:45:11Z

+        // a fake address in `fronter_for_test`, so the open will fail
+        // — and ensure_h2 returns None. The point is: the stale gen=1
+        // SendRequest was NOT served.
+        let result = fronter.ensure_h2().await;


This test relies on 127.0.0.1:443 being closed so open_h2's TCP-connect fails fast
with ECONNREFUSED. On most dev/CI hosts that holds, but if a developer has HTTPS bound on
localhost (mkcert/Caddy/local nginx) the connect will succeed and the TLS handshake will
stall up to H2_OPEN_TIMEOUT_SECS = 8s before the assertion finally passes — flaky and
slow.

The test only cares that a dead-flagged cell isn't served, not that the open path actually
runs. Pre-set the failure-backoff so ensure_h2 short-circuits before any network I/O:

*fronter.h2_open_failed_at.lock().await = Some(Instant::now());

That isolates the assertion to the new dead-flag check and makes the test deterministic
regardless of what's running on localhost.

Bonus: a follow-up assertion that *fronter.h2_cell.lock().await is None after the call
would lock down the "cleanup on failed reopen" invariant too.

dazzling-no-more · 2026-05-10T19:45:57Z

+    /// prewarm behind `ensure_h2()` so the h1 pool stayed empty during
+    /// the h2 init window.
    ///
    /// Staggered 500 ms apart so we don't burst N TLS handshakes at the


This paragraph predates the parallel-prewarm change. With h2 now spawned alongside h1[0],
startup actually fires two TLS handshakes back-to-back (h2 handshake + h1[0]) before any
500 ms stagger kicks in — the stagger only applies between h1[i] and h1[i+1] for i ≥ 1.

The new behavior is intentional and acceptable, but a future reader chasing "why are we
seeing 2 handshakes simultaneously at boot?" will be misled by the current wording.
Suggest one sentence noting that the spawned h2 handshake races h1[0], and the 500 ms
stagger only applies to h1[1..n].

…herealaleph#1029)

@rezaisrad

Bumps Cargo.toml v1.9.19 → v1.9.20 and ships the changelog. Headline fix: the v1.9.15 Full-mode regression that's been tracking in #924 for ~3 weeks is resolved by @rezaisrad's PR #1029. Bisect-quality root cause (h1 prewarm gated behind h2 handshake, both stall on cold start under the same network conditions). Affected users can drop the `force_http1: true` workaround now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skip H2 for `tunnel_batch_request_to` (the Full-mode batch path). Tunnel batches already coalesce N ops into one HTTP request — H2 stream multiplexing has nothing to multiplex. The H2 try/fallback path on this specific code path introduced three regressions vs v1.9.14: 1. **Long-poll stalls**: idle polls completed at 16-17s (`LONGPOLL_DEADLINE` + network latency) instead of timing out at 10s on H1. Each poll held an Apps Script execution slot 60% longer. 2. **Silent batch drops**: `RequestSent::Maybe` failures dropped the entire batch with no retry — a failure mode H1 doesn't have. 3. **Pool starvation**: `POOL_MIN_H2_FALLBACK = 2` trimmed the H1 pool from 8 → 2 once H2 connected, but tunnel batches still used H1 and needed the full pool. H2 multiplexing is **kept active for relay mode** (non-full) where each browser request is a separate HTTP call that genuinely benefits from stream multiplexing. r0ar's controlled A/B test in #962 confirmed h2 is strictly better than `force_http1: true` for apps_script-mode users, and that path is unchanged here. ## Changes - `tunnel_batch_request_to`: remove H2 try/fallback/NonRetryable block, go straight to H1 pool `acquire()` (-54 lines). - `run_pool_refill`: always maintain `POOL_MIN = 8`. Remove the `POOL_MIN_H2_FALLBACK = 2` trim that was starving tunnel batches (-12 lines). ## A/B results (Pixel 6 Pro, 30 batch samples each) | Metric | H2 (stock v1.9.20) | H1 (this PR) | v1.9.14 (baseline) | |---|---|---|---| | 16-17s batches | **8-10/30** | **0/30** | **0/30** | | 10s timeouts | 0 | 4/30 | 5/30 | | Active RTTs | 1.4-2.4s | 1.3-2.2s | 1.4-2.3s | Restores v1.9.14 tunnel performance while keeping all v1.9.15+ improvements (H2 for relay, zero-copy mux, block DoH/QUIC, TLS pool tuning, PR #1029's warm-race fix). ## Verified locally on top of v1.9.20 - `cargo test --lib --release`: 209/209 ✅ (matches v1.9.20 baseline) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅ ## Interaction with PR #1029 (just shipped in v1.9.20) PR #1029 added `H2Cell.dead: Arc<AtomicBool>` for synchronous dead-cell detection. With this PR removing the H2 path for tunnel batches, the dead-cell flag is no longer consulted on the tunnel batch path — that's intentional (the flag now scopes to relay mode, which is the path it was protecting in practice). Reviewed via Anthropic Claude. Co-Authored-By: yyoyoian-pixel <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rezaisrad added 2 commits May 10, 2026 13:22

fix: warm h1 pool in parallel with h2 (therealaleph#924)

b3f0006

fix: bound h1 open() with timeout, detect dead h2 cells synchronously

5980baa

github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 10, 2026

rezaisrad mentioned this pull request May 10, 2026

مشکل حالت فول تانل vps از نسخه ۱.۹.۱۵ به بعد #924

Closed

dazzling-no-more reviewed May 10, 2026

View reviewed changes

fix: surface h2 prewarm panics, harden dead-cell test, fix stale doc (t…

8e795bc

…herealaleph#1029)

dazzling-no-more approved these changes May 10, 2026

View reviewed changes

therealaleph merged commit 4d3e621 into therealaleph:main May 10, 2026
1 check passed

therealaleph mentioned this pull request May 10, 2026

perf: skip H2 for full-tunnel batch requests #1040

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: v1.9.15 full-mode warm-up race during h2 init (#924)#1029

fix: v1.9.15 full-mode warm-up race during h2 init (#924)#1029
therealaleph merged 3 commits intotherealaleph:mainfrom
rezaisrad:fix/full-mode-warm-race-924

rezaisrad commented May 10, 2026

Uh oh!

dazzling-no-more May 10, 2026

Uh oh!

dazzling-no-more May 10, 2026

Uh oh!

dazzling-no-more May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rezaisrad commented May 10, 2026

Summary

Bisect

Symptoms

Changes

Commit 1: fix: warm h1 pool in parallel with h2 (#924)

Commit 2: fix: bound h1 open() with timeout, detect dead h2 cells synchronously

Testing

Unit + integration

Live end-to-end

A/B vs PR #903 (per-session pipelining, currently approved-but-unmerged)

Reproducing

Out of scope

Uh oh!

dazzling-no-more May 10, 2026

Choose a reason for hiding this comment

Uh oh!

dazzling-no-more May 10, 2026

Choose a reason for hiding this comment

Uh oh!

dazzling-no-more May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Commit 1: `fix: warm h1 pool in parallel with h2 (#924)`

Commit 2: `fix: bound h1 open() with timeout, detect dead h2 cells synchronously`