Skip to content

Retry transient pool-acquire timeouts with exponential backoff#104

Merged
swlynch99 merged 1 commit into
mainfrom
claude/bold-bardeen-fo44yc
Jun 9, 2026
Merged

Retry transient pool-acquire timeouts with exponential backoff#104
swlynch99 merged 1 commit into
mainfrom
claude/bold-bardeen-fo44yc

Conversation

@swlynch99

Copy link
Copy Markdown
Contributor

Summary

Add resilience to transient database connection pool timeouts by implementing automatic retry logic with exponential backoff instead of immediately tearing down the worker.

Key Changes

  • New SharedState methods: Added acquire() and begin() methods that wrap the underlying pool operations with retry logic for transient PoolTimedOut errors
  • Retry mechanism: Implemented PoolRetry struct that tracks exponential backoff state and decides whether to retry or propagate errors
    • Retries only on sqlx::Error::PoolTimedOut errors
    • Respects a configurable maximum retry count
    • Uses exponential backoff (doubling each time) with a configurable maximum backoff duration
    • Remains responsive to shutdown signals during backoff periods
  • Configuration: Added three new config parameters to Config:
    • pool_acquire_max_retries: Maximum number of retries (default: 5)
    • pool_acquire_backoff: Initial backoff duration (default: 1 second)
    • pool_acquire_max_backoff: Maximum backoff duration (default: 30 seconds)
  • Updated call sites: Replaced all direct pool.acquire() and pool.begin() calls throughout the worker with the new resilient methods
  • Tests: Added unit tests for the next_backoff() helper function to verify correct doubling behavior and overflow handling

Implementation Details

  • The retry logic is transparent to callers—they use the same API but get automatic resilience
  • Transient pool timeouts (momentary spikes in connection demand) are now handled gracefully without worker teardown
  • Sustained timeouts still escalate to worker teardown as before
  • Shutdown signals interrupt backoff waits to prevent delays during graceful shutdown
  • The backoff calculation uses saturating arithmetic to prevent overflow panics

https://claude.ai/code/session_01D4z4t3D4r4pTAYcXCSK8Ex

When a long-running worker task (heartbeat, validate_workers, leader,
process, spawn_new_tasks, load_leader_id) hit a transient
`sqlx::Error::PoolTimedOut` while acquiring a pool connection, the error
propagated out of the task, the `ShutdownGuard` raised the shutdown flag
on drop, and the entire worker tore itself down -- relying on the
supervisor to rebuild it (worker row churn, re-acquiring leadership,
re-claiming orphaned tasks) for what is often a momentary connection-pool
spike.

Add `SharedState::acquire`/`SharedState::begin` helpers that retry the
pool operation with exponential backoff on transient pool-acquire
timeouts, only escalating to a worker teardown once the condition is
sustained. The backoff sleep goes through the runtime clock (so DST stays
deterministic) and remains responsive to shutdown so a graceful shutdown
is not delayed by the full backoff period.

Retry behaviour is tunable via three new (defaulted, backwards-compatible)
config options: `pool_acquire_max_retries` (default 5),
`pool_acquire_backoff` (default 1s), and `pool_acquire_max_backoff`
(default 30s). Setting `pool_acquire_max_retries` to 0 restores the
previous fail-on-first-timeout behaviour.

The task-execution paths (run_task/run_task_impl) already treat pool
timeouts as recoverable by suspending the task, and task_cleanup/
stuck_notify already log-and-continue, so those are left unchanged.

https://claude.ai/code/session_01D4z4t3D4r4pTAYcXCSK8Ex
@swlynch99 swlynch99 enabled auto-merge (squash) June 9, 2026 21:57
@swlynch99 swlynch99 merged commit 892653c into main Jun 9, 2026
7 checks passed
@swlynch99 swlynch99 deleted the claude/bold-bardeen-fo44yc branch June 9, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants