Retry transient pool-acquire timeouts with exponential backoff by swlynch99 · Pull Request #104 · iopsystems/durable

swlynch99 · 2026-06-09T21:55:40Z

Summary

Add resilience to transient database connection pool timeouts by implementing automatic retry logic with exponential backoff instead of immediately tearing down the worker.

Key Changes

New SharedState methods: Added acquire() and begin() methods that wrap the underlying pool operations with retry logic for transient PoolTimedOut errors
Retry mechanism: Implemented PoolRetry struct that tracks exponential backoff state and decides whether to retry or propagate errors
- Retries only on sqlx::Error::PoolTimedOut errors
- Respects a configurable maximum retry count
- Uses exponential backoff (doubling each time) with a configurable maximum backoff duration
- Remains responsive to shutdown signals during backoff periods
Configuration: Added three new config parameters to Config:
- pool_acquire_max_retries: Maximum number of retries (default: 5)
- pool_acquire_backoff: Initial backoff duration (default: 1 second)
- pool_acquire_max_backoff: Maximum backoff duration (default: 30 seconds)
Updated call sites: Replaced all direct pool.acquire() and pool.begin() calls throughout the worker with the new resilient methods
Tests: Added unit tests for the next_backoff() helper function to verify correct doubling behavior and overflow handling

Implementation Details

The retry logic is transparent to callers—they use the same API but get automatic resilience
Transient pool timeouts (momentary spikes in connection demand) are now handled gracefully without worker teardown
Sustained timeouts still escalate to worker teardown as before
Shutdown signals interrupt backoff waits to prevent delays during graceful shutdown
The backoff calculation uses saturating arithmetic to prevent overflow panics

https://claude.ai/code/session_01D4z4t3D4r4pTAYcXCSK8Ex

When a long-running worker task (heartbeat, validate_workers, leader, process, spawn_new_tasks, load_leader_id) hit a transient `sqlx::Error::PoolTimedOut` while acquiring a pool connection, the error propagated out of the task, the `ShutdownGuard` raised the shutdown flag on drop, and the entire worker tore itself down -- relying on the supervisor to rebuild it (worker row churn, re-acquiring leadership, re-claiming orphaned tasks) for what is often a momentary connection-pool spike. Add `SharedState::acquire`/`SharedState::begin` helpers that retry the pool operation with exponential backoff on transient pool-acquire timeouts, only escalating to a worker teardown once the condition is sustained. The backoff sleep goes through the runtime clock (so DST stays deterministic) and remains responsive to shutdown so a graceful shutdown is not delayed by the full backoff period. Retry behaviour is tunable via three new (defaulted, backwards-compatible) config options: `pool_acquire_max_retries` (default 5), `pool_acquire_backoff` (default 1s), and `pool_acquire_max_backoff` (default 30s). Setting `pool_acquire_max_retries` to 0 restores the previous fail-on-first-timeout behaviour. The task-execution paths (run_task/run_task_impl) already treat pool timeouts as recoverable by suspending the task, and task_cleanup/ stuck_notify already log-and-continue, so those are left unchanged. https://claude.ai/code/session_01D4z4t3D4r4pTAYcXCSK8Ex

swlynch99 enabled auto-merge (squash) June 9, 2026 21:57

swlynch99 merged commit 892653c into main Jun 9, 2026
7 checks passed

swlynch99 deleted the claude/bold-bardeen-fo44yc branch June 9, 2026 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry transient pool-acquire timeouts with exponential backoff#104

Retry transient pool-acquire timeouts with exponential backoff#104
swlynch99 merged 1 commit into
mainfrom
claude/bold-bardeen-fo44yc

swlynch99 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swlynch99 commented Jun 9, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants