Fix and register error_message_on_failure_to_fetch_snapshot test#7733
Fix and register error_message_on_failure_to_fetch_snapshot test#7733
error_message_on_failure_to_fetch_snapshot test#7733Conversation
The test was added in #7317 but never added to a runner. The original logic was also broken: it stopped the primary before joining, so no StartupSeqnoIsOld response was ever produced and FetchSnapshot was never triggered - the expected retry/giving-up log messages could never appear. Redesign: use BackupSnapshotFetch with backup_snapshot_fetch_target_rpc_interface="primary_rpc_interface". The primary_rpc_interface lacks the SnapshotRead operator feature, so every fetch request returns HTTP 404, reliably driving the fetch_from_peer retry loop to log all three attempt messages and the "Exceeded maximum snapshot fetch retries … giving up" message. Also: - Add args.label suffix to avoid workspace collision - Inline txs construction (matches run_backup_snapshot_download style) - Use network.txs.issue instead of a separate txs variable - Collect matched patterns before removing to avoid list mutation during iteration - Register the test in schema.py alongside run_backup_snapshot_download Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
error_message_on_failure_to_fetch_snapshot test
|
@copilot Rather than adding a separate runner and network just for this single test, can we rename it to a |
Per reviewer feedback: rename run_error_message_on_failure_to_fetch_snapshot to test_error_message_on_failure_to_fetch_snapshot(network, args) and add it to the existing run_backup_snapshot_download run alongside test_backup_snapshot_fetch and test_backup_snapshot_fetch_max_size, rather than having a separate runner and network. Also remove the now-unnecessary standalone schema.py cr.add entry. Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Done in 5004695. Renamed to |
|
@copilot There's 2 kinds of "backup nodes fetching snapshots from primaries" - during join, and gossiped during runtime. Your changes have altered this test to find this error during the second phase ( |
… coverage
Adds coverage for the initial join-time FetchSnapshot failure path, which
the reviewer identified as missing.
- Add join_target_interface_name parameter to _setup_node in network.py
(defaults to FILE_SERVING_RPC_INTERFACE, preserving existing behaviour)
so tests can direct the join target to a specific interface.
- Add test_join_time_snapshot_fetch_failure(network, args):
1. Adds an intermediate node joined from snapshot (startup_seqno > 0).
2. Joins a fresh node (no snapshot) targeting that intermediate node
via primary_rpc_interface (lacks SnapshotRead). The intermediate
node returns StartupSeqnoIsOld (its startup_seqno > 0), triggering
the FetchSnapshot task.
3. FetchSnapshot GETs /node/snapshot on primary_rpc_interface → HTTP 404
on every attempt, exhausting the 3-attempt retry loop and logging
"Exceeded maximum snapshot fetch retries (3), giving up".
4. Asserts all expected retry and give-up log messages appear.
- Add call in run_backup_snapshot_download before the existing
test_error_message_on_failure_to_fetch_snapshot (BackupSnapshotFetch).
Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Added in 21b84d8. |
|
@copilot The setup and shutdown code around these tests might be different, but I feel like the core assertion logic (scanning the logs for the expected messages) should be identical. Can you factor that out into a separate function, called by both tests? |
…_messages Both test_join_time_snapshot_fetch_failure and test_error_message_on_failure_to_fetch_snapshot used identical expected_patterns lists and near-identical log-scanning loops. Factored into _assert_snapshot_fetch_failure_messages(node, timeout_s=0): - timeout_s=0 (default): single pass, for already-stopped nodes - positive timeout_s: polls until messages appear, for still-running nodes Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Done in f57db9c. Factored into |
…add-error-message-test-run
There was a problem hiding this comment.
Pull request overview
This PR updates CCF’s end-to-end test suite to ensure snapshot-fetch failure log messages are covered and executed, and extends the test infra to allow selecting which RPC interface is used as the join target (enabling deterministic negative test cases).
Changes:
- Added
join_target_interface_namesupport toNetwork._setup_node()so tests can direct join traffic to a specific RPC interface. - Refactored and registered the snapshot-fetch failure log assertion into a shared helper and two runnable tests (join-time and backup-snapshot-fetch paths).
- Wired the new tests into
run_backup_snapshot_downloadso they are executed in CI.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/infra/network.py |
Adds an optional join_target_interface_name to control the join target RPC interface used by _setup_node(). |
tests/e2e_operations.py |
Adds/refactors snapshot-fetch failure tests, shared log assertion helper, and registers tests in the backup snapshot download runner. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
run_error_message_on_failure_to_fetch_snapshot→test_error_message_on_failure_to_fetch_snapshot(network, args)and add torun_backup_snapshot_download(runtimeBackupSnapshotFetchfailure)schema.pyentryjoin_target_interface_nameparam to_setup_nodeinnetwork.py(defaults toFILE_SERVING_RPC_INTERFACE, no behaviour change for existing callers)test_join_time_snapshot_fetch_failure(network, args)— join-timeFetchSnapshotfailuretest_join_time_snapshot_fetch_failureinrun_backup_snapshot_download_assert_snapshot_fetch_failure_messages(node, timeout_s=0)helper, used by both testsOriginal prompt
error_message_on_failure_to_fetch_snapshottest is not run, and currently fails #7715💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.