Fix witness by On1x · Pull Request #96 · VIZ-Blockchain/viz-cpp-node

On1x · 2026-04-28T04:59:59Z

No description provided.

…condition

…ore breaking loop - Prevent peers from staying in _active_connections indefinitely - Move push_back of peers_to_disconnect_forcibly before loop break - Avoid repeated disconnect logs by ensuring proper disconnection flow

… storage gap)

…_median_witness_props

…chedule when emergency mode is active

…psis matching diagnostics

…indows - Track logical block and index file sizes separately from mapped_file.size() - Add methods to sync and verify logical sizes against actual mapped sizes - Heal stale mappings by closing and reopening files when size mismatches are detected - Increase resize count on each resize operation for diagnostics - Return dlt_block_log head by value to ensure thread safety - Expose verify_mapping() and resize_count() for external monitoring - Integrate periodic mapping verification and healing in P2P plugin statistics task - Improve block read and append logic to use logical sizes and assert correctness

…ling - Add detailed per-peer and failed peer stats logging when p2p-stats-enabled is true - Include block storage diagnostics in P2P stats output with ranges and resize counts - Implement verify_mapping() to detect and self-heal stale memory-mapped file states - Track logical file sizes independently from mapped_file.size() to avoid stale size issues - Automatically run verify_mapping() every 5 minutes by P2P stats task to maintain integrity - Document minority fork auto-recovery process triggered by witness plugin resync_from_lib()

…ontinuity - Detect gap between dlt_block_log end and fork_db start on update_lib - Reset dlt_block_log to earliest available fork_db block after gap detection - Append blocks from fork_db to dlt_block_log after reset for recovery - Add dlt_block_log::verify_continuity() to identify missing or unreadable blocks - Log warnings for detected gaps and coverage issues during P2P stats reporting - Improve dlt_block_log integrity checks by combining mapping verification and full scan - Introduce ANSI color codes for enhanced P2P log message clarity

…recovery - Add verify_continuity() method for gap detection and integrity verification - Implement sophisticated gap detection and automatic DLT block log reset - Integrate signal-based snapshot plugin triggering after DLT reset - Enhance diagnostic system with verify_mapping() and resize_count() for monitoring - Fix Windows memory-mapped file size drift with separate logical size tracking - Improve read/append logic using logical sizes for correctness and reliability - Add automatic healing and periodic mapping verification mechanisms - Strengthen gap warning suppression to prevent redundant logging - Update reset() method to ensure safe log clearing with comprehensive cleanup - Integrate gap monitoring and recovery in P2P synchronization and stats tasks

- Implement diagnostics to log block storage state at startup - Report head block, last irreversible block, earliest block available - Log dlt_block_log range and block_log end block number - Display fork_db head, linked and unlinked block ranges and counts - Detect and log gaps between dlt_block_log end and fork_db start - Perform full integrity scan of dlt_block_log continuity - Log any detected gaps or confirm integrity if none found

- Added check for peers being ahead in DLT mode by verifying all synopsis entries are above local head block number - Return empty result if peer is ahead, indicating no fork but no blocks to send - Retain existing logging and exception throw for unreachable fork cases - Improved log messages with detailed peer synopsis and head block info

- Add tracking of connected peer IPs for cross-referencing with peer_db entries - Log additional peer metrics including bytes sent, connection direction, user agent, fork revision, head block info, firewall status, and timestamps for connection events - Improve p2p peer log output with expanded details and clearer formatting - Cross-reference failed peer_db entries against currently connected IPs - Annotate peer_db logs when a "failed" peer is actually connected at the moment

…d Windows support - Add verify_continuity(), verify_mapping(), and resize_count() methods for diagnostics - Implement intelligent gap detection and automatic recovery with block log reset - Integrate periodic integrity scanning into P2P stats task for DLT mode nodes - Add signal-based snapshot plugin integration for fresh snapshot creation - Enhance gap logging with automatic warning suppression via _dlt_gap_logged state - Improve Windows compatibility by tracking logical file size and healing stale mappings - Introduce advanced peer ahead-of-us detection to prevent unnecessary sync attempts - Provide comprehensive startup diagnostics before synchronization begins - Use ANSI color codes for enhanced console readability and comprehensive logging - Strengthen DLT integrity verification and coverage gap monitoring in P2P plugin

- Introduce a disable threshold setting to auto-disable a witness after producing N consecutive blocks from the same node, with 0 to disable this feature - Track per-witness consecutive block counts and auto-disable witnesses exceeding the threshold by setting their signing key to null on-chain - Prevent auto-restoration of witnesses that have been auto-disabled, requiring manual intervention or restart to re-enable - Implement send_witness_disable function to create and broadcast witness_update transactions that disable witnesses safely - Add detailed logging for auto-disable and disable transaction broadcasting events - Reset consecutive block counters when block production is by different witnesses - Add configuration options and documentation entries for new auto-disable feature - Update witness_guard plugin lifecycle to initialize and use the new threshold and auto-disable logic during runtime block handling

- Changed peer connection time to be returned as seconds since epoch - Updated conntime field assignment to use sec_since_epoch() method - Ensured consistency with other timestamp fields in peer details output

…itnesses - Add `witness-guard-disable` config option to set threshold for auto-disabling a monitored witness after producing consecutive blocks - Implement per-witness consecutive block counters, reset when a different witness produces a block - Broadcast `witness_update_operation` with null signing key to disable witness on reaching threshold, marking witness as auto-disabled - Suppress auto-restore for auto-disabled witnesses until operator manually restores the signing key - Clear auto-disabled flag when a non-null signing key is detected on-chain - Update block handler to track and act on consecutive block counts - Add detailed documentation for new feature, safety guards, and operator guidance

…rialization error Range replies from on_dlt_get_block_range had no size limit and could exceed MAX_MESSAGE_SIZE (2MB). When a corrupted block in a 200-block range caused fc::raw deserialization to abort (static_variant set_which assert), the entire range was lost and the TCP stream was disconnected. Fix: - on_dlt_get_block_range: check pack_size per block against MAX_MESSAGE_SIZE - 64KB headroom; send partial reply when limit hit - on_message: catch deserialization errors for range replies separately; set range_fallback_mode and send single-block request instead of disconnecting (stream is still aligned after full message read) - on_dlt_block_reply: continue single-block fetch chain while range_fallback_mode is active; clear flag and transition to FORWARD when caught up

When a remote peer responds with dlt_peer_exchange_rate_limited, the local node logged the response but never recorded the rate-limit in its local peer state. Since last_peer_exchange_request_time was only set when receiving an incoming exchange request (not when getting rate-limited on an outbound one), periodic_peer_exchange() kept sending a request to the same peer every 5 seconds for the entire 600-second cooldown window. Now on_dlt_peer_exchange_rate_limited() sets last_peer_exchange_request_time to now() so the local is_peer_exchange_rate_limited() check correctly excludes the peer until the cooldown expires.

When a master node broadcasts a block, peers that receive it from another peer before the master's broadcast arrives will relay it back to the master — wasting bandwidth and producing "already on our chain" log noise. The existing `exclude` parameter only filters the direct sender, not peers that received the block indirectly. Fix: add a per-peer `known_blocks` ring buffer (20 entries) that tracks which blocks each peer is known to have. A block is recorded when we send it to the peer or when the peer sends it to us. Before retransmitting a block, `send_to_all_our_fork_peers` checks the buffer and skips peers that already have the block. The `block_id` parameter defaults to null for non-block messages (transactions, fork_status, etc.), so echo suppression only applies to block_reply messages. Diagnostic log updated to show echo-filtered peer count: "Relay block_reply to N peers (X skipped: echo)"

When a block is applied to our chain (from any source), all peers' expected_next_block values must be advanced to match our new head. Without this, self-produced blocks or blocks from peer A leave peer B's expected_next_block stale, causing false "out of order" warnings on the next incoming block. Two changes: 1. on_block_applied(): iterate all peers and advance stale expected_next_block values to block_num + 1. This covers blocks from network peers, gap fill, and (via #2) own production. 2. broadcast_block(): call on_block_applied() after sending to peers. This ensures mempool cleanup, fork state tracking, and expected_next_block advancement happen for self-produced blocks (previously these were silently skipped).

When expected_next_block is stale (narrow race after fix 8.1+8.2), blocks at or behind head, or blocks that match head+1, are not genuinely out of order — they are just per-peer tracker lag. Demote these cases to dlog. Only emit wlog for genuine gaps where block_num > head + 1, which matches the existing gap-fill trigger threshold. Applied to both sites: - on_dlt_block_range_reply (SYNC mode, range blocks) - on_dlt_block_reply (FORWARD mode, single-block broadcast)

When the snapshot plugin creates a snapshot, P2P block processing is paused and incoming blocks from peers are silently dropped. After the pause ends, resume_block_processing() detects the gap and requests missing blocks asynchronously — but the witness production loop (250ms tick) could fire before those blocks arrive. For the emergency master, which bypasses all sync checks, this meant producing a block on a stale head that conflicted with the blocks about to arrive from peers. Other nodes saw this as a fork. Add a _catchup_after_pause flag that is set when resume_block_processing() detects peers are ahead, and cleared when transition_to_forward() confirms catchup is complete. The witness plugin checks this flag in maybe_produce_block() and defers production until the gap is filled.

When the snapshot plugin creates a snapshot, P2P block processing is paused and incoming blocks from peers are silently dropped. After the pause ends, resume_block_processing() detects the gap and requests missing blocks asynchronously — but the witness production loop (250ms tick) could fire before those blocks arrive. For the emergency master, which bypasses all sync checks, this meant producing a block on a stale head that conflicted with the blocks about to arrive from peers. Other nodes saw this as a fork. Add a _catchup_after_pause flag that is set unconditionally when resume_block_processing() runs (peer_head_num may be stale after the pause), and cleared either by transition_to_forward() when the gap is filled, or by periodic_task() when no gap actually existed. Also send a proactive hello to all peers in resume_block_processing() to refresh their head info as quickly as possible.

When the snapshot plugin creates a snapshot, P2P block processing is paused. Previously, incoming blocks from peers were silently dropped, requiring a gap fill after resume — and the emergency master could produce a block on a stale head before the gap was detected. Now block-carrying messages (block_reply, block_range_reply, gap_fill_reply) are deserialized and pushed into _paused_block_queue during the pause. Hello and fork_status messages are still processed normally to keep peer_head_num up to date. When resume_block_processing() is called, it posts drain_paused_block_queue() to the P2P thread, which sorts queued blocks by block_num and applies each via accept_block(). The _catchup_after_pause flag blocks witness production until the drain completes and no peer is ahead. Files changed: - dlt_p2p_node.hpp: add _paused_block_queue, drain_paused_block_queue() - dlt_p2p_node.cpp: queue blocks in on_message(), drain in resume, periodic fallback drain - p2p_plugin.hpp/cpp: expose is_catching_up_after_pause() - witness.cpp: defer production while catchup flag is set

When the snapshot plugin creates a snapshot, P2P block processing is paused and the snapshot thread holds a strong DB read lock for 30-120s. Two bugs: 1. Write lock deadlock: the emergency master's production loop (250ms tick) bypasses all sync checks and calls generate_block() → push_block() → write lock, which deadlocks behind the read lock, producing 11+ second write lock timeouts (readers=0, waiter spinning). 2. Fork on stale head: blocks arriving during the pause were silently dropped. After resume, the emergency master produced a block on a stale head before gap-fill could deliver the real blocks. Fix: - Block queue: block-carrying messages (block_reply, block_range_reply, gap_fill_reply) are deserialized and pushed into _paused_block_queue during the pause. Hello/fork_status are still processed to keep peer_head_num up to date. - Queue drain: resume_block_processing() posts drain_paused_block_queue() to the P2P thread, which sorts by block_num and applies each block. - Production gate: is_catching_up_after_pause() returns true when EITHER _block_processing_paused OR _catchup_after_pause is set. This prevents generate_block() during the pause (no write lock deadlock) AND during post-pause catchup (no stale-head fork). Files changed: - dlt_p2p_node.hpp: add _paused_block_queue, drain_paused_block_queue(), update is_catching_up_after_pause() to check _block_processing_paused - dlt_p2p_node.cpp: queue blocks in on_message(), drain in resume, periodic fallback drain - p2p_plugin.hpp/cpp: expose is_catching_up_after_pause() - witness.cpp: defer production while gate is active

…erleaving send_message() had two bugs causing remote peers to see corrupted dlt_block_reply_message payloads: 1. writesome() return value was ignored — partial TCP writes silently dropped the remaining bytes, sending truncated messages (e.g. 123 bytes where 500KB were intended). 2. Two separate writesome() calls (header, data) each yield the fiber via async .wait(). During the yield, another fiber could write to the same socket, interleaving bytes and producing garbage that fails deserialization at the receiver (e.g. huge varint in witness field). Fix: coalesce header + payload into a single buffer, loop writesome() until all bytes are written, and add a per-peer send guard that drops concurrent messages rather than corrupting the stream.

send_message() previously dropped messages when a send was already in progress for the same peer. If blocks 10, 11, 12 were dispatched in rapid succession, 11 and 12 were silently discarded, breaking sync. Replace the drop-on-contention guard with a per-peer send queue (dlt_peer_state.send_queue). When a fiber is already writing to a peer's socket, new messages are enqueued. The active writer drains the queue after each successful write, preserving message order. Key details: - Single-buffer serialization (header + payload coalesced) prevents fiber interleaving mid-message during partial writes - Partial-write retry loop ensures every byte reaches the wire - Queue capped at 100 messages per peer (configurable constant) - Dropped-message counter tracked in stats and logged periodically - Queue cleared on disconnect; send guard cleaned on close

…sconnected When all peers are disconnected/banned (e.g. after snapshot pause), check_sync_catchup() falsely reported "caught up" with zero active peers, causing SYNC→FORWARD→SYNC oscillation every 30s. Forward stagnation also oscillated uselessly because no peers existed to request blocks from. Add isolation detection with a 60s grace period, after which all peer backoffs are reset to initial values and soft bans are cleared, forcing immediate reconnection attempts. This replaces the oscillating state machine with a single recovery action.

The snapshot TCP server's accept_loop had no sleep in its catch handlers for fc::exception, std::exception, and catch-all. When the process hit the file descriptor limit (EMFILE / "Too many open files"), the loop would spin at full speed, spamming error logs every millisecond and burning CPU for hours. Add fc::usleep(fc::seconds(1)) in all three error paths, matching the pattern already used in the DLT P2P accept_loop.

… gaps Previously request_gap_fill() was gated to FORWARD mode only, causing nodes stuck in SYNC with a gap to never request missing blocks. The gap silently grew as broadcast blocks arrived and were stored in fork_db as unlinkable. Additionally, gaps larger than GAP_FILL_MAX_BLOCKS (100) were silently skipped instead of being served in chunks. With a gap of 800+ blocks, the node had no recovery path at all. Changes: - Remove FORWARD-only guard so gap fill works in both SYNC and FORWARD - Include SYNCING lifecycle peers in the candidate search (the best peer in SYNC mode is typically in SYNCING state) - For large gaps, request the first GAP_FILL_MAX_BLOCKS blocks instead of returning; subsequent chunks are requested after the current one completes or times out - Also trigger gap fill from out-of-order blocks in SYNC mode, not just FORWARD

…mpeting forks When a witness node produces a block that the network rejects (competing fork at the same height), gap fill requests starting from our_head+1 cannot link any blocks because their parent (the network's version of our_head) is unknown to our chain. By including our_head in the gap fill request (same P49 logic already used in request_blocks_from_peer), the peer returns its version of our head block. If it's the same as ours, accept_block returns ALREADY_KNOWN with no side effects. If different, fork_db stores it and can link the subsequent blocks, resolving the competing fork. For the reported case: head=79740489, gap fill now requests 79740489-79740500 instead of 79740490-79740500. The network's #79740489 links to #79740490+, allowing the chain to advance.

Two root causes for the immediate SYNC→FORWARD round-trip: 1. transition_to_sync() did not reset _last_block_received_time, so sync_stagnation_check() inherited a stale timestamp and fired on the very next tick (~5s later). 2. check_forward_stagnation() transitioned to SYNC even when no peer had a higher block number. SYNC has nothing to offer in that case — check_sync_catchup() immediately returned to FORWARD, completing the oscillation loop. Fix (1): reset _last_block_received_time to now() on SYNC entry. Fix (2): check for peers ahead before transitioning; if none, reset the stagnation timer and stay in FORWARD.

… fork_db When syncing from LIB, the sync starting block's parent is on the main chain but absent from fork_db (which only tracks blocks near head). The dead-fork detection in _push_block() checked only fork_db for the parent, causing it to incorrectly reject legitimate fork blocks and soft-ban the sync peer. Add a main-chain parent check via fetch_block_by_id() before declaring a dead fork. If the parent exists on the main chain, seed fork_db with it and let the block proceed through normal push logic. Apply the same check in p2p_plugin's unlinkable_block_exception handler as a safety net.

During P2P sync mode, the accept_block handler was skipping both witness signature and transaction signature verification. Skipping witness signatures allows a malicious peer with a valid fork_id to inject forged blocks that would be accepted without verification. Keep skip_transaction_signatures during sync (transactions inside a block are already committed by the producing witness, so individual signature checks are redundant). But always verify the witness block signature — it is the sole defense against block forgery.

Replace catch-all exception handlers with targeted catches: - database.cpp: fork_db parent seeding now catches only fc::assert_exception (duplicate/already-present). Memory errors and corruption propagate instead of being silently swallowed. - dlt_p2p_node: add PAUSED_QUEUE_MAX (1000) limit to prevent unbounded memory growth during long snapshot pauses. Replace silent catch(...){} with fc::exception catch + wlog so deserialization failures are visible in diagnostics.

Issue #4: Add comprehensive safety analysis for _pending_tx access during snapshot serialization. Document that strong_read_lock is compatible with weak_write_lock (used by push_transaction), making a theoretical race possible via API accept_transaction. P2P pause during snapshot eliminates the primary concurrent writer. A full fix requires pausing API transactions or copying _pending_tx outside the read lock scope. Issue #5: Prevent infinite loop in drain_send_queue when writesome() returns 0 bytes (stalled connection). Throw an fc::exception which is already caught by the outer handler that logs the error and disconnects the peer.

On1x added 30 commits April 29, 2026 08:34

Added _delegate->get_block_number(item) <= our_head_num to the dedup …

cbbe2e6

…condition

remaining_item_count now uses effective_head instead of head_block_num()

b6b1358

update enable stale production config arg

07ed810

fix infinity loop when remaining=0 and all items were consumed by dedup

e48eed4

fix loop in peer claims to have more blocks but can't serve them (DLT…

a6bfdd8

… storage gap)

fix disable all witnesses on emergency mode start

cccc9f3

Fix emergency logic

9f6e084

fix: move the emergency hybrid schedule override to run before update…

9884f7b

…_median_witness_props

Add startup recovery in database::open() to detect and fix a broken s…

5c842b2

…chedule when emergency mode is active

add debug crash

154b514

add debug crash deeper

f437f43

fix: add a bounds check on i to prevent overflow

e4e73d4

move debug_crash dlog to debug-block-production option in config.ini

ec5a2f7

Only restart sync if the peer's last block is AHEAD of our head

97ea87a

add 5 min soft ban for sync spam

25678ea

update docs

7c9d246

add dlt block storage stat logs to p2p stats

b23deb2

add better exception handling peer_is_on_an_unreachable_fork and syno…

fb6cbdf

…psis matching diagnostics

fix(network): return connection time as epoch seconds

4b1371b

- Changed peer connection time to be returned as seconds since epoch - Updated conntime field assignment to use sec_since_epoch() method - Ensured consistency with other timestamp fields in peer details output

On1x added 30 commits May 7, 2026 19:03

update agent

5c3c776

update docs

ef41081

update debug logs

ec8ffbe

fix build

143d2f4

update debug log

c53f1b0

fix debug log for socket close race condition

5f3c538

update tx debug log

e700749

update docs

322d871

update docs

7f97b2a

update docs

8787f23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix witness#96

Fix witness#96
On1x wants to merge 311 commits intomasterfrom
fix-witness

On1x commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

On1x commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant