Skip to content

Bound concurrent snapshot capture under fault storms #441

@bburda

Description

@bburda

Summary

When several different faults confirm at about the same time, the fault_manager starts one capture thread per fault with no upper bound.

src/ros2_medkit_fault_manager/src/fault_manager_node.cpp (around line 426) starts a std::thread per confirmation that runs snapshot capture and rosbag capture, tracked in capture_threads_. There is a recapture cooldown for the same fault code, but nothing bounds the number of concurrent capture threads across different faults. The rosbag side has a single writer (active_writer_ under writer_mutex_, recording_post_fault_), so concurrent captures also contend there.

Measured with a fresh gateway per fault count: peak memory grows with the number of concurrent faults (about +0.5 MiB at N=1 up to +5.8 MiB at N=16), and above about N=4 the process does not return to its pre-fault memory within a 20 s window. Reproduced with a benchmark harness in selfpatch_demos (fault lane, --faults 1,2,4,8,16).

Proposed solution

  • Replace the unbounded thread-per-fault model with a bounded capture thread pool or a semaphore.
  • Use a bounded queue with a clear policy when it is full (drop oldest or reject), so memory has an upper bound under a fault storm.

Additional context

Capture is already off the service callback thread, so the fault path is not blocked. The problem is unbounded concurrency: N simultaneous faults cost N threads and N buffers. The single rosbag writer already serializes its part; the snapshot path does not.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions