Summary
When several different faults confirm at about the same time, the fault_manager starts one capture thread per fault with no upper bound.
src/ros2_medkit_fault_manager/src/fault_manager_node.cpp (around line 426) starts a std::thread per confirmation that runs snapshot capture and rosbag capture, tracked in capture_threads_. There is a recapture cooldown for the same fault code, but nothing bounds the number of concurrent capture threads across different faults. The rosbag side has a single writer (active_writer_ under writer_mutex_, recording_post_fault_), so concurrent captures also contend there.
Measured with a fresh gateway per fault count: peak memory grows with the number of concurrent faults (about +0.5 MiB at N=1 up to +5.8 MiB at N=16), and above about N=4 the process does not return to its pre-fault memory within a 20 s window. Reproduced with a benchmark harness in selfpatch_demos (fault lane, --faults 1,2,4,8,16).
Proposed solution
- Replace the unbounded thread-per-fault model with a bounded capture thread pool or a semaphore.
- Use a bounded queue with a clear policy when it is full (drop oldest or reject), so memory has an upper bound under a fault storm.
Additional context
Capture is already off the service callback thread, so the fault path is not blocked. The problem is unbounded concurrency: N simultaneous faults cost N threads and N buffers. The single rosbag writer already serializes its part; the snapshot path does not.
Summary
When several different faults confirm at about the same time, the fault_manager starts one capture thread per fault with no upper bound.
src/ros2_medkit_fault_manager/src/fault_manager_node.cpp(around line 426) starts astd::threadper confirmation that runs snapshot capture and rosbag capture, tracked incapture_threads_. There is a recapture cooldown for the same fault code, but nothing bounds the number of concurrent capture threads across different faults. The rosbag side has a single writer (active_writer_underwriter_mutex_,recording_post_fault_), so concurrent captures also contend there.Measured with a fresh gateway per fault count: peak memory grows with the number of concurrent faults (about +0.5 MiB at N=1 up to +5.8 MiB at N=16), and above about N=4 the process does not return to its pre-fault memory within a 20 s window. Reproduced with a benchmark harness in selfpatch_demos (fault lane,
--faults 1,2,4,8,16).Proposed solution
Additional context
Capture is already off the service callback thread, so the fault path is not blocked. The problem is unbounded concurrency: N simultaneous faults cost N threads and N buffers. The single rosbag writer already serializes its part; the snapshot path does not.