fix: clarify missing-vendor exploration error and add BMC connection circuit breaker#2980
Conversation
Summary by CodeRabbit
Walkthrough
ChangesSite-explorer MissingVendor propagation
BMC circuit breaker and collector sweep
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
crates/health/src/bmc.rs (1)
743-757: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse table-driven coverage for the classifier cases.
This is an input-to-boolean classifier with several rows; convert it to the project’s table-driven style so adding future error cases stays cheap and consistent. As per coding guidelines, “Use table-driven test style when writing tests in Rust”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/bmc.rs` around lines 743 - 757, Convert the classifier test in non_connection_errors_do_not_trip_the_circuit into table-driven style, since is_connection_error is an input-to-boolean helper with multiple cases. Replace the repeated individual assertions with a single cases table covering the existing HealthError variants (BmcError with UNAUTHORIZED, BmcError with NOT_FOUND, HttpError, and dummy_error), then iterate the table and assert each result. Keep the test name or fold it into a parameterized pattern that clearly matches the classifier behavior.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-model/src/site_explorer/mod.rs`:
- Around line 1199-1211: Keep MissingVendor backward-compatible by preserving
support for the legacy unit-variant JSONB shape currently stored in
exploration_report. Update the SiteExplorer error type in site_explorer::mod so
MissingVendor can still deserialize rows written as "MissingVendor" while also
handling the new observed payload, either by adding an explicit legacy decode
path or a migration for stored data. Make sure the serde/thiserror
representation for MissingVendor does not break existing deserialization of
previously persisted reports.
In `@crates/health/src/bmc.rs`:
- Around line 263-267: The `BmcClient` error handling in the `Probing`/half-open
path only trips on connection errors and otherwise returns the error without
closing the circuit, so update that branch to explicitly close the breaker when
a non-connection BMC response is received. Use the existing
`is_connection_error`, `trip_circuit`, and circuit-state handling in `bmc.rs` to
distinguish reachable responses from transport failures, and add a regression
test covering `CircuitState::Probing` with a non-connection `BmcError` such as
`InvalidResponse` or an auth error to assert the circuit transitions back to
closed.
- Around line 576-580: The stream error handling in the BMC SSE path bypasses
the circuit breaker because `self.inner.stream(uri)` is only wrapped for
creation, while item-level connection failures are mapped afterward in the
returned stream. Update the `guarded()` flow in `Bmc::stream` so that fallible
stream-item errors are routed back through the breaker before they are returned,
and keep the `HealthError::from` conversion associated with the guarded path
rather than only applying it in `map_err` on the stream.
In `@crates/health/src/collectors/sensors.rs`:
- Around line 198-214: The suppression in the sensor fetch error path is using
the breaker’s post-call state, which can misclassify failures that occur during
the single probe in Probing. Update the error handling in the sensors collector
around sensor_link.fetch().await so the quiet skip only happens when the
returned error is a breaker fast-fail or a true connection-level failure from
BmcClient, not merely when self.endpoint.bmc.circuit_blocks_requests() is
currently true. Keep the existing debug log and fetch_failures accounting
behavior in the sensor fetch loop, but gate that branch on the error
type/details from e instead of the breaker state.
---
Nitpick comments:
In `@crates/health/src/bmc.rs`:
- Around line 743-757: Convert the classifier test in
non_connection_errors_do_not_trip_the_circuit into table-driven style, since
is_connection_error is an input-to-boolean helper with multiple cases. Replace
the repeated individual assertions with a single cases table covering the
existing HealthError variants (BmcError with UNAUTHORIZED, BmcError with
NOT_FOUND, HttpError, and dummy_error), then iterate the table and assert each
result. Keep the test name or fold it into a parameterized pattern that clearly
matches the classifier behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bb36629e-5a3c-4923-9c6b-d72720bb9b42
📒 Files selected for processing (5)
crates/api-model/src/site_explorer/mod.rscrates/health/src/bmc.rscrates/health/src/collectors/sensors.rscrates/site-explorer/src/metrics.rscrates/site-explorer/src/redfish.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/health/src/bmc.rs (1)
347-381: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winPublish the tripped hint before opening the circuit.
Line 350 updates the mutex state to
Openbefore Line 381 storescircuit_tripped = true. During that window,check_circuit()can still take the lock-free fast path and allow requests, andnote_reachable()can miss a concurrent success because it observes the stalefalsehint.Proposed fix
fn trip_circuit(&self, error: &HealthError) { let mut state = self.circuit.lock().expect("circuit mutex poisoned"); + // Publish the conservative hint before mutating the state away from + // Closed. A Closed-but-tripped window only causes extra locking; an + // Open-but-untripped window lets requests bypass the breaker. + self.circuit_tripped.store(true, Ordering::Release); match *state { CircuitState::Closed => { *state = CircuitState::Open { until: Instant::now() + CIRCUIT_INITIAL_BACKOFF, backoff: CIRCUIT_INITIAL_BACKOFF, @@ // circuit opened. Leave the existing window untouched. CircuitState::Open { .. } => {} } - // Every branch above leaves the circuit non-`Closed`; publish the hint - // while still holding the lock so the fast path observes it. - self.circuit_tripped.store(true, Ordering::Release); }As per path instructions, this review prioritizes Rust behavior and concurrency findings in
crates/**/*.rs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/bmc.rs` around lines 347 - 381, The circuit tripped hint is being published too late in the BMC circuit breaker flow, allowing `check_circuit()` and `note_reachable()` to observe stale state between the mutex update and the atomic flag store. In the `Bmc` circuit-handling logic, set `circuit_tripped` to true before transitioning `CircuitState` to `Open` in the `Closed` and `Probing` failure paths, while keeping the mutex-protected state update and existing ordering guarantees intact so the fast path sees the circuit as tripped immediately.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/health/src/bmc.rs`:
- Around line 347-381: The circuit tripped hint is being published too late in
the BMC circuit breaker flow, allowing `check_circuit()` and `note_reachable()`
to observe stale state between the mutex update and the atomic flag store. In
the `Bmc` circuit-handling logic, set `circuit_tripped` to true before
transitioning `CircuitState` to `Open` in the `Closed` and `Probing` failure
paths, while keeping the mutex-protected state update and existing ordering
guarantees intact so the fast path sees the circuit as tripped immediately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eccd673e-0082-4742-8cc6-1f54c55b1282
📒 Files selected for processing (3)
crates/api-model/src/site_explorer/mod.rscrates/health/src/bmc.rscrates/health/src/collectors/sensors.rs
…circuit breaker Two related reliability fixes for unreachable BMCs. site-explorer: when a BMC's Redfish ServiceRoot does not yield a recognized vendor, the recorded exploration error now reports the raw Vendor/Oem string we actually read (and where from), distinguishing a BMC that reported nothing (commonly transient while it is still initializing) from one that reported a vendor we don't support yet. The exploration metric label is split accordingly (vendor_not_reported vs vendor_unrecognized) so the two cases can be alerted on separately. health: add a per-endpoint connection circuit breaker on the shared BmcClient. Previously, when a BMC stopped answering at the network level, every collector sharing the endpoint kept firing requests that each blocked for a full TCP connect timeout -- hundreds per sensor sweep, each logging a warning. The breaker classifies connect/timeout failures (not 404s/auth/decode errors), opens after the first one, and fast-fails subsequent requests with exponential backoff and a single half-open probe to self-heal. The healthy path stays lock-free via an atomic hint. The sensor sweep skips its fan-out while the circuit is open and treats breaker fast-fails as skips rather than per-sensor failures, so a dead endpoint no longer produces a log flood. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ensor sweep Follow-up to the connection circuit breaker, resolving review feedback: - guarded(): a non-connection error (404/auth/decode) now closes the circuit too, not just a success. The BMC responded, so it is reachable -- without this a half-open probe that got a real error would stay fast-failed until the probe deadline. - Sensor sweep: replace the post-call breaker-state suppression (fragile under concurrency -- a probe success could flip the state mid-sweep and re-surface the fast-fails) with an explicit three-way decision (Full/Probe/Skip). When the backoff window elapses the sweep sends a single probe instead of fanning out every sensor, so a dead BMC costs one request rather than a fast-fail burst, and per-sensor failures are logged/counted normally again. - Document that only stream *establishment* runs through the breaker; long-lived SSE item errors are handled by the streaming collector's own reconnect backoff. - Add tests: legacy MissingVendor unit-variant still deserializes (observed: None) and the new form round-trips; non-connection error during a probe closes the circuit; collector_sweep returns Probe once the window elapses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2ad8766 to
97f9c4d
Compare
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2980.docs.buildwithfern.com/infra-controller |
Latest main added new `MissingVendor` references (the error_code and
operator_mitigation match arms, plus rpc/api-model tests) that still treated it
as a unit variant. Update them for the `MissingVendor { observed }` struct
variant introduced earlier on this branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/rpc/src/model/site_explorer.rs (1)
429-429: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winCorrect migration to struct variant, but consider covering
Some(observed).The construction correctly adopts the new
MissingVendor { observed }shape. However, this test only exercises theNonecase. Given this PR's goal is distinguishing "vendor not reported" from "vendor unrecognized" via theobservedfield, add a case withobserved: Some("...".into())to assert the schema/text propagates the observed value end-to-end.♻️ Suggested additional case
#[test] fn endpoint_report_propagates_operator_error_schema_to_rpc() { let error = EndpointExplorationError::MissingVendor { observed: None }; let expected_schema = error.operator_error_schema(); let report = rpc::site_explorer::EndpointExplorationReport::from(EndpointExplorationReport { last_exploration_error: Some(error), ..Default::default() }); let actual_schema = report .last_exploration_error_schema .expect("report contains operator error schema"); assert_eq!( actual_schema.error_code, expected_schema.error_code.to_string() ); assert_eq!(actual_schema.text, expected_schema.text); assert_eq!(actual_schema.mitigation, expected_schema.mitigation); assert!(report.last_exploration_error.is_some()); } + + #[test] + fn endpoint_report_propagates_observed_vendor_to_rpc() { + let error = EndpointExplorationError::MissingVendor { + observed: Some("AcmeCorp".into()), + }; + let expected_schema = error.operator_error_schema(); + + let report = + rpc::site_explorer::EndpointExplorationReport::from(EndpointExplorationReport { + last_exploration_error: Some(error), + ..Default::default() + }); + + let actual_schema = report + .last_exploration_error_schema + .expect("report contains operator error schema"); + assert_eq!(actual_schema.text, expected_schema.text); + assert!(actual_schema.text.contains("AcmeCorp")); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/rpc/src/model/site_explorer.rs` at line 429, The `EndpointExplorationError::MissingVendor` test coverage in `site_explorer` only validates the `observed: None` path, so extend the existing assertions to also cover `observed: Some(...)`. Add a case in the relevant test around `EndpointExplorationError::MissingVendor` that builds the error with a concrete observed vendor string and verifies the resulting schema/text propagation includes that value end-to-end, alongside the current `None` case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/rpc/src/model/site_explorer.rs`:
- Line 429: The `EndpointExplorationError::MissingVendor` test coverage in
`site_explorer` only validates the `observed: None` path, so extend the existing
assertions to also cover `observed: Some(...)`. Add a case in the relevant test
around `EndpointExplorationError::MissingVendor` that builds the error with a
concrete observed vendor string and verifies the resulting schema/text
propagation includes that value end-to-end, alongside the current `None` case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 16a08844-3485-4159-88a1-5e5bd1276cb2
📒 Files selected for processing (2)
crates/api-model/src/site_explorer/mod.rscrates/rpc/src/model/site_explorer.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/api-model/src/site_explorer/mod.rs
Summary
Two related reliability fixes for BMCs that become unreachable at the network level.
site-explorer: self-describing missing-vendor error
When a BMC's Redfish ServiceRoot (
/redfish/v1) doesn't yield a recognized vendor, the recorded exploration error now reports the rawVendor/Oemstring we actually read and where it came from. This distinguishes:The exploration metric label is split accordingly —
vendor_not_reportedvsvendor_unrecognized— so the two failure modes can be alerted on separately. The defensiveUnknownarm in the password-rotation path is also documented (it's unreachable from the live exploration path, since the vendor probe rejectsUnknownfirst).health: per-endpoint BMC connection circuit breaker
Previously, when a BMC stopped answering at the network level, every collector sharing the endpoint's
BmcClientkept firing requests that each blocked for a full TCP connect timeout — hundreds per sensor sweep, each logging a warning. This both hammered the dead BMC and flooded the logs.A per-endpoint circuit breaker now sits on the shared
BmcClient:reqwest::Error::is_connect()/is_timeout()) — HTTP 404s, auth, and decode errors are not treated as the BMC being unreachable and don't trip it.AtomicBoolhint, so the common case never takes the mutex. The mutex is only touched on the unhealthy path.Testing
cargo test -p carbide-health --lib bmc::— 16 tests pass, including:BmcClientat a closed local port and asserts a genuine connect failure is classified and trips the breakercargo test -p carbide-site-explorer --lib— passes (incl. unknown→real vendor rotation)cargo clippyclean oncarbide-health,carbide-site-explorer,carbide-api-modelMissingVendor's unit→struct change is serde-backward-compatible (#[serde(default)]deserializes existing records toobserved: None)🤖 Generated with Claude Code