Skip to content

fix: clarify missing-vendor exploration error and add BMC connection circuit breaker#2980

Open
wminckler wants to merge 3 commits into
NVIDIA:mainfrom
wminckler:fix/exploration-vendor-clarity-bmc-circuit-breaker
Open

fix: clarify missing-vendor exploration error and add BMC connection circuit breaker#2980
wminckler wants to merge 3 commits into
NVIDIA:mainfrom
wminckler:fix/exploration-vendor-clarity-bmc-circuit-breaker

Conversation

@wminckler

Copy link
Copy Markdown
Contributor

Summary

Two related reliability fixes for BMCs that become unreachable at the network level.

site-explorer: self-describing missing-vendor error

When a BMC's Redfish ServiceRoot (/redfish/v1) doesn't yield a recognized vendor, the recorded exploration error now reports the raw Vendor/Oem string we actually read and where it came from. This distinguishes:

  • a BMC that reported no vendor at all (commonly transient while it is still initializing/syncing — exploration retries), from
  • a BMC that reported a vendor we don't support yet (the value it sent is preserved).

The exploration metric label is split accordingly — vendor_not_reported vs vendor_unrecognized — so the two failure modes can be alerted on separately. The defensive Unknown arm in the password-rotation path is also documented (it's unreachable from the live exploration path, since the vendor probe rejects Unknown first).

health: per-endpoint BMC connection circuit breaker

Previously, when a BMC stopped answering at the network level, every collector sharing the endpoint's BmcClient kept firing requests that each blocked for a full TCP connect timeout — hundreds per sensor sweep, each logging a warning. This both hammered the dead BMC and flooded the logs.

A per-endpoint circuit breaker now sits on the shared BmcClient:

  • Classifies connect/timeout failures only (reqwest::Error::is_connect()/is_timeout()) — HTTP 404s, auth, and decode errors are not treated as the BMC being unreachable and don't trip it.
  • Opens after the first connect-level failure and fast-fails subsequent requests without touching the network, with exponential backoff (5s → 300s) and a single half-open probe to self-heal.
  • The healthy path is lock-free: a closed circuit is gated by an AtomicBool hint, so the common case never takes the mutex. The mutex is only touched on the unhealthy path.
  • The sensor sweep skips its whole fan-out while the circuit is open, and treats breaker fast-fails as skips rather than per-sensor failures — so a dead endpoint no longer produces a per-sensor log/metric flood.

Testing

  • cargo test -p carbide-health --lib bmc:: — 16 tests pass, including:
    • circuit open/close, single-probe admission, backoff escalation, stale-failure no-op, fast-path-hint coherence
    • an end-to-end test that points a real BmcClient at a closed local port and asserts a genuine connect failure is classified and trips the breaker
    • negative cases: 401/403/404/generic errors do not trip the breaker
  • cargo test -p carbide-site-explorer --lib — passes (incl. unknown→real vendor rotation)
  • cargo clippy clean on carbide-health, carbide-site-explorer, carbide-api-model
  • MissingVendor's unit→struct change is serde-backward-compatible (#[serde(default)] deserializes existing records to observed: None)

🤖 Generated with Claude Code

@wminckler wminckler requested a review from a team as a code owner June 29, 2026 17:35
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features
    • Added a per-endpoint connection circuit breaker with probe/skip behavior, reducing wasted sensor/exploration work while devices are temporarily unreachable.
  • Bug Fixes
    • Vendor/OEM exploration errors now include the observed raw value when available, while staying backward-compatible with legacy decoding.
    • Improved transport error classification so backoff/circuit-opening triggers only for true connectivity failures.
    • Updated site-explorer metrics to distinguish “vendor not reported” vs “vendor unrecognized.”
  • Tests
    • Added/updated coverage for circuit-breaker transitions, legacy/new error serialization, and operator/error-schema propagation.

Walkthrough

MissingVendor now preserves the observed vendor string through error construction, logging, metrics, and RPC test setup. BmcClient also gains a per-endpoint circuit breaker, and the sensor collector now adjusts iteration behavior based on breaker state.

Changes

Site-explorer MissingVendor propagation

Layer / File(s) Summary
MissingVendor struct variant and error model
crates/api-model/src/site_explorer/mod.rs, crates/rpc/src/model/site_explorer.rs
EndpointExplorationError::MissingVendor changes from a unit variant to observed: Option<String>, with backward-compatible serde defaults, updated error text, operator error handling, legacy deserialization and round-trip tests, and RPC test construction for the new shape.
Vendor capture and metric labels
crates/site-explorer/src/redfish.rs, crates/site-explorer/src/metrics.rs
get_redfish_vendor captures the raw vendor string and returns MissingVendor { observed }; metrics split the label into vendor_not_reported and vendor_unrecognized; Unknown match-arm docs are expanded.

BMC circuit breaker and collector sweep

Layer / File(s) Summary
Circuit state and client fields
crates/health/src/bmc.rs
Circuit timing, CircuitState, and CollectorSweep are defined; BmcClient stores mutex-protected circuit state plus an atomic hint and initializes both in new().
Core breaker logic and classification
crates/health/src/bmc.rs
guarded(), check_circuit(), note_reachable(), trip_circuit(), and is_connection_error() implement breaker admission, state transitions, and transport-only failure classification.
Bmc operations routed through guarded
crates/health/src/bmc.rs
All Bmc operations run through guarded() before the existing auth-refresh retry path.
Sensor collector sweep handling
crates/health/src/collectors/sensors.rs
run_iteration exits on Skip, and Probe mode limits sensor fan-out and derived-metric emission.
Circuit breaker test coverage
crates/health/src/bmc.rs
Tests cover classification, breaker transitions, probe behavior, hint consistency, backoff, and stale-failure handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2586: Also changes EndpointExplorationError in crates/api-model/src/site_explorer/mod.rs, so schema and downstream handling for the updated MissingVendor shape are closely related.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately captures the two main changes: missing-vendor clarification and the BMC circuit breaker.
Description check ✅ Passed The description is directly related to the code changes and explains both reliability fixes in sufficient detail.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
crates/health/src/bmc.rs (1)

743-757: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use table-driven coverage for the classifier cases.

This is an input-to-boolean classifier with several rows; convert it to the project’s table-driven style so adding future error cases stays cheap and consistent. As per coding guidelines, “Use table-driven test style when writing tests in Rust”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/bmc.rs` around lines 743 - 757, Convert the classifier test
in non_connection_errors_do_not_trip_the_circuit into table-driven style, since
is_connection_error is an input-to-boolean helper with multiple cases. Replace
the repeated individual assertions with a single cases table covering the
existing HealthError variants (BmcError with UNAUTHORIZED, BmcError with
NOT_FOUND, HttpError, and dummy_error), then iterate the table and assert each
result. Keep the test name or fold it into a parameterized pattern that clearly
matches the classifier behavior.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-model/src/site_explorer/mod.rs`:
- Around line 1199-1211: Keep MissingVendor backward-compatible by preserving
support for the legacy unit-variant JSONB shape currently stored in
exploration_report. Update the SiteExplorer error type in site_explorer::mod so
MissingVendor can still deserialize rows written as "MissingVendor" while also
handling the new observed payload, either by adding an explicit legacy decode
path or a migration for stored data. Make sure the serde/thiserror
representation for MissingVendor does not break existing deserialization of
previously persisted reports.

In `@crates/health/src/bmc.rs`:
- Around line 263-267: The `BmcClient` error handling in the `Probing`/half-open
path only trips on connection errors and otherwise returns the error without
closing the circuit, so update that branch to explicitly close the breaker when
a non-connection BMC response is received. Use the existing
`is_connection_error`, `trip_circuit`, and circuit-state handling in `bmc.rs` to
distinguish reachable responses from transport failures, and add a regression
test covering `CircuitState::Probing` with a non-connection `BmcError` such as
`InvalidResponse` or an auth error to assert the circuit transitions back to
closed.
- Around line 576-580: The stream error handling in the BMC SSE path bypasses
the circuit breaker because `self.inner.stream(uri)` is only wrapped for
creation, while item-level connection failures are mapped afterward in the
returned stream. Update the `guarded()` flow in `Bmc::stream` so that fallible
stream-item errors are routed back through the breaker before they are returned,
and keep the `HealthError::from` conversion associated with the guarded path
rather than only applying it in `map_err` on the stream.

In `@crates/health/src/collectors/sensors.rs`:
- Around line 198-214: The suppression in the sensor fetch error path is using
the breaker’s post-call state, which can misclassify failures that occur during
the single probe in Probing. Update the error handling in the sensors collector
around sensor_link.fetch().await so the quiet skip only happens when the
returned error is a breaker fast-fail or a true connection-level failure from
BmcClient, not merely when self.endpoint.bmc.circuit_blocks_requests() is
currently true. Keep the existing debug log and fetch_failures accounting
behavior in the sensor fetch loop, but gate that branch on the error
type/details from e instead of the breaker state.

---

Nitpick comments:
In `@crates/health/src/bmc.rs`:
- Around line 743-757: Convert the classifier test in
non_connection_errors_do_not_trip_the_circuit into table-driven style, since
is_connection_error is an input-to-boolean helper with multiple cases. Replace
the repeated individual assertions with a single cases table covering the
existing HealthError variants (BmcError with UNAUTHORIZED, BmcError with
NOT_FOUND, HttpError, and dummy_error), then iterate the table and assert each
result. Keep the test name or fold it into a parameterized pattern that clearly
matches the classifier behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bb36629e-5a3c-4923-9c6b-d72720bb9b42

📥 Commits

Reviewing files that changed from the base of the PR and between 5408f4d and 0b4c2e5.

📒 Files selected for processing (5)
  • crates/api-model/src/site_explorer/mod.rs
  • crates/health/src/bmc.rs
  • crates/health/src/collectors/sensors.rs
  • crates/site-explorer/src/metrics.rs
  • crates/site-explorer/src/redfish.rs

Comment thread crates/api-model/src/site_explorer/mod.rs
Comment thread crates/health/src/bmc.rs
Comment thread crates/health/src/bmc.rs
Comment thread crates/health/src/collectors/sensors.rs Outdated
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 302 6 30 114 8 144
machine-validation-runner 765 30 194 283 37 221
machine_validation 765 30 194 283 37 221
machine_validation-aarch64 765 30 194 283 37 221
nvmetal-carbide 765 30 194 283 37 221
TOTAL 3368 126 806 1252 156 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/health/src/bmc.rs (1)

347-381: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Publish the tripped hint before opening the circuit.

Line 350 updates the mutex state to Open before Line 381 stores circuit_tripped = true. During that window, check_circuit() can still take the lock-free fast path and allow requests, and note_reachable() can miss a concurrent success because it observes the stale false hint.

Proposed fix
 fn trip_circuit(&self, error: &HealthError) {
     let mut state = self.circuit.lock().expect("circuit mutex poisoned");
+    // Publish the conservative hint before mutating the state away from
+    // Closed. A Closed-but-tripped window only causes extra locking; an
+    // Open-but-untripped window lets requests bypass the breaker.
+    self.circuit_tripped.store(true, Ordering::Release);
     match *state {
         CircuitState::Closed => {
             *state = CircuitState::Open {
                 until: Instant::now() + CIRCUIT_INITIAL_BACKOFF,
                 backoff: CIRCUIT_INITIAL_BACKOFF,
@@
             // circuit opened. Leave the existing window untouched.
             CircuitState::Open { .. } => {}
         }
-        // Every branch above leaves the circuit non-`Closed`; publish the hint
-        // while still holding the lock so the fast path observes it.
-        self.circuit_tripped.store(true, Ordering::Release);
 }

As per path instructions, this review prioritizes Rust behavior and concurrency findings in crates/**/*.rs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/bmc.rs` around lines 347 - 381, The circuit tripped hint is
being published too late in the BMC circuit breaker flow, allowing
`check_circuit()` and `note_reachable()` to observe stale state between the
mutex update and the atomic flag store. In the `Bmc` circuit-handling logic, set
`circuit_tripped` to true before transitioning `CircuitState` to `Open` in the
`Closed` and `Probing` failure paths, while keeping the mutex-protected state
update and existing ordering guarantees intact so the fast path sees the circuit
as tripped immediately.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/health/src/bmc.rs`:
- Around line 347-381: The circuit tripped hint is being published too late in
the BMC circuit breaker flow, allowing `check_circuit()` and `note_reachable()`
to observe stale state between the mutex update and the atomic flag store. In
the `Bmc` circuit-handling logic, set `circuit_tripped` to true before
transitioning `CircuitState` to `Open` in the `Closed` and `Probing` failure
paths, while keeping the mutex-protected state update and existing ordering
guarantees intact so the fast path sees the circuit as tripped immediately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eccd673e-0082-4742-8cc6-1f54c55b1282

📥 Commits

Reviewing files that changed from the base of the PR and between 0b4c2e5 and 2ad8766.

📒 Files selected for processing (3)
  • crates/api-model/src/site_explorer/mod.rs
  • crates/health/src/bmc.rs
  • crates/health/src/collectors/sensors.rs

@wminckler wminckler requested review from poroh and yoks June 29, 2026 19:17
@wminckler wminckler added this to the v2.0 milestone Jun 30, 2026
wminckler and others added 2 commits June 30, 2026 18:56
…circuit breaker

Two related reliability fixes for unreachable BMCs.

site-explorer: when a BMC's Redfish ServiceRoot does not yield a recognized
vendor, the recorded exploration error now reports the raw Vendor/Oem string we
actually read (and where from), distinguishing a BMC that reported nothing
(commonly transient while it is still initializing) from one that reported a
vendor we don't support yet. The exploration metric label is split accordingly
(vendor_not_reported vs vendor_unrecognized) so the two cases can be alerted on
separately.

health: add a per-endpoint connection circuit breaker on the shared BmcClient.
Previously, when a BMC stopped answering at the network level, every collector
sharing the endpoint kept firing requests that each blocked for a full TCP
connect timeout -- hundreds per sensor sweep, each logging a warning. The
breaker classifies connect/timeout failures (not 404s/auth/decode errors),
opens after the first one, and fast-fails subsequent requests with exponential
backoff and a single half-open probe to self-heal. The healthy path stays
lock-free via an atomic hint. The sensor sweep skips its fan-out while the
circuit is open and treats breaker fast-fails as skips rather than per-sensor
failures, so a dead endpoint no longer produces a log flood.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ensor sweep

Follow-up to the connection circuit breaker, resolving review feedback:

- guarded(): a non-connection error (404/auth/decode) now closes the circuit
  too, not just a success. The BMC responded, so it is reachable -- without this
  a half-open probe that got a real error would stay fast-failed until the probe
  deadline.

- Sensor sweep: replace the post-call breaker-state suppression (fragile under
  concurrency -- a probe success could flip the state mid-sweep and re-surface
  the fast-fails) with an explicit three-way decision (Full/Probe/Skip). When the
  backoff window elapses the sweep sends a single probe instead of fanning out
  every sensor, so a dead BMC costs one request rather than a fast-fail burst,
  and per-sensor failures are logged/counted normally again.

- Document that only stream *establishment* runs through the breaker; long-lived
  SSE item errors are handled by the streaming collector's own reconnect backoff.

- Add tests: legacy MissingVendor unit-variant still deserializes (observed:
  None) and the new form round-trips; non-connection error during a probe closes
  the circuit; collector_sweep returns Probe once the window elapses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wminckler wminckler force-pushed the fix/exploration-vendor-clarity-bmc-circuit-breaker branch from 2ad8766 to 97f9c4d Compare June 30, 2026 18:57
@github-actions

Copy link
Copy Markdown

Latest main added new `MissingVendor` references (the error_code and
operator_mitigation match arms, plus rpc/api-model tests) that still treated it
as a unit variant. Update them for the `MissingVendor { observed }` struct
variant introduced earlier on this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/rpc/src/model/site_explorer.rs (1)

429-429: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Correct migration to struct variant, but consider covering Some(observed).

The construction correctly adopts the new MissingVendor { observed } shape. However, this test only exercises the None case. Given this PR's goal is distinguishing "vendor not reported" from "vendor unrecognized" via the observed field, add a case with observed: Some("...".into()) to assert the schema/text propagates the observed value end-to-end.

♻️ Suggested additional case
     #[test]
     fn endpoint_report_propagates_operator_error_schema_to_rpc() {
         let error = EndpointExplorationError::MissingVendor { observed: None };
         let expected_schema = error.operator_error_schema();
 
         let report =
             rpc::site_explorer::EndpointExplorationReport::from(EndpointExplorationReport {
                 last_exploration_error: Some(error),
                 ..Default::default()
             });
 
         let actual_schema = report
             .last_exploration_error_schema
             .expect("report contains operator error schema");
         assert_eq!(
             actual_schema.error_code,
             expected_schema.error_code.to_string()
         );
         assert_eq!(actual_schema.text, expected_schema.text);
         assert_eq!(actual_schema.mitigation, expected_schema.mitigation);
         assert!(report.last_exploration_error.is_some());
     }
+
+    #[test]
+    fn endpoint_report_propagates_observed_vendor_to_rpc() {
+        let error = EndpointExplorationError::MissingVendor {
+            observed: Some("AcmeCorp".into()),
+        };
+        let expected_schema = error.operator_error_schema();
+
+        let report =
+            rpc::site_explorer::EndpointExplorationReport::from(EndpointExplorationReport {
+                last_exploration_error: Some(error),
+                ..Default::default()
+            });
+
+        let actual_schema = report
+            .last_exploration_error_schema
+            .expect("report contains operator error schema");
+        assert_eq!(actual_schema.text, expected_schema.text);
+        assert!(actual_schema.text.contains("AcmeCorp"));
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/rpc/src/model/site_explorer.rs` at line 429, The
`EndpointExplorationError::MissingVendor` test coverage in `site_explorer` only
validates the `observed: None` path, so extend the existing assertions to also
cover `observed: Some(...)`. Add a case in the relevant test around
`EndpointExplorationError::MissingVendor` that builds the error with a concrete
observed vendor string and verifies the resulting schema/text propagation
includes that value end-to-end, alongside the current `None` case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/rpc/src/model/site_explorer.rs`:
- Line 429: The `EndpointExplorationError::MissingVendor` test coverage in
`site_explorer` only validates the `observed: None` path, so extend the existing
assertions to also cover `observed: Some(...)`. Add a case in the relevant test
around `EndpointExplorationError::MissingVendor` that builds the error with a
concrete observed vendor string and verifies the resulting schema/text
propagation includes that value end-to-end, alongside the current `None` case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16a08844-3485-4159-88a1-5e5bd1276cb2

📥 Commits

Reviewing files that changed from the base of the PR and between 97f9c4d and 01eec58.

📒 Files selected for processing (2)
  • crates/api-model/src/site_explorer/mod.rs
  • crates/rpc/src/model/site_explorer.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/api-model/src/site_explorer/mod.rs

@wminckler wminckler removed this from the v2.0 milestone Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant