Skip to content

feat(switch-controller): add ConfigureCertificate phase with RMS job …#2330

Open
vinodchitraliNVIDIA wants to merge 12 commits into
NVIDIA:mainfrom
vinodchitraliNVIDIA:vc/cm
Open

feat(switch-controller): add ConfigureCertificate phase with RMS job …#2330
vinodchitraliNVIDIA wants to merge 12 commits into
NVIDIA:mainfrom
vinodchitraliNVIDIA:vc/cm

Conversation

@vinodchitraliNVIDIA

@vinodchitraliNVIDIA vinodchitraliNVIDIA commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

…polling

Extend the switch Configuring state machine with a certificate configuration sub-flow that runs before RotateOsPassword and before Validating. The handler submits an async RMS job via Component Manager and polls until completion.

State machine (api-model):

  • Add ConfigureCertificateState { Start, WaitForComplete { job_id } }
  • Nest under ConfiguringState::ConfigureCertificate
  • RotateOsPassword now transitions into ConfigureCertificate(Start)

Switch handler (configuring.rs):

  • Start: derive cert_name from switch.rack_id; build SwitchEndpoint from BMC MAC, NVOS interface, and vault credentials; call CM to start the job
  • WaitForComplete: poll CM for ConfigureSwitchCertificateState until Completed (→ Validating), Failed (→ Error), or in-progress (wait)
  • Skip certificate configuration when rack_id or component manager is absent

Component Manager:

  • Expose configure_switch_certificate(endpoint, cert_name) → job_id
  • Expose get_configure_switch_certificate_job_status(job_id) → job status
  • Extend NvSwitchManager; implement in mock (configurable job status), NSM (unsupported), and RmsBackend (stub until librms RPCs land)
  • Add ConfigureSwitchCertificateState { Started, InProgress, Completed, Failed }

Tests:

  • Integration tests for skip paths, Start → WaitForComplete, success/failure polling, and RotateOsPassword → ConfigureCertificate(Start)
  • Test fixtures for rack_id assignment and versioned state transitions

Docs:

  • Add switch_configure_certificate.md with FSM detail and RMS sequence diagrams
  • Update switch.md transitions and link to the new design doc

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

2327

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@vinodchitraliNVIDIA vinodchitraliNVIDIA self-assigned this Jun 9, 2026
@vinodchitraliNVIDIA vinodchitraliNVIDIA requested review from a team and Coco-Ben as code owners June 9, 2026 13:19
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds switch and rack certificate-configuration flows, expands shared state and backend contracts, wires the new RPC through api-core and component-manager, updates controller transitions and hostname propagation, and refreshes related tests and documentation.

Changes

Certificate configuration flow

Layer / File(s) Summary
State and config contracts
crates/api-model/src/..., crates/api-db/src/..., crates/api-core/src/cfg/file.rs, crates/rpc/proto/forge.proto, rest-api/flow/internal/nicoapi/nicoproto/nico.proto, Cargo.toml, crates/component-manager/Cargo.toml, crates/rack-controller/Cargo.toml
Adds certificate state variants, job-status types, endpoint hostname fields, mTLS service config, and the new component-management RPC contract.
Component-manager backends
crates/component-manager/src/..., crates/api-test-helper/src/mock_rms.rs, crates/rack/src/rms_client.rs
Extends the component-manager trait, config mapping, mock, NSM, and RMS implementations for starting and polling switch certificate jobs.
API RPC and controller wiring
crates/api-core/src/api.rs, crates/api-core/src/handlers/component_manager.rs, crates/api-core/src/setup.rs, crates/api-core/src/test_support/default_config.rs, crates/api-core/src/tests/common/api_fixtures/mod.rs
Wires the new RPC, service configuration, and test fixtures into api-core and the state-controller service setup.
Switch controller flow
crates/switch-controller/src/..., crates/api-core/src/tests/switch_state_controller/*
Adds switch endpoint resolution, certificate start and polling, FetchInfo handling, maintenance reconfigure transitions, and switch-controller integration tests.
Rack controller flow
crates/rack-controller/src/..., crates/rack/src/firmware_update.rs, crates/site-explorer/src/machine_creator.rs, crates/api-core/src/tests/rack_state_controller/*
Adds rack-side certificate orchestration, hostname propagation into rack inventory, and rack-controller integration tests.
Docs and workspace updates
docs/architecture/state_machines/switch.md, docs/architecture/state_machines/switch_configure_certificate.md, deploy/nico-base/api/config-files/nico-api-config.toml, Cargo.toml
Updates the switch FSM documentation, the dedicated certificate design page, deployment comments, and the workspace dependency tag.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: adding a ConfigureCertificate phase with RMS job polling.
Description check ✅ Passed The description is clearly related to the changeset and matches the new switch certificate flow, tests, and docs.
Docstring Coverage ✅ Passed Docstring coverage is 80.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@vinodchitraliNVIDIA vinodchitraliNVIDIA marked this pull request as draft June 9, 2026 13:20
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@vinodchitraliNVIDIA vinodchitraliNVIDIA added the rack lifecycle Issues that relate to managing the lifecycle of a full rack (compute, switches and powershelves) label Jun 9, 2026
@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/cm branch 2 times, most recently from f555481 to e1e1aea Compare June 9, 2026 14:00
@vinodchitraliNVIDIA vinodchitraliNVIDIA marked this pull request as ready for review June 10, 2026 09:39
@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/cm branch 2 times, most recently from 1cd6d97 to 7314be8 Compare June 10, 2026 10:47
@vinodchitraliNVIDIA

Copy link
Copy Markdown
Contributor Author

@coderabbitai help

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai rate limit to show your current review rate limit status.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
  • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
  • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai emit path instructions to emit generated path-specific review instructions for this repository.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@vinodchitraliNVIDIA

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
crates/switch-controller/src/configuring.rs (3)

231-285: ⚖️ Poor tradeoff

Consider treating missing component manager during polling as a transient wait condition.

Lines 236-244 immediately error if component_manager is None during the WaitForComplete phase. However, if the component manager becomes unavailable mid-flight (e.g., due to a configuration change or service restart), this causes an irrecoverable Error state even though the job may still be running in RMS.

Consider whether this should instead produce a wait(...) outcome to allow the controller to retry once the component manager is restored, rather than transitioning to a terminal error state.

If the error is intentional (component manager removal invalidates all pending jobs), add a comment explaining that assumption.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 231 - 285, The
current handler handle_configure_certificate_wait_for_complete immediately
returns a terminal Error when ctx.services.component_manager is None; change
this to treat a missing component_manager as a transient condition by returning
StateHandlerOutcome::wait(...) (including job_id and switch_id in the message)
so the controller retries polling instead of transitioning to Error, and only
transition to Error on explicit job Failed state or real RPC errors from
get_configure_switch_certificate_job_status; alternatively, if removal of the
component manager is intended to invalidate jobs, add an explicit comment above
the ctx.services.component_manager check stating that assumption and why an
Error transition is correct.

34-52: 💤 Low value

Consider removing unnecessary config_state clone.

Line 40 clones config_state before matching. Since ConfiguringState is a small enum, the clone is not expensive, but it's also not required—you can match on &config_state and clone only the fields needed by async handlers (e.g., job_id).

♻️ Optional refactor
 pub async fn handle_configuring(
     switch_id: &SwitchId,
     state: &mut Switch,
     ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>,
 ) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> {
-    let config_state = match &state.controller_state.value {
-        SwitchControllerState::Configuring { config_state } => config_state.clone(),
+    match &state.controller_state.value {
+        SwitchControllerState::Configuring { config_state } => match config_state {
+            ConfiguringState::ConfigureCertificate { configure_certificate } => {
+                handle_configure_certificate(switch_id, state, ctx, configure_certificate.clone()).await
+            }
+            ConfiguringState::RotateOsPassword => {
+                handle_rotate_os_password(switch_id, state, ctx).await
+            }
+        },
         _ => unreachable!("handle_configuring called with non-Configuring state"),
-    };
-
-    match config_state {
-        ConfiguringState::ConfigureCertificate { configure_certificate } => {
-            handle_configure_certificate(switch_id, state, ctx, configure_certificate).await
-        }
-        ConfiguringState::RotateOsPassword => {
-            handle_rotate_os_password(switch_id, state, ctx).await
-        }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 34 - 52, The code
in handle_configuring unnecessarily clones config_state; instead match on a
reference to avoid the full clone (match &state.controller_state.value {
SwitchControllerState::Configuring { config_state } => match config_state {
ConfiguringState::ConfigureCertificate { configure_certificate } => { clone only
the inner fields needed (e.g., job_id or specific data) and pass those into
handle_configure_certificate(switch_id, state, ctx, cloned_field).await },
ConfiguringState::RotateOsPassword => { call
handle_rotate_os_password(switch_id, state, ctx).await } ... } }, keeping
references except where an async call requires owned data) so remove the
config_state.clone() and only clone the minimal fields when invoking async
handlers like handle_configure_certificate.

166-189: 💤 Low value

Skip transitions suggest certificate configuration is optional.

Lines 166-176 and 179-189 skip certificate configuration (transitioning directly to RotateOsPassword) when rack_id or component_manager are absent, while line 191-197 treats missing bmc_mac_address as a fatal error. This distinction implies:

  • Certificate configuration is optional if the switch is not rack-associated or if no component manager is configured.
  • Certificate configuration is required if both preconditions are met, but then bmc_mac_address becomes mandatory.

Ensure this logic matches the intended lifecycle semantics. If certificate configuration should always succeed or fail atomically when attempted, consider consolidating the checks and treating all missing preconditions uniformly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 166 - 189, The
current logic treats missing rack_id or component_manager as optional (skipping
to ConfiguringState::RotateOsPassword) but treats missing bmc_mac_address as
fatal; consolidate these precondition checks so certificate configuration is
handled atomically: in the handler that decides to run certificate configuration
(the block that builds cert_name and inspects ctx.services.component_manager),
check rack_id, ctx.services.component_manager, and the bmc_mac_address together
(referencing rack_id, component_manager, and the bmc_mac_address lookup) and
then either: a) if the intended semantics are "optional", uniformly skip
certificate configuration when any of those are missing and transition to
ConfiguringState::RotateOsPassword, or b) if the intended semantics are
"required", return an error (or a failing StateHandlerOutcome) when any of them
are missing; implement whichever policy by replacing the separate early-returns
with a single consolidated conditional that yields a uniform outcome.
crates/switch-controller/src/endpoint.rs (2)

50-65: ⚖️ Poor tradeoff

Necessary credential clone exposes API design smell.

Line 62 clones nvos_credentials to populate both bmc_credentials and nvos_credentials fields in SwitchEndpoint. This suggests the SwitchEndpoint API may not match the actual use case—if BMC and NVOS always share credentials, consider:

  1. Refactoring SwitchEndpoint to accept a single Credentials field and derive both internally, or
  2. Documenting why both fields exist if they serve different purposes.

If they truly are independent in some scenarios, the clone is acceptable, but the current pattern (fetch one, clone for both) indicates they are not.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/endpoint.rs` around lines 50 - 65, The function
switch_endpoint_from_row currently clones nvos_credentials to populate both
SwitchEndpoint.bmc_credentials and .nvos_credentials which signals an API
mismatch; update the API and this constructor to avoid the unnecessary clone by
either (A) refactoring the SwitchEndpoint struct to have a single credentials
field (e.g., credentials) and update switch_endpoint_from_row to set that single
field from the provided Credentials, or (B) change switch_endpoint_from_row's
signature to accept both bmc_credentials and nvos_credentials separately and
assign them directly so no clone is needed; adjust all usages of SwitchEndpoint
and the constructor to match the chosen approach and update any
documentation/comments to explain the intended credential relationship.

69-85: Consider adding a single-switch endpoint DB helper for readability.

resolve_switch_endpoint calls db::switch::find_switch_endpoints_by_ids(..., std::slice::from_ref(switch_id)); there’s no singular find_switch_endpoint_by_id helper in crates/api-db/src/switch.rs (only find_switch_endpoints_by_ids exists), and other call sites use the same singleton-slice pattern. A small wrapper like find_switch_endpoint_by_id(db_pool, switch_id) would remove the slice indirection and make the intent explicit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/endpoint.rs` around lines 69 - 85, The code uses
db::switch::find_switch_endpoints_by_ids(..., std::slice::from_ref(switch_id))
inside resolve_switch_endpoint which makes a singleton-slice indirection; add a
small helper function find_switch_endpoint_by_id(db_pool: &PgPool, switch_id:
&SwitchId) -> Result<Option<SwitchEndpointRow>, _> (or matching existing return
type) in crates/api-db/src/switch.rs that wraps find_switch_endpoints_by_ids and
returns the single row (or Option), then update resolve_switch_endpoint to call
find_switch_endpoint_by_id(db_pool, switch_id) and adapt error handling to use
the returned Option/Result; also replace other call sites that use
std::slice::from_ref pattern to improve readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/switch-controller/src/endpoint.rs`:
- Around line 34-48: The function fetch_switch_nvos_credentials silently returns
empty Credentials on Any/None; instead propagate failures or surface a warning
so missing vault data isn't masked: change fetch_switch_nvos_credentials to
return a Result<Credentials, E> (or Option<Credentials>) rather than always
returning Credentials, map
credential_manager.get_credentials(&CredentialKey::SwitchNvosAdmin {
bmc_mac_address: bmc_mac }).await errors/None into an Err with context (or None)
and update callers to handle the error/absence (or log a clear warning) so the
handler/component manager can decide to skip/fail; ensure references to
CredentialKey::SwitchNvosAdmin, credential_manager.get_credentials, and the
Credentials enum are updated accordingly.
- Around line 30-32: The current placeholder_ip() and the usage
unwrap_or_else(placeholder_ip) allow row.nvos_ip == None to silently become
0.0.0.0; change resolve_switch_endpoint (or switch_endpoint_from_row) to treat a
missing row.nvos_ip as an error and return a Result/Err instead of constructing
a SwitchEndpoint with a placeholder IP. Remove the
unwrap_or_else(placeholder_ip) usage, propagate the error to callers like
handle_configure_certificate_start, and ensure those callers map the error into
SwitchControllerState::Error (or otherwise fail-fast) so certificate
configuration is not attempted against an invalid address; keep references to
SwitchEndpoint and row.nvos_ip in error messages for clarity.

In `@docs/architecture/state_machines/switch_configure_certificate.md`:
- Around line 22-37: Remove the misleading self-transition from the
ConfigureCertificate sub-state: delete the transition "WaitForComplete --> Start
: not used (terminal only)" so the sub-state ConfigureCertificate only shows
Start --> WaitForComplete and then ConfigureCertificate --> RotateOsPassword (or
ConfigureCertificate --> Error) — update the Mermaid diagram so WaitForComplete
is represented as a terminal sub-state (no transition back to Start) and keep
references to ConfigureCertificate, Start, and WaitForComplete intact.

---

Nitpick comments:
In `@crates/switch-controller/src/configuring.rs`:
- Around line 231-285: The current handler
handle_configure_certificate_wait_for_complete immediately returns a terminal
Error when ctx.services.component_manager is None; change this to treat a
missing component_manager as a transient condition by returning
StateHandlerOutcome::wait(...) (including job_id and switch_id in the message)
so the controller retries polling instead of transitioning to Error, and only
transition to Error on explicit job Failed state or real RPC errors from
get_configure_switch_certificate_job_status; alternatively, if removal of the
component manager is intended to invalidate jobs, add an explicit comment above
the ctx.services.component_manager check stating that assumption and why an
Error transition is correct.
- Around line 34-52: The code in handle_configuring unnecessarily clones
config_state; instead match on a reference to avoid the full clone (match
&state.controller_state.value { SwitchControllerState::Configuring {
config_state } => match config_state { ConfiguringState::ConfigureCertificate {
configure_certificate } => { clone only the inner fields needed (e.g., job_id or
specific data) and pass those into handle_configure_certificate(switch_id,
state, ctx, cloned_field).await }, ConfiguringState::RotateOsPassword => { call
handle_rotate_os_password(switch_id, state, ctx).await } ... } }, keeping
references except where an async call requires owned data) so remove the
config_state.clone() and only clone the minimal fields when invoking async
handlers like handle_configure_certificate.
- Around line 166-189: The current logic treats missing rack_id or
component_manager as optional (skipping to ConfiguringState::RotateOsPassword)
but treats missing bmc_mac_address as fatal; consolidate these precondition
checks so certificate configuration is handled atomically: in the handler that
decides to run certificate configuration (the block that builds cert_name and
inspects ctx.services.component_manager), check rack_id,
ctx.services.component_manager, and the bmc_mac_address together (referencing
rack_id, component_manager, and the bmc_mac_address lookup) and then either: a)
if the intended semantics are "optional", uniformly skip certificate
configuration when any of those are missing and transition to
ConfiguringState::RotateOsPassword, or b) if the intended semantics are
"required", return an error (or a failing StateHandlerOutcome) when any of them
are missing; implement whichever policy by replacing the separate early-returns
with a single consolidated conditional that yields a uniform outcome.

In `@crates/switch-controller/src/endpoint.rs`:
- Around line 50-65: The function switch_endpoint_from_row currently clones
nvos_credentials to populate both SwitchEndpoint.bmc_credentials and
.nvos_credentials which signals an API mismatch; update the API and this
constructor to avoid the unnecessary clone by either (A) refactoring the
SwitchEndpoint struct to have a single credentials field (e.g., credentials) and
update switch_endpoint_from_row to set that single field from the provided
Credentials, or (B) change switch_endpoint_from_row's signature to accept both
bmc_credentials and nvos_credentials separately and assign them directly so no
clone is needed; adjust all usages of SwitchEndpoint and the constructor to
match the chosen approach and update any documentation/comments to explain the
intended credential relationship.
- Around line 69-85: The code uses db::switch::find_switch_endpoints_by_ids(...,
std::slice::from_ref(switch_id)) inside resolve_switch_endpoint which makes a
singleton-slice indirection; add a small helper function
find_switch_endpoint_by_id(db_pool: &PgPool, switch_id: &SwitchId) ->
Result<Option<SwitchEndpointRow>, _> (or matching existing return type) in
crates/api-db/src/switch.rs that wraps find_switch_endpoints_by_ids and returns
the single row (or Option), then update resolve_switch_endpoint to call
find_switch_endpoint_by_id(db_pool, switch_id) and adapt error handling to use
the returned Option/Result; also replace other call sites that use
std::slice::from_ref pattern to improve readability.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e7794363-4586-4ab4-be8f-9e58bddbf0df

📥 Commits

Reviewing files that changed from the base of the PR and between 51d0530 and 7314be8.

📒 Files selected for processing (15)
  • crates/api-core/src/tests/switch_state_controller/fixtures/switch.rs
  • crates/api-core/src/tests/switch_state_controller/mod.rs
  • crates/api-model/src/component_manager.rs
  • crates/api-model/src/switch/mod.rs
  • crates/component-manager/src/component_manager.rs
  • crates/component-manager/src/mock.rs
  • crates/component-manager/src/nsm.rs
  • crates/component-manager/src/nv_switch_manager.rs
  • crates/component-manager/src/rms.rs
  • crates/switch-controller/src/configuring.rs
  • crates/switch-controller/src/endpoint.rs
  • crates/switch-controller/src/initializing.rs
  • crates/switch-controller/src/lib.rs
  • docs/architecture/state_machines/switch.md
  • docs/architecture/state_machines/switch_configure_certificate.md

Comment thread crates/switch-controller/src/endpoint.rs Outdated
Comment thread docs/architecture/state_machines/switch_configure_certificate.md
@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/cm branch 3 times, most recently from 2f73908 to b742949 Compare June 12, 2026 06:37

@Matthias247 Matthias247 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

As discussed offline it might be nice to rethink doing this outside of the state machine for the future, but as an initial implementation this is definitely helpful

Comment thread crates/api-core/src/tests/switch_state_controller/mod.rs
@github-actions

Copy link
Copy Markdown

@vinodchitraliNVIDIA vinodchitraliNVIDIA marked this pull request as draft June 12, 2026 16:52
@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/cm branch 2 times, most recently from 4010226 to 6999d9b Compare June 29, 2026 06:29
@vinodchitraliNVIDIA vinodchitraliNVIDIA marked this pull request as ready for review June 29, 2026 22:08
@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-29 22:10:54 UTC | Commit: 6999d9b

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 13 1 2 2 1 7
nico-nsm 5 0 0 4 1 0
nico-psm 13 1 2 2 1 7
nico-rest-api 13 1 2 2 1 7
nico-rest-cert-manager 12 1 2 2 0 7
nico-rest-db 13 1 2 2 1 7
nico-rest-site-agent 12 1 2 2 0 7
nico-rest-site-manager 12 1 2 2 0 7
nico-rest-workflow 13 1 2 2 1 7
TOTAL 106 8 16 20 6 56

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/cm branch 5 times, most recently from 285578f to 5e2629c Compare June 30, 2026 20:39

@polarweasel polarweasel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of little changes, and a few questions to answer for your own updates.

Comment thread docs/architecture/state_machines/switch_configure_certificate.md Outdated
Comment thread docs/architecture/state_machines/switch_configure_certificate.md Outdated
Comment thread docs/architecture/state_machines/switch_configure_certificate.md
Comment thread docs/architecture/state_machines/switch_configure_certificate.md Outdated
Comment thread docs/architecture/state_machines/switch_configure_certificate.md Outdated
Comment thread docs/architecture/state_machines/rackstatemachine.md Outdated
Comment thread docs/architecture/state_machines/rackstatemachine.md Outdated
Comment thread docs/architecture/state_machines/rackstatemachine.md Outdated
Comment thread docs/architecture/state_machines/rackstatemachine.md Outdated
Comment thread docs/architecture/state_machines/rackstatemachine.md Outdated
@vinodchitraliNVIDIA

Copy link
Copy Markdown
Contributor Author

@polarweasel can u review again

@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/cm branch 3 times, most recently from 1f72fb1 to 1001bf8 Compare July 1, 2026 14:07
vinodchitraliNVIDIA and others added 10 commits July 1, 2026 16:36
…polling

Extend the switch Configuring state machine with a certificate configuration
sub-flow that runs after RotateOsPassword and before Validating. The handler
submits an async RMS job via Component Manager and polls until completion.

   State machine (api-model):
   - Add ConfigureCertificateState { Start, WaitForComplete { job_id } }
   - Nest under ConfiguringState::ConfigureCertificate
   - RotateOsPassword now transitions into ConfigureCertificate(Start)

   Switch handler (configuring.rs):
   - Start: derive cert_name from switch.rack_id; build SwitchEndpoint from BMC
     MAC, NVOS interface, and vault credentials; call CM to start the job
   - WaitForComplete: poll CM for ConfigureSwitchCertificateState until Completed
     (→ Validating), Failed (→ Error), or in-progress (wait)
   - Skip certificate configuration when rack_id or component manager is absent

   Component Manager:
   - Expose configure_switch_certificate(endpoint, cert_name) → job_id
   - Expose get_configure_switch_certificate_job_status(job_id) → job status
   - Extend NvSwitchManager; implement in mock (configurable job status), NSM
     (unsupported), and RmsBackend (stub until librms RPCs land)
   - Add ConfigureSwitchCertificateState { Started, InProgress, Completed, Failed }

   Tests:
   - Integration tests for skip paths, Start → WaitForComplete, success/failure
     polling, and RotateOsPassword → ConfigureCertificate(Start)
   - Test fixtures for rack_id assignment and versioned state transitions

   Docs:
   - Add switch_configure_certificate.md with FSM detail and RMS sequence diagrams
            - Update switch.md transitions and link to the new design doc

Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
@amit-pabalkar

Copy link
Copy Markdown

Why do we not just call the ConfigureSwitch certificate API from within the new NVLink Manager task directly and poll results in a follow-up iteration. It could really be called in a loop - we just need to store the poll handle and the last update time on the NVLink switch.
Is there anything major to gain from threading it through the state machine? The usual thing to gain is "there is mutual exclusive access to operations on that switch". However I don't think it matters for this use-case:

  • Either the old cert is valid, and the next cert deployment is not impacting the switch
  • Or the old cert was not valid, and everything else would fail anyway.

@vinodchitraliNVIDIA @tmcroberts97

seems like exactly this was implemented in #2989. Is that correct? So do we not need this PR anymore?

We decided to stick with this current approach of going through state machine. @tmcroberts97 will be reworking his MR 2989 to go through the state machine for the rotation.

vinodchitraliNVIDIA and others added 2 commits July 2, 2026 00:28
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rack lifecycle Issues that relate to managing the lifecycle of a full rack (compute, switches and powershelves)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants