feat(switch-controller): add ConfigureCertificate phase with RMS job …#2330
feat(switch-controller): add ConfigureCertificate phase with RMS job …#2330vinodchitraliNVIDIA wants to merge 12 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds switch and rack certificate-configuration flows, expands shared state and backend contracts, wires the new RPC through api-core and component-manager, updates controller transitions and hostname propagation, and refreshes related tests and documentation. ChangesCertificate configuration flow
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
f555481 to
e1e1aea
Compare
1cd6d97 to
7314be8
Compare
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
crates/switch-controller/src/configuring.rs (3)
231-285: ⚖️ Poor tradeoffConsider treating missing component manager during polling as a transient wait condition.
Lines 236-244 immediately error if
component_managerisNoneduring theWaitForCompletephase. However, if the component manager becomes unavailable mid-flight (e.g., due to a configuration change or service restart), this causes an irrecoverableErrorstate even though the job may still be running in RMS.Consider whether this should instead produce a
wait(...)outcome to allow the controller to retry once the component manager is restored, rather than transitioning to a terminal error state.If the error is intentional (component manager removal invalidates all pending jobs), add a comment explaining that assumption.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/switch-controller/src/configuring.rs` around lines 231 - 285, The current handler handle_configure_certificate_wait_for_complete immediately returns a terminal Error when ctx.services.component_manager is None; change this to treat a missing component_manager as a transient condition by returning StateHandlerOutcome::wait(...) (including job_id and switch_id in the message) so the controller retries polling instead of transitioning to Error, and only transition to Error on explicit job Failed state or real RPC errors from get_configure_switch_certificate_job_status; alternatively, if removal of the component manager is intended to invalidate jobs, add an explicit comment above the ctx.services.component_manager check stating that assumption and why an Error transition is correct.
34-52: 💤 Low valueConsider removing unnecessary config_state clone.
Line 40 clones
config_statebefore matching. SinceConfiguringStateis a small enum, the clone is not expensive, but it's also not required—you can match on&config_stateand clone only the fields needed by async handlers (e.g.,job_id).♻️ Optional refactor
pub async fn handle_configuring( switch_id: &SwitchId, state: &mut Switch, ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>, ) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> { - let config_state = match &state.controller_state.value { - SwitchControllerState::Configuring { config_state } => config_state.clone(), + match &state.controller_state.value { + SwitchControllerState::Configuring { config_state } => match config_state { + ConfiguringState::ConfigureCertificate { configure_certificate } => { + handle_configure_certificate(switch_id, state, ctx, configure_certificate.clone()).await + } + ConfiguringState::RotateOsPassword => { + handle_rotate_os_password(switch_id, state, ctx).await + } + }, _ => unreachable!("handle_configuring called with non-Configuring state"), - }; - - match config_state { - ConfiguringState::ConfigureCertificate { configure_certificate } => { - handle_configure_certificate(switch_id, state, ctx, configure_certificate).await - } - ConfiguringState::RotateOsPassword => { - handle_rotate_os_password(switch_id, state, ctx).await - } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/switch-controller/src/configuring.rs` around lines 34 - 52, The code in handle_configuring unnecessarily clones config_state; instead match on a reference to avoid the full clone (match &state.controller_state.value { SwitchControllerState::Configuring { config_state } => match config_state { ConfiguringState::ConfigureCertificate { configure_certificate } => { clone only the inner fields needed (e.g., job_id or specific data) and pass those into handle_configure_certificate(switch_id, state, ctx, cloned_field).await }, ConfiguringState::RotateOsPassword => { call handle_rotate_os_password(switch_id, state, ctx).await } ... } }, keeping references except where an async call requires owned data) so remove the config_state.clone() and only clone the minimal fields when invoking async handlers like handle_configure_certificate.
166-189: 💤 Low valueSkip transitions suggest certificate configuration is optional.
Lines 166-176 and 179-189 skip certificate configuration (transitioning directly to
RotateOsPassword) whenrack_idorcomponent_managerare absent, while line 191-197 treats missingbmc_mac_addressas a fatal error. This distinction implies:
- Certificate configuration is optional if the switch is not rack-associated or if no component manager is configured.
- Certificate configuration is required if both preconditions are met, but then
bmc_mac_addressbecomes mandatory.Ensure this logic matches the intended lifecycle semantics. If certificate configuration should always succeed or fail atomically when attempted, consider consolidating the checks and treating all missing preconditions uniformly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/switch-controller/src/configuring.rs` around lines 166 - 189, The current logic treats missing rack_id or component_manager as optional (skipping to ConfiguringState::RotateOsPassword) but treats missing bmc_mac_address as fatal; consolidate these precondition checks so certificate configuration is handled atomically: in the handler that decides to run certificate configuration (the block that builds cert_name and inspects ctx.services.component_manager), check rack_id, ctx.services.component_manager, and the bmc_mac_address together (referencing rack_id, component_manager, and the bmc_mac_address lookup) and then either: a) if the intended semantics are "optional", uniformly skip certificate configuration when any of those are missing and transition to ConfiguringState::RotateOsPassword, or b) if the intended semantics are "required", return an error (or a failing StateHandlerOutcome) when any of them are missing; implement whichever policy by replacing the separate early-returns with a single consolidated conditional that yields a uniform outcome.crates/switch-controller/src/endpoint.rs (2)
50-65: ⚖️ Poor tradeoffNecessary credential clone exposes API design smell.
Line 62 clones
nvos_credentialsto populate bothbmc_credentialsandnvos_credentialsfields inSwitchEndpoint. This suggests theSwitchEndpointAPI may not match the actual use case—if BMC and NVOS always share credentials, consider:
- Refactoring
SwitchEndpointto accept a singleCredentialsfield and derive both internally, or- Documenting why both fields exist if they serve different purposes.
If they truly are independent in some scenarios, the clone is acceptable, but the current pattern (fetch one, clone for both) indicates they are not.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/switch-controller/src/endpoint.rs` around lines 50 - 65, The function switch_endpoint_from_row currently clones nvos_credentials to populate both SwitchEndpoint.bmc_credentials and .nvos_credentials which signals an API mismatch; update the API and this constructor to avoid the unnecessary clone by either (A) refactoring the SwitchEndpoint struct to have a single credentials field (e.g., credentials) and update switch_endpoint_from_row to set that single field from the provided Credentials, or (B) change switch_endpoint_from_row's signature to accept both bmc_credentials and nvos_credentials separately and assign them directly so no clone is needed; adjust all usages of SwitchEndpoint and the constructor to match the chosen approach and update any documentation/comments to explain the intended credential relationship.
69-85: Consider adding a single-switch endpoint DB helper for readability.
resolve_switch_endpointcallsdb::switch::find_switch_endpoints_by_ids(..., std::slice::from_ref(switch_id)); there’s no singularfind_switch_endpoint_by_idhelper incrates/api-db/src/switch.rs(onlyfind_switch_endpoints_by_idsexists), and other call sites use the same singleton-slice pattern. A small wrapper likefind_switch_endpoint_by_id(db_pool, switch_id)would remove the slice indirection and make the intent explicit.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/switch-controller/src/endpoint.rs` around lines 69 - 85, The code uses db::switch::find_switch_endpoints_by_ids(..., std::slice::from_ref(switch_id)) inside resolve_switch_endpoint which makes a singleton-slice indirection; add a small helper function find_switch_endpoint_by_id(db_pool: &PgPool, switch_id: &SwitchId) -> Result<Option<SwitchEndpointRow>, _> (or matching existing return type) in crates/api-db/src/switch.rs that wraps find_switch_endpoints_by_ids and returns the single row (or Option), then update resolve_switch_endpoint to call find_switch_endpoint_by_id(db_pool, switch_id) and adapt error handling to use the returned Option/Result; also replace other call sites that use std::slice::from_ref pattern to improve readability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/switch-controller/src/endpoint.rs`:
- Around line 34-48: The function fetch_switch_nvos_credentials silently returns
empty Credentials on Any/None; instead propagate failures or surface a warning
so missing vault data isn't masked: change fetch_switch_nvos_credentials to
return a Result<Credentials, E> (or Option<Credentials>) rather than always
returning Credentials, map
credential_manager.get_credentials(&CredentialKey::SwitchNvosAdmin {
bmc_mac_address: bmc_mac }).await errors/None into an Err with context (or None)
and update callers to handle the error/absence (or log a clear warning) so the
handler/component manager can decide to skip/fail; ensure references to
CredentialKey::SwitchNvosAdmin, credential_manager.get_credentials, and the
Credentials enum are updated accordingly.
- Around line 30-32: The current placeholder_ip() and the usage
unwrap_or_else(placeholder_ip) allow row.nvos_ip == None to silently become
0.0.0.0; change resolve_switch_endpoint (or switch_endpoint_from_row) to treat a
missing row.nvos_ip as an error and return a Result/Err instead of constructing
a SwitchEndpoint with a placeholder IP. Remove the
unwrap_or_else(placeholder_ip) usage, propagate the error to callers like
handle_configure_certificate_start, and ensure those callers map the error into
SwitchControllerState::Error (or otherwise fail-fast) so certificate
configuration is not attempted against an invalid address; keep references to
SwitchEndpoint and row.nvos_ip in error messages for clarity.
In `@docs/architecture/state_machines/switch_configure_certificate.md`:
- Around line 22-37: Remove the misleading self-transition from the
ConfigureCertificate sub-state: delete the transition "WaitForComplete --> Start
: not used (terminal only)" so the sub-state ConfigureCertificate only shows
Start --> WaitForComplete and then ConfigureCertificate --> RotateOsPassword (or
ConfigureCertificate --> Error) — update the Mermaid diagram so WaitForComplete
is represented as a terminal sub-state (no transition back to Start) and keep
references to ConfigureCertificate, Start, and WaitForComplete intact.
---
Nitpick comments:
In `@crates/switch-controller/src/configuring.rs`:
- Around line 231-285: The current handler
handle_configure_certificate_wait_for_complete immediately returns a terminal
Error when ctx.services.component_manager is None; change this to treat a
missing component_manager as a transient condition by returning
StateHandlerOutcome::wait(...) (including job_id and switch_id in the message)
so the controller retries polling instead of transitioning to Error, and only
transition to Error on explicit job Failed state or real RPC errors from
get_configure_switch_certificate_job_status; alternatively, if removal of the
component manager is intended to invalidate jobs, add an explicit comment above
the ctx.services.component_manager check stating that assumption and why an
Error transition is correct.
- Around line 34-52: The code in handle_configuring unnecessarily clones
config_state; instead match on a reference to avoid the full clone (match
&state.controller_state.value { SwitchControllerState::Configuring {
config_state } => match config_state { ConfiguringState::ConfigureCertificate {
configure_certificate } => { clone only the inner fields needed (e.g., job_id or
specific data) and pass those into handle_configure_certificate(switch_id,
state, ctx, cloned_field).await }, ConfiguringState::RotateOsPassword => { call
handle_rotate_os_password(switch_id, state, ctx).await } ... } }, keeping
references except where an async call requires owned data) so remove the
config_state.clone() and only clone the minimal fields when invoking async
handlers like handle_configure_certificate.
- Around line 166-189: The current logic treats missing rack_id or
component_manager as optional (skipping to ConfiguringState::RotateOsPassword)
but treats missing bmc_mac_address as fatal; consolidate these precondition
checks so certificate configuration is handled atomically: in the handler that
decides to run certificate configuration (the block that builds cert_name and
inspects ctx.services.component_manager), check rack_id,
ctx.services.component_manager, and the bmc_mac_address together (referencing
rack_id, component_manager, and the bmc_mac_address lookup) and then either: a)
if the intended semantics are "optional", uniformly skip certificate
configuration when any of those are missing and transition to
ConfiguringState::RotateOsPassword, or b) if the intended semantics are
"required", return an error (or a failing StateHandlerOutcome) when any of them
are missing; implement whichever policy by replacing the separate early-returns
with a single consolidated conditional that yields a uniform outcome.
In `@crates/switch-controller/src/endpoint.rs`:
- Around line 50-65: The function switch_endpoint_from_row currently clones
nvos_credentials to populate both SwitchEndpoint.bmc_credentials and
.nvos_credentials which signals an API mismatch; update the API and this
constructor to avoid the unnecessary clone by either (A) refactoring the
SwitchEndpoint struct to have a single credentials field (e.g., credentials) and
update switch_endpoint_from_row to set that single field from the provided
Credentials, or (B) change switch_endpoint_from_row's signature to accept both
bmc_credentials and nvos_credentials separately and assign them directly so no
clone is needed; adjust all usages of SwitchEndpoint and the constructor to
match the chosen approach and update any documentation/comments to explain the
intended credential relationship.
- Around line 69-85: The code uses db::switch::find_switch_endpoints_by_ids(...,
std::slice::from_ref(switch_id)) inside resolve_switch_endpoint which makes a
singleton-slice indirection; add a small helper function
find_switch_endpoint_by_id(db_pool: &PgPool, switch_id: &SwitchId) ->
Result<Option<SwitchEndpointRow>, _> (or matching existing return type) in
crates/api-db/src/switch.rs that wraps find_switch_endpoints_by_ids and returns
the single row (or Option), then update resolve_switch_endpoint to call
find_switch_endpoint_by_id(db_pool, switch_id) and adapt error handling to use
the returned Option/Result; also replace other call sites that use
std::slice::from_ref pattern to improve readability.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e7794363-4586-4ab4-be8f-9e58bddbf0df
📒 Files selected for processing (15)
crates/api-core/src/tests/switch_state_controller/fixtures/switch.rscrates/api-core/src/tests/switch_state_controller/mod.rscrates/api-model/src/component_manager.rscrates/api-model/src/switch/mod.rscrates/component-manager/src/component_manager.rscrates/component-manager/src/mock.rscrates/component-manager/src/nsm.rscrates/component-manager/src/nv_switch_manager.rscrates/component-manager/src/rms.rscrates/switch-controller/src/configuring.rscrates/switch-controller/src/endpoint.rscrates/switch-controller/src/initializing.rscrates/switch-controller/src/lib.rsdocs/architecture/state_machines/switch.mddocs/architecture/state_machines/switch_configure_certificate.md
2f73908 to
b742949
Compare
Matthias247
left a comment
There was a problem hiding this comment.
Looks reasonable to me.
As discussed offline it might be nice to rethink doing this outside of the state machine for the future, but as an initial implementation this is definitely helpful
b742949 to
f14edc1
Compare
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2330.docs.buildwithfern.com/infra-controller |
f14edc1 to
3fa9c9c
Compare
4010226 to
6999d9b
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-29 22:10:54 UTC | Commit: 6999d9b |
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
285578f to
5e2629c
Compare
polarweasel
left a comment
There was a problem hiding this comment.
Lots of little changes, and a few questions to answer for your own updates.
e5cfabc to
8a416ea
Compare
8a416ea to
541fb75
Compare
|
@polarweasel can u review again |
1f72fb1 to
1001bf8
Compare
…polling
Extend the switch Configuring state machine with a certificate configuration
sub-flow that runs after RotateOsPassword and before Validating. The handler
submits an async RMS job via Component Manager and polls until completion.
State machine (api-model):
- Add ConfigureCertificateState { Start, WaitForComplete { job_id } }
- Nest under ConfiguringState::ConfigureCertificate
- RotateOsPassword now transitions into ConfigureCertificate(Start)
Switch handler (configuring.rs):
- Start: derive cert_name from switch.rack_id; build SwitchEndpoint from BMC
MAC, NVOS interface, and vault credentials; call CM to start the job
- WaitForComplete: poll CM for ConfigureSwitchCertificateState until Completed
(→ Validating), Failed (→ Error), or in-progress (wait)
- Skip certificate configuration when rack_id or component manager is absent
Component Manager:
- Expose configure_switch_certificate(endpoint, cert_name) → job_id
- Expose get_configure_switch_certificate_job_status(job_id) → job status
- Extend NvSwitchManager; implement in mock (configurable job status), NSM
(unsupported), and RmsBackend (stub until librms RPCs land)
- Add ConfigureSwitchCertificateState { Started, InProgress, Completed, Failed }
Tests:
- Integration tests for skip paths, Start → WaitForComplete, success/failure
polling, and RotateOsPassword → ConfigureCertificate(Start)
- Test fixtures for rack_id assignment and versioned state transitions
Docs:
- Add switch_configure_certificate.md with FSM detail and RMS sequence diagrams
- Update switch.md transitions and link to the new design doc
Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
1001bf8 to
db36b9d
Compare
We decided to stick with this current approach of going through state machine. @tmcroberts97 will be reworking his MR 2989 to go through the state machine for the rotation. |
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com> Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
d418cf0 to
edc7fcb
Compare
…polling
Extend the switch Configuring state machine with a certificate configuration sub-flow that runs before RotateOsPassword and before Validating. The handler submits an async RMS job via Component Manager and polls until completion.
State machine (api-model):
Switch handler (configuring.rs):
Component Manager:
Tests:
Docs:
Description
Type of Change
Related Issues (Optional)
2327
Breaking Changes
Testing
Additional Notes