feat(admin-cli): complete hardware health management#2958
Conversation
Add rack health-report CRUD and expose aggregate health consistently for racks, switches, and power shelves. Keep generated CLI examples valid and copy-pasteable. Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds rack health-report subcommands, shared health formatting helpers, and expanded health-aware rendering for rack, switch, and power-shelf output. CLI help examples and manuals are updated to cover the new commands and revised example identifiers. ChangesRack health-report CLI and unified health display
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/admin-cli/src/switch/show/cmd.rs (1)
324-355: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winIndent continuation lines for multi-line health fields.
format_health_alerts()andformat_health_sources()join multiple entries with\n, but this loop only prefixes the first line. With more than one alert or report source, the remaining lines render flush-left and bleed into the surrounding sections. Split multiline values and indent continuation lines before writing them. As per path instructions,crates/admin-cli/**: "Review CLI changes for clap behavior, actionable operator-facing error messages, realistic help examples, stable command names, and regenerated reference documentation when command surfaces change."Proposed fix
- for (key, value) in status_data { - writeln!(&mut lines, "\t{key:<sw$}: {value}")?; - } + for (key, value) in status_data { + let mut value_lines = value.lines(); + if let Some(first_line) = value_lines.next() { + writeln!(&mut lines, "\t{key:<sw$}: {first_line}")?; + for continuation in value_lines { + writeln!(&mut lines, "\t{:<sw$} {continuation}", "")?; + } + } else { + writeln!(&mut lines, "\t{key:<sw$}:")?; + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/switch/show/cmd.rs` around lines 324 - 355, The multiline health fields in the status rendering loop need continuation-line indentation so extra alerts/sources don’t print flush-left. Update the `status_data` writing logic in `cmd.rs` to detect values from `health_utils::format_health_alerts()` and `health_utils::format_health_sources()` that contain newlines, split them into lines, and prefix all lines after the first with the same indentation as the value column. Keep the existing formatting for single-line fields like `Controller State` and `Fabric Manager`, and ensure the loop that writes with `writeln!` preserves aligned operator-facing output.Source: Path instructions
🧹 Nitpick comments (3)
crates/admin-cli/src/rack/health_report/print_empty_template/args.rs (1)
21-27: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDrop
after_long_helpfrom this unit command.This subcommand has no arguments or hidden behavior, so the extra example just adds noise to generated help/docs. As per coding guidelines, “Do not add
#[command(after_long_help)]blocks to trivial unit variants; add one only when the example teaches something the--helppage wouldn't otherwise show.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/rack/health_report/print_empty_template/args.rs` around lines 21 - 27, Remove the unnecessary after_long_help attribute from the print_empty_template unit command in the args definition, since this subcommand has no arguments or special behavior and the extra example adds noise. Update the command metadata on the relevant command enum/variant so the generated help stays minimal and consistent with the other trivial unit variants.Source: Coding guidelines
crates/admin-cli/src/rack/health_report/show/cmd.rs (1)
32-41: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse action-oriented error context here.
These
wrap_errmessages are operator-facing, but they do not follow the repo’s requiredwhile attempting to …phrasing, so the rendered chain reads less coherently than the other admin-cli errors.Proposed change
- let context = format!("Failed to list health reports for rack {}", args.rack_id); + let context = format!( + "while attempting to list health reports for rack {}", + args.rack_id + ); @@ - .wrap_err("Failed to display rack health reports")?; + .wrap_err("while attempting to display rack health reports")?;As per coding guidelines, "Write
.context()chains that complete 'while attempting to …' so the rendered error chain reads as a coherent story", and as per path instructions forcrates/admin-cli/**, review CLI changes for actionable operator-facing error messages.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/rack/health_report/show/cmd.rs` around lines 32 - 41, Update the operator-facing error contexts in show::cmd so they follow the repo’s “while attempting to …” wording. In the command flow around list_rack_health_reports and health_utils::display_health_reports, replace the current wrap_err messages with action-oriented context that reads naturally in the error chain, using the existing symbols args.rack_id, api_client, list_rack_health_reports, and display_health_reports to keep the messages specific and actionable.Sources: Coding guidelines, Path instructions
crates/admin-cli/src/rack/health_report/remove/cmd.rs (1)
26-37: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAlign the remove error context with admin-cli conventions.
This context string is also user-facing and should follow the repo’s
while attempting to …style so the error chain stays consistent and easier to scan.Proposed change
- let context = format!( - "Failed to remove health report source {} from rack {}", - args.report_source, args.rack_id - ); + let context = format!( + "while attempting to remove health report source {} from rack {}", + args.report_source, args.rack_id + );As per coding guidelines, "Write
.context()chains that complete 'while attempting to …' so the rendered error chain reads as a coherent story", and as per path instructions forcrates/admin-cli/**, review CLI changes for actionable operator-facing error messages.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/rack/health_report/remove/cmd.rs` around lines 26 - 37, The remove-flow error context in the health report command is not using the repo’s standard user-facing phrasing. Update the context built in the remove health report command (the `remove_rack_health_report` call site in the `cmd` handler) so it reads in the “while attempting to …” style, keeping the message actionable and consistent with other `admin-cli` error chains.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/admin-cli/src/rack/health_report/add/args.rs`:
- Around line 47-48: Bind the message argument to the template flow in the add
args parser: in the args type that defines `message` and `template`, make
`message` require `template` so `rack health-report add ... --health-report ...
--message ...` is rejected instead of ignored. Update the clap metadata on the
`message` field in the relevant args struct, and add a negative parse test
covering the invalid `message`-without-`template` case alongside the existing
template-path behavior.
In `@crates/admin-cli/src/rack/show/cmd.rs`:
- Around line 273-276: Remove the unnecessary format! wrappers around the count
values passed into the row! call in the rack summary logic; the relevant cells
in the show cmd flow already accept the numeric lengths directly. Update the row
construction near the health_utils::aggregate_health_status usage to pass
current_compute_trays.len(), current_power_shelves.len(), and
current_nvlink_switches.len() without formatting so the table code stays
consistent and avoids clippy::useless_format.
In `@docs/manuals/nico-admin-cli/commands/rack/rack-health-report-add.md`:
- Around line 12-27: Update the rack-health-report-add command docs to state the
one-of requirement clearly: exactly one of --health-report or --template must be
provided. In the command synopsis and the option descriptions for
--health-report and --template, mark them as mutually exclusive and note that
supplying both or neither is invalid. Keep the guidance aligned with the CLI
parser behavior for the rack-health-report-add command.
---
Outside diff comments:
In `@crates/admin-cli/src/switch/show/cmd.rs`:
- Around line 324-355: The multiline health fields in the status rendering loop
need continuation-line indentation so extra alerts/sources don’t print
flush-left. Update the `status_data` writing logic in `cmd.rs` to detect values
from `health_utils::format_health_alerts()` and
`health_utils::format_health_sources()` that contain newlines, split them into
lines, and prefix all lines after the first with the same indentation as the
value column. Keep the existing formatting for single-line fields like
`Controller State` and `Fabric Manager`, and ensure the loop that writes with
`writeln!` preserves aligned operator-facing output.
---
Nitpick comments:
In `@crates/admin-cli/src/rack/health_report/print_empty_template/args.rs`:
- Around line 21-27: Remove the unnecessary after_long_help attribute from the
print_empty_template unit command in the args definition, since this subcommand
has no arguments or special behavior and the extra example adds noise. Update
the command metadata on the relevant command enum/variant so the generated help
stays minimal and consistent with the other trivial unit variants.
In `@crates/admin-cli/src/rack/health_report/remove/cmd.rs`:
- Around line 26-37: The remove-flow error context in the health report command
is not using the repo’s standard user-facing phrasing. Update the context built
in the remove health report command (the `remove_rack_health_report` call site
in the `cmd` handler) so it reads in the “while attempting to …” style, keeping
the message actionable and consistent with other `admin-cli` error chains.
In `@crates/admin-cli/src/rack/health_report/show/cmd.rs`:
- Around line 32-41: Update the operator-facing error contexts in show::cmd so
they follow the repo’s “while attempting to …” wording. In the command flow
around list_rack_health_reports and health_utils::display_health_reports,
replace the current wrap_err messages with action-oriented context that reads
naturally in the error chain, using the existing symbols args.rack_id,
api_client, list_rack_health_reports, and display_health_reports to keep the
messages specific and actionable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2f981dba-ee15-4409-a44d-718e085d85bd
📒 Files selected for processing (38)
crates/admin-cli/src/health_utils.rscrates/admin-cli/src/power_shelf/health_report/add/args.rscrates/admin-cli/src/power_shelf/health_report/remove/args.rscrates/admin-cli/src/power_shelf/health_report/show/args.rscrates/admin-cli/src/power_shelf/show/cmd.rscrates/admin-cli/src/rack/health_report/add/args.rscrates/admin-cli/src/rack/health_report/add/cmd.rscrates/admin-cli/src/rack/health_report/add/mod.rscrates/admin-cli/src/rack/health_report/args.rscrates/admin-cli/src/rack/health_report/mod.rscrates/admin-cli/src/rack/health_report/print_empty_template/args.rscrates/admin-cli/src/rack/health_report/print_empty_template/cmd.rscrates/admin-cli/src/rack/health_report/print_empty_template/mod.rscrates/admin-cli/src/rack/health_report/remove/args.rscrates/admin-cli/src/rack/health_report/remove/cmd.rscrates/admin-cli/src/rack/health_report/remove/mod.rscrates/admin-cli/src/rack/health_report/show/args.rscrates/admin-cli/src/rack/health_report/show/cmd.rscrates/admin-cli/src/rack/health_report/show/mod.rscrates/admin-cli/src/rack/mod.rscrates/admin-cli/src/rack/show/cmd.rscrates/admin-cli/src/rack/tests.rscrates/admin-cli/src/switch/health_report/add/args.rscrates/admin-cli/src/switch/health_report/remove/args.rscrates/admin-cli/src/switch/health_report/show/args.rscrates/admin-cli/src/switch/show/cmd.rsdocs/manuals/nico-admin-cli/commands/power-shelf/power-shelf-health-report-add.mddocs/manuals/nico-admin-cli/commands/power-shelf/power-shelf-health-report-remove.mddocs/manuals/nico-admin-cli/commands/power-shelf/power-shelf-health-report-show.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report-add.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report-remove.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report-show.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report.mddocs/manuals/nico-admin-cli/commands/rack/rack.mddocs/manuals/nico-admin-cli/commands/switch/switch-health-report-add.mddocs/manuals/nico-admin-cli/commands/switch/switch-health-report-remove.mddocs/manuals/nico-admin-cli/commands/switch/switch-health-report-show.md
Tighten rack health-report argument validation and documentation, align multiline switch health output, and normalize operator-facing error context. Signed-off-by: Hasan Khan <hasank@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/admin-cli/src/rack/health_report/print_empty_template/args.rs (1)
21-21: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRestore the worked
--helpexamples for this clap command.Replacing the
#[command(after_long_help = "...")]block with a doc comment removes the operator-facing examples fromnico-admin-cli rack health-report print-empty-template --help. This command is part of the public admin CLI surface, so the examples should stay on theArgsstruct rather than only in docs.Suggested fix
+#[derive(clap::Parser, Debug)] +#[command( + after_long_help = "EXAMPLES: + +Print an empty rack health report template to stdout: + $ nico-admin-cli rack health-report print-empty-template + +Write a starter template to a file for editing: + $ nico-admin-cli rack health-report print-empty-template > ./rack-health-report.json + +" +)] /// Arguments for printing an empty rack health report template. pub struct Args;Based on learnings from the repository guidelines, admin-cli clap
Args/Optsstructs should include#[command(after_long_help = "...")]with a workedEXAMPLES:section using the real binary name and copy-pasteable invocations. As per path instructions, review CLI changes for clap behavior and realistic help examples.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/rack/health_report/print_empty_template/args.rs` at line 21, The public clap help examples were removed from the Args struct for the print-empty-template command, so restore them on the Args definition using #[command(after_long_help = "...")] rather than only a doc comment. Update the existing rack health_report print_empty_template Args struct to include a worked EXAMPLES section with the real nico-admin-cli binary name and copy-pasteable invocations so `--help` shows the operator-facing examples again.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/admin-cli/src/rack/health_report/print_empty_template/args.rs`:
- Line 21: The public clap help examples were removed from the Args struct for
the print-empty-template command, so restore them on the Args definition using
#[command(after_long_help = "...")] rather than only a doc comment. Update the
existing rack health_report print_empty_template Args struct to include a worked
EXAMPLES section with the real nico-admin-cli binary name and copy-pasteable
invocations so `--help` shows the operator-facing examples again.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 54a89340-09a7-41e3-8b34-b15f22de1732
📒 Files selected for processing (17)
crates/admin-cli/src/power_shelf/show/cmd.rscrates/admin-cli/src/rack/health_report/add/args.rscrates/admin-cli/src/rack/health_report/add/cmd.rscrates/admin-cli/src/rack/health_report/args.rscrates/admin-cli/src/rack/health_report/print_empty_template/args.rscrates/admin-cli/src/rack/health_report/print_empty_template/cmd.rscrates/admin-cli/src/rack/health_report/remove/args.rscrates/admin-cli/src/rack/health_report/remove/cmd.rscrates/admin-cli/src/rack/health_report/show/args.rscrates/admin-cli/src/rack/health_report/show/cmd.rscrates/admin-cli/src/rack/show/cmd.rscrates/admin-cli/src/rack/tests.rscrates/admin-cli/src/switch/show/cmd.rsdocs/manuals/nico-admin-cli/commands/rack/rack-health-report-add.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report-remove.mddocs/manuals/nico-admin-cli/commands/rack/rack-health-report.md
💤 Files with no reviewable changes (1)
- docs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.md
✅ Files skipped from review due to trivial changes (2)
- docs/manuals/nico-admin-cli/commands/rack/rack-health-report-remove.md
- docs/manuals/nico-admin-cli/commands/rack/rack-health-report.md
🚧 Files skipped from review as they are similar to previous changes (13)
- crates/admin-cli/src/rack/health_report/show/args.rs
- crates/admin-cli/src/rack/health_report/print_empty_template/cmd.rs
- crates/admin-cli/src/rack/health_report/remove/args.rs
- crates/admin-cli/src/rack/health_report/args.rs
- docs/manuals/nico-admin-cli/commands/rack/rack-health-report-add.md
- crates/admin-cli/src/rack/health_report/add/args.rs
- crates/admin-cli/src/rack/health_report/show/cmd.rs
- crates/admin-cli/src/rack/tests.rs
- crates/admin-cli/src/rack/health_report/remove/cmd.rs
- crates/admin-cli/src/rack/health_report/add/cmd.rs
- crates/admin-cli/src/switch/show/cmd.rs
- crates/admin-cli/src/power_shelf/show/cmd.rs
- crates/admin-cli/src/rack/show/cmd.rs
Keep the no-argument command useful by documenting stdout and file-redirection workflows in generated help. Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.md (1)
12-39: 📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick winTrim the option block to match the real command.
print_empty_template::Argsis flagless, and the implementation only prints a JSON template to stdout. The--extended/--sort-bysynopsis and the measured-boot option text are stale copies from another command, so this page currently advertises flags that do not exist.Suggested cleanup
## SYNOPSIS **nico-admin-cli rack health-report print-empty-template** -\[**--extended**\] \[**--sort-by**\] \[**-h**\|**--help**\] +\[**-h**\|**--help**\] ## DESCRIPTION -Print an empty health report template +Print an empty health report template to stdout. -## OPTIONS - -**--extended** -Extended result output. - -This used by measured boot, where basic output contains just what you -probably care about, and "extended" output also dumps out all the -internal UUIDs that are used to associate instances. - -**--sort-by** *\<SORT_BY\>* \[default: primary-id\] -Sort output by specified field\ - -\ -*Possible values:* - -- primary-id: Sort by the primary id - -- state: Sort by state - **-h**, **--help** Print help (see a summary with -h)As per path instructions, Markdown should match the implemented CLI surface.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.md` around lines 12 - 39, The markdown for the rack health-report print-empty-template command is advertising stale options that do not exist in the implemented CLI surface. Update the command synopsis and OPTIONS section to match `print_empty_template::Args` and the template-printing behavior, removing the copied `--extended` and `--sort-by` entries and the measured-boot explanatory text so the docs only reflect the actual flagless command.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@docs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.md`:
- Around line 12-39: The markdown for the rack health-report
print-empty-template command is advertising stale options that do not exist in
the implemented CLI surface. Update the command synopsis and OPTIONS section to
match `print_empty_template::Args` and the template-printing behavior, removing
the copied `--extended` and `--sort-by` entries and the measured-boot
explanatory text so the docs only reflect the actual flagless command.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 354ff92c-d76d-47ae-b395-b43b8e903f63
📒 Files selected for processing (2)
crates/admin-cli/src/rack/health_report/print_empty_template/args.rsdocs/manuals/nico-admin-cli/commands/rack/rack-health-report-print-empty-template.md
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/admin-cli/src/rack/health_report/print_empty_template/args.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
|
Weird, this PR says it's waiting on a review from @polarweasel but doesn't show in the top right reviewers box. |
…ack-health # Conflicts: # crates/admin-cli/src/power_shelf/show/cmd.rs
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2958.docs.buildwithfern.com/infra-controller |
polarweasel
left a comment
There was a problem hiding this comment.
Just some little nitpicky things...
Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Hasan Khan <hkhan7@ualberta.ca>
…how.md Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Hasan Khan <hkhan7@ualberta.ca>
…rint-empty-template.md Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Hasan Khan <hkhan7@ualberta.ca>
…dd.md Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Hasan Khan <hkhan7@ualberta.ca>
…dd.md Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Hasan Khan <hkhan7@ualberta.ca>
…dd.md Co-authored-by: Alex Ball <awball@polarweasel.org> Signed-off-by: Hasan Khan <hkhan7@ualberta.ca>
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Description
Complete admin-cli health-report management for racks and align aggregate-health rendering across racks, switches, and power shelves.
This change:
show,add,remove, andprint-empty-templatecommands with thehralias;--print-onlypreviews;Related issues
Closes #1383
Type of Change
Breaking Changes
No API or protobuf schema changes are included. Existing component output gains additive aggregate-health fields and columns.
Testing
Passed locally:
cargo test -p nico-admin-cli --quiet(366/366 pass)cargo clippy -p nico-admin-cli --all-targets --all-features -- -D warningscargo fmt --all -- --check--print-onlychecks withjqassertionshralias, generated help examples, and empty-template checksAdditional Notes
Switch and power-shelf health-report CRUD already existed; this change fills the rack gap and makes normal component output render aggregate health consistently.
The focused
carbide-api-corerack-health test cannot build on macOS/aarch64 becausetss-esapi-sysdoes not support that target. The live kind smoke exercised the real RPC and database path instead.The complete CLI-doc check currently exposes unrelated pre-existing generated-reference drift, including the rack state-history entry. This PR keeps only issue-scoped generated documentation changes.