docs: explain phoneHomeEnabled flag and the phone-home flow#3007
docs: explain phoneHomeEnabled flag and the phone-home flow#3007CTOmari360 wants to merge 1 commit into
Conversation
Summary by CodeRabbit
WalkthroughThree documentation files are updated to explain the Phone-home documentation expansion
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The phoneHomeEnabled flag was documented as a one-liner ("Enable
phone-home on first boot") with no explanation of what it does, what (if
anything) NICo injects, or how the phone-home callback works. Document
the mechanism, sourced from the code:
- It gates instance readiness: when enabled, the instance is held in a
provisioning state and not reported Ready until the booted OS contacts
NICo's metadata service (the carbide tenant-state gate in
crates/rpc/src/model/instance/status/tenant.rs; proto note on
InstanceOperatingSystemConfig.phone_home_enabled).
- NICo injects nothing: user-data is served verbatim
(crates/agent/src/instance_metadata_endpoint.rs). The OS must include a
cloud-init phone_home module posting to the link-local metadata
endpoint http://169.254.169.254:7777/latest/meta-data/phone_home, which
records the contact timestamp and releases the gate.
- Add usage guidance (delay readiness until cloud-init applies SSH
keys/passwords) and caveats (no module => never ready; custom-iPXE
reboot re-arms the gate).
Adds a Phone-home section to docs/configuration/tenant_management.md and
links the operating-system create/update CLI reference pages to it.
Fixes NVIDIA#2790
Signed-off-by: Omar Refai <omar@refai.org>
07feed4 to
6951cde
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/configuration/tenant_management.md`:
- Around line 594-617: Clarify the precedence of phoneHomeEnabled in the tenant
management docs: update the phone-home section to explicitly state that an
instance-level setting overrides the operating-system default, and that the
OS-level value is only inherited when the instance does not specify one. Make
this change in the phoneHomeEnabled description and keep the wording aligned
with the existing instance create/update and operating-system create/update
references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2677088f-c21c-4d49-86c0-a82267d33700
📒 Files selected for processing (3)
docs/configuration/tenant_management.mddocs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.mddocs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md
✅ Files skipped from review due to trivial changes (1)
- docs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md
| `phoneHomeEnabled` (`--phone-home-enabled`) controls whether NICo waits for the booted operating system to call back -- to "phone home" -- before it reports the instance as ready. It is a property of the operating-system definition (`operating-system create` / `operating-system update`) and can also be set per instance (`instance create` / `instance update`). | ||
|
|
||
| **What it does.** When phone-home is enabled, NICo does not consider the instance fully provisioned until the booted OS contacts NICo's metadata service from inside the guest. The instance is held in a provisioning state -- the transition to `Ready` is gated -- until that callback arrives. When it is disabled, NICo reports the instance ready as soon as provisioning and config sync finish, without waiting for any signal from the OS. | ||
|
|
||
| **What NICo injects: nothing.** Enabling the flag does *not* add anything to your `userData`. NICo serves your cloud-init user-data to the booting host verbatim. To make the host actually phone home, your user-data must include a cloud-init [`phone_home`](https://cloudinit.readthedocs.io/en/latest/reference/modules.html#phone-home) module that POSTs to NICo's link-local metadata endpoint: | ||
|
|
||
| ```yaml | ||
| #cloud-config | ||
| # ... your first-boot setup (SSH keys, passwords, packages, ...) ... | ||
| phone_home: | ||
| url: http://169.254.169.254:7777/latest/meta-data/phone_home | ||
| post: all | ||
| ``` | ||
|
|
||
| cloud-init runs the `phone_home` module in its final stage, after the rest of your configuration has been applied, so the callback fires only once your setup has completed. When NICo receives the POST it records the contact and releases the readiness gate, allowing the instance to become `Ready`. | ||
|
|
||
| **When to use it.** Enable phone-home when "ready" must mean "the guest has finished its own first-boot setup" -- for example, to delay readiness until cloud-init has applied SSH keys and passwords, so that automation watching for `Ready` does not connect before the host is fully configured. Leave it disabled when NICo finishing provisioning is a sufficient ready signal and you do not need a callback from inside the guest. | ||
|
|
||
| **Notes and caveats.** | ||
|
|
||
| - Phone-home only gates *status reporting*. It does not change how the host is provisioned or booted -- the reboot into the provisioned OS happens identically whether or not phone-home is enabled. | ||
| - If you enable phone-home but never include a `phone_home` module in your user-data, the instance stays in its provisioning state indefinitely and never reports ready. | ||
| - The metadata endpoint (`169.254.169.254:7777`) is reachable only from the provisioned host over its link-local metadata link; it is not a tenant-facing API. | ||
| - Rebooting the instance with a one-time custom iPXE override (`instance update --reboot-with-custom-ipxe=true`) re-arms the gate when phone-home is enabled: NICo clears the recorded contact, so the OS must phone home again before the instance is reported ready. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C 3 --hidden -g '!**/target/**' 'phoneHomeEnabled|phone-home-enabled' .Repository: NVIDIA/infra-controller
Length of output: 50381
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the doc section around the reviewed lines.
sed -n '594,617p' docs/configuration/tenant_management.md
echo
echo '---'
echo
# Show the merge logic that resolves phoneHomeEnabled for instance creation/update.
sed -n '2068,2095p' rest-api/api/pkg/api/handler/instance.go
echo
echo '---'
echo
# Show the update-path precedence comment for phoneHomeEnabled.
sed -n '1228,1245p' rest-api/api/pkg/api/model/instance.goRepository: NVIDIA/infra-controller
Length of output: 4532
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '594,617p' docs/configuration/tenant_management.md
echo '---'
sed -n '2068,2095p' rest-api/api/pkg/api/handler/instance.go
echo '---'
sed -n '1228,1245p' rest-api/api/pkg/api/model/instance.goRepository: NVIDIA/infra-controller
Length of output: 4528
State the precedence for phoneHomeEnabled. The text says the flag can be set both on the operating system and per instance, but it should explicitly say that the instance value overrides the OS default, and that the OS value is inherited only when the instance omits it.
🧰 Tools
🪛 LanguageTool
[style] ~614-~614: ‘whether or not’ might be wordy. Consider a shorter alternative.
Context: ... the provisioned OS happens identically whether or not phone-home is enabled. - If you enable ...
(EN_WORDINESS_PREMIUM_WHETHER_OR_NOT)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/configuration/tenant_management.md` around lines 594 - 617, Clarify the
precedence of phoneHomeEnabled in the tenant management docs: update the
phone-home section to explicitly state that an instance-level setting overrides
the operating-system default, and that the OS-level value is only inherited when
the instance does not specify one. Make this change in the phoneHomeEnabled
description and keep the wording aligned with the existing instance
create/update and operating-system create/update references.
Source: Path instructions
|
|
||
| **What it does.** When phone-home is enabled, NICo does not consider the instance fully provisioned until the booted OS contacts NICo's metadata service from inside the guest. The instance is held in a provisioning state -- the transition to `Ready` is gated -- until that callback arrives. When it is disabled, NICo reports the instance ready as soon as provisioning and config sync finish, without waiting for any signal from the OS. | ||
|
|
||
| **What NICo injects: nothing.** Enabling the flag does *not* add anything to your `userData`. NICo serves your cloud-init user-data to the booting host verbatim. To make the host actually phone home, your user-data must include a cloud-init [`phone_home`](https://cloudinit.readthedocs.io/en/latest/reference/modules.html#phone-home) module that POSTs to NICo's link-local metadata endpoint: |
There was a problem hiding this comment.
@thossain-nv I thought we were injecting phone_home into user data?
There was a problem hiding this comment.
We do - https://github.com/NVIDIA/infra-controller/blob/main/rest-api/api/internal/config/config.go#L100 The default value is https://github.com/NVIDIA/infra-controller/blob/main/rest-api/api/internal/config/config.go#L34
However Helm chart/docs etc are missing instructions on how to override it.
Documents the
phoneHomeEnabledflag, which was previously a one-liner ("Enable phone-home on first boot") across the operating-system and instance docs with no explanation of what it does, what NICo injects, or how the callback works.The explanation is traced from the code, not inferred:
Readyuntil the booted OS phones home. This is the carbide tenant-state gate incrates/rpc/src/model/instance/status/tenant.rs(phone_home_pending = phone_home_enrolled && phone_home_last_contact.is_none()), fed byinstance_config.os.phone_home_enabledincrates/rpc/src/model/instance/status.rs. The proto note onInstanceOperatingSystemConfig.phone_home_enabledsays the same: "the instance will not transition to a Ready state until InstancePhoneHomeLastContact is updated."USER_DATA_CATEGORY => metadata.user_data.clone()incrates/agent/src/instance_metadata_endpoint.rs); no code mutates user-data based on the flag. The OS must supply a cloud-initphone_homemodule that POSTs to the link-local metadata endpointhttp://169.254.169.254:7777/latest/meta-data/phone_home(the canonical example appears inrest-api/openapi/spec.yaml). Theupdate_phone_home_last_contacthandler (crates/api-core/src/handlers/instance.rs) authorizes the calling host/DPU and records the contact timestamp (UPDATE instances SET phone_home_last_contact=now()), which releases the gate.Readyuntil cloud-init finishes applying SSH keys/passwords), and documents the gotchas: enabling the flag without adding aphone_homemodule leaves the instance never-ready, and a one-time custom-iPXE reboot re-arms the gate (clear_phone_home_last_contactwhenboot_with_custom_ipxe && phone_home_enabled).Adds a Phone-home section to
docs/configuration/tenant_management.md(the page that already covers the instance lifecycle) and links theoperating-system create/operating-system updateCLI reference pages to it.Related issues
Fixes #2790
Type of Change
Breaking Changes
Testing
Docs-only change. The added relative link (
../../../../configuration/tenant_management.md#phone-home) resolves to the existing page and anchor.Additional Notes
Scoped to documentation only. The flag descriptions in the two CLI reference pages now point at one authoritative explanation rather than duplicating it.