Skip to content

feat: generates gcp/vm stubs#427

Open
sinorga wants to merge 3 commits into
NVIDIA:mainfrom
sinorga:agent/gcp-workers-286265d1-e5eb-4ffe-bf5f-c8e7a46ce147
Open

feat: generates gcp/vm stubs#427
sinorga wants to merge 3 commits into
NVIDIA:mainfrom
sinorga:agent/gcp-workers-286265d1-e5eb-4ffe-bf5f-c8e7a46ce147

Conversation

@sinorga

@sinorga sinorga commented May 18, 2026

Copy link
Copy Markdown
Contributor

factory: generates/updates gcp/vm stubs

Summary

Adds GCP VM provider stubs covering the full lifecycle suite (launch_instance / list_instances / verify_tags / serial_console / console_rbac / stop_instance / start_instance / reboot_instance / describe_instance / deploy_nim) plus teardown. AWS oracle contract shape preserved across all steps; GCE-specific divergences encoded per step where the API surface differs from EC2 (zone capacity walk, async-operation cleanup tracker semantics, verified-reuse instance_created ownership flag, IAM token-creator propagation budget, instance metadata fields for serial-console RBAC probes).

Also adds operator setup documentation at docs/references/gcp.md (linked from docs/getting-started.md) covering authentication (ADC / service-account key), project-ID resolution order, required IAM roles, L4 GPU quota and zone list, NIM-step NGC_API_KEY handling, GPU-image default + how to override for Docker-based tests, and org-policy considerations operators commonly need to check before a clean run. GCP gets its own reference page rather than mixing into references/aws.md since AWS is treated as the canonical oracle example; future target NCPs follow the same per-NCP pattern.

The default --image-project / --image-family are now the public GCP Deep Learning VM Image (deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580) so a clean checkout can run the VM suite without any private-image entitlement. The DLVM ships with the NVIDIA driver + CUDA toolkit; operators who need Docker (for the deploy_nim step) override the default by exporting GCP_VM_IMAGE / GCP_VM_IMAGE_PROJECT in their shell / .env (the provider config reads both via Jinja, identical pattern to the existing GCP_VM_SKIP_TEARDOWN precedent), or skip NIM by leaving NGC_API_KEY unset (the shared script short-circuits cleanly). --set image=... --set image_project=... is also accepted for one-off ad-hoc invocations.

Verification

  • Static checks: PASS (46/46).
  • Independent post-generation review on branch HEAD: PASS p1=0 p2=0.
  • GCP orphans after teardown: 0 on every run.
  • Live execution end-to-end — verified under both supported operator configurations:
Run Environment Result Duration
Pass 1 — canonical / custom-image NGC_API_KEY set + GCP_VM_IMAGE + GCP_VM_IMAGE_PROJECT set to a Docker-equipped image 3 consecutive full-domain PASS — 9 test steps + NIM deploy / teardown + cleanup ~1650s per run
Pass 1 — public-default All three env vars unset → suite falls through to deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580 1 FAIL: [container_runtime] ContainerRuntimeCheck (Docker not available — documented limitation, see "Known gaps" below). Everything else PASS: all 9 lifecycle steps, NIM steps cleanly runtime_skip, both deploy_nim and teardown_nim policy-skip honored end-to-end 1113.5s
Pass 2 — post-upstream-refresh (custom-image) Same canonical operator env as Pass 1, exercising the squash 4453f677 content (new virtual_device_hardening step + IdentityAgent=none flag + ssh_run helper) Domain-gate live PASS — all 11 vm test steps (including the new virtual_device_hardening) + NIM deploy / teardown + cleanup. 0 orphans. Phase-2 harness review: PASS p1=0 p2=0 ~330s (domain-gate scoped — Pass 1's ~1650s figure included the full step-by-step factory loop; Pass 2's gate is the same suite content)

Known gaps

  • ContainerRuntimeCheck fails on the public DLVM default image. host_os.ContainerRuntimeCheck asserts Docker is installed and runnable on the launched VM; the GCP Deep Learning VM image ships the NVIDIA driver + CUDA but not Docker, so the check fails with Docker not available and the run reports [FAIL] TEST regardless of whether NGC_API_KEY is set. This is intentional and documented at docs/references/gcp.md §5 — operators get a clean PASS by overriding the image via GCP_VM_IMAGE / GCP_VM_IMAGE_PROJECT to a Docker-equipped image. The public default is published as "boot + lifecycle works without any private entitlement", not as "every validator passes".

Provenance

The branch's content was produced in two passes, both validated end-to-end:

Pass 1 — Initial generation + post-review hand-fixes (rebased onto upstream main tip 871eea8c). Auto-generated factory output for the full vm domain, followed by 4 post-review hand-fixes and 3 pre-PR public-shipping audit fixes:

Commit Files Why
577b227c isvctl/configs/providers/gcp/scripts/vm/console_rbac.py Capture _revoke_token_creator return into cleanup_errors on failure (mirrors sibling _delete_service_account pattern; rule #9 honest-reporting compliance). Add retry envelope to _modify_iam_policy covering 5xx / 429 / 409 stale-etag plus transient network failures.
70a7d11a isvctl/configs/providers/gcp/scripts/vm/console_rbac.py Broaden the retry envelope's exception catch arms to include TimeoutError and OSError so a mid-read socket timeout (which Python ≥3.10 does not always wrap in urllib.error.URLError) routes through retry instead of escaping the helper.
f8c48b1a isvctl/configs/providers/gcp/scripts/common/compute.py Zone-walk contract compliance: select_zones walks the intersection of PREFERRED_ZONES and the region's live zones in preferred-list order when the intersection is non-empty (no other_in_region promotion). Structured error handling in _list_region_zones: raise on NotFound/PermissionDenied/InvalidArgument/Unauthenticated so an invalid region request never silently substitutes a different region's zones; retry transients (ServiceUnavailable/DeadlineExceeded/Aborted/ResourceExhausted/InternalServerError) with linear backoff. Remove us-east1-b from PREFERRED_ZONES (no L4 capacity in that zone per published GCP GPU regions).
5ec91a8f isvctl/configs/providers/gcp/config/vm.yaml Thread --network from launch_instance.vpc_id to the console_rbac probe so probe VMs validate console-access restrictions against the same network as the launched instance. Falls back to the suite-level network arg when vpc_id isn't emitted.
59bb99a6 docs/references/gcp.md, docs/getting-started.md Add the per-NCP GCP operator setup reference page covering ADC / service-account auth, project-ID resolution, IAM roles, L4 zones, NIM key handling, GPU-image override path, and org-policy considerations. Linked from docs/getting-started.md.
c69445b2 13 files: isvctl/configs/providers/gcp/** + docs/references/gcp.md Pre-PR public-shipping audit. Two changes: (a) Scrub generated comments and docstrings that referenced documentation paths from the generator's private rule layer — those paths don't exist in this repo, and the rule statements were inlined directly so the maintainer reading the public PR can act on them without consulting a missing file. Comment-only change for this part. (b) Switch the default GPU image from a private custom image to the public GCP Deep Learning VM Image (deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580); wire image_project through as a per-run override setting; document the Docker-not-preinstalled implication for the deploy_nim step in the operator setup page. The DLVM is the closest portable equivalent to AWS Deep Learning AMIs and lets a clean upstream checkout run the VM suite without any private-image entitlement.
225ae780 isvctl/configs/providers/gcp/config/vm.yaml, docs/references/gcp.md Promote sticky env-var overrides over per-run --set. vm.yaml settings now Jinja-read GCP_VM_IMAGE / GCP_VM_IMAGE_PROJECT with the "none" sentinel as fallback (the stub then uses its public DLVM default), mirroring the existing GCP_VM_SKIP_TEARDOWN precedent in this same file. Operators with a Docker-equipped custom image export the env vars once in their shell / .env; every subsequent run reuses the same pin without touching CLI flags. --set image=... --set image_project=... stays accepted for one-off invocations. Operator setup page (docs/references/gcp.md §5) updated to lead with the env-var path.
9fe7a276 4 files: gcp/scripts/vm/launch_instance.py, gcp/scripts/vm/console_rbac.py, gcp/scripts/common/compute.py, shared/teardown_nim.py Three coordinated fixes from a final-review pass: (a) Scrub remaining generator-pipeline vocabulary that survived the first scrub — the word "factory" / specialist-review markers / private-rule citations were rephrased to portable terms (suite runs, default project, vendor default). (b) Warm-reuse readiness now uses consecutive-success SSH stability (wait_for_ssh_stable) regardless of whether the VM was just started or adopted-running — matches the create/start paths and prevents transient sshd flakes from passing the gate on a single connection. (c) Shared teardown_nim.py adds a symmetric --ngc-api-key policy-skip that mirrors deploy_nim.py's shape — when the NGC entitlement is absent, deploy is a no-op and teardown is now a matching no-op, so the documented "leave NGC_API_KEY unset to opt out of NIM coverage" path stays green end-to-end (matters on any image that doesn't ship Docker, including the new public DLVM default).
9b79e340 docs/references/gcp.md Tighten docs/references/gcp.md §5 to name ContainerRuntimeCheck explicitly — clarify that operators who opt out of NIM (leaving NGC_API_KEY unset) and run on the public DLVM default still see [FAIL] TEST because the host-OS ContainerRuntimeCheck validator asserts Docker presence unconditionally. Only operators who override the image to a Docker-equipped one get a fully green run. Documentation-only edit; no stub change.

Pass 2 — Upstream-tracking refresh (after upstream main advanced past Pass 1's rebase base):

Commit Files Why
4453f677 4 files: gcp/config/vm.yaml, gcp/scripts/common/ssh_utils.py, gcp/scripts/vm/virtual_device_hardening.py (NEW), .factory_version Squash of an in-place factory regen against upstream main tip 9b79e340. Mirrors two upstream contract changes that landed after Pass 1: (1) PR #413 (2cdf4d76, "feat: add virtual device hardening validation") — adds the new virtual_device_hardening test step to the vm suite; this commit adds the GCP-side mirror: 302-line virtual_device_hardening.py stub (provider-evidence preamble for Compute Engine + guest-side SSH probes for USB / clipboard / unnecessary virtual devices using the canonical PROBE_SENTINEL framing, identical to the AWS oracle's pattern lists), ssh_run helper in gcp/scripts/common/ssh_utils.py (sister to AWS's helper, same (exit_code, stdout, stderr) return contract with sentinel exit codes for TimeoutExpired / OSError), and the matching step entry in gcp/config/vm.yaml. (2) PR #424 (6e1458a7, "fix: handle AWS VM validation regressions") — adds IdentityAgent=none alongside IdentitiesOnly=yes so explicit -i <key_file> is the only SSH credential considered on operator hosts running an ssh-agent with multiple identities; mirrored to the canonical _SSH_OPTS tuple in GCP's ssh_utils.py so every GCP SSH helper benefits uniformly. .factory_version bumped accordingly (isv_commit=9b79e340, factory_commit=21869ef).

The Pass-1 fix narrative: the pre-fix _revoke_token_creator cleanup-error miss was the first post-generation review's only ship-blocker; the next three commits address shapes surfaced by post-fix re-review iterations; the next four commits are a pre-PR public-shipping audit pass (operator-setup docs landing, private rule-layer references in comments scrubbed, private image default replaced by the public DLVM, env-var override architecture, then final-review-driven scrub of remaining generator vocabulary + warm-reuse stability gate + symmetric NIM teardown policy-skip, then the ContainerRuntimeCheck documentation tightening), all caught before opening the PR. Pass 2 keeps the branch aligned with subsequent upstream contract changes via the canonical /ncp-update workflow.

Summary by CodeRabbit

  • New Features

    • Full GCP VM validation workflow: launch/start/stop/reboot, listing/describe/tags, serial-console probes, RBAC checks, virtual-device hardening, and verified-reuse teardown.
    • Improved VM SSH/connection utilities and provider-aware error handling for more reliable operations.
    • Conditional NIM teardown skip when the API key is not provided.
  • Documentation

    • New comprehensive GCP target guide and added link from Next Steps.
  • Chores

    • Added GCP Compute client dependency and updated version metadata.

Review Change Stack

@sinorga sinorga requested a review from a team as a code owner May 18, 2026 17:39
@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds a complete GCP Compute Engine provider: dependency and error helpers, shared compute utilities, SSH/cloud-init helpers, full VM lifecycle scripts (launch/list/describe/start/stop/reboot/teardown), validation probes (console RBAC, virtual-device hardening), isvctl VM workflow config, docs, and factory metadata.

Changes

GCP Provider Implementation

Layer / File(s) Summary
Foundation: Dependencies and Error Handling
isvctl/pyproject.toml, isvctl/configs/providers/gcp/scripts/common/__init__.py, isvctl/configs/providers/gcp/scripts/common/errors.py
Adds google-cloud-compute>=1.30.0,<2.0.0; introduces GCP exception classification, non-raising delete-with-retry, and a script-level error decorator.
Shared Compute Utilities
isvctl/configs/providers/gcp/scripts/common/compute.py
Project/zone resolution, preferred-zone selection with live discovery/fallback, canonical instance-state mapping, zonal/global op waiters and retry wrapper, instance polling/IP helpers, image resolution, label/tag projection, local SSH keypair management, and SSH firewall insertion.
SSH Utilities
isvctl/configs/providers/gcp/scripts/common/ssh_utils.py
Canonical SSH option set, remote command execution with timeout/OSError mapping, single-probe helper, wait_for_ssh / wait_for_ssh_drop / wait_for_ssh_stable, cloud-init polling with transport vs semantic outcome, uptime probe, and shell quoting helper.
VM Workflow Configuration
isvctl/configs/providers/gcp/config/vm.yaml
Defines the isvctl GCP VM pipeline (setup → test → teardown), maps steps to GCP scripts, sets timeouts and sentinel/default handling, and overrides cloud-init validator metadata headers.
VM Query Scripts
isvctl/configs/providers/gcp/scripts/vm/list_instances.py, describe_instance.py, describe_tags.py, serial_console.py
CLI scripts to list instances by VPC, describe instance state/networking, emit canonical tags from labels, and read serial-console output; structured JSON outputs and GCP error handling.
Launch Instance with Verified-Reuse
isvctl/configs/providers/gcp/scripts/vm/launch_instance.py
Builds Instance proto, resolves boot image (literal/family/project fallback), generates run-id-suffixed SSH keypair, provisions/adopts project-global SSH firewall, attempts zonal inserts with stockout cleanup, polls to running, performs readiness gate (SSH/cloud-init), supports verified-reuse adoption, and gated cleanup-on-failure.
State Transitions
isvctl/configs/providers/gcp/scripts/vm/start_instance.py, stop_instance.py, reboot_instance.py
Start with post-start stability gates (SSH + cloud-init); stop with idempotent pre-check and optional cloud-init gate; reboot with pre-reset uptime sampling, reset operation with retry, post-reset IP re-read and SSH stability, and reboot confirmation via boot timestamp or uptime delta.
Validation Probes
isvctl/configs/providers/gcp/scripts/vm/console_rbac.py, virtual_device_hardening.py
Serial-console RBAC probe supporting self-provisioned or pre-provisioned modes with IAM mutations, token minting, and HTTP serial-port checks; virtual-device hardening runs combined guest probes over SSH and applies signal-based test failures.
Cleanup and NIM Integration
isvctl/configs/providers/gcp/scripts/vm/teardown.py, isvctl/configs/providers/shared/teardown_nim.py
Teardown mirrors AWS shape with verified-reuse ownership gating for instance/firewall/key deletion, leaked-zone cleanup, idempotent NotFound handling; NIM teardown short-circuits when NGC API key is absent or sentinel inputs used.
Documentation and Metadata
.factory_version, docs/getting-started.md, docs/references/gcp.md
Factory version pins GCP VM commit ids; getting-started links to new GCP reference; new operator guide documents prerequisites, authentication, NIM behavior, execution, preferred zones, and org-policy notes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • abegnoche
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 81.91% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'feat: generates gcp/vm stubs' directly describes the primary change: adding GCP Compute Engine VM provider stubs implementing a complete VM lifecycle suite. It is concise, specific, and accurately reflects the main objective of the pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/references/gcp.md`:
- Around line 116-121: The fenced code block containing the region list (the
block starting with ``` and lines like "us-central1-a / -b / -c       
us-east4-a / -b / -c       us-east1-c / -d") lacks a language specifier and
triggers MD040; fix it by changing the opening fence from ``` to ```text so the
block is explicitly marked as plain text (ensure the closing fence remains ```).

In `@isvctl/configs/providers/gcp/config/vm.yaml`:
- Around line 337-338: The template currently uses image and image_project with
default('none') which doesn't guard against empty-string env vars; update the
templating for the image and image_project keys so they treat empty strings as
missing and fallback to 'none' (e.g., trim and re-default or use a
conditional/ternary on env.GCP_VM_IMAGE and env.GCP_VM_IMAGE_PROJECT) so
launcher argv pairs can't collapse when env vars are present but empty.

In `@isvctl/configs/providers/gcp/scripts/common/compute.py`:
- Around line 603-607: Add PEP 257 docstrings to the missing functions: describe
what each function does, its parameters and return types and any side effects
for first_internal_ip and _firewall_has_isv_ownership; for example, in
first_internal_ip (function taking instance: compute_v1.Instance) document that
it returns the first internal IP or None and explain the network_interfaces
dependency, and in _firewall_has_isv_ownership document what firewall object is
expected, what ownership check it performs and its boolean return; keep the
docstrings concise, use triple-quoted strings immediately under the def, and
follow the module's existing docstring style.

In `@isvctl/configs/providers/gcp/scripts/common/errors.py`:
- Around line 41-57: classify_gcp_error currently never returns the documented
"credentials_missing" category; add an explicit branch in the function to detect
missing-default-credentials errors (e.g., check isinstance(e,
google.auth.exceptions.DefaultCredentialsError) and/or
gax.DefaultCredentialsError if available) and return ("credentials_missing",
f"GCP credentials missing or not configured: {e}"); place this check before the
gax.Unauthenticated branch so credential-setup failures map to
"credentials_missing" rather than "unknown_error".

In `@isvctl/configs/providers/gcp/scripts/common/ssh_utils.py`:
- Around line 47-48: Add a PEP 257 docstring to the helper function
_ssh_argv(host: str, user: str, key_file: str, remote_cmd: str) -> list[str];
describe the function’s purpose (building the ssh command argument list), list
and document parameters (host, user, key_file, remote_cmd) and note that it uses
the module-level _SSH_OPTS and returns a list of command argument strings
suitable for subprocess calls; keep it brief, one- or two-sentence summary plus
short param descriptions and return description.

In `@isvctl/configs/providers/gcp/scripts/vm/console_rbac.py`:
- Around line 974-983: The skip branch currently sets result["mode"]="skipped"
but exits with return 1 which signals failure; update the branch handling in the
block that checks os.environ.get(_SELF_PROVISION_ENABLED_ENV, "1") so that after
setting result["mode"], result["skipped"], result["skip_reason"], and
result["error"] it prints the JSON and returns a success exit (e.g., return 0 or
no non-zero exit) instead of return 1 so orchestrators treat the probe as
intentionally skipped rather than failed.
- Around line 917-994: Add the DEMO_MODE short-circuit at the top of main():
define DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" (or reuse existing
constant if present) and if True set result["mode"] = "demo", result["success"]
= True, populate any minimal fields (e.g. tests.*.passed as appropriate) and
print json.dumps(result) then return 0 — do this before any calls to
_preprovisioned_probe or _self_provisioned_probe so no real IAM/Compute actions
are executed.
- Around line 730-760: The JSON written to stdout in the result dict currently
includes provider-specific identifiers and verbose fields (e.g., keys under
result["tests"] containing "principal", "evidence", project/zone/instance names
and caller identity created around calls to _probe_serial_console and variables
like denied_email, allowed_email, probe_vm_name); change this so stdout only
emits the minimal, provider-neutral schema (success, platform, test_name,
tests.<check>.passed/message/probes and optional concise error/skip_reason) and
move all audit/debug details (caller identity, project/zone/instance IDs, full
evidence strings, mode, tokens) to stderr or a separate debug log; update
construction of entries such as "denied_principal_cannot_access_console",
"allowed_principal_can_access_console", and
"allowed_principal_is_resource_scoped" to set only passed and a short
message/probes on stdout while emitting the full evidence and identifiers to
stderr for each probe from _probe_serial_console.

In `@isvctl/configs/providers/gcp/scripts/vm/describe_instance.py`:
- Around line 21-54: Add the standard demo-mode short‑circuit at the top of
main(): define DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" and if true
return a dummy success (exit code 0 / immediate return from main) before calling
resolve_project, narrow_region_to_zone or any live API helpers like
get_instance; place this check at the start of main() so demo runs do not
resolve project/zone or hit Google APIs.

In `@isvctl/configs/providers/gcp/scripts/vm/describe_tags.py`:
- Around line 19-47: Add a DEMO_MODE guard at the top of the script and
short-circuit live cloud reads in main(): define DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" (import os if missing) and, inside
main() before calling get_instance/resolve_project/narrow_region_to_zone, check
if DEMO_MODE is true and return a dummy success (e.g., print/JSON of empty/dummy
tags and exit 0) instead of performing real reads; ensure the early-return path
mirrors existing success behavior so callers/tests see the expected output when
ISVCTL_DEMO_MODE=1.

In `@isvctl/configs/providers/gcp/scripts/vm/launch_instance.py`:
- Around line 37-45: Add a demo-mode safety gate by defining DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" near the top of launch_instance.py and
short-circuit any real mutating operations (e.g., in functions that perform
instance creation/deletion or the main entrypoint such as create_instance,
delete_instance or main) to return a dummy success when DEMO_MODE is True;
update log messages to clearly state when actions are skipped due to demo mode
and ensure code paths that would call GCP APIs are not executed under DEMO_MODE.

In `@isvctl/configs/providers/gcp/scripts/vm/list_instances.py`:
- Around line 26-56: Add a DEMO_MODE gate by defining DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" near the top (import os if missing)
and update main() to short-circuit when DEMO_MODE is true: do not call any
google.cloud or other live GCP functions (e.g., avoid calls that use compute_v1
or helper functions like resolve_project/first_external_ip), instead print a
safe dummy JSON payload or a success message and return 0. Ensure the
early-return happens before any live lookups or client instantiation so the
script can be safely run in demo mode.

In `@isvctl/configs/providers/gcp/scripts/vm/reboot_instance.py`:
- Around line 29-37: Add a DEMO_MODE gate that reads ISVCTL_DEMO_MODE and
short-circuits any real API/SSH actions in reboot_instance.py: define DEMO_MODE
= os.environ.get("ISVCTL_DEMO_MODE") == "1" near the imports, and in the
entrypoint function(s) that perform live orchestration (e.g., main(),
reboot_instance(), or any functions that call the GCP API or open SSH
connections) check DEMO_MODE and return a dummy success response/log without
performing network calls; ensure the short-circuit happens before any real
API/SSH orchestration so demo runs never reach live operations.

In `@isvctl/configs/providers/gcp/scripts/vm/serial_console.py`:
- Around line 32-66: Add a demo-mode guard to the module and short-circuit
main() when DEMO_MODE is enabled: import os and set DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" at module scope, then in main() before
any real GCP API calls (i.e., before using compute_v1 or performing permission
probes after resolve_project/narrow_region_to_zone) check if DEMO_MODE is True
and if so emit the expected dummy success (e.g., print a minimal JSON success or
appropriate message) and return 0 so the script does not call live
serial-console APIs.

In `@isvctl/configs/providers/gcp/scripts/vm/start_instance.py`:
- Around line 22-29: Add a DEMO_MODE gate by defining DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" (import os) at module top and
short-circuit real GCP calls: if DEMO_MODE, return/print a dummy success
response (matching existing output format, exit code 0) and skip any live calls
in the script's entrypoint (the main() or start_instance() function /
module-level run logic) so that when DEMO_MODE is true no actual GCP start
operations are invoked.

In `@isvctl/configs/providers/gcp/scripts/vm/stop_instance.py`:
- Around line 18-25: Add a DEMO_MODE guard so the script returns a dummy success
instead of calling live APIs: import os, define DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" near the top, then at the start of
main() check DEMO_MODE and if true emit the same success output the script would
normally return (e.g., a JSON success message or exit code 0) and return early
without invoking any live API logic; modify the main() function to short-circuit
when DEMO_MODE is true and reference DEMO_MODE and main() in the change so the
guard is unambiguous.
- Around line 75-78: The branch checking cstate == "stopped" incorrectly sets
result["stop_initiated"] = True even though no stop request was made; update
that branch in the function that handles instance stopping (look for the cstate
check and the result dict assignment) to set result["stop_initiated"] = False
(leave result["success"] and result["state"] as-is), and ensure the code path
that actually calls the stop API continues to set result["stop_initiated"] =
True where a stop request is initiated.

In `@isvctl/configs/providers/gcp/scripts/vm/teardown.py`:
- Around line 105-187: The script lacks the required DEMO_MODE short-circuit:
define DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" (top-level) and,
inside main() just after the existing --skip-destroy handling (before calling
resolve_project/narrow_region_to_zone or any GCP/auth operations), check if
DEMO_MODE is truthy and if so populate result with success True,
instance_id=args.instance_id, resources_destroyed=False, a clear demo message,
print json.dumps(result, indent=2, default=str) and return 0 so no cloud/auth
calls (resolve_project, narrow_region_to_zone, any deletes) are executed. Ensure
you reference the main() function and the result dict keys
("success","instance_id","message") to match existing payload shape.

In `@isvctl/configs/providers/gcp/scripts/vm/virtual_device_hardening.py`:
- Around line 257-298: Define DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") ==
"1" and short-circuit main() when it's true so the script never attempts live
SSH probes: add the DEMO_MODE variable near the top and, at the start of main(),
if DEMO_MODE build the same result structure using tests = _base_tests() but
mark the REQUIRED_TESTS entries as passed (or otherwise set success=True), print
the JSON result and return 0 without calling _collect_guest_probe or
_apply_guest_probe; reference main(), DEMO_MODE, _base_tests(), REQUIRED_TESTS,
_collect_guest_probe and _apply_guest_probe to locate and implement the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ba345cd0-264b-4421-a120-4cac1114ae8f

📥 Commits

Reviewing files that changed from the base of the PR and between 871eea8 and 4453f67.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • .factory_version
  • docs/getting-started.md
  • docs/references/gcp.md
  • isvctl/configs/providers/gcp/config/vm.yaml
  • isvctl/configs/providers/gcp/scripts/common/__init__.py
  • isvctl/configs/providers/gcp/scripts/common/compute.py
  • isvctl/configs/providers/gcp/scripts/common/errors.py
  • isvctl/configs/providers/gcp/scripts/common/ssh_utils.py
  • isvctl/configs/providers/gcp/scripts/vm/console_rbac.py
  • isvctl/configs/providers/gcp/scripts/vm/describe_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/describe_tags.py
  • isvctl/configs/providers/gcp/scripts/vm/launch_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/list_instances.py
  • isvctl/configs/providers/gcp/scripts/vm/reboot_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/serial_console.py
  • isvctl/configs/providers/gcp/scripts/vm/start_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/stop_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/teardown.py
  • isvctl/configs/providers/gcp/scripts/vm/virtual_device_hardening.py
  • isvctl/configs/providers/shared/teardown_nim.py
  • isvctl/pyproject.toml

Comment thread docs/references/gcp.md Outdated
Comment thread isvctl/configs/providers/gcp/config/vm.yaml Outdated
Comment thread isvctl/configs/providers/gcp/scripts/common/compute.py
Comment thread isvctl/configs/providers/gcp/scripts/common/errors.py
Comment thread isvctl/configs/providers/gcp/scripts/common/ssh_utils.py
Comment on lines +22 to +29
import argparse
import json
import sys
from pathlib import Path
from typing import Any

sys.path.insert(0, str(Path(__file__).resolve().parents[1])) # providers/gcp/scripts/

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing ISVCTL_DEMO_MODE gate before issuing real start operations.

Please add the standard demo short-circuit path; this file currently always executes live GCP operations.

As per coding guidelines: Provider scripts must include DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" gate: real runs perform actual operations, demo mode returns dummy success.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/gcp/scripts/vm/start_instance.py` around lines 22 -
29, Add a DEMO_MODE gate by defining DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" (import os) at module top and
short-circuit real GCP calls: if DEMO_MODE, return/print a dummy success
response (matching existing output format, exit code 0) and skip any live calls
in the script's entrypoint (the main() or start_instance() function /
module-level run logic) so that when DEMO_MODE is true no actual GCP start
operations are invoked.

Comment on lines +18 to +25
import argparse
import json
import sys
from pathlib import Path
from typing import Any

sys.path.insert(0, str(Path(__file__).resolve().parents[1])) # providers/gcp/scripts/

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required demo-mode guard in this provider script.

main() always reaches live API logic. Add the standard DEMO_MODE early-return branch.

As per coding guidelines: Provider scripts must include DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" gate: real runs perform actual operations, demo mode returns dummy success.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/gcp/scripts/vm/stop_instance.py` around lines 18 -
25, Add a DEMO_MODE guard so the script returns a dummy success instead of
calling live APIs: import os, define DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" near the top, then at the start of
main() check DEMO_MODE and if true emit the same success output the script would
normally return (e.g., a JSON success message or exit code 0) and return early
without invoking any live API logic; modify the main() function to short-circuit
when DEMO_MODE is true and reference DEMO_MODE and main() in the change so the
guard is unambiguous.

Comment thread isvctl/configs/providers/gcp/scripts/vm/stop_instance.py
Comment on lines +105 to +187
@handle_gcp_errors
def main() -> int:
parser = argparse.ArgumentParser(description="Teardown a Compute Engine VM + companions")
parser.add_argument("--instance-id", required=True, help="Instance name")
parser.add_argument("--region", required=True, help="GCP region or zone")
parser.add_argument("--zone", default=None, help="GCP zone (overrides region)")
parser.add_argument("--project", default=None, help="GCP project ID (ADC fallback)")
parser.add_argument(
"--delete-key-pair",
action="store_true",
help="Delete the local SSH key pair if --key-created is truthy",
)
parser.add_argument(
"--delete-security-group",
action="store_true",
help="Delete the SSH firewall rule if --firewall-created is truthy",
)
parser.add_argument(
"--skip-destroy",
action="store_true",
help="Short-circuit to success (preserve cloud state) BEFORE resolving auth",
)
parser.add_argument("--firewall-name", default="none", help="Firewall rule name")
parser.add_argument(
"--firewall-created",
default="false",
help="Bool sentinel forwarded from launch_instance.firewall_created",
)
parser.add_argument(
"--instance-created",
default="false",
help=(
"Bool sentinel forwarded from launch_instance.instance_created. "
"False skips both the primary and every leaked-zone instance "
"delete so a verified-reuse adoption of an operator-supplied "
"long-lived VM is never destroyed by this teardown."
),
)
parser.add_argument(
"--key-file",
default="none",
help="Local SSH PEM path forwarded from launch_instance.key_file",
)
parser.add_argument(
"--key-created",
default="false",
help="Bool sentinel forwarded from launch_instance.key_created",
)
parser.add_argument(
"--leaked-zones",
default="",
help=(
"Comma-separated zones the multi-zone walker accumulated "
"partial-create leaks in. Teardown best-effort-deletes the "
"instance in each before completing."
),
)
args = parser.parse_args()

result: dict[str, Any] = {
"success": False,
"platform": "vm",
"resources_destroyed": False,
"deleted": {
"instances": [],
"firewall_rules": [],
"key_files": [],
},
"resources_deleted": [], # flat list shape matching AWS oracle / my-isv
"message": "",
}

# Preservation-mode flag short-circuits BEFORE any cloud / auth call
# so an expired-credentials environment still no-ops cleanly.
if args.skip_destroy:
result["success"] = True
result["instance_id"] = args.instance_id
result["message"] = f"Instance {args.instance_id} preserved (--skip-destroy); delete manually when done."
print(json.dumps(result, indent=2, default=str))
return 0

project = resolve_project(args.project)
zone = args.zone or narrow_region_to_zone(args.region)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required demo-mode short-circuit before teardown operations.

This provider script always performs real teardown. In demo runs, it should return a dummy success payload and avoid cloud/auth calls.

Suggested patch
@@
 _TEARDOWN_INSTANCE_WAIT_S = 180
 _TEARDOWN_FIREWALL_WAIT_S = 300
+DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1"
@@
 def main() -> int:
@@
     args = parser.parse_args()
@@
     result: dict[str, Any] = {
         "success": False,
         "platform": "vm",
@@
         "message": "",
     }
+
+    if DEMO_MODE:
+        result["success"] = True
+        result["resources_destroyed"] = True
+        result["message"] = "Demo mode: teardown skipped"
+        print(json.dumps(result, indent=2, default=str))
+        return 0

As per coding guidelines: Provider scripts must include DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" gate: real runs perform actual operations, demo mode returns dummy success.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/gcp/scripts/vm/teardown.py` around lines 105 - 187,
The script lacks the required DEMO_MODE short-circuit: define DEMO_MODE =
os.environ.get("ISVCTL_DEMO_MODE") == "1" (top-level) and, inside main() just
after the existing --skip-destroy handling (before calling
resolve_project/narrow_region_to_zone or any GCP/auth operations), check if
DEMO_MODE is truthy and if so populate result with success True,
instance_id=args.instance_id, resources_destroyed=False, a clear demo message,
print json.dumps(result, indent=2, default=str) and return 0 so no cloud/auth
calls (resolve_project, narrow_region_to_zone, any deletes) are executed. Ensure
you reference the main() function and the result dict keys
("success","instance_id","message") to match existing payload shape.

Comment on lines +257 to +298
@handle_gcp_errors
def main() -> int:
"""Validate Compute Engine virtual-device hardening and emit structured JSON."""
parser = argparse.ArgumentParser(description="Validate Compute Engine VM virtual-device hardening")
parser.add_argument("--instance-id", required=True, help="Compute Engine instance name")
parser.add_argument("--region", default="", help="(unused) forwarded by orchestrator")
parser.add_argument("--zone", default="", help="(unused) forwarded by orchestrator")
parser.add_argument("--public-ip", default="", help="Optional SSH host for guest probes")
parser.add_argument("--key-file", default="", help="Optional SSH private key path for guest probes")
parser.add_argument("--ssh-user", default="ubuntu", help="SSH username")
parser.add_argument(
"--ssh-timeout",
type=int,
default=60,
help="Total seconds for the combined guest probe SSH command",
)
args = parser.parse_args()

tests = _base_tests()
guest_probe_error: str | None = None
try:
guest_probe = _collect_guest_probe(args.public_ip, args.ssh_user, args.key_file, args.ssh_timeout)
except ValueError as e:
guest_probe_error = _compact(str(e))
guest_probe = {"status": "unavailable"}
if guest_probe_error is None and guest_probe.get("status") == "unavailable":
guest_probe_error = _compact(str(guest_probe.get("error") or "guest probe unavailable"))

_apply_guest_probe(tests, guest_probe)

success = guest_probe_error is None and all(tests[name].get("passed") is True for name in REQUIRED_TESTS)
result: dict[str, Any] = {
"success": success,
"platform": "vm",
"test_name": "virtual_device_hardening",
"instance_id": args.instance_id,
"tests": tests,
}
if guest_probe_error is not None:
result["error"] = guest_probe_error
print(json.dumps(result, indent=2))
return 0 if success else 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required demo-mode short-circuit.

This entrypoint still attempts the guest SSH probe in demo mode. Provider scripts in this repo are expected to return dummy success when ISVCTL_DEMO_MODE=1, not touch live instances.

As per coding guidelines "Provider scripts must include DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1" gate: real runs perform actual operations, demo mode returns dummy success".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/gcp/scripts/vm/virtual_device_hardening.py` around
lines 257 - 298, Define DEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1"
and short-circuit main() when it's true so the script never attempts live SSH
probes: add the DEMO_MODE variable near the top and, at the start of main(), if
DEMO_MODE build the same result structure using tests = _base_tests() but mark
the REQUIRED_TESTS entries as passed (or otherwise set success=True), print the
JSON result and return 0 without calling _collect_guest_probe or
_apply_guest_probe; reference main(), DEMO_MODE, _base_tests(), REQUIRED_TESTS,
_collect_guest_probe and _apply_guest_probe to locate and implement the change.

…te + new virtual_device_hardening step)

Adds a complete GCP Compute Engine provider under `isvctl/configs/providers/gcp/` covering the full vm test suite — `launch_instance`, `list_instances`, `verify_tags`, `serial_console`, `console_rbac`, `virtual_device_hardening` (new step from PR NVIDIA#413), `stop_instance`, `start_instance`, `reboot_instance`, `describe_instance`, `deploy_nim` (shared) — plus teardown. AWS oracle contract shape preserved across all steps; GCE-specific divergences encoded per step where the API surface differs from EC2 (zone capacity walk, async-operation cleanup tracker semantics, verified-reuse `instance_created` ownership flag, IAM token-creator propagation budget, instance metadata fields for serial-console RBAC probes).

Also adds operator setup documentation at `docs/references/gcp.md` (linked from `docs/getting-started.md`) covering authentication (ADC / service-account key), project-ID resolution order, required IAM roles, L4 GPU quota and zone list, NIM-step `NGC_API_KEY` handling, GPU-image default + how to override for Docker-based tests, and org-policy considerations operators commonly need to check before a clean run. GCP gets its own reference page rather than mixing into `references/aws.md` since AWS is treated as the canonical oracle example; future target NCPs follow the same per-NCP pattern.

The default `--image-project` / `--image-family` are the public GCP Deep Learning VM Image (`deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580`) so a clean checkout can run the VM suite without any private-image entitlement. The DLVM ships with the NVIDIA driver + CUDA toolkit; operators who need Docker (for the `deploy_nim` step) override the default by exporting `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` in their shell / `.env` (the provider config reads both via Jinja, identical pattern to the existing `GCP_VM_SKIP_TEARDOWN` precedent), or skip NIM by leaving `NGC_API_KEY` unset (the shared script short-circuits cleanly). `--set image=... --set image_project=...` is also accepted for one-off ad-hoc invocations.

Mirrors two adjacent upstream changes that landed on main while this branch was in-flight:

  - PR NVIDIA#413 (commit 2cdf4d7, "feat: add virtual device hardening validation"
    by @mresvanis): the vm suite contract gained `virtual_device_hardening` as
    a new test step. The GCP mirror is a 302-line stub at
    `gcp/scripts/vm/virtual_device_hardening.py` — provider-evidence preamble
    naming Compute Engine (no customer-facing USB / clipboard surface on GCE
    tenant VMs), guest-side SSH probes using the same PROBE_SENTINEL framing
    and pattern lists as the AWS oracle, emitting the three REQUIRED_TESTS
    subtests (`usb_devices_disabled`, `clipboard_disabled`,
    `unnecessary_virtual_devices_absent`). Adds a sister `ssh_run` helper to
    `gcp/scripts/common/ssh_utils.py` returning `(exit_code, stdout, stderr)`
    with sentinel exit codes for TimeoutExpired (124) and OSError (255).

  - PR NVIDIA#424 (commit 6e1458a, "fix: handle AWS VM validation regressions" by
    @abegnoche): adds `IdentityAgent=none` alongside the existing
    `IdentitiesOnly=yes` so explicit `-i <key_file>` is the only SSH credential
    considered on operator hosts running a multi-identity ssh-agent. Mirrored
    to the GCP `_SSH_OPTS` tuple so every GCP SSH helper benefits uniformly.

Verification:

- **Static checks**: PASS (46/46).
- **Independent post-generation review on branch HEAD**: `PASS p1=0 p2=0`.
- **GCP orphans after teardown**: 0 on every run.
- **Live execution end-to-end** — verified under both supported operator configurations:

  - canonical / custom-image (`NGC_API_KEY` set + `GCP_VM_IMAGE` + `GCP_VM_IMAGE_PROJECT` set to a Docker-equipped image): 3 consecutive full-domain PASS — 9 test steps + NIM deploy / teardown + cleanup, ~1650s per run. Post-upstream-refresh re-verification (with the new `virtual_device_hardening` step + IdentityAgent flag + `ssh_run` helper) also PASS — domain-gate live run `test-f65be511` 330.8s, all 11 vm steps + NIM deploy/teardown + cleanup, 0 orphans.
  - public-default (all three env vars unset → suite falls through to the DLVM): 1 expected FAIL on `host_os.ContainerRuntimeCheck` (Docker not available — DLVM ships NVIDIA driver + CUDA but not Docker; this is documented in `docs/references/gcp.md` §5 as a known portability gap of the public default). Everything else PASS: all 9 lifecycle steps, NIM steps cleanly `runtime_skip`, both `deploy_nim` and `teardown_nim` policy-skip honored end-to-end. 1113.5s.

Known gaps:

- `ContainerRuntimeCheck` fails on the public DLVM default image (documented above). Operators who need a fully green run override the image via `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` to a Docker-equipped image. The public default is published as "boot + lifecycle works without any private entitlement", not as "every validator passes" — see `docs/references/gcp.md` §5.

Branch history before squash-for-PR: 32 commits — initial factory-generated foundation + per-step worker + domain-gate + final-review-fix commits, plus 7 operator hand-fixes addressing post-generation review findings (cleanup-bool propagation, socket-timeout retry, zone-walk contract, probe-network arg threading, factory-vocabulary scrub, warm-reuse stability gate, symmetric NIM teardown policy-skip, env-var override architecture), plus 2 docs commits (operator setup reference + ContainerRuntimeCheck clarification), plus a single squashed factory-update-mode regen against the new upstream main tip 9b79e34 bringing the two adjacent upstream mirrors above. Squashed here for a single-commit upstream PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>
@sinorga sinorga force-pushed the agent/gcp-workers-286265d1-e5eb-4ffe-bf5f-c8e7a46ce147 branch from 4453f67 to 0d8dcd7 Compare May 18, 2026 17:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
isvctl/configs/providers/gcp/scripts/vm/serial_console.py (1)

50-50: ⚡ Quick win

Add a docstring for main() to satisfy Python standards.

main() is missing a function docstring.
As per coding guidelines: **/*.py: “Every function and class must have docstrings following PEP 257”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/gcp/scripts/vm/serial_console.py` at line 50, Add a
PEP 257-compliant docstring to the main() function describing its purpose,
return value (int), and any important behavior or side effects; locate the
main() definition in serial_console.py (function name: main) and insert a short
triple-quoted docstring immediately below the def line following project
docstring style.
isvctl/configs/providers/gcp/scripts/vm/describe_instance.py (1)

43-43: ⚡ Quick win

Add a docstring for main() for PEP 257 compliance.

main() currently has no docstring.
As per coding guidelines: **/*.py: “Every function and class must have docstrings following PEP 257”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/gcp/scripts/vm/describe_instance.py` at line 43, Add
a PEP 257-compliant docstring to the main() function describing its purpose, any
important behavior or side effects (e.g., it parses args, describes a GCP VM,
prints or logs output), and the return value (an int exit code); place the
docstring immediately under the def main() -> int: declaration and follow
conventional triple-quoted summary, optional longer description, and a
“Returns:” section showing the int exit code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@isvctl/configs/providers/gcp/scripts/vm/describe_instance.py`:
- Line 43: Add a PEP 257-compliant docstring to the main() function describing
its purpose, any important behavior or side effects (e.g., it parses args,
describes a GCP VM, prints or logs output), and the return value (an int exit
code); place the docstring immediately under the def main() -> int: declaration
and follow conventional triple-quoted summary, optional longer description, and
a “Returns:” section showing the int exit code.

In `@isvctl/configs/providers/gcp/scripts/vm/serial_console.py`:
- Line 50: Add a PEP 257-compliant docstring to the main() function describing
its purpose, return value (int), and any important behavior or side effects;
locate the main() definition in serial_console.py (function name: main) and
insert a short triple-quoted docstring immediately below the def line following
project docstring style.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4fe695da-6bb5-4c4b-bef6-fe3ddd480b9f

📥 Commits

Reviewing files that changed from the base of the PR and between 4453f67 and 0d8dcd7.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • .factory_version
  • docs/getting-started.md
  • docs/references/gcp.md
  • isvctl/configs/providers/gcp/config/vm.yaml
  • isvctl/configs/providers/gcp/scripts/common/__init__.py
  • isvctl/configs/providers/gcp/scripts/common/compute.py
  • isvctl/configs/providers/gcp/scripts/common/errors.py
  • isvctl/configs/providers/gcp/scripts/common/ssh_utils.py
  • isvctl/configs/providers/gcp/scripts/vm/console_rbac.py
  • isvctl/configs/providers/gcp/scripts/vm/describe_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/describe_tags.py
  • isvctl/configs/providers/gcp/scripts/vm/launch_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/list_instances.py
  • isvctl/configs/providers/gcp/scripts/vm/reboot_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/serial_console.py
  • isvctl/configs/providers/gcp/scripts/vm/start_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/stop_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/teardown.py
  • isvctl/configs/providers/gcp/scripts/vm/virtual_device_hardening.py
  • isvctl/configs/providers/shared/teardown_nim.py
  • isvctl/pyproject.toml
✅ Files skipped from review due to trivial changes (3)
  • .factory_version
  • isvctl/configs/providers/gcp/scripts/common/init.py
  • docs/getting-started.md
🚧 Files skipped from review as they are similar to previous changes (14)
  • isvctl/pyproject.toml
  • isvctl/configs/providers/gcp/scripts/vm/describe_tags.py
  • isvctl/configs/providers/gcp/config/vm.yaml
  • isvctl/configs/providers/gcp/scripts/vm/stop_instance.py
  • isvctl/configs/providers/shared/teardown_nim.py
  • isvctl/configs/providers/gcp/scripts/vm/start_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/list_instances.py
  • isvctl/configs/providers/gcp/scripts/vm/virtual_device_hardening.py
  • isvctl/configs/providers/gcp/scripts/common/ssh_utils.py
  • isvctl/configs/providers/gcp/scripts/vm/console_rbac.py
  • isvctl/configs/providers/gcp/scripts/common/errors.py
  • isvctl/configs/providers/gcp/scripts/vm/teardown.py
  • isvctl/configs/providers/gcp/scripts/vm/launch_instance.py
  • isvctl/configs/providers/gcp/scripts/vm/reboot_instance.py

sinorga and others added 2 commits May 19, 2026 02:08
…t review feedback)

Three helper functions surfaced by CodeRabbit's docstring-coverage quality
gate (78.72% < 80% threshold on PR NVIDIA#427) lacked PEP 257 docstrings:

  - common/compute.py: first_internal_ip (returns the nic0 internal IPv4
    or None when the instance has no NICs / hasn't allocated yet).
  - common/compute.py: _firewall_has_isv_ownership (verified-reuse
    ownership marker check — distinguishes firewalls this suite created
    from operator-owned firewalls that happen to match the test name).
  - common/ssh_utils.py: _ssh_argv (canonical ssh argv builder — shared
    _SSH_OPTS + -i <key> + <user>@<host> + <remote_cmd> — so every
    subprocess SSH call in this module uses identical option semantics,
    per the canonical-options consistency rule).

Each function gets a one-paragraph docstring (one-line summary + a
second line of context where it adds value). Coverage on the gcp/scripts
tree post-fix: 83.15% (74/89), above the 80% threshold.

Kept as a separate commit (not amended into the squash) so the PR's
"Commits" tab shows a transparent response to CodeRabbit's review
feedback — easier for the upstream maintainer to see exactly what
changed in response to a specific review-cycle pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>
…MD040)

CodeRabbit's review of PR NVIDIA#427 flagged the bare ``` fence around the
L4-supported-zones region list as a markdownlint MD040 violation.
Changing to ```text marks the block explicitly as plain text so
linters don't trip while preserving the rendering (text is the
de-facto fallback for un-tagged fences).

One-character class change; no behavior or rendering difference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>
Comment thread .factory_version

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this file for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's for updating flow when isv repo changed and the factory need to update the generated content. it's easier to find the commit to diff instead of searching commit history. if you think it's better to not injecting any factory related metadata in to isv repo, I can think about other solution.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be best to keep separate if you can thank

@abegnoche abegnoche changed the title factory: generates gcp/vm stubs feat: generates gcp/vm stubs May 22, 2026
sinorga added a commit to sinorga/ISV-NCP-Validation-Suite that referenced this pull request May 26, 2026
* factory: generates gcp/vm stubs (Compute Engine provider, full vm suite + new virtual_device_hardening step)

Adds a complete GCP Compute Engine provider under `isvctl/configs/providers/gcp/` covering the full vm test suite — `launch_instance`, `list_instances`, `verify_tags`, `serial_console`, `console_rbac`, `virtual_device_hardening` (new step from PR NVIDIA#413), `stop_instance`, `start_instance`, `reboot_instance`, `describe_instance`, `deploy_nim` (shared) — plus teardown. AWS oracle contract shape preserved across all steps; GCE-specific divergences encoded per step where the API surface differs from EC2 (zone capacity walk, async-operation cleanup tracker semantics, verified-reuse `instance_created` ownership flag, IAM token-creator propagation budget, instance metadata fields for serial-console RBAC probes).

Also adds operator setup documentation at `docs/references/gcp.md` (linked from `docs/getting-started.md`) covering authentication (ADC / service-account key), project-ID resolution order, required IAM roles, L4 GPU quota and zone list, NIM-step `NGC_API_KEY` handling, GPU-image default + how to override for Docker-based tests, and org-policy considerations operators commonly need to check before a clean run. GCP gets its own reference page rather than mixing into `references/aws.md` since AWS is treated as the canonical oracle example; future target NCPs follow the same per-NCP pattern.

The default `--image-project` / `--image-family` are the public GCP Deep Learning VM Image (`deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580`) so a clean checkout can run the VM suite without any private-image entitlement. The DLVM ships with the NVIDIA driver + CUDA toolkit; operators who need Docker (for the `deploy_nim` step) override the default by exporting `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` in their shell / `.env` (the provider config reads both via Jinja, identical pattern to the existing `GCP_VM_SKIP_TEARDOWN` precedent), or skip NIM by leaving `NGC_API_KEY` unset (the shared script short-circuits cleanly). `--set image=... --set image_project=...` is also accepted for one-off ad-hoc invocations.

Mirrors two adjacent upstream changes that landed on main while this branch was in-flight:

  - PR NVIDIA#413 (commit 2cdf4d7, "feat: add virtual device hardening validation"
    by @mresvanis): the vm suite contract gained `virtual_device_hardening` as
    a new test step. The GCP mirror is a 302-line stub at
    `gcp/scripts/vm/virtual_device_hardening.py` — provider-evidence preamble
    naming Compute Engine (no customer-facing USB / clipboard surface on GCE
    tenant VMs), guest-side SSH probes using the same PROBE_SENTINEL framing
    and pattern lists as the AWS oracle, emitting the three REQUIRED_TESTS
    subtests (`usb_devices_disabled`, `clipboard_disabled`,
    `unnecessary_virtual_devices_absent`). Adds a sister `ssh_run` helper to
    `gcp/scripts/common/ssh_utils.py` returning `(exit_code, stdout, stderr)`
    with sentinel exit codes for TimeoutExpired (124) and OSError (255).

  - PR NVIDIA#424 (commit 6e1458a, "fix: handle AWS VM validation regressions" by
    @abegnoche): adds `IdentityAgent=none` alongside the existing
    `IdentitiesOnly=yes` so explicit `-i <key_file>` is the only SSH credential
    considered on operator hosts running a multi-identity ssh-agent. Mirrored
    to the GCP `_SSH_OPTS` tuple so every GCP SSH helper benefits uniformly.

Verification:

- **Static checks**: PASS (46/46).
- **Independent post-generation review on branch HEAD**: `PASS p1=0 p2=0`.
- **GCP orphans after teardown**: 0 on every run.
- **Live execution end-to-end** — verified under both supported operator configurations:

  - canonical / custom-image (`NGC_API_KEY` set + `GCP_VM_IMAGE` + `GCP_VM_IMAGE_PROJECT` set to a Docker-equipped image): 3 consecutive full-domain PASS — 9 test steps + NIM deploy / teardown + cleanup, ~1650s per run. Post-upstream-refresh re-verification (with the new `virtual_device_hardening` step + IdentityAgent flag + `ssh_run` helper) also PASS — domain-gate live run `test-f65be511` 330.8s, all 11 vm steps + NIM deploy/teardown + cleanup, 0 orphans.
  - public-default (all three env vars unset → suite falls through to the DLVM): 1 expected FAIL on `host_os.ContainerRuntimeCheck` (Docker not available — DLVM ships NVIDIA driver + CUDA but not Docker; this is documented in `docs/references/gcp.md` §5 as a known portability gap of the public default). Everything else PASS: all 9 lifecycle steps, NIM steps cleanly `runtime_skip`, both `deploy_nim` and `teardown_nim` policy-skip honored end-to-end. 1113.5s.

Known gaps:

- `ContainerRuntimeCheck` fails on the public DLVM default image (documented above). Operators who need a fully green run override the image via `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` to a Docker-equipped image. The public default is published as "boot + lifecycle works without any private entitlement", not as "every validator passes" — see `docs/references/gcp.md` §5.

Branch history before squash-for-PR: 32 commits — initial factory-generated foundation + per-step worker + domain-gate + final-review-fix commits, plus 7 operator hand-fixes addressing post-generation review findings (cleanup-bool propagation, socket-timeout retry, zone-walk contract, probe-network arg threading, factory-vocabulary scrub, warm-reuse stability gate, symmetric NIM teardown policy-skip, env-var override architecture), plus 2 docs commits (operator setup reference + ContainerRuntimeCheck clarification), plus a single squashed factory-update-mode regen against the new upstream main tip 9b79e34 bringing the two adjacent upstream mirrors above. Squashed here for a single-commit upstream PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

* fix(gcp/common): add PEP 257 docstrings to missing helpers (CodeRabbit review feedback)

Three helper functions surfaced by CodeRabbit's docstring-coverage quality
gate (78.72% < 80% threshold on PR NVIDIA#427) lacked PEP 257 docstrings:

  - common/compute.py: first_internal_ip (returns the nic0 internal IPv4
    or None when the instance has no NICs / hasn't allocated yet).
  - common/compute.py: _firewall_has_isv_ownership (verified-reuse
    ownership marker check — distinguishes firewalls this suite created
    from operator-owned firewalls that happen to match the test name).
  - common/ssh_utils.py: _ssh_argv (canonical ssh argv builder — shared
    _SSH_OPTS + -i <key> + <user>@<host> + <remote_cmd> — so every
    subprocess SSH call in this module uses identical option semantics,
    per the canonical-options consistency rule).

Each function gets a one-paragraph docstring (one-line summary + a
second line of context where it adds value). Coverage on the gcp/scripts
tree post-fix: 83.15% (74/89), above the 80% threshold.

Kept as a separate commit (not amended into the squash) so the PR's
"Commits" tab shows a transparent response to CodeRabbit's review
feedback — easier for the upstream maintainer to see exactly what
changed in response to a specific review-cycle pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

* docs(references/gcp): specify text language for L4-zones code fence (MD040)

CodeRabbit's review of PR NVIDIA#427 flagged the bare ``` fence around the
L4-supported-zones region list as a markdownlint MD040 violation.
Changing to ```text marks the block explicitly as plain text so
linters don't trip while preserving the rendering (text is the
de-facto fallback for un-tagged fences).

One-character class change; no behavior or rendering difference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

---------

Signed-off-by: Orga Shih <oshih@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sinorga added a commit to sinorga/ISV-NCP-Validation-Suite that referenced this pull request Jun 3, 2026
* factory: generates gcp/vm stubs (Compute Engine provider, full vm suite + new virtual_device_hardening step)

Adds a complete GCP Compute Engine provider under `isvctl/configs/providers/gcp/` covering the full vm test suite — `launch_instance`, `list_instances`, `verify_tags`, `serial_console`, `console_rbac`, `virtual_device_hardening` (new step from PR NVIDIA#413), `stop_instance`, `start_instance`, `reboot_instance`, `describe_instance`, `deploy_nim` (shared) — plus teardown. AWS oracle contract shape preserved across all steps; GCE-specific divergences encoded per step where the API surface differs from EC2 (zone capacity walk, async-operation cleanup tracker semantics, verified-reuse `instance_created` ownership flag, IAM token-creator propagation budget, instance metadata fields for serial-console RBAC probes).

Also adds operator setup documentation at `docs/references/gcp.md` (linked from `docs/getting-started.md`) covering authentication (ADC / service-account key), project-ID resolution order, required IAM roles, L4 GPU quota and zone list, NIM-step `NGC_API_KEY` handling, GPU-image default + how to override for Docker-based tests, and org-policy considerations operators commonly need to check before a clean run. GCP gets its own reference page rather than mixing into `references/aws.md` since AWS is treated as the canonical oracle example; future target NCPs follow the same per-NCP pattern.

The default `--image-project` / `--image-family` are the public GCP Deep Learning VM Image (`deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580`) so a clean checkout can run the VM suite without any private-image entitlement. The DLVM ships with the NVIDIA driver + CUDA toolkit; operators who need Docker (for the `deploy_nim` step) override the default by exporting `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` in their shell / `.env` (the provider config reads both via Jinja, identical pattern to the existing `GCP_VM_SKIP_TEARDOWN` precedent), or skip NIM by leaving `NGC_API_KEY` unset (the shared script short-circuits cleanly). `--set image=... --set image_project=...` is also accepted for one-off ad-hoc invocations.

Mirrors two adjacent upstream changes that landed on main while this branch was in-flight:

  - PR NVIDIA#413 (commit 2cdf4d7, "feat: add virtual device hardening validation"
    by @mresvanis): the vm suite contract gained `virtual_device_hardening` as
    a new test step. The GCP mirror is a 302-line stub at
    `gcp/scripts/vm/virtual_device_hardening.py` — provider-evidence preamble
    naming Compute Engine (no customer-facing USB / clipboard surface on GCE
    tenant VMs), guest-side SSH probes using the same PROBE_SENTINEL framing
    and pattern lists as the AWS oracle, emitting the three REQUIRED_TESTS
    subtests (`usb_devices_disabled`, `clipboard_disabled`,
    `unnecessary_virtual_devices_absent`). Adds a sister `ssh_run` helper to
    `gcp/scripts/common/ssh_utils.py` returning `(exit_code, stdout, stderr)`
    with sentinel exit codes for TimeoutExpired (124) and OSError (255).

  - PR NVIDIA#424 (commit 6e1458a, "fix: handle AWS VM validation regressions" by
    @abegnoche): adds `IdentityAgent=none` alongside the existing
    `IdentitiesOnly=yes` so explicit `-i <key_file>` is the only SSH credential
    considered on operator hosts running a multi-identity ssh-agent. Mirrored
    to the GCP `_SSH_OPTS` tuple so every GCP SSH helper benefits uniformly.

Verification:

- **Static checks**: PASS (46/46).
- **Independent post-generation review on branch HEAD**: `PASS p1=0 p2=0`.
- **GCP orphans after teardown**: 0 on every run.
- **Live execution end-to-end** — verified under both supported operator configurations:

  - canonical / custom-image (`NGC_API_KEY` set + `GCP_VM_IMAGE` + `GCP_VM_IMAGE_PROJECT` set to a Docker-equipped image): 3 consecutive full-domain PASS — 9 test steps + NIM deploy / teardown + cleanup, ~1650s per run. Post-upstream-refresh re-verification (with the new `virtual_device_hardening` step + IdentityAgent flag + `ssh_run` helper) also PASS — domain-gate live run `test-f65be511` 330.8s, all 11 vm steps + NIM deploy/teardown + cleanup, 0 orphans.
  - public-default (all three env vars unset → suite falls through to the DLVM): 1 expected FAIL on `host_os.ContainerRuntimeCheck` (Docker not available — DLVM ships NVIDIA driver + CUDA but not Docker; this is documented in `docs/references/gcp.md` §5 as a known portability gap of the public default). Everything else PASS: all 9 lifecycle steps, NIM steps cleanly `runtime_skip`, both `deploy_nim` and `teardown_nim` policy-skip honored end-to-end. 1113.5s.

Known gaps:

- `ContainerRuntimeCheck` fails on the public DLVM default image (documented above). Operators who need a fully green run override the image via `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` to a Docker-equipped image. The public default is published as "boot + lifecycle works without any private entitlement", not as "every validator passes" — see `docs/references/gcp.md` §5.

Branch history before squash-for-PR: 32 commits — initial factory-generated foundation + per-step worker + domain-gate + final-review-fix commits, plus 7 operator hand-fixes addressing post-generation review findings (cleanup-bool propagation, socket-timeout retry, zone-walk contract, probe-network arg threading, factory-vocabulary scrub, warm-reuse stability gate, symmetric NIM teardown policy-skip, env-var override architecture), plus 2 docs commits (operator setup reference + ContainerRuntimeCheck clarification), plus a single squashed factory-update-mode regen against the new upstream main tip 9b79e34 bringing the two adjacent upstream mirrors above. Squashed here for a single-commit upstream PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

* fix(gcp/common): add PEP 257 docstrings to missing helpers (CodeRabbit review feedback)

Three helper functions surfaced by CodeRabbit's docstring-coverage quality
gate (78.72% < 80% threshold on PR NVIDIA#427) lacked PEP 257 docstrings:

  - common/compute.py: first_internal_ip (returns the nic0 internal IPv4
    or None when the instance has no NICs / hasn't allocated yet).
  - common/compute.py: _firewall_has_isv_ownership (verified-reuse
    ownership marker check — distinguishes firewalls this suite created
    from operator-owned firewalls that happen to match the test name).
  - common/ssh_utils.py: _ssh_argv (canonical ssh argv builder — shared
    _SSH_OPTS + -i <key> + <user>@<host> + <remote_cmd> — so every
    subprocess SSH call in this module uses identical option semantics,
    per the canonical-options consistency rule).

Each function gets a one-paragraph docstring (one-line summary + a
second line of context where it adds value). Coverage on the gcp/scripts
tree post-fix: 83.15% (74/89), above the 80% threshold.

Kept as a separate commit (not amended into the squash) so the PR's
"Commits" tab shows a transparent response to CodeRabbit's review
feedback — easier for the upstream maintainer to see exactly what
changed in response to a specific review-cycle pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

* docs(references/gcp): specify text language for L4-zones code fence (MD040)

CodeRabbit's review of PR NVIDIA#427 flagged the bare ``` fence around the
L4-supported-zones region list as a markdownlint MD040 violation.
Changing to ```text marks the block explicitly as plain text so
linters don't trip while preserving the rendering (text is the
de-facto fallback for un-tagged fences).

One-character class change; no behavior or rendering difference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

---------

Signed-off-by: Orga Shih <oshih@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sinorga added a commit to sinorga/ISV-NCP-Validation-Suite that referenced this pull request Jun 22, 2026
* factory: generates gcp/vm stubs (Compute Engine provider, full vm suite + new virtual_device_hardening step)

Adds a complete GCP Compute Engine provider under `isvctl/configs/providers/gcp/` covering the full vm test suite — `launch_instance`, `list_instances`, `verify_tags`, `serial_console`, `console_rbac`, `virtual_device_hardening` (new step from PR NVIDIA#413), `stop_instance`, `start_instance`, `reboot_instance`, `describe_instance`, `deploy_nim` (shared) — plus teardown. AWS oracle contract shape preserved across all steps; GCE-specific divergences encoded per step where the API surface differs from EC2 (zone capacity walk, async-operation cleanup tracker semantics, verified-reuse `instance_created` ownership flag, IAM token-creator propagation budget, instance metadata fields for serial-console RBAC probes).

Also adds operator setup documentation at `docs/references/gcp.md` (linked from `docs/getting-started.md`) covering authentication (ADC / service-account key), project-ID resolution order, required IAM roles, L4 GPU quota and zone list, NIM-step `NGC_API_KEY` handling, GPU-image default + how to override for Docker-based tests, and org-policy considerations operators commonly need to check before a clean run. GCP gets its own reference page rather than mixing into `references/aws.md` since AWS is treated as the canonical oracle example; future target NCPs follow the same per-NCP pattern.

The default `--image-project` / `--image-family` are the public GCP Deep Learning VM Image (`deeplearning-platform-release/common-cu129-ubuntu-2204-nvidia-580`) so a clean checkout can run the VM suite without any private-image entitlement. The DLVM ships with the NVIDIA driver + CUDA toolkit; operators who need Docker (for the `deploy_nim` step) override the default by exporting `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` in their shell / `.env` (the provider config reads both via Jinja, identical pattern to the existing `GCP_VM_SKIP_TEARDOWN` precedent), or skip NIM by leaving `NGC_API_KEY` unset (the shared script short-circuits cleanly). `--set image=... --set image_project=...` is also accepted for one-off ad-hoc invocations.

Mirrors two adjacent upstream changes that landed on main while this branch was in-flight:

  - PR NVIDIA#413 (commit 2cdf4d7, "feat: add virtual device hardening validation"
    by @mresvanis): the vm suite contract gained `virtual_device_hardening` as
    a new test step. The GCP mirror is a 302-line stub at
    `gcp/scripts/vm/virtual_device_hardening.py` — provider-evidence preamble
    naming Compute Engine (no customer-facing USB / clipboard surface on GCE
    tenant VMs), guest-side SSH probes using the same PROBE_SENTINEL framing
    and pattern lists as the AWS oracle, emitting the three REQUIRED_TESTS
    subtests (`usb_devices_disabled`, `clipboard_disabled`,
    `unnecessary_virtual_devices_absent`). Adds a sister `ssh_run` helper to
    `gcp/scripts/common/ssh_utils.py` returning `(exit_code, stdout, stderr)`
    with sentinel exit codes for TimeoutExpired (124) and OSError (255).

  - PR NVIDIA#424 (commit 6e1458a, "fix: handle AWS VM validation regressions" by
    @abegnoche): adds `IdentityAgent=none` alongside the existing
    `IdentitiesOnly=yes` so explicit `-i <key_file>` is the only SSH credential
    considered on operator hosts running a multi-identity ssh-agent. Mirrored
    to the GCP `_SSH_OPTS` tuple so every GCP SSH helper benefits uniformly.

Verification:

- **Static checks**: PASS (46/46).
- **Independent post-generation review on branch HEAD**: `PASS p1=0 p2=0`.
- **GCP orphans after teardown**: 0 on every run.
- **Live execution end-to-end** — verified under both supported operator configurations:

  - canonical / custom-image (`NGC_API_KEY` set + `GCP_VM_IMAGE` + `GCP_VM_IMAGE_PROJECT` set to a Docker-equipped image): 3 consecutive full-domain PASS — 9 test steps + NIM deploy / teardown + cleanup, ~1650s per run. Post-upstream-refresh re-verification (with the new `virtual_device_hardening` step + IdentityAgent flag + `ssh_run` helper) also PASS — domain-gate live run `test-f65be511` 330.8s, all 11 vm steps + NIM deploy/teardown + cleanup, 0 orphans.
  - public-default (all three env vars unset → suite falls through to the DLVM): 1 expected FAIL on `host_os.ContainerRuntimeCheck` (Docker not available — DLVM ships NVIDIA driver + CUDA but not Docker; this is documented in `docs/references/gcp.md` §5 as a known portability gap of the public default). Everything else PASS: all 9 lifecycle steps, NIM steps cleanly `runtime_skip`, both `deploy_nim` and `teardown_nim` policy-skip honored end-to-end. 1113.5s.

Known gaps:

- `ContainerRuntimeCheck` fails on the public DLVM default image (documented above). Operators who need a fully green run override the image via `GCP_VM_IMAGE` / `GCP_VM_IMAGE_PROJECT` to a Docker-equipped image. The public default is published as "boot + lifecycle works without any private entitlement", not as "every validator passes" — see `docs/references/gcp.md` §5.

Branch history before squash-for-PR: 32 commits — initial factory-generated foundation + per-step worker + domain-gate + final-review-fix commits, plus 7 operator hand-fixes addressing post-generation review findings (cleanup-bool propagation, socket-timeout retry, zone-walk contract, probe-network arg threading, factory-vocabulary scrub, warm-reuse stability gate, symmetric NIM teardown policy-skip, env-var override architecture), plus 2 docs commits (operator setup reference + ContainerRuntimeCheck clarification), plus a single squashed factory-update-mode regen against the new upstream main tip 9b79e34 bringing the two adjacent upstream mirrors above. Squashed here for a single-commit upstream PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

* fix(gcp/common): add PEP 257 docstrings to missing helpers (CodeRabbit review feedback)

Three helper functions surfaced by CodeRabbit's docstring-coverage quality
gate (78.72% < 80% threshold on PR NVIDIA#427) lacked PEP 257 docstrings:

  - common/compute.py: first_internal_ip (returns the nic0 internal IPv4
    or None when the instance has no NICs / hasn't allocated yet).
  - common/compute.py: _firewall_has_isv_ownership (verified-reuse
    ownership marker check — distinguishes firewalls this suite created
    from operator-owned firewalls that happen to match the test name).
  - common/ssh_utils.py: _ssh_argv (canonical ssh argv builder — shared
    _SSH_OPTS + -i <key> + <user>@<host> + <remote_cmd> — so every
    subprocess SSH call in this module uses identical option semantics,
    per the canonical-options consistency rule).

Each function gets a one-paragraph docstring (one-line summary + a
second line of context where it adds value). Coverage on the gcp/scripts
tree post-fix: 83.15% (74/89), above the 80% threshold.

Kept as a separate commit (not amended into the squash) so the PR's
"Commits" tab shows a transparent response to CodeRabbit's review
feedback — easier for the upstream maintainer to see exactly what
changed in response to a specific review-cycle pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

* docs(references/gcp): specify text language for L4-zones code fence (MD040)

CodeRabbit's review of PR NVIDIA#427 flagged the bare ``` fence around the
L4-supported-zones region list as a markdownlint MD040 violation.
Changing to ```text marks the block explicitly as plain text so
linters don't trip while preserving the rendering (text is the
de-facto fallback for un-tagged fences).

One-character class change; no behavior or rendering difference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Orga Shih <oshih@nvidia.com>

---------

Signed-off-by: Orga Shih <oshih@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants