Skip to content

feat: add vCluster provider for k8s and control-plane suites#414

Open
saiyam1814 wants to merge 6 commits into
NVIDIA:mainfrom
saiyam1814:add-vcluster-provider
Open

feat: add vCluster provider for k8s and control-plane suites#414
saiyam1814 wants to merge 6 commits into
NVIDIA:mainfrom
saiyam1814:add-vcluster-provider

Conversation

@saiyam1814

@saiyam1814 saiyam1814 commented May 13, 2026

Copy link
Copy Markdown

Summary

Adds vCluster (by vCluster Labs) as a validated CaaS provider in the NVIDIA ISV NCP Validation Suite.

vCluster is an open-source project that provisions isolated tenant clusters on top of any existing Kubernetes cluster (the Control Plane Cluster). Each tenant cluster has its own virtual control plane (API server, scheduler, controller manager) while sharing the host cluster's GPU nodes. vCluster is CNCF-certified for Kubernetes 1.28-1.35 in three configurations: vcluster-standalone, vcluster-with-private-nodes, and vcluster-with-shared-nodes. This PR validates the shared-nodes topology (sync.fromHost.nodes) which is the configuration that enables GPU workloads in tenant clusters to schedule onto the Control Plane Cluster's physical GPU nodes.

Provider files (isvctl/configs/providers/vcluster/):

  • config/k8s.yaml — full k8s suite wiring with CNCF conformance skip patterns, GKE COS GPU overrides, A100 MIG labeling, NCCL multi-node configuration, and API network ACL probe
  • config/control-plane.yaml — control-plane suite wiring
  • scripts/k8s/setup.sh — tenant cluster lifecycle, GPU Operator handling, MPI Operator install (server-side apply), A100/H100 MIG label seeding
  • scripts/k8s/teardown.sh — tenant cluster teardown with taint restoration
  • scripts/control-plane/ — access key and tenant management scripts
  • manifests/mpi-operator-v0.5.0.yaml — bundled Kubeflow MPI Operator manifest for NCCL multi-node workloads
  • README.md — provider documentation

Workload / framework improvements (apply across providers):

  • isvtest/src/isvtest/workloads/k8s_nim.pyruntime_class_name, nim_memory_request, nim_memory_limit config params
  • isvtest/src/isvtest/workloads/k8s_nim_helm.py--kubeconfig targeting, memory and runtimeClassName params, NIM_MAX_MODEL_LEN env via --set-string
  • isvtest/src/isvtest/workloads/k8s_nccl_multinode.pyoverride_launcher_affinity and runtime_class_name for non-control-plane MPI launchers
  • isvtest/src/isvtest/workloads/k8s_stress.pyruntime_class_name config param
  • isvtest/src/isvtest/validations/k8s_conformance.py — robust JUnit retrieval: retry on transient stream resets, kubectl cp fallback, and an opt-in pre-staged local file path for providers behind managed-K8s LBs
  • isvtest/src/isvtest/core/nvidia.py — GPU row regex fix for non-standard GPU names

Test Infrastructure

Control Plane Cluster: GKE vcluster-isv-gpu-test, us-central1-c, Kubernetes v1.35.3-gke.2190000

Node pool Nodes Machine GPU Purpose
default-pool 2 n1-standard-4 CPU workloads / control plane
gpu-pool 2 n1-standard-4 1× NVIDIA T4 (16 GB) GPU workloads (NIM, stress, NCCL)
gpu-a100-pool 1 a2-highgpu-1g 1× NVIDIA A100 (40 GB) MIG configuration check

Tenant cluster: vcluster-isv-validation in namespace vcluster-isv-validation, vCluster 0.34.0, configured with sync.fromHost.nodes to expose host GPU capacity into the tenant.


Test Results

make test / make lint — PASS

728 unit tests pass; ruff, yamlfmt, and SPDX header checks all pass.

Control-plane suite — 11 / 11 PASS

All 11 control-plane checks pass: FieldExistsCheck, FieldValueCheck, AccessKeyCreatedCheck, TenantCreatedCheck, AccessKeyAuthenticatedCheck, AccessKeyDisabledCheck, AccessKeyRejectedCheck, TenantListedCheck, TenantInfoCheck, StepSuccessCheck-delete_access_key, StepSuccessCheck-delete_tenant.

Kubernetes suite — 31 / 31 PASS

# Check Result Notes
1 K8sNodeCountCheck PASS
2 K8sExpectedNodesCheck PASS
3 K8sNodeReadyCheck PASS
4 K8sNvidiaSmiCheck PASS nvidia-smi on synced GPU nodes
5 K8sDriverVersionCheck PASS driver_version: "" — GKE manages drivers natively
6 K8sGpuPodAccessCheck PASS GPU pod scheduled via device plugin
7 K8sGpuCapacityCheck PASS T4 nodes + A100 node
8 K8sGpuOperatorNamespaceCheck PASS gpu-operator namespace in tenant
9 K8sGpuOperatorPodsCheck PASS pause Deployment satisfies the check (host GPU Operator already running)
10 K8sGpuLabelsCheck PASS nvidia.com/gpu.present=true on synced nodes
11 K8sPodHealthCheck PASS
12 K8sNoPendingPodsCheck PASS
13 K8sNoErrorPodsCheck PASS
14 K8sMigConfigCheck PASS A100 node auto-labeled with nvidia.com/mig.capable=true and nvidia.com/mig.strategy=single in setup.sh (GFD cannot run on GKE COS)
15 K8sDualStackNodeCheck PASS Single-stack IPv4
16 K8sNetworkPolicyCheck PASS NetworkPolicies synced to host CNI
17 K8sCsiStorageTypesCheck PASS
18 K8sCsiStorageQuotaApiCheck PASS
19 K8sCsiTenantScopedCredentialsCheck PASS
20 K8sCsiProvisioningModesCheck PASS
21 K8sOidcIssuerCheck PASS
22 K8sApiServerMetricsCheck PASS
23 K8sControlPlaneLogsCheck PASS syncer logs via kubectl in vCluster namespace
24 K8sApiNetworkAclCheck PASS API rejects an unauthenticated curl -f with HTTP 401 (probe exits 22 → ACL enforced); authorized probe via kubectl succeeds
25 K8sCncfConformanceCheck PASS certified-conformance mode, v1.35.0: 419 / 7353 passed, 0 failed, 0 errors, 6934 skipped (the 6934 are tests outside the [Conformance] focus plus the 28-pattern provider skip list documented below)
26 K8sNcclWorkload PASS Single-node NCCL
27 K8sNcclMultiNodeWorkload PASS MPI Operator v0.5.0 installed by setup.sh (server-side apply), launcher affinity + runtimeClassName overrides applied. min_bus_bw_gbps: 0 pinned because the cluster uses T4 + plain GKE pod networking; reviewers on A100/H100 + IB rigs should override this in their provider config
28 K8sGpuStressWorkload PASS PyTorch GPU stress on T4
29 K8sNimInferenceWorkload PASS T4-tuned memory + KV-cache (NIM_MAX_MODEL_LEN)
30 K8sNimHelmWorkload-1b PASS llama-3.2-1b-instruct via Helm
31 K8sNimHelmWorkload-3b PASS llama-3.2-3b-instruct via Helm

Bold rows were the four checks exercised in a targeted re-run after the initial full run (so the suite saw a fully stable cluster with the A100 already attached and MPI Operator installed). The same config/k8s.yaml is used end-to-end; the re-run command was isvctl test run -f config/k8s.yaml -- -k "K8sCncfConformanceCheck or K8sMigConfigCheck or K8sNcclMultiNodeWorkload or K8sApiNetworkAclCheck".


CNCF Conformance Skips

K8sCncfConformanceCheck runs in certified-conformance mode (full [Conformance] suite). 28 test patterns across 15 architectural limitation groups are skipped, all specific to the sync.fromHost.nodes topology used to expose GPU capacity from host nodes into the tenant cluster.

vCluster passes the full conformance suite with zero skips on dedicated nodes with virtualScheduler.enabled — see the official certification at https://github.com/cncf/k8s-conformance/tree/master/v1.35/vcluster-with-shared-nodes. The skips below arise only because sync.fromHost.nodes makes several node properties read-only and rewrites virtual node InternalIPs to pod-CIDR addresses.

# Category Tests Root cause
1 HostPorts + hostPort scheduling conflict 3 vCluster does not map virtual ports to host network interfaces
2 ServiceCIDR / IPAddress API 1 Not yet implemented in vCluster 1.35 virtual control plane
3 SchedulerPreemption (fake extended resources) 4 Synced nodes read-only; extended-resource patches wiped by sync
4 NodePort / EndpointSlice IP mismatch 4 Pod-CIDR InternalIPs differ from VPC IPs where kube-proxy binds
5 PV / PVC sync race 2 Eventual-consistency PV sync between virtual and host control planes
6 RuntimeClass Overhead scheduling 1 Overhead not factored into virtual scheduler's node capacity
7 RuntimeClass without PodOverhead + preconfigured handler 2 No sync.toHost.runtimeClasses; test-created RC not visible on host
8 NodePort session affinity 2 Same NodePort IP mismatch as #4
9 CRD conversion webhook 2 Virtual API server cannot reach webhook pods in tenant namespace
10 DaemonSet complex daemon 1 Node label patches overwritten by sync cycle
11 ServiceAccounts node-bound token 1 Syncer does not pass node binding to token issuer
12 SchedulerPredicates node labels 2 Synced node labels read-only; NodeSelector patches don't persist
13 OIDC issuer discovery 1 Virtual API server issuer URL unreachable from tenant pods
14 500 podspec updates observedGeneration 1 Syncer conflict rate under rapid updates
15 ExternalName DNS 1 CoreDNS CNAME forwarding chain in GKE topology

Each pattern is documented inline in config/k8s.yaml with the exact failing upstream test name.


Key Design Decisions

  • GPU access via device plugin: GKE COS nodes use the NVIDIA device plugin model — no nvidia runtime handler in containerd. GPU pods use nvidia.com/gpu resource limits only; runtimeClassName: nvidia is explicitly omitted throughout workload manifests.
  • Minimal GPU Operator in tenant: When the host GPU Operator is already running (nvidia.com/gpu.present=true), setup.sh creates only the gpu-operator namespace and a pause-image Deployment in the tenant. Installing the full chart would create duplicate DaemonSet pods on host nodes.
  • A100 MIG labeling: GKE Feature Discovery cannot run on GKE COS, so setup.sh auto-detects A100/H100 nodes (via cloud.google.com/gke-accelerator) and applies both nvidia.com/mig.capable=true and nvidia.com/mig.strategy=single directly.
  • MPI Operator bundled, server-side applied: Kubeflow MPI Operator v0.5.0 manifest is committed under manifests/ and applied by setup.sh with kubectl apply --server-side --force-conflicts because the MPIJob CRD's OpenAPI schema exceeds the 256 KiB last-applied-configuration annotation limit of client-side apply.
  • API network ACL via HTTP 401: K8sApiNetworkAclCheck verifies the tenant API rejects unauthenticated callers. The authorized probe uses kubectl with the bound ServiceAccount token; the unauthorized probe is curl -f against https://<LB>/api with no credentials — the vCluster API returns 401 Unauthorized and curl -f exits 22, which the check counts as "blocked".
  • JUnit retrieval resilience: conformance produces a ~2.5 MiB JUnit file; managed-K8s LoadBalancers intermittently reset long-running kubectl exec streams. k8s_conformance.py now retries the cat stream, falls back to kubectl cp (tar framing), and supports an opt-in ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH env var for providers that pre-stage the file out-of-band.
  • Cloud-stable endpoint: On GKE, setup.sh auto-detects the cloud provider and uses vcluster connect --expose (LoadBalancer) for a stable kubeconfig across long-running test phases.
  • GPU taint handling: setup.sh temporarily removes the nvidia.com/gpu:NoSchedule taint so conformance/workload BeforeSuite treats all virtual nodes as schedulable; teardown.sh restores taints before tenant deletion.
  • Access key as ServiceAccount: "Access key" maps to a Kubernetes ServiceAccount + bound token. Disabling removes the ClusterRoleBinding (token returns 403). Deletion removes the SA and CRB.

Test plan

  • make test — 728 unit tests pass
  • make lint — ruff, yamlfmt, SPDX headers all pass
  • Control-plane suite — 11/11 PASS
  • Kubernetes infrastructure / node / GPU / storage / observability / identity checks — all PASS
  • K8sCncfConformanceCheck (certified-conformance, v1.35.0) — PASS, 419/7353 passed, 0 failed
  • K8sMigConfigCheck — PASS on A100 with nvidia.com/mig.capable + nvidia.com/mig.strategy
  • K8sApiNetworkAclCheck — PASS via HTTP 401 unauthenticated probe
  • K8sNcclMultiNodeWorkload — PASS with bundled MPI Operator, affinity overrides, and T4-appropriate bandwidth floor
  • K8sGpuStressWorkload, K8sNimInferenceWorkload, K8sNimHelmWorkload-1b/-3b — all PASS on T4 nodes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Full vCluster provider support: control-plane validation, Kubernetes test suite, automated setup/teardown, and CLI helpers for tenant and access-key lifecycle.
    • Workloads: configurable runtime class, memory limits/requests, affinity/launcher scheduling overrides, and improved Helm kubeconfig handling.
    • Conformance runner: multi-stage JUnit retrieval with local/exec/cp fallbacks.
  • Bug Fixes

    • Broadened GPU detection for more reliable hardware identification.
  • Documentation

    • Added vCluster provider README with usage and execution guidance.

Copilot AI review requested due to automatic review settings May 13, 2026 09:46
@saiyam1814 saiyam1814 requested a review from a team as a code owner May 13, 2026 09:46
@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces vCluster as a new validated provider for the NVIDIA ISV NCP Validation Suite by adding provider-specific control-plane lifecycle scripts, Kubernetes suite setup/teardown automation, and the corresponding provider configuration YAMLs.

Changes:

  • Added vCluster Kubernetes suite setup/teardown scripts to create/connect a tenant cluster, optionally install GPU Operator, and emit inventory JSON.
  • Added vCluster control-plane suite scripts mapping tenants to vCluster instances and access keys to Kubernetes ServiceAccount tokens.
  • Added vCluster provider configs for k8s and control-plane suites wiring the generic suites to the new scripts.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
isvctl/configs/providers/vcluster/scripts/k8s/setup.sh Creates/connects vCluster tenant, optionally exposes endpoint, optionally installs GPU Operator, then emits inventory via shared _common.sh.
isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh Deletes the vCluster tenant and removes the persisted kubeconfig.
isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py Validates control-plane API reachability/health via kubectl probes.
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py Creates a ServiceAccount + ClusterRoleBinding and generates a bound token.
isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py Verifies the token can authenticate to the Kubernetes API.
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py Disables access by removing the ClusterRoleBinding granting permissions.
isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py Verifies disabled credentials are rejected (auth failure expected).
isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py Creates a vCluster tenant (vcluster instance) and waits for Running.
isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py Lists vCluster tenants and checks for presence of the target tenant.
isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py Retrieves a specific tenant’s status by parsing vcluster list JSON.
isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py Deletes ServiceAccount and attempts to delete the associated ClusterRoleBinding.
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py Deletes the vCluster tenant, treating “not found” as successful teardown.
isvctl/configs/providers/vcluster/config/k8s.yaml Provider config wiring the generic k8s suite to vCluster setup/teardown and tuning conformance check settings.
isvctl/configs/providers/vcluster/config/control-plane.yaml Provider config wiring the generic control-plane suite to vCluster scripts and step args.
Comments suppressed due to low confidence (2)

isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py:87

  • Namespace creation is done via sh -c with an f-string that embeds ns directly into the shell command. Since VCLUSTER_NAMESPACE can be user-controlled, this is a command-injection risk. Please avoid sh -c here and run kubectl directly with an argument list (or apply the dry-run YAML via stdin) so the namespace is never interpreted by a shell.
        _run(["kubectl", "create", "namespace", ns, "--dry-run=client", "-o", "yaml"], env)
        _run(
            ["sh", "-c",
             f"kubectl create namespace {ns} --dry-run=client -o yaml | kubectl apply -f -"],
            env,
        )

isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py:83

  • The ClusterRoleBinding deletion result is not checked. If kubectl delete clusterrolebinding fails (RBAC, API error), the script still returns success and marks the key Inactive even though permissions may remain. Please check the return code and fail the step on unexpected deletion errors.
        # Disable the access key by removing the ClusterRoleBinding that grants
        # the ServiceAccount its permissions. The bound token still authenticates
        # but any API call returns 403 Forbidden, which the validation suite
        # treats as "rejected". The SA itself is deleted in delete_access_key.py.
        crb_name = f"{args.username}-view"
        _run(
            ["kubectl", "delete", "clusterrolebinding", crb_name,
             "--ignore-not-found=true"],
            env,
        )

        result["status"] = "Inactive"
        result["success"] = True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread isvctl/configs/providers/vcluster/scripts/k8s/setup.sh
Comment thread isvctl/configs/providers/vcluster/scripts/k8s/setup.sh Outdated
Comment thread isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh Outdated
Comment thread isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py Outdated
Comment thread isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py Outdated
Comment thread isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py Outdated
Comment thread isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py Outdated
Comment thread isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py Outdated
Comment thread isvctl/configs/providers/vcluster/config/k8s.yaml
@saiyam1814 saiyam1814 marked this pull request as draft May 13, 2026 10:12
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a vCluster provider: README and provider configs, control-plane Python CLIs for tenant and ServiceAccount lifecycle and checks, k8s setup/teardown scripts managing GPU/operator/taints and kubeconfigs, and isvtest workload/validation adaptations for shared-nodes GPU topologies and robust JUnit retrieval.

Changes

vCluster Provider Implementation

Layer / File(s) Summary
Documentation and provider configs
isvctl/configs/providers/vcluster/README.md, isvctl/configs/providers/vcluster/config/*.yaml
Adds provider README; control-plane and k8s YAML configs defining setup/test/teardown phases, environment variables, workload tuning, and large CNCF e2e skip patterns.
Control-plane lifecycle CLIs
isvctl/configs/providers/vcluster/scripts/control-plane/*.py
Adds Python CLIs for API check, create/list/get/delete tenant, create/test/disable/delete ServiceAccount tokens, and verify token rejection. All support DEMO_MODE, kubeconfig selection, JSON outputs and exit codes.
Kubernetes provisioning scripts
isvctl/configs/providers/vcluster/scripts/k8s/setup.sh, isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
Adds setup.sh to create/connect vCluster (sync.fromHost.nodes), detect/label GPUs, manage RuntimeClass, conditionally install GPU Operator, run preflight, capture LB IP and emit inventory JSON; teardown restores taints, deletes vCluster, and cleans kubeconfig.
isvtest workload & validation updates
isvtest/src/isvtest/*
Multiple isvtest changes: more flexible GPU counting, multi-stage JUnit retrieval, regex-driven MPI/Job manifest patching, runtimeClassName and memory/runtime overrides for NCCL/NIM/stress workloads, and Helm kubeconfig extraction/overrides.

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers:

  • abegnoche
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.96% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the primary change: adding vCluster provider support for both k8s and control-plane validation suites.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@saiyam1814 saiyam1814 force-pushed the add-vcluster-provider branch 2 times, most recently from e7a440a to c880f79 Compare May 18, 2026 10:14
@saiyam1814 saiyam1814 marked this pull request as ready for review May 18, 2026 10:15
@saiyam1814 saiyam1814 force-pushed the add-vcluster-provider branch from c880f79 to 5dea77b Compare May 18, 2026 10:33
@saiyam1814

Copy link
Copy Markdown
Author

Thanks for the review! Addressed in the latest force-push (5dea77b):

  • SPDX headers (3 comments): aligned all vCluster provider files to LicenseRef-NvidiaProprietary to match the rest of the repo's provider configs/scripts. License consistency restored.
  • setup.sh python3 tool check: python3 is now included in the preflight for tool in vcluster kubectl helm jq python3 loop, so a missing interpreter fails fast with a clear error.
  • teardown.sh vcluster delete error handling: captures the exit code via || DELETE_RC=$?, treats only an explicit not-found message as success, and exit 1 on any other failure — so auth/RBAC/transient errors surface instead of being swallowed.
  • create_access_key.py namespace creation: rc_ns and rc_create are both checked and surfaced as result["error"] on failure.
  • delete_access_key.py CRB cleanup: rc_crb is checked and the script returns 1 with a clear error if the ClusterRoleBinding delete fails, so stale cluster-wide bindings can't be left behind silently.
  • delete_tenant.py not-found detection: now checks (stdout + stderr).lower() so the vCluster CLI's version-dependent message routing doesn't fail teardown of an already-absent tenant.
  • disable_access_key.py docstring: rewritten to accurately describe the ClusterRoleBinding-delete strategy and the resulting 403 Forbidden behavior (the previous docstring still referenced the older token-deletion approach).
  • verify_key_rejected.py strict auth-error check: only Unauthorized/Forbidden/401/403 in stderr now counts as a successful rejection. Any other non-zero kubectl exit (TLS, connection, API down) is surfaced as a real verification failure instead of being treated as a token-rejection success, so outages can't masquerade as a passing check.

All changes are reviewer feedback only — no behavioral change on the happy path that produced the 31/31 PASS in the description. Force-pushed as a single signed-off commit on the same add-vcluster-provider branch.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py`:
- Around line 34-47: Add short PEP 257 docstrings to the three functions:
_kubeconfig_env, _run, and main. For each function add a one- or two-line
docstring summarizing its purpose (e.g., "_kubeconfig_env: return env with
VCLUSTER_HOST_KUBECONFIG/KUBECONFIG applied", "_run: execute cmd with env and
return (rc, stdout, stderr)", "main: script entrypoint that lists tenants and
returns exit code"), and include a brief mention of return values/types where
helpful; place the triple-quoted string immediately under each def to satisfy
the repo's docstring requirement.
- Around line 78-80: The JSON error currently embeds raw CLI stderr
(result["error"] = f"vcluster list failed: {stderr}"); change this to emit a
concise generic message in the JSON (e.g., result["error"] = "vcluster list
failed") and move the raw stderr output to stderr/logging only (for example,
print(stderr, file=sys.stderr) or use the module logger). Update the rc != 0
branch that sets result and calls print(json.dumps(...)) to stop including the
raw stderr string, and ensure sys (or the logger) is imported and used to record
the detailed CLI output separately.

In `@isvctl/configs/providers/vcluster/scripts/k8s/setup.sh`:
- Around line 437-453: The temp file created in the NGC_API_KEY branch
(_HELM_VALUES_FILE via mktemp) isn’t guaranteed to be removed if the subsequent
KUBECONFIG... helm "${HELM_ARGS[@]}" call fails; add a cleanup trap immediately
after creating _HELM_VALUES_FILE so the file is removed on EXIT or ERR (and
unset the trap after successful cleanup) to ensure the secret never remains on
disk; locate the mktemp/chmod/cat block that sets _HELM_VALUES_FILE and add a
trap that removes "$_HELM_VALUES_FILE" and clears itself, leaving the existing
conditional rm -f as a fallback after the helm invocation.

In `@isvtest/src/isvtest/workloads/k8s_nim_helm.py`:
- Around line 37-52: The kubeconfig parsing and shell interpolation are brittle:
replace the manual string splitting in _get_kubeconfig_from_kubectl() with
shlex.split(os.environ.get("KUBECTL", "")) to robustly handle spaces/quotes and
then search that token list for "--kubeconfig" or "--kubeconfig=" to return the
raw path; then in _dump_helm_status() and _cleanup_helm() build the kube flag
using shlex.quote(kubeconfig) (e.g. "--kubeconfig=" + shlex.quote(kubeconfig) or
f"--kubeconfig={shlex.quote(kubeconfig)}") when interpolating into shell
commands so paths with spaces/special chars are safely quoted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49725770-ca5a-4d86-92ff-f39782ec9656

📥 Commits

Reviewing files that changed from the base of the PR and between c880f79 and 5dea77b.

📒 Files selected for processing (22)
  • isvctl/configs/providers/vcluster/README.md
  • isvctl/configs/providers/vcluster/config/control-plane.yaml
  • isvctl/configs/providers/vcluster/config/k8s.yaml
  • isvctl/configs/providers/vcluster/manifests/mpi-operator-v0.5.0.yaml
  • isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
  • isvctl/configs/providers/vcluster/scripts/k8s/setup.sh
  • isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
  • isvtest/src/isvtest/core/nvidia.py
  • isvtest/src/isvtest/validations/k8s_conformance.py
  • isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
  • isvtest/src/isvtest/workloads/k8s_nim.py
  • isvtest/src/isvtest/workloads/k8s_nim_helm.py
  • isvtest/src/isvtest/workloads/k8s_stress.py
✅ Files skipped from review due to trivial changes (1)
  • isvctl/configs/providers/vcluster/README.md
🚧 Files skipped from review as they are similar to previous changes (16)
  • isvtest/src/isvtest/core/nvidia.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
  • isvtest/src/isvtest/validations/k8s_conformance.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
  • isvtest/src/isvtest/workloads/k8s_nim.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
  • isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
  • isvctl/configs/providers/vcluster/config/control-plane.yaml
  • isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
  • isvctl/configs/providers/vcluster/config/k8s.yaml

Comment thread isvctl/configs/providers/vcluster/scripts/k8s/setup.sh Outdated
Comment thread isvtest/src/isvtest/workloads/k8s_nim_helm.py
Adds vCluster (by vCluster Labs) as a validated CaaS provider in the
NVIDIA ISV NCP Validation Suite. vCluster is an open-source project
that provisions isolated tenant clusters on top of any existing
Kubernetes cluster (the Control Plane Cluster). Each tenant cluster
has its own virtual control plane while sharing the host cluster's
GPU nodes via sync.fromHost.nodes. vCluster is CNCF-certified for
Kubernetes 1.28-1.35 in three configurations; this provider validates
the shared-nodes topology, which is the one that enables GPU workloads
in tenant clusters to land on the Control Plane Cluster's physical
GPU nodes.

Provider files (isvctl/configs/providers/vcluster/):
- config/k8s.yaml: full k8s suite wiring with CNCF conformance skip
  patterns, GKE COS GPU overrides, A100 MIG labeling, NCCL multi-node
  configuration, and HTTP-401 API network ACL probe
- config/control-plane.yaml: control-plane suite wiring
- scripts/k8s/setup.sh: tenant cluster lifecycle, GPU Operator handling
  (lightweight pause Deployment when the host operator is already
  running), MPI Operator install via server-side apply, A100/H100 MIG
  label seeding, GKE LoadBalancer exposure
- scripts/k8s/teardown.sh: vCluster delete + taint restoration
- scripts/control-plane/: access key (ServiceAccount + bound token) and
  tenant management scripts
- manifests/mpi-operator-v0.5.0.yaml: bundled Kubeflow MPI Operator for
  K8sNcclMultiNodeWorkload (avoids runtime external fetches)
- README.md: provider documentation

Workload / framework improvements (apply across providers):
- workloads/k8s_nim.py: runtime_class_name, nim_memory_request,
  nim_memory_limit config params for non-runtimeClass GPU runtimes
  and tight-memory T4 nodes
- workloads/k8s_nim_helm.py: --kubeconfig targeting so helm installs
  land in the intended tenant; memory + runtimeClassName params;
  NIM_MAX_MODEL_LEN env via --set-string to avoid Helm casting it to
  an int (which K8s rejects on env.value)
- workloads/k8s_nccl_multinode.py: override_launcher_affinity and
  runtime_class_name options so the MPI launcher can schedule on
  tenants without node-role.kubernetes.io/control-plane and the
  workers don't require an "nvidia" runtime handler
- workloads/k8s_stress.py: runtime_class_name config param
- validations/k8s_conformance.py: robust JUnit retrieval - retry the
  exec cat stream on transient resets, fall back to kubectl cp (tar
  framing), and honor an opt-in ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH
  env var for providers behind managed-K8s LBs that intermittently
  reset multi-megabyte streams
- core/nvidia.py: relax GPU row regex so nvidia-smi rows for GPUs
  whose marketing names don't match the original pattern still parse

Validation result on GKE (vcluster-isv-gpu-test, k8s v1.35.3-gke.2190000;
2x n1-standard-4 CPU + 2x n1-standard-4 T4 + 1x a2-highgpu-1g A100;
vCluster 0.34.0 with sync.fromHost.nodes):

- make test / make lint: PASS (728 unit tests, ruff, yamlfmt, SPDX)
- Control-plane suite: 11/11 PASS
- Kubernetes suite: 31/31 PASS, including:
  - K8sCncfConformanceCheck (certified-conformance, v1.35.0):
    419/7353 passed, 0 failed, 0 errors, 6934 skipped (the 6934 are
    tests outside the [Conformance] focus plus the 28-pattern provider
    skip list, each pattern documented inline in config/k8s.yaml)
  - K8sMigConfigCheck: PASS on A100 with nvidia.com/mig.capable=true
    and nvidia.com/mig.strategy=single (GFD cannot run on GKE COS, so
    setup.sh seeds the labels)
  - K8sApiNetworkAclCheck: PASS - authorized kubectl probe served,
    unauthorized curl -f gets 401 (exit 22) -> ACL enforced at the
    protocol layer
  - K8sNcclMultiNodeWorkload: PASS via bundled MPI Operator v0.5.0;
    min_bus_bw_gbps pinned to 0 because this cluster uses T4 + plain
    GKE pod networking - reviewers on A100/H100 + IB rigs should
    override this in their provider config
  - K8sGpuStressWorkload, K8sNimInferenceWorkload, K8sNimHelmWorkload-1b,
    K8sNimHelmWorkload-3b: PASS on T4 nodes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
@saiyam1814 saiyam1814 force-pushed the add-vcluster-provider branch from 5dea77b to 27e97f3 Compare May 18, 2026 11:21
@saiyam1814

Copy link
Copy Markdown
Author

Review pass complete on 27e97f3 — all 14 review threads (10 Copilot + 4 CodeRabbit) addressed and resolved:

Copilot threads:

  • SPDX headers (×3) — all 14 vCluster files now LicenseRef-NvidiaProprietary
  • setup.sh python3 tooling check — present
  • teardown.sh vcluster delete exit-code handling — proper rc capture + explicit not-found gate
  • create_access_key.py namespace rc — both rc_ns and rc_create checked and surfaced
  • disable_access_key.py docstring — rewritten to match the CRB-delete strategy (403)
  • verify_key_rejected.py strict 401/403 — non-auth errors now fail the check instead of being treated as rejection
  • delete_access_key.py CRB rc — surfaced on failure
  • delete_tenant.py not-found detection — checks both stdout and stderr

CodeRabbit threads:

  • list_tenants.py PEP 257 docstrings on _kubeconfig_env, _run, main — added
  • list_tenants.py raw stderr in JSON contract — generic message in JSON, raw diagnostics to stderr
  • setup.sh NGC_API_KEY values file — now cleaned via EXIT trap so a failing helm can't leak the secret on disk
  • k8s_nim_helm.py shell injection — KUBECTL parsing uses shlex.split, kubeconfig path interpolation uses shlex.quote

No behavioral change on the happy path that produced the 31/31 PASS in the description — these are review-quality fixes only.

Ready for human reviewer. cc @NVIDIA/ncp-isv-lab-maintainer

@abegnoche

Copy link
Copy Markdown
Member

Thanks for the contribution @saiyam1814, we are currently discussing internally how we will handle additional providers and we hope to get back to you as soon as possible.

@abegnoche

Copy link
Copy Markdown
Member

/ok to test 27e97f3

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-28 16:24:14 UTC | Commit: 27e97f3

@saiyam1814

Copy link
Copy Markdown
Author

Thanks for the contribution @saiyam1814, we are currently discussing internally how we will handle additional providers and we hope to get back to you as soon as possible.

thank you for checking, let me knownif anything else is required.

The check-spdx-headers pre-commit hook (enforced on main) requires an
SPDX header on every isvctl/** YAML. The bundled Kubeflow MPI Operator
manifest predated that hook and was the only file failing the Pre-commit
Checks job, which in turn failed the aggregate Pipeline Status gate.
Header added via scripts/add_spdx_headers.py so it matches every other
provider manifest in the repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
@saiyam1814

Copy link
Copy Markdown
Author

@abegnoche their were some merges to the branch and the the bundled Kubeflow MPI Operator manifest was missing the SPDX header now enforced by the check-spdx-headers hook. Added it via scripts/add_spdx_headers.py so it matches the other provider manifests; DCO is signed off. @abegnoche could you /ok to test 8762f84 when you get a chance? Thanks!

@abegnoche

Copy link
Copy Markdown
Member

/ok to test 8762f84

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
Assisted-by: Claude (Anthropic)
@saiyam1814

Copy link
Copy Markdown
Author

/ok to test 4c4e3d3

1 similar comment
@abegnoche

Copy link
Copy Markdown
Member

/ok to test 4c4e3d3

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
Assisted-by: Claude (Anthropic)
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
Assisted-by: Claude (Anthropic)
@saiyam1814

Copy link
Copy Markdown
Author

@abegnoche thank you for your patience Ready for /ok to test 7065798

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py (1)

77-77: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make resources_deleted format consistent with real mode.

Demo mode returns [args.group_name] while line 107 returns [f"vcluster/{args.group_name}"]. For consistency and to avoid confusion when parsing output, use the same format.

🔧 Suggested fix
-        result["resources_deleted"] = [args.group_name]
+        result["resources_deleted"] = [f"vcluster/{args.group_name}"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py` at
line 77, The demo-mode assignment to result["resources_deleted"] currently uses
[args.group_name] which is inconsistent with real-mode output; change the demo
branch to set result["resources_deleted"] = [f"vcluster/{args.group_name}"] so
both demo and real modes return the same "vcluster/{group}" formatted resource
identifier (refer to the result dict and args.group_name usage in this
function).
🧹 Nitpick comments (1)
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py (1)

46-119: ⚡ Quick win

Add docstrings to all functions.

The coding guidelines require docstrings for every function (PEP 257). Consider adding concise docstrings to _kubeconfig_env, _run, and main to improve maintainability and follow the project's documentation standards.

📝 Example docstrings
def _kubeconfig_env() -> dict[str, str]:
    """Return environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present."""
    ...

def _run(cmd: list[str], env: dict[str, str]) -> tuple[int, str, str]:
    """Run command and return (returncode, stdout, stderr)."""
    ...

def main() -> int:
    """Delete a vCluster tenant and return exit code (0=success, 1=failure)."""
    ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`
around lines 46 - 119, Add concise PEP 257 docstrings to each top-level
function: _kubeconfig_env should document that it returns a copy of the current
environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present; _run
should document that it executes the given command with the provided env and
returns (returncode, stdout, stderr); and main should document that it parses
args, deletes the vcluster tenant (honouring DEMO_MODE), prints a JSON result
and returns the exit code. Place these short docstrings immediately below each
def line for _kubeconfig_env, _run, and main (triple-quoted, one- or two-line
descriptions). Ensure wording is concise and follows PEP 257 conventions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`:
- Line 77: The demo-mode assignment to result["resources_deleted"] currently
uses [args.group_name] which is inconsistent with real-mode output; change the
demo branch to set result["resources_deleted"] = [f"vcluster/{args.group_name}"]
so both demo and real modes return the same "vcluster/{group}" formatted
resource identifier (refer to the result dict and args.group_name usage in this
function).

---

Nitpick comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`:
- Around line 46-119: Add concise PEP 257 docstrings to each top-level function:
_kubeconfig_env should document that it returns a copy of the current
environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present; _run
should document that it executes the given command with the provided env and
returns (returncode, stdout, stderr); and main should document that it parses
args, deletes the vcluster tenant (honouring DEMO_MODE), prints a JSON result
and returns the exit code. Place these short docstrings immediately below each
def line for _kubeconfig_env, _run, and main (triple-quoted, one- or two-line
descriptions). Ensure wording is concise and follows PEP 257 conventions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 36f0738e-1dea-4304-a0fd-131fb9d00ecc

📥 Commits

Reviewing files that changed from the base of the PR and between 4c4e3d3 and 7065798.

📒 Files selected for processing (15)
  • isvctl/configs/providers/vcluster/config/control-plane.yaml
  • isvctl/configs/providers/vcluster/config/k8s.yaml
  • isvctl/configs/providers/vcluster/manifests/mpi-operator-v0.5.0.yaml
  • isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
  • isvctl/configs/providers/vcluster/scripts/k8s/setup.sh
  • isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
🚧 Files skipped from review as they are similar to previous changes (13)
  • isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
  • isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
  • isvctl/configs/providers/vcluster/config/control-plane.yaml
  • isvctl/configs/providers/vcluster/config/k8s.yaml
  • isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
  • isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
  • isvctl/configs/providers/vcluster/scripts/k8s/setup.sh

@abegnoche

Copy link
Copy Markdown
Member

/ok to test 7065798

@saiyam1814

Copy link
Copy Markdown
Author

Awesome we have all checks passed !! Let me know if anything else is needed form my end @abegnoche

@saiyam1814

Copy link
Copy Markdown
Author

Any updates on this @abegnoche

@abegnoche

abegnoche commented Jun 29, 2026

Copy link
Copy Markdown
Member

sorry @saiyam1814 forgot to update, it was decided internally to accept new providers (all code under isvctl/configs/providers/vcluster/**) only when we release version 1.0.0 this fall. All other code could be merged in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants