feat: add vCluster provider for k8s and control-plane suites by saiyam1814 · Pull Request #414 · NVIDIA/ai-cloud-validation

saiyam1814 · 2026-05-13T09:46:12Z

Summary

Adds vCluster (by vCluster Labs) as a validated CaaS provider in the NVIDIA ISV NCP Validation Suite.

vCluster is an open-source project that provisions isolated tenant clusters on top of any existing Kubernetes cluster (the Control Plane Cluster). Each tenant cluster has its own virtual control plane (API server, scheduler, controller manager) while sharing the host cluster's GPU nodes. vCluster is CNCF-certified for Kubernetes 1.28-1.35 in three configurations: vcluster-standalone, vcluster-with-private-nodes, and vcluster-with-shared-nodes. This PR validates the shared-nodes topology (sync.fromHost.nodes) which is the configuration that enables GPU workloads in tenant clusters to schedule onto the Control Plane Cluster's physical GPU nodes.

Provider files (isvctl/configs/providers/vcluster/):

config/k8s.yaml — full k8s suite wiring with CNCF conformance skip patterns, GKE COS GPU overrides, A100 MIG labeling, NCCL multi-node configuration, and API network ACL probe
config/control-plane.yaml — control-plane suite wiring
scripts/k8s/setup.sh — tenant cluster lifecycle, GPU Operator handling, MPI Operator install (server-side apply), A100/H100 MIG label seeding
scripts/k8s/teardown.sh — tenant cluster teardown with taint restoration
scripts/control-plane/ — access key and tenant management scripts
manifests/mpi-operator-v0.5.0.yaml — bundled Kubeflow MPI Operator manifest for NCCL multi-node workloads
README.md — provider documentation

Workload / framework improvements (apply across providers):

isvtest/src/isvtest/workloads/k8s_nim.py — runtime_class_name, nim_memory_request, nim_memory_limit config params
isvtest/src/isvtest/workloads/k8s_nim_helm.py — --kubeconfig targeting, memory and runtimeClassName params, NIM_MAX_MODEL_LEN env via --set-string
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py — override_launcher_affinity and runtime_class_name for non-control-plane MPI launchers
isvtest/src/isvtest/workloads/k8s_stress.py — runtime_class_name config param
isvtest/src/isvtest/validations/k8s_conformance.py — robust JUnit retrieval: retry on transient stream resets, kubectl cp fallback, and an opt-in pre-staged local file path for providers behind managed-K8s LBs
isvtest/src/isvtest/core/nvidia.py — GPU row regex fix for non-standard GPU names

Test Infrastructure

Control Plane Cluster: GKE vcluster-isv-gpu-test, us-central1-c, Kubernetes v1.35.3-gke.2190000

Node pool	Nodes	Machine	GPU	Purpose
`default-pool`	2	n1-standard-4	—	CPU workloads / control plane
`gpu-pool`	2	n1-standard-4	1× NVIDIA T4 (16 GB)	GPU workloads (NIM, stress, NCCL)
`gpu-a100-pool`	1	a2-highgpu-1g	1× NVIDIA A100 (40 GB)	MIG configuration check

Tenant cluster: vcluster-isv-validation in namespace vcluster-isv-validation, vCluster 0.34.0, configured with sync.fromHost.nodes to expose host GPU capacity into the tenant.

Test Results

`make test` / `make lint` — PASS

728 unit tests pass; ruff, yamlfmt, and SPDX header checks all pass.

Control-plane suite — 11 / 11 PASS

All 11 control-plane checks pass: FieldExistsCheck, FieldValueCheck, AccessKeyCreatedCheck, TenantCreatedCheck, AccessKeyAuthenticatedCheck, AccessKeyDisabledCheck, AccessKeyRejectedCheck, TenantListedCheck, TenantInfoCheck, StepSuccessCheck-delete_access_key, StepSuccessCheck-delete_tenant.

Kubernetes suite — 31 / 31 PASS

#	Check	Result	Notes
1	K8sNodeCountCheck	PASS
2	K8sExpectedNodesCheck	PASS
3	K8sNodeReadyCheck	PASS
4	K8sNvidiaSmiCheck	PASS	`nvidia-smi` on synced GPU nodes
5	K8sDriverVersionCheck	PASS	`driver_version: ""` — GKE manages drivers natively
6	K8sGpuPodAccessCheck	PASS	GPU pod scheduled via device plugin
7	K8sGpuCapacityCheck	PASS	T4 nodes + A100 node
8	K8sGpuOperatorNamespaceCheck	PASS	`gpu-operator` namespace in tenant
9	K8sGpuOperatorPodsCheck	PASS	pause Deployment satisfies the check (host GPU Operator already running)
10	K8sGpuLabelsCheck	PASS	`nvidia.com/gpu.present=true` on synced nodes
11	K8sPodHealthCheck	PASS
12	K8sNoPendingPodsCheck	PASS
13	K8sNoErrorPodsCheck	PASS
14	K8sMigConfigCheck	PASS	A100 node auto-labeled with `nvidia.com/mig.capable=true` and `nvidia.com/mig.strategy=single` in `setup.sh` (GFD cannot run on GKE COS)
15	K8sDualStackNodeCheck	PASS	Single-stack IPv4
16	K8sNetworkPolicyCheck	PASS	NetworkPolicies synced to host CNI
17	K8sCsiStorageTypesCheck	PASS
18	K8sCsiStorageQuotaApiCheck	PASS
19	K8sCsiTenantScopedCredentialsCheck	PASS
20	K8sCsiProvisioningModesCheck	PASS
21	K8sOidcIssuerCheck	PASS
22	K8sApiServerMetricsCheck	PASS
23	K8sControlPlaneLogsCheck	PASS	syncer logs via `kubectl` in vCluster namespace
24	K8sApiNetworkAclCheck	PASS	API rejects an unauthenticated `curl -f` with HTTP 401 (probe exits 22 → ACL enforced); authorized probe via `kubectl` succeeds
25	K8sCncfConformanceCheck	PASS	`certified-conformance` mode, `v1.35.0`: 419 / 7353 passed, 0 failed, 0 errors, 6934 skipped (the 6934 are tests outside the `[Conformance]` focus plus the 28-pattern provider skip list documented below)
26	K8sNcclWorkload	PASS	Single-node NCCL
27	K8sNcclMultiNodeWorkload	PASS	MPI Operator v0.5.0 installed by `setup.sh` (server-side apply), launcher affinity + `runtimeClassName` overrides applied. `min_bus_bw_gbps: 0` pinned because the cluster uses T4 + plain GKE pod networking; reviewers on A100/H100 + IB rigs should override this in their provider config
28	K8sGpuStressWorkload	PASS	PyTorch GPU stress on T4
29	K8sNimInferenceWorkload	PASS	T4-tuned memory + KV-cache (`NIM_MAX_MODEL_LEN`)
30	K8sNimHelmWorkload-1b	PASS	`llama-3.2-1b-instruct` via Helm
31	K8sNimHelmWorkload-3b	PASS	`llama-3.2-3b-instruct` via Helm

Bold rows were the four checks exercised in a targeted re-run after the initial full run (so the suite saw a fully stable cluster with the A100 already attached and MPI Operator installed). The same config/k8s.yaml is used end-to-end; the re-run command was isvctl test run -f config/k8s.yaml -- -k "K8sCncfConformanceCheck or K8sMigConfigCheck or K8sNcclMultiNodeWorkload or K8sApiNetworkAclCheck".

CNCF Conformance Skips

K8sCncfConformanceCheck runs in certified-conformance mode (full [Conformance] suite). 28 test patterns across 15 architectural limitation groups are skipped, all specific to the sync.fromHost.nodes topology used to expose GPU capacity from host nodes into the tenant cluster.

vCluster passes the full conformance suite with zero skips on dedicated nodes with virtualScheduler.enabled — see the official certification at https://github.com/cncf/k8s-conformance/tree/master/v1.35/vcluster-with-shared-nodes. The skips below arise only because sync.fromHost.nodes makes several node properties read-only and rewrites virtual node InternalIPs to pod-CIDR addresses.

#	Category	Tests	Root cause
1	HostPorts + hostPort scheduling conflict	3	vCluster does not map virtual ports to host network interfaces
2	ServiceCIDR / IPAddress API	1	Not yet implemented in vCluster 1.35 virtual control plane
3	SchedulerPreemption (fake extended resources)	4	Synced nodes read-only; extended-resource patches wiped by sync
4	NodePort / EndpointSlice IP mismatch	4	Pod-CIDR InternalIPs differ from VPC IPs where kube-proxy binds
5	PV / PVC sync race	2	Eventual-consistency PV sync between virtual and host control planes
6	RuntimeClass Overhead scheduling	1	Overhead not factored into virtual scheduler's node capacity
7	RuntimeClass without PodOverhead + preconfigured handler	2	No `sync.toHost.runtimeClasses`; test-created RC not visible on host
8	NodePort session affinity	2	Same NodePort IP mismatch as #4
9	CRD conversion webhook	2	Virtual API server cannot reach webhook pods in tenant namespace
10	DaemonSet complex daemon	1	Node label patches overwritten by sync cycle
11	ServiceAccounts node-bound token	1	Syncer does not pass node binding to token issuer
12	SchedulerPredicates node labels	2	Synced node labels read-only; NodeSelector patches don't persist
13	OIDC issuer discovery	1	Virtual API server issuer URL unreachable from tenant pods
14	500 podspec updates observedGeneration	1	Syncer conflict rate under rapid updates
15	ExternalName DNS	1	CoreDNS CNAME forwarding chain in GKE topology

Each pattern is documented inline in config/k8s.yaml with the exact failing upstream test name.

Key Design Decisions

GPU access via device plugin: GKE COS nodes use the NVIDIA device plugin model — no nvidia runtime handler in containerd. GPU pods use nvidia.com/gpu resource limits only; runtimeClassName: nvidia is explicitly omitted throughout workload manifests.
Minimal GPU Operator in tenant: When the host GPU Operator is already running (nvidia.com/gpu.present=true), setup.sh creates only the gpu-operator namespace and a pause-image Deployment in the tenant. Installing the full chart would create duplicate DaemonSet pods on host nodes.
A100 MIG labeling: GKE Feature Discovery cannot run on GKE COS, so setup.sh auto-detects A100/H100 nodes (via cloud.google.com/gke-accelerator) and applies both nvidia.com/mig.capable=true and nvidia.com/mig.strategy=single directly.
MPI Operator bundled, server-side applied: Kubeflow MPI Operator v0.5.0 manifest is committed under manifests/ and applied by setup.sh with kubectl apply --server-side --force-conflicts because the MPIJob CRD's OpenAPI schema exceeds the 256 KiB last-applied-configuration annotation limit of client-side apply.
API network ACL via HTTP 401: K8sApiNetworkAclCheck verifies the tenant API rejects unauthenticated callers. The authorized probe uses kubectl with the bound ServiceAccount token; the unauthorized probe is curl -f against https://<LB>/api with no credentials — the vCluster API returns 401 Unauthorized and curl -f exits 22, which the check counts as "blocked".
JUnit retrieval resilience: conformance produces a ~2.5 MiB JUnit file; managed-K8s LoadBalancers intermittently reset long-running kubectl exec streams. k8s_conformance.py now retries the cat stream, falls back to kubectl cp (tar framing), and supports an opt-in ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH env var for providers that pre-stage the file out-of-band.
Cloud-stable endpoint: On GKE, setup.sh auto-detects the cloud provider and uses vcluster connect --expose (LoadBalancer) for a stable kubeconfig across long-running test phases.
GPU taint handling: setup.sh temporarily removes the nvidia.com/gpu:NoSchedule taint so conformance/workload BeforeSuite treats all virtual nodes as schedulable; teardown.sh restores taints before tenant deletion.
Access key as ServiceAccount: "Access key" maps to a Kubernetes ServiceAccount + bound token. Disabling removes the ClusterRoleBinding (token returns 403). Deletion removes the SA and CRB.

Test plan

make test — 728 unit tests pass
make lint — ruff, yamlfmt, SPDX headers all pass
Control-plane suite — 11/11 PASS
Kubernetes infrastructure / node / GPU / storage / observability / identity checks — all PASS
K8sCncfConformanceCheck (certified-conformance, v1.35.0) — PASS, 419/7353 passed, 0 failed
K8sMigConfigCheck — PASS on A100 with nvidia.com/mig.capable + nvidia.com/mig.strategy
K8sApiNetworkAclCheck — PASS via HTTP 401 unauthenticated probe
K8sNcclMultiNodeWorkload — PASS with bundled MPI Operator, affinity overrides, and T4-appropriate bandwidth floor
K8sGpuStressWorkload, K8sNimInferenceWorkload, K8sNimHelmWorkload-1b/-3b — all PASS on T4 nodes

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Full vCluster provider support: control-plane validation, Kubernetes test suite, automated setup/teardown, and CLI helpers for tenant and access-key lifecycle.
- Workloads: configurable runtime class, memory limits/requests, affinity/launcher scheduling overrides, and improved Helm kubeconfig handling.
- Conformance runner: multi-stage JUnit retrieval with local/exec/cp fallbacks.
Bug Fixes
- Broadened GPU detection for more reliable hardware identification.
Documentation
- Added vCluster provider README with usage and execution guidance.

copy-pr-bot · 2026-05-13T09:46:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull request overview

This PR introduces vCluster as a new validated provider for the NVIDIA ISV NCP Validation Suite by adding provider-specific control-plane lifecycle scripts, Kubernetes suite setup/teardown automation, and the corresponding provider configuration YAMLs.

Changes:

Added vCluster Kubernetes suite setup/teardown scripts to create/connect a tenant cluster, optionally install GPU Operator, and emit inventory JSON.
Added vCluster control-plane suite scripts mapping tenants to vCluster instances and access keys to Kubernetes ServiceAccount tokens.
Added vCluster provider configs for k8s and control-plane suites wiring the generic suites to the new scripts.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
isvctl/configs/providers/vcluster/scripts/k8s/setup.sh	Creates/connects vCluster tenant, optionally exposes endpoint, optionally installs GPU Operator, then emits inventory via shared `_common.sh`.
isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh	Deletes the vCluster tenant and removes the persisted kubeconfig.
isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py	Validates control-plane API reachability/health via kubectl probes.
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py	Creates a ServiceAccount + ClusterRoleBinding and generates a bound token.
isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py	Verifies the token can authenticate to the Kubernetes API.
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py	Disables access by removing the ClusterRoleBinding granting permissions.
isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py	Verifies disabled credentials are rejected (auth failure expected).
isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py	Creates a vCluster tenant (vcluster instance) and waits for Running.
isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py	Lists vCluster tenants and checks for presence of the target tenant.
isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py	Retrieves a specific tenant’s status by parsing `vcluster list` JSON.
isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py	Deletes ServiceAccount and attempts to delete the associated ClusterRoleBinding.
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py	Deletes the vCluster tenant, treating “not found” as successful teardown.
isvctl/configs/providers/vcluster/config/k8s.yaml	Provider config wiring the generic k8s suite to vCluster setup/teardown and tuning conformance check settings.
isvctl/configs/providers/vcluster/config/control-plane.yaml	Provider config wiring the generic control-plane suite to vCluster scripts and step args.

Comments suppressed due to low confidence (2)

isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py:87

Namespace creation is done via sh -c with an f-string that embeds ns directly into the shell command. Since VCLUSTER_NAMESPACE can be user-controlled, this is a command-injection risk. Please avoid sh -c here and run kubectl directly with an argument list (or apply the dry-run YAML via stdin) so the namespace is never interpreted by a shell.

        _run(["kubectl", "create", "namespace", ns, "--dry-run=client", "-o", "yaml"], env)
        _run(
            ["sh", "-c",
             f"kubectl create namespace {ns} --dry-run=client -o yaml | kubectl apply -f -"],
            env,
        )

isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py:83

The ClusterRoleBinding deletion result is not checked. If kubectl delete clusterrolebinding fails (RBAC, API error), the script still returns success and marks the key Inactive even though permissions may remain. Please check the return code and fail the step on unexpected deletion errors.

        # Disable the access key by removing the ClusterRoleBinding that grants
        # the ServiceAccount its permissions. The bound token still authenticates
        # but any API call returns 403 Forbidden, which the validation suite
        # treats as "rejected". The SA itself is deleted in delete_access_key.py.
        crb_name = f"{args.username}-view"
        _run(
            ["kubectl", "delete", "clusterrolebinding", crb_name,
             "--ignore-not-found=true"],
            env,
        )

        result["status"] = "Inactive"
        result["success"] = True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai · 2026-05-13T10:32:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a vCluster provider: README and provider configs, control-plane Python CLIs for tenant and ServiceAccount lifecycle and checks, k8s setup/teardown scripts managing GPU/operator/taints and kubeconfigs, and isvtest workload/validation adaptations for shared-nodes GPU topologies and robust JUnit retrieval.

Changes

vCluster Provider Implementation

Layer / File(s)	Summary
Documentation and provider configs `isvctl/configs/providers/vcluster/README.md`, `isvctl/configs/providers/vcluster/config/*.yaml`	Adds provider README; control-plane and k8s YAML configs defining setup/test/teardown phases, environment variables, workload tuning, and large CNCF e2e skip patterns.
Control-plane lifecycle CLIs `isvctl/configs/providers/vcluster/scripts/control-plane/*.py`	Adds Python CLIs for API check, create/list/get/delete tenant, create/test/disable/delete ServiceAccount tokens, and verify token rejection. All support DEMO_MODE, kubeconfig selection, JSON outputs and exit codes.
Kubernetes provisioning scripts `isvctl/configs/providers/vcluster/scripts/k8s/setup.sh`, `isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh`	Adds setup.sh to create/connect vCluster (sync.fromHost.nodes), detect/label GPUs, manage RuntimeClass, conditionally install GPU Operator, run preflight, capture LB IP and emit inventory JSON; teardown restores taints, deletes vCluster, and cleans kubeconfig.
isvtest workload & validation updates `isvtest/src/isvtest/*`	Multiple isvtest changes: more flexible GPU counting, multi-stage JUnit retrieval, regex-driven MPI/Job manifest patching, runtimeClassName and memory/runtime overrides for NCCL/NIM/stress workloads, and Helm kubeconfig extraction/overrides.

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers:

abegnoche

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.96% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the primary change: adding vCluster provider support for both k8s and control-plane validation suites.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

saiyam1814 · 2026-05-18T10:33:59Z

Thanks for the review! Addressed in the latest force-push (5dea77b):

SPDX headers (3 comments): aligned all vCluster provider files to LicenseRef-NvidiaProprietary to match the rest of the repo's provider configs/scripts. License consistency restored.
setup.sh python3 tool check: python3 is now included in the preflight for tool in vcluster kubectl helm jq python3 loop, so a missing interpreter fails fast with a clear error.
teardown.sh vcluster delete error handling: captures the exit code via || DELETE_RC=$?, treats only an explicit not-found message as success, and exit 1 on any other failure — so auth/RBAC/transient errors surface instead of being swallowed.
create_access_key.py namespace creation: rc_ns and rc_create are both checked and surfaced as result["error"] on failure.
delete_access_key.py CRB cleanup: rc_crb is checked and the script returns 1 with a clear error if the ClusterRoleBinding delete fails, so stale cluster-wide bindings can't be left behind silently.
delete_tenant.py not-found detection: now checks (stdout + stderr).lower() so the vCluster CLI's version-dependent message routing doesn't fail teardown of an already-absent tenant.
disable_access_key.py docstring: rewritten to accurately describe the ClusterRoleBinding-delete strategy and the resulting 403 Forbidden behavior (the previous docstring still referenced the older token-deletion approach).
verify_key_rejected.py strict auth-error check: only Unauthorized/Forbidden/401/403 in stderr now counts as a successful rejection. Any other non-zero kubectl exit (TLS, connection, API down) is surfaced as a real verification failure instead of being treated as a token-rejection success, so outages can't masquerade as a passing check.

All changes are reviewer feedback only — no behavioral change on the happy path that produced the 31/31 PASS in the description. Force-pushed as a single signed-off commit on the same add-vcluster-provider branch.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py`:
- Around line 34-47: Add short PEP 257 docstrings to the three functions:
_kubeconfig_env, _run, and main. For each function add a one- or two-line
docstring summarizing its purpose (e.g., "_kubeconfig_env: return env with
VCLUSTER_HOST_KUBECONFIG/KUBECONFIG applied", "_run: execute cmd with env and
return (rc, stdout, stderr)", "main: script entrypoint that lists tenants and
returns exit code"), and include a brief mention of return values/types where
helpful; place the triple-quoted string immediately under each def to satisfy
the repo's docstring requirement.
- Around line 78-80: The JSON error currently embeds raw CLI stderr
(result["error"] = f"vcluster list failed: {stderr}"); change this to emit a
concise generic message in the JSON (e.g., result["error"] = "vcluster list
failed") and move the raw stderr output to stderr/logging only (for example,
print(stderr, file=sys.stderr) or use the module logger). Update the rc != 0
branch that sets result and calls print(json.dumps(...)) to stop including the
raw stderr string, and ensure sys (or the logger) is imported and used to record
the detailed CLI output separately.

In `@isvctl/configs/providers/vcluster/scripts/k8s/setup.sh`:
- Around line 437-453: The temp file created in the NGC_API_KEY branch
(_HELM_VALUES_FILE via mktemp) isn’t guaranteed to be removed if the subsequent
KUBECONFIG... helm "${HELM_ARGS[@]}" call fails; add a cleanup trap immediately
after creating _HELM_VALUES_FILE so the file is removed on EXIT or ERR (and
unset the trap after successful cleanup) to ensure the secret never remains on
disk; locate the mktemp/chmod/cat block that sets _HELM_VALUES_FILE and add a
trap that removes "$_HELM_VALUES_FILE" and clears itself, leaving the existing
conditional rm -f as a fallback after the helm invocation.

In `@isvtest/src/isvtest/workloads/k8s_nim_helm.py`:
- Around line 37-52: The kubeconfig parsing and shell interpolation are brittle:
replace the manual string splitting in _get_kubeconfig_from_kubectl() with
shlex.split(os.environ.get("KUBECTL", "")) to robustly handle spaces/quotes and
then search that token list for "--kubeconfig" or "--kubeconfig=" to return the
raw path; then in _dump_helm_status() and _cleanup_helm() build the kube flag
using shlex.quote(kubeconfig) (e.g. "--kubeconfig=" + shlex.quote(kubeconfig) or
f"--kubeconfig={shlex.quote(kubeconfig)}") when interpolating into shell
commands so paths with spaces/special chars are safely quoted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49725770-ca5a-4d86-92ff-f39782ec9656

📥 Commits

Reviewing files that changed from the base of the PR and between c880f79 and 5dea77b.

📒 Files selected for processing (22)

isvctl/configs/providers/vcluster/README.md
isvctl/configs/providers/vcluster/config/control-plane.yaml
isvctl/configs/providers/vcluster/config/k8s.yaml
isvctl/configs/providers/vcluster/manifests/mpi-operator-v0.5.0.yaml
isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
isvctl/configs/providers/vcluster/scripts/k8s/setup.sh
isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
isvtest/src/isvtest/core/nvidia.py
isvtest/src/isvtest/validations/k8s_conformance.py
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
isvtest/src/isvtest/workloads/k8s_nim.py
isvtest/src/isvtest/workloads/k8s_nim_helm.py
isvtest/src/isvtest/workloads/k8s_stress.py

✅ Files skipped from review due to trivial changes (1)

isvctl/configs/providers/vcluster/README.md

🚧 Files skipped from review as they are similar to previous changes (16)

isvtest/src/isvtest/core/nvidia.py
isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
isvtest/src/isvtest/validations/k8s_conformance.py
isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
isvtest/src/isvtest/workloads/k8s_nim.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
isvctl/configs/providers/vcluster/config/control-plane.yaml
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
isvctl/configs/providers/vcluster/config/k8s.yaml

Adds vCluster (by vCluster Labs) as a validated CaaS provider in the NVIDIA ISV NCP Validation Suite. vCluster is an open-source project that provisions isolated tenant clusters on top of any existing Kubernetes cluster (the Control Plane Cluster). Each tenant cluster has its own virtual control plane while sharing the host cluster's GPU nodes via sync.fromHost.nodes. vCluster is CNCF-certified for Kubernetes 1.28-1.35 in three configurations; this provider validates the shared-nodes topology, which is the one that enables GPU workloads in tenant clusters to land on the Control Plane Cluster's physical GPU nodes. Provider files (isvctl/configs/providers/vcluster/): - config/k8s.yaml: full k8s suite wiring with CNCF conformance skip patterns, GKE COS GPU overrides, A100 MIG labeling, NCCL multi-node configuration, and HTTP-401 API network ACL probe - config/control-plane.yaml: control-plane suite wiring - scripts/k8s/setup.sh: tenant cluster lifecycle, GPU Operator handling (lightweight pause Deployment when the host operator is already running), MPI Operator install via server-side apply, A100/H100 MIG label seeding, GKE LoadBalancer exposure - scripts/k8s/teardown.sh: vCluster delete + taint restoration - scripts/control-plane/: access key (ServiceAccount + bound token) and tenant management scripts - manifests/mpi-operator-v0.5.0.yaml: bundled Kubeflow MPI Operator for K8sNcclMultiNodeWorkload (avoids runtime external fetches) - README.md: provider documentation Workload / framework improvements (apply across providers): - workloads/k8s_nim.py: runtime_class_name, nim_memory_request, nim_memory_limit config params for non-runtimeClass GPU runtimes and tight-memory T4 nodes - workloads/k8s_nim_helm.py: --kubeconfig targeting so helm installs land in the intended tenant; memory + runtimeClassName params; NIM_MAX_MODEL_LEN env via --set-string to avoid Helm casting it to an int (which K8s rejects on env.value) - workloads/k8s_nccl_multinode.py: override_launcher_affinity and runtime_class_name options so the MPI launcher can schedule on tenants without node-role.kubernetes.io/control-plane and the workers don't require an "nvidia" runtime handler - workloads/k8s_stress.py: runtime_class_name config param - validations/k8s_conformance.py: robust JUnit retrieval - retry the exec cat stream on transient resets, fall back to kubectl cp (tar framing), and honor an opt-in ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH env var for providers behind managed-K8s LBs that intermittently reset multi-megabyte streams - core/nvidia.py: relax GPU row regex so nvidia-smi rows for GPUs whose marketing names don't match the original pattern still parse Validation result on GKE (vcluster-isv-gpu-test, k8s v1.35.3-gke.2190000; 2x n1-standard-4 CPU + 2x n1-standard-4 T4 + 1x a2-highgpu-1g A100; vCluster 0.34.0 with sync.fromHost.nodes): - make test / make lint: PASS (728 unit tests, ruff, yamlfmt, SPDX) - Control-plane suite: 11/11 PASS - Kubernetes suite: 31/31 PASS, including: - K8sCncfConformanceCheck (certified-conformance, v1.35.0): 419/7353 passed, 0 failed, 0 errors, 6934 skipped (the 6934 are tests outside the [Conformance] focus plus the 28-pattern provider skip list, each pattern documented inline in config/k8s.yaml) - K8sMigConfigCheck: PASS on A100 with nvidia.com/mig.capable=true and nvidia.com/mig.strategy=single (GFD cannot run on GKE COS, so setup.sh seeds the labels) - K8sApiNetworkAclCheck: PASS - authorized kubectl probe served, unauthorized curl -f gets 401 (exit 22) -> ACL enforced at the protocol layer - K8sNcclMultiNodeWorkload: PASS via bundled MPI Operator v0.5.0; min_bus_bw_gbps pinned to 0 because this cluster uses T4 + plain GKE pod networking - reviewers on A100/H100 + IB rigs should override this in their provider config - K8sGpuStressWorkload, K8sNimInferenceWorkload, K8sNimHelmWorkload-1b, K8sNimHelmWorkload-3b: PASS on T4 nodes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>

saiyam1814 · 2026-05-18T11:24:05Z

Review pass complete on 27e97f3 — all 14 review threads (10 Copilot + 4 CodeRabbit) addressed and resolved:

Copilot threads:

SPDX headers (×3) — all 14 vCluster files now LicenseRef-NvidiaProprietary
setup.sh python3 tooling check — present
teardown.sh vcluster delete exit-code handling — proper rc capture + explicit not-found gate
create_access_key.py namespace rc — both rc_ns and rc_create checked and surfaced
disable_access_key.py docstring — rewritten to match the CRB-delete strategy (403)
verify_key_rejected.py strict 401/403 — non-auth errors now fail the check instead of being treated as rejection
delete_access_key.py CRB rc — surfaced on failure
delete_tenant.py not-found detection — checks both stdout and stderr

CodeRabbit threads:

list_tenants.py PEP 257 docstrings on _kubeconfig_env, _run, main — added
list_tenants.py raw stderr in JSON contract — generic message in JSON, raw diagnostics to stderr
setup.sh NGC_API_KEY values file — now cleaned via EXIT trap so a failing helm can't leak the secret on disk
k8s_nim_helm.py shell injection — KUBECTL parsing uses shlex.split, kubeconfig path interpolation uses shlex.quote

No behavioral change on the happy path that produced the 31/31 PASS in the description — these are review-quality fixes only.

Ready for human reviewer. cc @NVIDIA/ncp-isv-lab-maintainer

abegnoche · 2026-05-28T15:49:26Z

Thanks for the contribution @saiyam1814, we are currently discussing internally how we will handle additional providers and we hope to get back to you as soon as possible.

abegnoche · 2026-05-28T16:23:23Z

/ok to test 27e97f3

github-actions · 2026-05-28T16:24:15Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-05-28 16:24:14 UTC | Commit: 27e97f3}

saiyam1814 · 2026-06-05T09:55:53Z

Thanks for the contribution @saiyam1814, we are currently discussing internally how we will handle additional providers and we hope to get back to you as soon as possible.

thank you for checking, let me knownif anything else is required.

The check-spdx-headers pre-commit hook (enforced on main) requires an SPDX header on every isvctl/** YAML. The bundled Kubeflow MPI Operator manifest predated that hook and was the only file failing the Pre-commit Checks job, which in turn failed the aggregate Pipeline Status gate. Header added via scripts/add_spdx_headers.py so it matches every other provider manifest in the repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>

saiyam1814 · 2026-06-09T05:46:03Z

@abegnoche their were some merges to the branch and the the bundled Kubeflow MPI Operator manifest was missing the SPDX header now enforced by the check-spdx-headers hook. Added it via scripts/add_spdx_headers.py so it matches the other provider manifests; DCO is signed off. @abegnoche could you /ok to test 8762f84 when you get a chance? Thanks!

abegnoche · 2026-06-10T20:18:42Z

/ok to test 8762f84

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)

saiyam1814 · 2026-06-11T08:05:21Z

/ok to test 4c4e3d3

abegnoche · 2026-06-11T15:34:43Z

/ok to test 4c4e3d3

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)

saiyam1814 · 2026-06-11T17:07:05Z

@abegnoche thank you for your patience Ready for /ok to test 7065798

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py (1)
77-77: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make resources_deleted format consistent with real mode.

Demo mode returns [args.group_name] while line 107 returns [f"vcluster/{args.group_name}"]. For consistency and to avoid confusion when parsing output, use the same format.
🔧 Suggested fix
-        result["resources_deleted"] = [args.group_name]
+        result["resources_deleted"] = [f"vcluster/{args.group_name}"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py` at
line 77, The demo-mode assignment to result["resources_deleted"] currently uses
[args.group_name] which is inconsistent with real-mode output; change the demo
branch to set result["resources_deleted"] = [f"vcluster/{args.group_name}"] so
both demo and real modes return the same "vcluster/{group}" formatted resource
identifier (refer to the result dict and args.group_name usage in this
function).

🧹 Nitpick comments (1)

isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py (1)

46-119: ⚡ Quick win

Add docstrings to all functions.

The coding guidelines require docstrings for every function (PEP 257). Consider adding concise docstrings to _kubeconfig_env, _run, and main to improve maintainability and follow the project's documentation standards.

📝 Example docstrings

def _kubeconfig_env() -> dict[str, str]:
    """Return environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present."""
    ...

def _run(cmd: list[str], env: dict[str, str]) -> tuple[int, str, str]:
    """Run command and return (returncode, stdout, stderr)."""
    ...

def main() -> int:
    """Delete a vCluster tenant and return exit code (0=success, 1=failure)."""
    ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`
around lines 46 - 119, Add concise PEP 257 docstrings to each top-level
function: _kubeconfig_env should document that it returns a copy of the current
environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present; _run
should document that it executes the given command with the provided env and
returns (returncode, stdout, stderr); and main should document that it parses
args, deletes the vcluster tenant (honouring DEMO_MODE), prints a JSON result
and returns the exit code. Place these short docstrings immediately below each
def line for _kubeconfig_env, _run, and main (triple-quoted, one- or two-line
descriptions). Ensure wording is concise and follows PEP 257 conventions.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`:
- Line 77: The demo-mode assignment to result["resources_deleted"] currently
uses [args.group_name] which is inconsistent with real-mode output; change the
demo branch to set result["resources_deleted"] = [f"vcluster/{args.group_name}"]
so both demo and real modes return the same "vcluster/{group}" formatted
resource identifier (refer to the result dict and args.group_name usage in this
function).

---

Nitpick comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`:
- Around line 46-119: Add concise PEP 257 docstrings to each top-level function:
_kubeconfig_env should document that it returns a copy of the current
environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present; _run
should document that it executes the given command with the provided env and
returns (returncode, stdout, stderr); and main should document that it parses
args, deletes the vcluster tenant (honouring DEMO_MODE), prints a JSON result
and returns the exit code. Place these short docstrings immediately below each
def line for _kubeconfig_env, _run, and main (triple-quoted, one- or two-line
descriptions). Ensure wording is concise and follows PEP 257 conventions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 36f0738e-1dea-4304-a0fd-131fb9d00ecc

📥 Commits

Reviewing files that changed from the base of the PR and between 4c4e3d3 and 7065798.

📒 Files selected for processing (15)

isvctl/configs/providers/vcluster/config/control-plane.yaml
isvctl/configs/providers/vcluster/config/k8s.yaml
isvctl/configs/providers/vcluster/manifests/mpi-operator-v0.5.0.yaml
isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
isvctl/configs/providers/vcluster/scripts/k8s/setup.sh
isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh

🚧 Files skipped from review as they are similar to previous changes (13)

isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
isvctl/configs/providers/vcluster/config/control-plane.yaml
isvctl/configs/providers/vcluster/config/k8s.yaml
isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
isvctl/configs/providers/vcluster/scripts/k8s/setup.sh

abegnoche · 2026-06-11T18:43:28Z

/ok to test 7065798

saiyam1814 · 2026-06-13T06:16:25Z

Awesome we have all checks passed !! Let me know if anything else is needed form my end @abegnoche

saiyam1814 · 2026-06-29T10:37:14Z

Any updates on this @abegnoche

abegnoche · 2026-06-29T20:29:33Z

sorry @saiyam1814 forgot to update, it was decided internally to accept new providers (all code under isvctl/configs/providers/vcluster/**) only when we release version 1.0.0 this fall. All other code could be merged in a separate PR.

Copilot AI review requested due to automatic review settings May 13, 2026 09:46

saiyam1814 requested a review from a team as a code owner May 13, 2026 09:46

Copilot started reviewing on behalf of saiyam1814 May 13, 2026 09:46 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

saiyam1814 marked this pull request as draft May 13, 2026 10:12

saiyam1814 force-pushed the add-vcluster-provider branch 2 times, most recently from e7a440a to c880f79 Compare May 18, 2026 10:14

saiyam1814 marked this pull request as ready for review May 18, 2026 10:15

saiyam1814 force-pushed the add-vcluster-provider branch from c880f79 to 5dea77b Compare May 18, 2026 10:33

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

saiyam1814 force-pushed the add-vcluster-provider branch from 5dea77b to 27e97f3 Compare May 18, 2026 11:21

style: apply repo pre-commit formatting (ruff, yamlfmt)

4c4e3d3

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)

saiyam1814 added 3 commits June 11, 2026 22:29

Merge remote-tracking branch 'upstream/main' into add-vcluster-provider

75be22f

test: cover kubectl cp JUnit fallback in conformance tests

972fd62

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)

chore: migrate vcluster provider files to Apache-2.0 headers

7065798

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Uh oh!

Conversation

saiyam1814 commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Infrastructure

Test Results

make test / make lint — PASS

Control-plane suite — 11 / 11 PASS

Kubernetes suite — 31 / 31 PASS

CNCF Conformance Skips

Key Design Decisions

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

saiyam1814 commented May 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saiyam1814 commented May 18, 2026

Uh oh!

abegnoche commented May 28, 2026

Uh oh!

abegnoche commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

🔐 TruffleHog Secret Scan

Uh oh!

saiyam1814 commented Jun 5, 2026

Uh oh!

saiyam1814 commented Jun 9, 2026

Uh oh!

abegnoche commented Jun 10, 2026

Uh oh!

saiyam1814 commented Jun 11, 2026

Uh oh!

abegnoche commented Jun 11, 2026

Uh oh!

saiyam1814 commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

abegnoche commented Jun 11, 2026

Uh oh!

saiyam1814 commented Jun 13, 2026

Uh oh!

saiyam1814 commented Jun 29, 2026

Uh oh!

abegnoche commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

saiyam1814 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

`make test` / `make lint` — PASS

coderabbitai Bot commented May 13, 2026 •

edited

Loading

abegnoche commented Jun 29, 2026 •

edited

Loading