feat: add vCluster provider for k8s and control-plane suites#414
feat: add vCluster provider for k8s and control-plane suites#414saiyam1814 wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces vCluster as a new validated provider for the NVIDIA ISV NCP Validation Suite by adding provider-specific control-plane lifecycle scripts, Kubernetes suite setup/teardown automation, and the corresponding provider configuration YAMLs.
Changes:
- Added vCluster Kubernetes suite setup/teardown scripts to create/connect a tenant cluster, optionally install GPU Operator, and emit inventory JSON.
- Added vCluster control-plane suite scripts mapping tenants to vCluster instances and access keys to Kubernetes ServiceAccount tokens.
- Added vCluster provider configs for
k8sandcontrol-planesuites wiring the generic suites to the new scripts.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| isvctl/configs/providers/vcluster/scripts/k8s/setup.sh | Creates/connects vCluster tenant, optionally exposes endpoint, optionally installs GPU Operator, then emits inventory via shared _common.sh. |
| isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh | Deletes the vCluster tenant and removes the persisted kubeconfig. |
| isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py | Validates control-plane API reachability/health via kubectl probes. |
| isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py | Creates a ServiceAccount + ClusterRoleBinding and generates a bound token. |
| isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py | Verifies the token can authenticate to the Kubernetes API. |
| isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py | Disables access by removing the ClusterRoleBinding granting permissions. |
| isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py | Verifies disabled credentials are rejected (auth failure expected). |
| isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py | Creates a vCluster tenant (vcluster instance) and waits for Running. |
| isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py | Lists vCluster tenants and checks for presence of the target tenant. |
| isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py | Retrieves a specific tenant’s status by parsing vcluster list JSON. |
| isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py | Deletes ServiceAccount and attempts to delete the associated ClusterRoleBinding. |
| isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py | Deletes the vCluster tenant, treating “not found” as successful teardown. |
| isvctl/configs/providers/vcluster/config/k8s.yaml | Provider config wiring the generic k8s suite to vCluster setup/teardown and tuning conformance check settings. |
| isvctl/configs/providers/vcluster/config/control-plane.yaml | Provider config wiring the generic control-plane suite to vCluster scripts and step args. |
Comments suppressed due to low confidence (2)
isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py:87
- Namespace creation is done via
sh -cwith an f-string that embedsnsdirectly into the shell command. SinceVCLUSTER_NAMESPACEcan be user-controlled, this is a command-injection risk. Please avoidsh -chere and runkubectldirectly with an argument list (or apply the dry-run YAML via stdin) so the namespace is never interpreted by a shell.
_run(["kubectl", "create", "namespace", ns, "--dry-run=client", "-o", "yaml"], env)
_run(
["sh", "-c",
f"kubectl create namespace {ns} --dry-run=client -o yaml | kubectl apply -f -"],
env,
)
isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py:83
- The ClusterRoleBinding deletion result is not checked. If
kubectl delete clusterrolebindingfails (RBAC, API error), the script still returns success and marks the key Inactive even though permissions may remain. Please check the return code and fail the step on unexpected deletion errors.
# Disable the access key by removing the ClusterRoleBinding that grants
# the ServiceAccount its permissions. The bound token still authenticates
# but any API call returns 403 Forbidden, which the validation suite
# treats as "rejected". The SA itself is deleted in delete_access_key.py.
crb_name = f"{args.username}-view"
_run(
["kubectl", "delete", "clusterrolebinding", crb_name,
"--ignore-not-found=true"],
env,
)
result["status"] = "Inactive"
result["success"] = True
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a vCluster provider: README and provider configs, control-plane Python CLIs for tenant and ServiceAccount lifecycle and checks, k8s setup/teardown scripts managing GPU/operator/taints and kubeconfigs, and isvtest workload/validation adaptations for shared-nodes GPU topologies and robust JUnit retrieval. ChangesvCluster Provider Implementation
Estimated code review effort: Suggested reviewers:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
e7a440a to
c880f79
Compare
c880f79 to
5dea77b
Compare
|
Thanks for the review! Addressed in the latest force-push (
All changes are reviewer feedback only — no behavioral change on the happy path that produced the 31/31 PASS in the description. Force-pushed as a single signed-off commit on the same |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py`:
- Around line 34-47: Add short PEP 257 docstrings to the three functions:
_kubeconfig_env, _run, and main. For each function add a one- or two-line
docstring summarizing its purpose (e.g., "_kubeconfig_env: return env with
VCLUSTER_HOST_KUBECONFIG/KUBECONFIG applied", "_run: execute cmd with env and
return (rc, stdout, stderr)", "main: script entrypoint that lists tenants and
returns exit code"), and include a brief mention of return values/types where
helpful; place the triple-quoted string immediately under each def to satisfy
the repo's docstring requirement.
- Around line 78-80: The JSON error currently embeds raw CLI stderr
(result["error"] = f"vcluster list failed: {stderr}"); change this to emit a
concise generic message in the JSON (e.g., result["error"] = "vcluster list
failed") and move the raw stderr output to stderr/logging only (for example,
print(stderr, file=sys.stderr) or use the module logger). Update the rc != 0
branch that sets result and calls print(json.dumps(...)) to stop including the
raw stderr string, and ensure sys (or the logger) is imported and used to record
the detailed CLI output separately.
In `@isvctl/configs/providers/vcluster/scripts/k8s/setup.sh`:
- Around line 437-453: The temp file created in the NGC_API_KEY branch
(_HELM_VALUES_FILE via mktemp) isn’t guaranteed to be removed if the subsequent
KUBECONFIG... helm "${HELM_ARGS[@]}" call fails; add a cleanup trap immediately
after creating _HELM_VALUES_FILE so the file is removed on EXIT or ERR (and
unset the trap after successful cleanup) to ensure the secret never remains on
disk; locate the mktemp/chmod/cat block that sets _HELM_VALUES_FILE and add a
trap that removes "$_HELM_VALUES_FILE" and clears itself, leaving the existing
conditional rm -f as a fallback after the helm invocation.
In `@isvtest/src/isvtest/workloads/k8s_nim_helm.py`:
- Around line 37-52: The kubeconfig parsing and shell interpolation are brittle:
replace the manual string splitting in _get_kubeconfig_from_kubectl() with
shlex.split(os.environ.get("KUBECTL", "")) to robustly handle spaces/quotes and
then search that token list for "--kubeconfig" or "--kubeconfig=" to return the
raw path; then in _dump_helm_status() and _cleanup_helm() build the kube flag
using shlex.quote(kubeconfig) (e.g. "--kubeconfig=" + shlex.quote(kubeconfig) or
f"--kubeconfig={shlex.quote(kubeconfig)}") when interpolating into shell
commands so paths with spaces/special chars are safely quoted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 49725770-ca5a-4d86-92ff-f39782ec9656
📒 Files selected for processing (22)
isvctl/configs/providers/vcluster/README.mdisvctl/configs/providers/vcluster/config/control-plane.yamlisvctl/configs/providers/vcluster/config/k8s.yamlisvctl/configs/providers/vcluster/manifests/mpi-operator-v0.5.0.yamlisvctl/configs/providers/vcluster/scripts/control-plane/check_api.pyisvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.pyisvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.pyisvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.pyisvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.pyisvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.pyisvctl/configs/providers/vcluster/scripts/k8s/setup.shisvctl/configs/providers/vcluster/scripts/k8s/teardown.shisvtest/src/isvtest/core/nvidia.pyisvtest/src/isvtest/validations/k8s_conformance.pyisvtest/src/isvtest/workloads/k8s_nccl_multinode.pyisvtest/src/isvtest/workloads/k8s_nim.pyisvtest/src/isvtest/workloads/k8s_nim_helm.pyisvtest/src/isvtest/workloads/k8s_stress.py
✅ Files skipped from review due to trivial changes (1)
- isvctl/configs/providers/vcluster/README.md
🚧 Files skipped from review as they are similar to previous changes (16)
- isvtest/src/isvtest/core/nvidia.py
- isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
- isvtest/src/isvtest/validations/k8s_conformance.py
- isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
- isvtest/src/isvtest/workloads/k8s_nim.py
- isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
- isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
- isvctl/configs/providers/vcluster/config/control-plane.yaml
- isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
- isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
- isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
- isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
- isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py
- isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
- isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
- isvctl/configs/providers/vcluster/config/k8s.yaml
Adds vCluster (by vCluster Labs) as a validated CaaS provider in the
NVIDIA ISV NCP Validation Suite. vCluster is an open-source project
that provisions isolated tenant clusters on top of any existing
Kubernetes cluster (the Control Plane Cluster). Each tenant cluster
has its own virtual control plane while sharing the host cluster's
GPU nodes via sync.fromHost.nodes. vCluster is CNCF-certified for
Kubernetes 1.28-1.35 in three configurations; this provider validates
the shared-nodes topology, which is the one that enables GPU workloads
in tenant clusters to land on the Control Plane Cluster's physical
GPU nodes.
Provider files (isvctl/configs/providers/vcluster/):
- config/k8s.yaml: full k8s suite wiring with CNCF conformance skip
patterns, GKE COS GPU overrides, A100 MIG labeling, NCCL multi-node
configuration, and HTTP-401 API network ACL probe
- config/control-plane.yaml: control-plane suite wiring
- scripts/k8s/setup.sh: tenant cluster lifecycle, GPU Operator handling
(lightweight pause Deployment when the host operator is already
running), MPI Operator install via server-side apply, A100/H100 MIG
label seeding, GKE LoadBalancer exposure
- scripts/k8s/teardown.sh: vCluster delete + taint restoration
- scripts/control-plane/: access key (ServiceAccount + bound token) and
tenant management scripts
- manifests/mpi-operator-v0.5.0.yaml: bundled Kubeflow MPI Operator for
K8sNcclMultiNodeWorkload (avoids runtime external fetches)
- README.md: provider documentation
Workload / framework improvements (apply across providers):
- workloads/k8s_nim.py: runtime_class_name, nim_memory_request,
nim_memory_limit config params for non-runtimeClass GPU runtimes
and tight-memory T4 nodes
- workloads/k8s_nim_helm.py: --kubeconfig targeting so helm installs
land in the intended tenant; memory + runtimeClassName params;
NIM_MAX_MODEL_LEN env via --set-string to avoid Helm casting it to
an int (which K8s rejects on env.value)
- workloads/k8s_nccl_multinode.py: override_launcher_affinity and
runtime_class_name options so the MPI launcher can schedule on
tenants without node-role.kubernetes.io/control-plane and the
workers don't require an "nvidia" runtime handler
- workloads/k8s_stress.py: runtime_class_name config param
- validations/k8s_conformance.py: robust JUnit retrieval - retry the
exec cat stream on transient resets, fall back to kubectl cp (tar
framing), and honor an opt-in ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH
env var for providers behind managed-K8s LBs that intermittently
reset multi-megabyte streams
- core/nvidia.py: relax GPU row regex so nvidia-smi rows for GPUs
whose marketing names don't match the original pattern still parse
Validation result on GKE (vcluster-isv-gpu-test, k8s v1.35.3-gke.2190000;
2x n1-standard-4 CPU + 2x n1-standard-4 T4 + 1x a2-highgpu-1g A100;
vCluster 0.34.0 with sync.fromHost.nodes):
- make test / make lint: PASS (728 unit tests, ruff, yamlfmt, SPDX)
- Control-plane suite: 11/11 PASS
- Kubernetes suite: 31/31 PASS, including:
- K8sCncfConformanceCheck (certified-conformance, v1.35.0):
419/7353 passed, 0 failed, 0 errors, 6934 skipped (the 6934 are
tests outside the [Conformance] focus plus the 28-pattern provider
skip list, each pattern documented inline in config/k8s.yaml)
- K8sMigConfigCheck: PASS on A100 with nvidia.com/mig.capable=true
and nvidia.com/mig.strategy=single (GFD cannot run on GKE COS, so
setup.sh seeds the labels)
- K8sApiNetworkAclCheck: PASS - authorized kubectl probe served,
unauthorized curl -f gets 401 (exit 22) -> ACL enforced at the
protocol layer
- K8sNcclMultiNodeWorkload: PASS via bundled MPI Operator v0.5.0;
min_bus_bw_gbps pinned to 0 because this cluster uses T4 + plain
GKE pod networking - reviewers on A100/H100 + IB rigs should
override this in their provider config
- K8sGpuStressWorkload, K8sNimInferenceWorkload, K8sNimHelmWorkload-1b,
K8sNimHelmWorkload-3b: PASS on T4 nodes
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
5dea77b to
27e97f3
Compare
|
Review pass complete on Copilot threads:
CodeRabbit threads:
No behavioral change on the happy path that produced the 31/31 PASS in the description — these are review-quality fixes only. Ready for human reviewer. cc @NVIDIA/ncp-isv-lab-maintainer |
|
Thanks for the contribution @saiyam1814, we are currently discussing internally how we will handle additional providers and we hope to get back to you as soon as possible. |
|
/ok to test 27e97f3 |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-28 16:24:14 UTC | Commit: 27e97f3 |
thank you for checking, let me knownif anything else is required. |
The check-spdx-headers pre-commit hook (enforced on main) requires an SPDX header on every isvctl/** YAML. The bundled Kubeflow MPI Operator manifest predated that hook and was the only file failing the Pre-commit Checks job, which in turn failed the aggregate Pipeline Status gate. Header added via scripts/add_spdx_headers.py so it matches every other provider manifest in the repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
|
@abegnoche their were some merges to the branch and the the bundled Kubeflow MPI Operator manifest was missing the SPDX header now enforced by the check-spdx-headers hook. Added it via scripts/add_spdx_headers.py so it matches the other provider manifests; DCO is signed off. @abegnoche could you /ok to test 8762f84 when you get a chance? Thanks! |
|
/ok to test 8762f84 |
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)
|
/ok to test 4c4e3d3 |
1 similar comment
|
/ok to test 4c4e3d3 |
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)
Signed-off-by: Saiyam Pathak <saiyam911@gmail.com> Assisted-by: Claude (Anthropic)
|
@abegnoche thank you for your patience Ready for /ok to test 7065798 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py (1)
77-77:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMake
resources_deletedformat consistent with real mode.Demo mode returns
[args.group_name]while line 107 returns[f"vcluster/{args.group_name}"]. For consistency and to avoid confusion when parsing output, use the same format.🔧 Suggested fix
- result["resources_deleted"] = [args.group_name] + result["resources_deleted"] = [f"vcluster/{args.group_name}"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py` at line 77, The demo-mode assignment to result["resources_deleted"] currently uses [args.group_name] which is inconsistent with real-mode output; change the demo branch to set result["resources_deleted"] = [f"vcluster/{args.group_name}"] so both demo and real modes return the same "vcluster/{group}" formatted resource identifier (refer to the result dict and args.group_name usage in this function).
🧹 Nitpick comments (1)
isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py (1)
46-119: ⚡ Quick winAdd docstrings to all functions.
The coding guidelines require docstrings for every function (PEP 257). Consider adding concise docstrings to
_kubeconfig_env,_run, andmainto improve maintainability and follow the project's documentation standards.📝 Example docstrings
def _kubeconfig_env() -> dict[str, str]: """Return environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present.""" ... def _run(cmd: list[str], env: dict[str, str]) -> tuple[int, str, str]: """Run command and return (returncode, stdout, stderr).""" ... def main() -> int: """Delete a vCluster tenant and return exit code (0=success, 1=failure).""" ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py` around lines 46 - 119, Add concise PEP 257 docstrings to each top-level function: _kubeconfig_env should document that it returns a copy of the current environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present; _run should document that it executes the given command with the provided env and returns (returncode, stdout, stderr); and main should document that it parses args, deletes the vcluster tenant (honouring DEMO_MODE), prints a JSON result and returns the exit code. Place these short docstrings immediately below each def line for _kubeconfig_env, _run, and main (triple-quoted, one- or two-line descriptions). Ensure wording is concise and follows PEP 257 conventions.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`:
- Line 77: The demo-mode assignment to result["resources_deleted"] currently
uses [args.group_name] which is inconsistent with real-mode output; change the
demo branch to set result["resources_deleted"] = [f"vcluster/{args.group_name}"]
so both demo and real modes return the same "vcluster/{group}" formatted
resource identifier (refer to the result dict and args.group_name usage in this
function).
---
Nitpick comments:
In `@isvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.py`:
- Around line 46-119: Add concise PEP 257 docstrings to each top-level function:
_kubeconfig_env should document that it returns a copy of the current
environment with KUBECONFIG set to VCLUSTER_HOST_KUBECONFIG if present; _run
should document that it executes the given command with the provided env and
returns (returncode, stdout, stderr); and main should document that it parses
args, deletes the vcluster tenant (honouring DEMO_MODE), prints a JSON result
and returns the exit code. Place these short docstrings immediately below each
def line for _kubeconfig_env, _run, and main (triple-quoted, one- or two-line
descriptions). Ensure wording is concise and follows PEP 257 conventions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 36f0738e-1dea-4304-a0fd-131fb9d00ecc
📒 Files selected for processing (15)
isvctl/configs/providers/vcluster/config/control-plane.yamlisvctl/configs/providers/vcluster/config/k8s.yamlisvctl/configs/providers/vcluster/manifests/mpi-operator-v0.5.0.yamlisvctl/configs/providers/vcluster/scripts/control-plane/check_api.pyisvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.pyisvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/delete_tenant.pyisvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.pyisvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.pyisvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.pyisvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.pyisvctl/configs/providers/vcluster/scripts/k8s/setup.shisvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
🚧 Files skipped from review as they are similar to previous changes (13)
- isvctl/configs/providers/vcluster/scripts/k8s/teardown.sh
- isvctl/configs/providers/vcluster/scripts/control-plane/check_api.py
- isvctl/configs/providers/vcluster/scripts/control-plane/create_access_key.py
- isvctl/configs/providers/vcluster/config/control-plane.yaml
- isvctl/configs/providers/vcluster/config/k8s.yaml
- isvctl/configs/providers/vcluster/scripts/control-plane/get_tenant.py
- isvctl/configs/providers/vcluster/scripts/control-plane/disable_access_key.py
- isvctl/configs/providers/vcluster/scripts/control-plane/create_tenant.py
- isvctl/configs/providers/vcluster/scripts/control-plane/test_access_key.py
- isvctl/configs/providers/vcluster/scripts/control-plane/list_tenants.py
- isvctl/configs/providers/vcluster/scripts/control-plane/verify_key_rejected.py
- isvctl/configs/providers/vcluster/scripts/control-plane/delete_access_key.py
- isvctl/configs/providers/vcluster/scripts/k8s/setup.sh
|
/ok to test 7065798 |
|
Awesome we have all checks passed !! Let me know if anything else is needed form my end @abegnoche |
|
Any updates on this @abegnoche |
|
sorry @saiyam1814 forgot to update, it was decided internally to accept new providers (all code under |
Summary
Adds vCluster (by vCluster Labs) as a validated CaaS provider in the NVIDIA ISV NCP Validation Suite.
vCluster is an open-source project that provisions isolated tenant clusters on top of any existing Kubernetes cluster (the Control Plane Cluster). Each tenant cluster has its own virtual control plane (API server, scheduler, controller manager) while sharing the host cluster's GPU nodes. vCluster is CNCF-certified for Kubernetes 1.28-1.35 in three configurations: vcluster-standalone, vcluster-with-private-nodes, and vcluster-with-shared-nodes. This PR validates the shared-nodes topology (
sync.fromHost.nodes) which is the configuration that enables GPU workloads in tenant clusters to schedule onto the Control Plane Cluster's physical GPU nodes.Provider files (
isvctl/configs/providers/vcluster/):config/k8s.yaml— full k8s suite wiring with CNCF conformance skip patterns, GKE COS GPU overrides, A100 MIG labeling, NCCL multi-node configuration, and API network ACL probeconfig/control-plane.yaml— control-plane suite wiringscripts/k8s/setup.sh— tenant cluster lifecycle, GPU Operator handling, MPI Operator install (server-side apply), A100/H100 MIG label seedingscripts/k8s/teardown.sh— tenant cluster teardown with taint restorationscripts/control-plane/— access key and tenant management scriptsmanifests/mpi-operator-v0.5.0.yaml— bundled Kubeflow MPI Operator manifest for NCCL multi-node workloadsREADME.md— provider documentationWorkload / framework improvements (apply across providers):
isvtest/src/isvtest/workloads/k8s_nim.py—runtime_class_name,nim_memory_request,nim_memory_limitconfig paramsisvtest/src/isvtest/workloads/k8s_nim_helm.py—--kubeconfigtargeting, memory andruntimeClassNameparams,NIM_MAX_MODEL_LENenv via--set-stringisvtest/src/isvtest/workloads/k8s_nccl_multinode.py—override_launcher_affinityandruntime_class_namefor non-control-plane MPI launchersisvtest/src/isvtest/workloads/k8s_stress.py—runtime_class_nameconfig paramisvtest/src/isvtest/validations/k8s_conformance.py— robust JUnit retrieval: retry on transient stream resets,kubectl cpfallback, and an opt-in pre-staged local file path for providers behind managed-K8s LBsisvtest/src/isvtest/core/nvidia.py— GPU row regex fix for non-standard GPU namesTest Infrastructure
Control Plane Cluster: GKE
vcluster-isv-gpu-test,us-central1-c, Kubernetesv1.35.3-gke.2190000default-poolgpu-poolgpu-a100-poolTenant cluster:
vcluster-isv-validationin namespacevcluster-isv-validation, vCluster0.34.0, configured withsync.fromHost.nodesto expose host GPU capacity into the tenant.Test Results
make test/make lint— PASS728 unit tests pass; ruff, yamlfmt, and SPDX header checks all pass.
Control-plane suite — 11 / 11 PASS
All 11 control-plane checks pass:
FieldExistsCheck,FieldValueCheck,AccessKeyCreatedCheck,TenantCreatedCheck,AccessKeyAuthenticatedCheck,AccessKeyDisabledCheck,AccessKeyRejectedCheck,TenantListedCheck,TenantInfoCheck,StepSuccessCheck-delete_access_key,StepSuccessCheck-delete_tenant.Kubernetes suite — 31 / 31 PASS
nvidia-smion synced GPU nodesdriver_version: ""— GKE manages drivers nativelygpu-operatornamespace in tenantnvidia.com/gpu.present=trueon synced nodesnvidia.com/mig.capable=trueandnvidia.com/mig.strategy=singleinsetup.sh(GFD cannot run on GKE COS)kubectlin vCluster namespacecurl -fwith HTTP 401 (probe exits 22 → ACL enforced); authorized probe viakubectlsucceedscertified-conformancemode,v1.35.0: 419 / 7353 passed, 0 failed, 0 errors, 6934 skipped (the 6934 are tests outside the[Conformance]focus plus the 28-pattern provider skip list documented below)setup.sh(server-side apply), launcher affinity +runtimeClassNameoverrides applied.min_bus_bw_gbps: 0pinned because the cluster uses T4 + plain GKE pod networking; reviewers on A100/H100 + IB rigs should override this in their provider configNIM_MAX_MODEL_LEN)llama-3.2-1b-instructvia Helmllama-3.2-3b-instructvia HelmCNCF Conformance Skips
K8sCncfConformanceCheckruns incertified-conformancemode (full[Conformance]suite). 28 test patterns across 15 architectural limitation groups are skipped, all specific to thesync.fromHost.nodestopology used to expose GPU capacity from host nodes into the tenant cluster.vCluster passes the full conformance suite with zero skips on dedicated nodes with
virtualScheduler.enabled— see the official certification at https://github.com/cncf/k8s-conformance/tree/master/v1.35/vcluster-with-shared-nodes. The skips below arise only becausesync.fromHost.nodesmakes several node properties read-only and rewrites virtual node InternalIPs to pod-CIDR addresses.sync.toHost.runtimeClasses; test-created RC not visible on hostEach pattern is documented inline in
config/k8s.yamlwith the exact failing upstream test name.Key Design Decisions
nvidiaruntime handler in containerd. GPU pods usenvidia.com/gpuresource limits only;runtimeClassName: nvidiais explicitly omitted throughout workload manifests.nvidia.com/gpu.present=true),setup.shcreates only thegpu-operatornamespace and a pause-image Deployment in the tenant. Installing the full chart would create duplicate DaemonSet pods on host nodes.setup.shauto-detects A100/H100 nodes (viacloud.google.com/gke-accelerator) and applies bothnvidia.com/mig.capable=trueandnvidia.com/mig.strategy=singledirectly.manifests/and applied bysetup.shwithkubectl apply --server-side --force-conflictsbecause the MPIJob CRD's OpenAPI schema exceeds the 256 KiBlast-applied-configurationannotation limit of client-side apply.kubectlwith the bound ServiceAccount token; the unauthorized probe iscurl -fagainsthttps://<LB>/apiwith no credentials — the vCluster API returns 401 Unauthorized andcurl -fexits 22, which the check counts as "blocked".kubectl execstreams.k8s_conformance.pynow retries thecatstream, falls back tokubectl cp(tar framing), and supports an opt-inISVTEST_CONFORMANCE_JUNIT_LOCAL_PATHenv var for providers that pre-stage the file out-of-band.setup.shauto-detects the cloud provider and usesvcluster connect --expose(LoadBalancer) for a stable kubeconfig across long-running test phases.setup.shtemporarily removes thenvidia.com/gpu:NoScheduletaint so conformance/workload BeforeSuite treats all virtual nodes as schedulable;teardown.shrestores taints before tenant deletion.Test plan
make test— 728 unit tests passmake lint— ruff, yamlfmt, SPDX headers all passK8sCncfConformanceCheck(certified-conformance, v1.35.0) — PASS, 419/7353 passed, 0 failedK8sMigConfigCheck— PASS on A100 withnvidia.com/mig.capable+nvidia.com/mig.strategyK8sApiNetworkAclCheck— PASS via HTTP 401 unauthenticated probeK8sNcclMultiNodeWorkload— PASS with bundled MPI Operator, affinity overrides, and T4-appropriate bandwidth floorK8sGpuStressWorkload,K8sNimInferenceWorkload,K8sNimHelmWorkload-1b/-3b— all PASS on T4 nodes🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation