feat: add nvidia gpu support to coco pattern by butler54 · Pull Request #82 · validatedpatterns/coco-pattern

butler54 · 2026-04-24T05:44:03Z

No description provided.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace git branch references (repoURL/targetRevision/path) with released Helm chart references (chart/chartVersion) for trustee, sandboxed-containers, and sandboxed-policies in values-baremetal.yaml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add tdx.enabled flag (default true) to baremetal chart to conditionally set kvm_intel.tdx=1 kernel argument. Without this, the kvm_intel module does not activate TDX and NFD cannot detect it. Enable intel-dcap application in values-baremetal.yaml for PCCS/QGS attestation services. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mplates Address PR review feedback: - Remove detect-runtime-class.yaml (OSC operator manages RuntimeClass) - Remove bm-kernel-params.yaml and kernel-params-mco.yaml (config should be provided via initdata or pod annotations to avoid inconsistencies) - Remove commented-out runtimeclass templates for AMD SNP and Intel TDX Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Chris Butler <chris.butler@redhat.com>

Conflicts resolved: - _helpers.tpl: kept runtimeClassName override support from baremetal - kbs-access/values.yaml: merged main's structure with runtimeClassName param - kbs-access/secure-pod.yaml: accepted deletion (replaced by secure-deployment.yaml) - kbs-access/secure-deployment.yaml: added runtimeClassName values override support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add Kyverno chart and coco-kyverno-policies to baremetal values - Update trustee chart to 0.3.* with kbs.admin.format v1.1 - Remove bypassAttestation (proper attestation via init_data) - Remove explicit runtimeClassName overrides (auto-detected by platform) - Add syncPolicy prune to hello-openshift and kbs-access - Reset default clusterGroupName to simple Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The policy only fired on Pod/Deployment CREATE, so pods created before the initdata ConfigMap existed never got the cc_init_data annotation. Adding UPDATE allows Kyverno to inject the annotation when a Deployment is updated (e.g. by ArgoCD sync), triggering a rolling restart with the correct initdata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds RAW_HASH field to both initdata and debug-initdata ConfigMaps. PCR8_HASH = SHA256(zeros || SHA256(toml)) — used by Azure vTPM attestation RAW_HASH = SHA256(toml) — used by baremetal TDX/SNP attestation Both are needed because Azure and baremetal present initdata differently in their attestation evidence. A single Trustee attestation server must accept both formats to support multi-platform deployments. Future: integrate veritas for comprehensive reference value generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Temporarily uses butler54/trustee-chart feature/baremetal-attestation branch instead of released chart. This branch includes: - Baremetal TDX and SNP attestation rules - Conditional pcr-stash (no error on baremetal without vTPM) - Raw init_data hash (zero-padded) for baremetal attestation - TDX QCNL config with use_secure_cert: false for local PCCS Revert to chartVersion after merging and releasing trustee chart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The kbs-access-app container image is ~1GB which causes container creation timeouts with the default 2GB kata VM memory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The autogen Deployment rule causes admission failures when the initdata ConfigMap hasn't been propagated to the workload namespace yet. By targeting Pods only (autogen-controllers: none), Deployments are admitted without ConfigMap resolution. Pods get cc_init_data injected at creation time when the ConfigMap is available. A rollout restart picks up new initdata values. Also removes UPDATE operation — only CREATE is needed since a rollout restart creates new Pods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Without braces, bash treats $initial_pcr followed by the hex hash as a single undefined variable name, producing SHA-256 of empty string instead of the correct PCR extend value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enables NVIDIA confidential GPU (H100/H200) on bare metal deployments with full CoCo integration. Addresses three documented gaps in the Red Hat OSC 1.12 documentation: - Gap 1: Pin GPU Operator to v26.3.0 (v26.3.1 breaks kata state machine) - Gap 2: Include kataSandboxDevicePlugin in ClusterPolicy (required for nvidia.com/pgpu resource advertisement) - Gap 3: Add imperative job to re-reconcile KataConfig after GPU Operator labels nodes (kata-cc-nvidia-gpu RuntimeClass creation) New charts: - charts/all/nvidia-gpu: ClusterPolicy CR and IOMMU MachineConfig - charts/coco-supported/gpu-workload: CUDA vectorAdd sample deployment Also extends Kyverno initdata injection to support kata-cc-nvidia-gpu runtime class and propagate initdata to gpu-workload namespace. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Chris Butler <chris.butler@redhat.com>

Rewrite ClusterPolicy to match the official Red Hat OCP 4.21.9+ documentation for NVIDIA confidential GPU support. Key changes: - Remove hardcoded cc-manager v0.1.0 (was from old v25.3.x line, caused IntelRootPort crash) — let GPU Operator manage its versions - Remove hardcoded CC_CAPABLE_DEVICE_IDS and sandbox device plugin image/version — operator fills in correct defaults - Disable host-side components not needed for CC passthrough (driver, dcgm, toolkit, migManager) — driver runs inside kata VM via initrd - Add kataSandboxDevicePlugin env vars (P_GPU_ALIAS, NVSWITCH_ALIAS) - Add vfioManager BIND_NVSWITCHES env var - Add both amd_iommu=on and intel_iommu=on to IOMMU MachineConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The background controller crashes with leader election timeouts at the default 128Mi limit when processing multiple generate policies for cc_init_data propagation across namespaces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GPU attestation fails at trustee ("contraindicated") while TDX CPU attestation passes. Switch to permissive agent policy to allow exec into the kata VM for further investigation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Driver 580.105.08 inside the kata VM provides CUDA 12.8+ runtime. The old cuda11.7.1 sample fails with "CUDA driver version is insufficient for CUDA runtime version". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The NVIDIA cuda-sample images (cuda11.7.1, cuda12.5.0) have CUDA runtime version mismatches with the driver inside the kata VM GPU initrd (580.105.08). Switch to the Red Hat gpu-verifier:ubi9 image which bundles CUDA samples built for the correct driver version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Chris Butler <chris.butler@redhat.com>

butler54 and others added 18 commits March 10, 2026 11:22

feat: add bare metal support for Intel TDX and AMD SEV-SNP

bad2552

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into baremetal-tp-releases-squashed

188c674

feat: update to OSC 1.12 / Trustee 1.1.0

fbce1aa

Signed-off-by: Chris Butler <chris.butler@redhat.com>

fix: set clusterGroupName to baremetal for deployment testing

a601af0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add intel-device-plugins-operator subscription for SGX/TDX quot…

27c71e5

…e generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: enable TDX config in trustee to point QCNL at local PCCS service

e462936

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: increase kata VM memory for kbs-access to 8192MB

070ca0e

The kbs-access-app container image is ~1GB which causes container creation timeouts with the default 2GB kata VM memory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

butler54 requested a review from a team April 24, 2026 05:44

butler54 and others added 8 commits April 27, 2026 21:04

feat: segregated gpu bm from bm for ease of testing

bbcb622

Signed-off-by: Chris Butler <chris.butler@redhat.com>

feat: correct BM deployment

40f41ae

Signed-off-by: Chris Butler <chris.butler@redhat.com>

chore: gitignore changes

0e56588

Signed-off-by: Chris Butler <chris.butler@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add nvidia gpu support to coco pattern#82

feat: add nvidia gpu support to coco pattern#82
butler54 wants to merge 26 commits intovalidatedpatterns:mainfrom
butler54:feat/nvidia-confidential-gpu

butler54 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

butler54 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant