feat: add nvidia gpu support to coco pattern#82
Open
butler54 wants to merge 26 commits intovalidatedpatterns:mainfrom
Open
feat: add nvidia gpu support to coco pattern#82butler54 wants to merge 26 commits intovalidatedpatterns:mainfrom
butler54 wants to merge 26 commits intovalidatedpatterns:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace git branch references (repoURL/targetRevision/path) with released Helm chart references (chart/chartVersion) for trustee, sandboxed-containers, and sandboxed-policies in values-baremetal.yaml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add tdx.enabled flag (default true) to baremetal chart to conditionally set kvm_intel.tdx=1 kernel argument. Without this, the kvm_intel module does not activate TDX and NFD cannot detect it. Enable intel-dcap application in values-baremetal.yaml for PCCS/QGS attestation services. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mplates Address PR review feedback: - Remove detect-runtime-class.yaml (OSC operator manages RuntimeClass) - Remove bm-kernel-params.yaml and kernel-params-mco.yaml (config should be provided via initdata or pod annotations to avoid inconsistencies) - Remove commented-out runtimeclass templates for AMD SNP and Intel TDX Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chris Butler <chris.butler@redhat.com>
Conflicts resolved: - _helpers.tpl: kept runtimeClassName override support from baremetal - kbs-access/values.yaml: merged main's structure with runtimeClassName param - kbs-access/secure-pod.yaml: accepted deletion (replaced by secure-deployment.yaml) - kbs-access/secure-deployment.yaml: added runtimeClassName values override support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Kyverno chart and coco-kyverno-policies to baremetal values - Update trustee chart to 0.3.* with kbs.admin.format v1.1 - Remove bypassAttestation (proper attestation via init_data) - Remove explicit runtimeClassName overrides (auto-detected by platform) - Add syncPolicy prune to hello-openshift and kbs-access - Reset default clusterGroupName to simple Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The policy only fired on Pod/Deployment CREATE, so pods created before the initdata ConfigMap existed never got the cc_init_data annotation. Adding UPDATE allows Kyverno to inject the annotation when a Deployment is updated (e.g. by ArgoCD sync), triggering a rolling restart with the correct initdata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds RAW_HASH field to both initdata and debug-initdata ConfigMaps. PCR8_HASH = SHA256(zeros || SHA256(toml)) — used by Azure vTPM attestation RAW_HASH = SHA256(toml) — used by baremetal TDX/SNP attestation Both are needed because Azure and baremetal present initdata differently in their attestation evidence. A single Trustee attestation server must accept both formats to support multi-platform deployments. Future: integrate veritas for comprehensive reference value generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Temporarily uses butler54/trustee-chart feature/baremetal-attestation branch instead of released chart. This branch includes: - Baremetal TDX and SNP attestation rules - Conditional pcr-stash (no error on baremetal without vTPM) - Raw init_data hash (zero-padded) for baremetal attestation - TDX QCNL config with use_secure_cert: false for local PCCS Revert to chartVersion after merging and releasing trustee chart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The kbs-access-app container image is ~1GB which causes container creation timeouts with the default 2GB kata VM memory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The autogen Deployment rule causes admission failures when the initdata ConfigMap hasn't been propagated to the workload namespace yet. By targeting Pods only (autogen-controllers: none), Deployments are admitted without ConfigMap resolution. Pods get cc_init_data injected at creation time when the ConfigMap is available. A rollout restart picks up new initdata values. Also removes UPDATE operation — only CREATE is needed since a rollout restart creates new Pods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without braces, bash treats $initial_pcr followed by the hex hash as a single undefined variable name, producing SHA-256 of empty string instead of the correct PCR extend value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enables NVIDIA confidential GPU (H100/H200) on bare metal deployments with full CoCo integration. Addresses three documented gaps in the Red Hat OSC 1.12 documentation: - Gap 1: Pin GPU Operator to v26.3.0 (v26.3.1 breaks kata state machine) - Gap 2: Include kataSandboxDevicePlugin in ClusterPolicy (required for nvidia.com/pgpu resource advertisement) - Gap 3: Add imperative job to re-reconcile KataConfig after GPU Operator labels nodes (kata-cc-nvidia-gpu RuntimeClass creation) New charts: - charts/all/nvidia-gpu: ClusterPolicy CR and IOMMU MachineConfig - charts/coco-supported/gpu-workload: CUDA vectorAdd sample deployment Also extends Kyverno initdata injection to support kata-cc-nvidia-gpu runtime class and propagate initdata to gpu-workload namespace. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Chris Butler <chris.butler@redhat.com>
Signed-off-by: Chris Butler <chris.butler@redhat.com>
Rewrite ClusterPolicy to match the official Red Hat OCP 4.21.9+ documentation for NVIDIA confidential GPU support. Key changes: - Remove hardcoded cc-manager v0.1.0 (was from old v25.3.x line, caused IntelRootPort crash) — let GPU Operator manage its versions - Remove hardcoded CC_CAPABLE_DEVICE_IDS and sandbox device plugin image/version — operator fills in correct defaults - Disable host-side components not needed for CC passthrough (driver, dcgm, toolkit, migManager) — driver runs inside kata VM via initrd - Add kataSandboxDevicePlugin env vars (P_GPU_ALIAS, NVSWITCH_ALIAS) - Add vfioManager BIND_NVSWITCHES env var - Add both amd_iommu=on and intel_iommu=on to IOMMU MachineConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The background controller crashes with leader election timeouts at the default 128Mi limit when processing multiple generate policies for cc_init_data propagation across namespaces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GPU attestation fails at trustee ("contraindicated") while TDX CPU
attestation passes. Switch to permissive agent policy to allow exec
into the kata VM for further investigation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Driver 580.105.08 inside the kata VM provides CUDA 12.8+ runtime. The old cuda11.7.1 sample fails with "CUDA driver version is insufficient for CUDA runtime version". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The NVIDIA cuda-sample images (cuda11.7.1, cuda12.5.0) have CUDA runtime version mismatches with the driver inside the kata VM GPU initrd (580.105.08). Switch to the Red Hat gpu-verifier:ubi9 image which bundles CUDA samples built for the correct driver version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Chris Butler <chris.butler@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.