Skip to content

[Staging] Add SLES support for AMD gpu-operator#371

Closed
Priyankasaggu11929 wants to merge 109 commits intoROCm:stagingfrom
Priyankasaggu11929:enable-sles-support
Closed

[Staging] Add SLES support for AMD gpu-operator#371
Priyankasaggu11929 wants to merge 109 commits intoROCm:stagingfrom
Priyankasaggu11929:enable-sles-support

Conversation

@Priyankasaggu11929
Copy link
Copy Markdown

(based on comment #365 (review) from the original PR)

Motivation

This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP5+ to the AMD GPU operator.

Technical Details

  • 781c5b5 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions

  • 0170a9a - add SLES Dockerfile template (DockerfileTemplate.sles) for building AMD GPU drivers on SLES (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).

    • also embed the template via go:embed and add SLES case logic
  • c2dce44 - docs: update example/deviceconfig_example.yaml <- dropped

  • 4da60d3 - use "registry.suse.com" as the default base image registry if OS == "sles"

    • although, use-specified BaseImageRegistry still takes precedence
    • also extend tests in internal/kmmodule/kmmodule_test.go to test above changes in resolveDockerfile func

Test Plan

  • b625441 - tests: update internal/utils_test.go for added support for SLES 15 SP*

Test Result

  • truncated output of make unit-test after new added tests in b625441

    > make unit-test
    ...
    ...
    === RUN   TestSLESDefaultDriverVersionsMapper
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP6
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP7
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP5
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP4
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_base
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format
    --- PASS: TestSLESDefaultDriverVersionsMapper (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP6 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP7 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP5 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP4 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_base (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format (0.00s)
    PASS
    coverage: 48.6% of statements
    ok  	github.com/ROCm/gpu-operator/internal	0.019s	coverage: 48.6% of statements
    === RUN   TestAPIs
    Running Suite: Controller Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/controllers
    ==========================================================================================================================
    Random Seed: 1761223798
    
    Will run 15 of 15 specs
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 7.9% of statements
    ok  	github.com/ROCm/gpu-operator/internal/controllers	(cached)	coverage: 7.9% of statements
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761223798
    
    Will run 5 of 5 specs
    testing multiple valid homogeneous nodes
    testing multiple valid heterogeneous nodes
    testing multiple valid heterogeneous nodes + one unsupported node
    testing multiple unsupported nodes
    testing empty node list
    •<moduleName>
    <amdgpu>
    •<moduleName>
    <amdgpu>
    •••
    
    Ran 5 of 5 Specs in 0.005 seconds
    SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 32.3% of statements
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	(cached)	coverage: 32.3% of statements
    
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    

  • output from tests added as part of 4da60d3

    ❯ go test ./internal/kmmmodule/... -v -ginkgo.focus="resolveDockerfile" -ginkgo.v
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761548380
    
    Will run 3 of 8 specs
    SSSS
    ------------------------------
    resolveDockerfile should use correct default registry when not specified by user
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:683
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should respect user-specified BaseImageRegistry for all OS types
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:702
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should return error for unsupported OS
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:727
    • [0.000 seconds]
    ------------------------------
    S
    
    Ran 3 of 8 Specs in 0.000 seconds
    SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 5 Skipped
    --- PASS: TestAPIs (0.00s)
    PASS
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	0.022s
    

Submission Checklist

yansun1996 and others added 28 commits November 19, 2025 13:15
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* add suspend and resume functionality for remediation workflows

* minor updates to docs

* minor refactoring to avoid duplicate k8s get calls

* add default configmap

* fix helm chart issues

* address code review comments

* move remediation configs and scripts into separate files

* add jq package to utils_container
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Yuva Shankar <11082310+yuva29@users.noreply.github.com>
… dashboard

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
…fter partitioning

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
…071)

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* make max parallel workflows configurable for auto remediation

* add zero value in default CR

* address review comments

(cherry picked from commit f023a5c)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
(cherry picked from commit 3f1a1ee2ea08f7675a6aba6cd60ed2f06ca7bdc6)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* remediation e2e tests for suspend and resume actions

* add e2e test for recoverypolicy cr

* use init container image from dev.env

(cherry picked from commit 3e0f7aa)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* GPUOP-525 update auto node remediation documentation

* address review comments

(cherry picked from commit 8e3f3e0)
* customize auto node remediation options

* address review comments

* commit generated files

* support custom labels and taints in workflow

* handle custom drain policy

* update documentation

* fix e2e test

(cherry picked from commit 8dd5196)
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
ci-penbot-01 and others added 23 commits April 1, 2026 17:11
* upgrade Argo workflow CRDs and controller to v4.0.3 (#1235)

* upgrade Argo workflow CRDs and controller to v4.0.3

* update controller image version to v4.0.3

(cherry picked from commit 155a669)

* Update amd-gpu-operator.clusterserviceversion.yaml

---------

Co-authored-by: Uday Bhaskar <udayb@amd.com>
Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com>
…OCm#500)

* [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff



* Update internal/controllers/remediation/scripts/test.sh



* Update internal/controllers/remediation/scripts/test.sh



---------



(cherry picked from commit 344e480)

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Yan Sun <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…m#501)

* GPUOP-618 fix helm upgrade issue with latest Argo CRDs (#1283)

(cherry picked from commit fe9ec91)

* Apply suggestion from @biluriuday

---------

Co-authored-by: Uday Bhaskar <udayb@amd.com>
* anr - fixes for applylabels step

* multiple anr fixes

(cherry picked from commit b33e4c9)

Co-authored-by: Uday Bhaskar <udayb@amd.com>
…ition (#1281) (ROCm#503)

(cherry picked from commit 9314824)

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Yan Sun <Yan.Sun3@amd.com>
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
* enable npd and anr e2e sims

* increase validation check duration

(cherry picked from commit 67defdf)

Co-authored-by: Uday Bhaskar <udayb@amd.com>
Fix two bugs in DeviceConfig node assignment management:

1. buildNodeAssignments now logs and skips node assignment conflicts
   instead of returning a fatal error. A CR-level conflict should not
   block the entire operator — the runtime validateNodeAssignments
   check already handles this per-CR during reconciliation.

2. Remove premature updateNodeAssignments call during finalization
   that freed nodes from the in-memory map before the finalizer was
   removed. Node cleanup is now handled solely via the NotFound path
   after CR garbage collection, preventing other DeviceConfigs from
   claiming nodes mid-finalization.

Also adds DRA driver DaemonSet cleanup to the finalization path,
which was previously only handled during normal reconciliation.

(cherry picked from commit a945553)

Co-authored-by: Nitish Bhat <bhatnitish@gmail.com>
…d and it's E2Es (#1267) (ROCm#508)

* DCM: mount default ConfigMap when spec.configManager.config is omitted

When DeviceConfig.spec.configManager.config is nil or has an empty name,
the DCM DaemonSet now always mounts a ConfigMap volume named
default-dcm-config (configurable by setting spec.configManager.config.name).

Add E2E coverage (TestDCMDefaultConfigMapWhenConfigOmitted), cluster_test
helpers, SIM skips for GPU-only partition tests, and align E2E_DCM_IMAGE
in dev.env with v1.4.1.

* Helm default CM + operator EnsureDefaultDCMConfigMap + E2E/docs

* changes

* address comments

* comments

* dcm changes

(cherry picked from commit e9c1e91)

Co-authored-by: nikhilsk <47417007+nikhilsk@users.noreply.github.com>
…ation (#1295) (ROCm#509)

(cherry picked from commit 65785ed)

Co-authored-by: bhatturu <bhatturu@amd.com>
…4) (ROCm#526)

(cherry picked from commit fa1328d092487fa7482c7d3166bbd5fd5fe6d74d)

Co-authored-by: Srivatsa Sangli <58572624+sangli-pensando@users.noreply.github.com>
…nual test examples (#1364) (#1365) (ROCm#527)

Add privileged SCC permissions to all ClusterRole definitions in manual/scheduled test documentation to support OpenShift deployments.


(cherry picked from commit 9915f721319cd7bf8fcb2ac581092473c0c3dc56)

Co-authored-by: Yan Sun <Yan.Sun3@amd.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* GPUOP-640 update remediation documentation

* fix argo helm chart version for openshift

(cherry picked from commit 1f246648b635f53b937c8003aa15278a67b9a008)


(cherry picked from commit a968197d9974e22087eefe3cad2b7a184bf848c9)

Co-authored-by: Uday Bhaskar <udayb@amd.com>
* metricsclient cli change

* add test/e2e dependency to e2e sim

* increase timeout

(cherry picked from commit 5b1e46a)

Co-authored-by: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com>
…rt (#1337) (ROCm#522)

* Add workflow and workflow-triggered pod collection to techsupport

Enhance the techsupport_dump.sh script to collect workflow CRs and
workflow-triggered pods when auto node remediation feature is enabled.
This helps with debugging workflow-based node remediation issues.

Changes:
- Add WORKFLOW_RESOURCES variable for workflow CRs
- Collect workflow CRs (get, describe, yaml/json output)
- Collect workflow-triggered pods identified by workflows.argoproj.io/workflow label
- Add per-node log collection for workflow-triggered pods
- Include error resilience with || true for ephemeral workflow pods



* Make pod_logs function resilient to ephemeral pod failures

Add error handling (|| true) to kubectl logs commands in pod_logs
function to prevent script termination when collecting logs from
ephemeral/terminated workflow pods. With set -e enabled, failed log
collection would previously abort the entire techsupport run before
reaching error handlers.

Changes:
- Add '2>&1 || true' to current container logs command
- Add '2>&1 || true' to previous container logs command
- Ensures individual pod log failures don't terminate script execution
- Critical for short-lived workflow pods that may be deleted during collection



* Add workflow controller pod collection to techsupport

Collect information and logs from the workflow controller pod
(identified by label app=amd-gpu-operator-workflow-controller)
in addition to workflow CRs and workflow-triggered pods.

Changes:
- Add workflow controller pod collection in cluster-wide section
  - kubectl get/describe output in both text and JSON/YAML format
- Add workflow controller pod log collection per node
- Maintains error resilience with || true for optional feature



---------


(cherry picked from commit 70b0104)

Co-authored-by: Yan Sun <Yan.Sun3@amd.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…clude "useSourceImage" for example DeviceConfig (ROCm#523)

* Include DeviceConfig driver.useSourceImage in OCP(olm) install docs

Signed-off-by: Landon LaSmith <LLaSmith@redhat.com>

* Airgapped: Sync driver version with OpenShift(OLM) documentation

Signed-off-by: Landon LaSmith <LLaSmith@redhat.com>

---------

Signed-off-by: Landon LaSmith <LLaSmith@redhat.com>
* Add DeviceConfig collection in testmonitor

* Fix DRA Tests
Add ServiceAccount, ClusterRole, and ClusterRoleBinding for the DRA
driver so it can run on OpenShift clusters. The ClusterRole grants:
- privileged SCC (required for OpenShift)
- resourceslices CRUD (to publish GPU resources)
- resourceclaims get (to process allocation requests)
- nodes get (to look up node info for ResourceSlice ownership)

Also add the DRA driver service account to the OLM bundle's
extra-service-accounts list so OLM-managed installs create the SA.
# Conflicts:
#	bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
…d (#1388)

* Create DeviceClass from operator code on OpenShift when DRA is enabled

On OpenShift, operator-sdk cannot deploy DeviceClass resources via the
OLM bundle. This adds handleDeviceClass to the reconciler which creates
the gpu.amd.com DeviceClass using an unstructured client when running on
OpenShift with DRA driver enabled. The DeviceClass is cluster-scoped and
shared, so it is created once (AlreadyExists is handled gracefully) and
never deleted on DeviceConfig finalization.

* Use deviceClassName constant instead of hardcoded string

Address review feedback: extract "gpu.amd.com" into a const and use it
throughout handleDeviceClass.
…opriate AMD GPU driver versions

* add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6'
* add `SLESDefaultDriverVersionsMapper` to select driver versions
  - SLES 15 SP6/SP7 -> driver 7.0.2 (ref: https://repo.radeon.com/amdgpu-install/7.0.2/sle/)
  - SLES 15 SP5 -> driver 6.2.2 (ref: https://repo.radeon.com/amdgpu-install/6.2.2/sle/)
* register both 'sles' and 'suse' identifiers in mappers

Co-authored-by: alex-isv <alex.zacharow@suse.com>
…sles"

* although, use-specified `BaseImageRegistry` still takes precedence

* also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
@yansun1996
Copy link
Copy Markdown
Member

close this PR and keep the one for main branch, staging branch has retired

@yansun1996 yansun1996 closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.