Skip to content

Add SLES support for AMD gpu-operator#365

Open
Priyankasaggu11929 wants to merge 4 commits intoROCm:mainfrom
Priyankasaggu11929:enable-sles-support
Open

Add SLES support for AMD gpu-operator#365
Priyankasaggu11929 wants to merge 4 commits intoROCm:mainfrom
Priyankasaggu11929:enable-sles-support

Conversation

@Priyankasaggu11929
Copy link
Copy Markdown

@Priyankasaggu11929 Priyankasaggu11929 commented Oct 23, 2025

Motivation

This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP5+ to the AMD GPU operator.

Technical Details

  • 781c5b5 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions

  • 0170a9a - add SLES Dockerfile template (DockerfileTemplate.sles) for building AMD GPU drivers on SLES (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).

    • also embed the template via go:embed and add SLES case logic
  • c2dce44 - docs: update example/deviceconfig_example.yaml <- dropped

  • 4da60d3 - use "registry.suse.com" as the default base image registry if OS == "sles"

    • although, use-specified BaseImageRegistry still takes precedence
    • also extend tests in internal/kmmodule/kmmodule_test.go to test above changes in resolveDockerfile func

Test Plan

  • b625441 - tests: update internal/utils_test.go for added support for SLES 15 SP*

Test Result

  • truncated output of make unit-test after new added tests in b625441

    > make unit-test
    ...
    ...
    === RUN   TestSLESDefaultDriverVersionsMapper
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP6
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP7
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP5
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP4
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_base
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format
    --- PASS: TestSLESDefaultDriverVersionsMapper (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP6 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP7 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP5 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP4 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_base (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format (0.00s)
    PASS
    coverage: 48.6% of statements
    ok  	github.com/ROCm/gpu-operator/internal	0.019s	coverage: 48.6% of statements
    === RUN   TestAPIs
    Running Suite: Controller Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/controllers
    ==========================================================================================================================
    Random Seed: 1761223798
    
    Will run 15 of 15 specs
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 7.9% of statements
    ok  	github.com/ROCm/gpu-operator/internal/controllers	(cached)	coverage: 7.9% of statements
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761223798
    
    Will run 5 of 5 specs
    testing multiple valid homogeneous nodes
    testing multiple valid heterogeneous nodes
    testing multiple valid heterogeneous nodes + one unsupported node
    testing multiple unsupported nodes
    testing empty node list
    •<moduleName>
    <amdgpu>
    •<moduleName>
    <amdgpu>
    •••
    
    Ran 5 of 5 Specs in 0.005 seconds
    SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 32.3% of statements
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	(cached)	coverage: 32.3% of statements
    
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    

  • output from tests added as part of 4da60d3

    ❯ go test ./internal/kmmmodule/... -v -ginkgo.focus="resolveDockerfile" -ginkgo.v
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761548380
    
    Will run 3 of 8 specs
    SSSS
    ------------------------------
    resolveDockerfile should use correct default registry when not specified by user
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:683
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should respect user-specified BaseImageRegistry for all OS types
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:702
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should return error for unsupported OS
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:727
    • [0.000 seconds]
    ------------------------------
    S
    
    Ran 3 of 8 Specs in 0.000 seconds
    SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 5 Skipped
    --- PASS: TestAPIs (0.00s)
    PASS
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	0.022s
    

Submission Checklist

@Priyankasaggu11929
Copy link
Copy Markdown
Author

Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7.
Please review and let me know if/where any changes are needed.

Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works!

@yansun1996
Copy link
Copy Markdown
Member

yansun1996 commented Oct 23, 2025

Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7. Please review and let me know if/where any changes are needed.

Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works!

Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR.

Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks !

@Priyankasaggu11929
Copy link
Copy Markdown
Author

Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR.
Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks !

Yes, I'll keep posting updates. Thank you!

Comment thread example/deviceconfig_example.yaml Outdated
# IMPORTANT for SLES: Base images must come from registry.suse.com
# Uncomment and set for SLES 15 SP5/SP6 deployments:
#imageBuild:
# baseImageRegistry: "registry.suse.com"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor suggestion,

since the controller will be able to parse the OS image and detect that the workers are SLES based, you can let the controller set the baseImageRegistry for the detected SLES based worker nodes.

PTAL at this function resolveDockerfile

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 4da60d3

set default to "registry.suse.com" in case of OS == "sles" but still giving precedence if a user defines spec.driver.imageBuild.baseImageRegistry = "custom-image-regisry". I added some minor tests to verify the behavior.

With above, I dropped the docs changes in example/deviceconfig_example.yaml

Please review again. Thank you!

Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor suggestion, the rest of the PR looks good
Let us know when you finished the verification with hardware

Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x

once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support

@Priyankasaggu11929
Copy link
Copy Markdown
Author

thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x

Created PR for staging branch - #371

once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support

Thank you so much!

Regarding "the verification on AMD GPU setup" - I'm still in discussion for getting the required lab infra access, so there are no updates as of now on this, but I will post updates as soon as I am able to run some tests.

leslie-qiwa pushed a commit to leslie-qiwa/gpu-operator that referenced this pull request Feb 6, 2026
* Automate the helm README build in sanity

* address comments
…opriate AMD GPU driver versions

* add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6'
* add `SLESDefaultDriverVersionsMapper` to select driver versions
  - SLES 15 SP6/SP7 -> driver 7.0.2 (ref: https://repo.radeon.com/amdgpu-install/7.0.2/sle/)
  - SLES 15 SP5 -> driver 6.2.2 (ref: https://repo.radeon.com/amdgpu-install/6.2.2/sle/)
* register both 'sles' and 'suse' identifiers in mappers

Co-authored-by: alex-isv <alex.zacharow@suse.com>
…sles"

* although, use-specified `BaseImageRegistry` still takes precedence

* also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
@Priyankasaggu11929
Copy link
Copy Markdown
Author

Hello @yansun1996, I have updated the PR today with latest changes.

We have tested the PR changes (with the SUSE built amdgpu driver container image for latest version v7.0.3) on a machine with AMD Radeon Pro V520 (7362) GPU device and can confirm that the AMD gpu-operator is able to detect SLES nodes and publish the available GPU devices to workloads requesting GPUs (across all kernel versions of SLES 15 SP7 codestream).

Requesting your review again on the PR changes.

(Also, please let me know if staging branch is still the way to submit these changes, in that case, I'll refresh the other PR too - #371)


I used the following patch for gpu-operator to detect Radeon Pro V520 (7362) device.

> cat v520-device-support.patch 
diff --git a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml
index 7d0269cd..531bc06b 100644
--- a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml
+++ b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml
@@ -64,6 +64,11 @@ spec:
             matchExpressions:
                 vendor: {op: In, value: ["1002"]}
                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+                vendor: {op: In, value: ["1002"]}
+                device: {op: In, value: ["7362"]} # Radeon Pro V520
   - name: amd-gpu
     labels:
       feature.node.kubernetes.io/amd-gpu: "true"
@@ -185,6 +190,11 @@ spec:
             matchExpressions:
               vendor: {op: In, value: ["1002"]}
               device: {op: In, value: ["73a1"]} # V620
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+              vendor: {op: In, value: ["1002"]}
+              device: {op: In, value: ["7362"]} # V520
       - matchFeatures:
           - feature: pci.device
             matchExpressions:
diff --git a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml
index 7d0269cd..531bc06b 100644
--- a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml
+++ b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml
@@ -64,6 +64,11 @@ spec:
             matchExpressions:
                 vendor: {op: In, value: ["1002"]}
                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+                vendor: {op: In, value: ["1002"]}
+                device: {op: In, value: ["7362"]} # Radeon Pro V520
   - name: amd-gpu
     labels:
       feature.node.kubernetes.io/amd-gpu: "true"
@@ -185,6 +190,11 @@ spec:
             matchExpressions:
               vendor: {op: In, value: ["1002"]}
               device: {op: In, value: ["73a1"]} # V620
+      - matchFeatures:
+          - feature: pci.device
+            matchExpressions:
+              vendor: {op: In, value: ["1002"]}
+              device: {op: In, value: ["7362"]} # V520
       - matchFeatures:
           - feature: pci.device
             matchExpressions:

@yansun1996
Copy link
Copy Markdown
Member

Hello @yansun1996, I have updated the PR today with latest changes.

We have tested the PR changes (with the SUSE built amdgpu driver container image for latest version v7.0.3) on a machine with AMD Radeon Pro V520 (7362) GPU device and can confirm that the AMD gpu-operator is able to detect SLES nodes and publish the available GPU devices to workloads requesting GPUs (across all kernel versions of SLES 15 SP7 codestream).

Requesting your review again on the PR changes.

(Also, please let me know if staging branch is still the way to submit these changes, in that case, I'll refresh the other PR too - #371)


I used the following patch for gpu-operator to detect Radeon Pro V520 (7362) device.


> cat v520-device-support.patch 

diff --git a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml

index 7d0269cd..531bc06b 100644

--- a/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml

+++ b/hack/k8s-patch/template-patch/gpu-nfd-default-rule.yaml

@@ -64,6 +64,11 @@ spec:

             matchExpressions:

                 vendor: {op: In, value: ["1002"]}

                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+                vendor: {op: In, value: ["1002"]}

+                device: {op: In, value: ["7362"]} # Radeon Pro V520

   - name: amd-gpu

     labels:

       feature.node.kubernetes.io/amd-gpu: "true"

@@ -185,6 +190,11 @@ spec:

             matchExpressions:

               vendor: {op: In, value: ["1002"]}

               device: {op: In, value: ["73a1"]} # V620

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+              vendor: {op: In, value: ["1002"]}

+              device: {op: In, value: ["7362"]} # V520

       - matchFeatures:

           - feature: pci.device

             matchExpressions:

diff --git a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml

index 7d0269cd..531bc06b 100644

--- a/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml

+++ b/helm-charts-k8s/templates/gpu-nfd-default-rule.yaml

@@ -64,6 +64,11 @@ spec:

             matchExpressions:

                 vendor: {op: In, value: ["1002"]}

                 device: {op: In, value: ["73ae"]} # Radeon Pro V620 MxGPU

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+                vendor: {op: In, value: ["1002"]}

+                device: {op: In, value: ["7362"]} # Radeon Pro V520

   - name: amd-gpu

     labels:

       feature.node.kubernetes.io/amd-gpu: "true"

@@ -185,6 +190,11 @@ spec:

             matchExpressions:

               vendor: {op: In, value: ["1002"]}

               device: {op: In, value: ["73a1"]} # V620

+      - matchFeatures:

+          - feature: pci.device

+            matchExpressions:

+              vendor: {op: In, value: ["1002"]}

+              device: {op: In, value: ["7362"]} # V520

       - matchFeatures:

           - feature: pci.device

             matchExpressions:

Hi @Priyankasaggu11929 thanks for the update and verification, let me discuss with the team about this PR, will get back to you.

@Priyankasaggu11929
Copy link
Copy Markdown
Author

thanks for the update and verification, let me discuss with the team about this PR, will get back to you.

Thank you.

@Priyankasaggu11929
Copy link
Copy Markdown
Author

Hello @yansun1996, could you help with some information on the following:

We were waiting for a new amdgpu driver modules release that supports SLES 16.0, and just noticed that the modules are available under tag v31.20: https://repo.radeon.com/amdgpu/31.20/sle/16.0/main/x86_64/

Previously, we had been using the v7.0.3 release as the latest release: https://repo.radeon.com/amdgpu/7.0.3/sle/
and were monitoring the latest branch here: https://repo.radeon.com/amdgpu/latest/sle/

What is the difference between the 7.x.x and 31.x.x versioning schemes? And Which release stream are the recommended one to use?

@yansun1996
Copy link
Copy Markdown
Member

Hello @yansun1996, could you help with some information on the following:

We were waiting for a new amdgpu driver modules release that supports SLES 16.0, and just noticed that the modules are available under tag v31.20: https://repo.radeon.com/amdgpu/31.20/sle/16.0/main/x86_64/

Previously, we had been using the v7.0.3 release as the latest release: https://repo.radeon.com/amdgpu/7.0.3/sle/ and were monitoring the latest branch here: https://repo.radeon.com/amdgpu/latest/sle/

What is the difference between the 7.x.x and 31.x.x versioning schemes? And Which release stream are the recommended one to use?

Hi @Priyankasaggu11929 ,

I will cherry-pick your commits for internal CI (not public available yet)

as for your question: v7.x.x and v30.x.x

there is a recent driver version change happened.

Previously in driver version <= v7.0.x you will see that each amdgpu release has followed the ROCm release version

However in recent amdgpu driver releases, the release version has diverged from ROCm release, start to use 30.x.x

So as of now, amdgpu driver has its own release and ROCm runtime libs has their release version

for more information please check https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html#user-and-amd-gpu-driver-amdgpu-support-matrix

Comment on lines +100 to +104
var slesCSDPrebuiltDriverImages = map[string]map[string]string{
"15.7": {
"7.0.3": "registry.suse.com/third-party/amd/amdgpu-driver:sles-15.7-7.0.3",
},
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — data tables are inconsistent. This prebuilt-image table only registers 15.7 → 7.0.3, but SLESDefaultDriverVersionsMapper (utils.go) defaults to 7.0.2 for SP6 and 6.2.2 for SP5/base. With those defaults, the lookup in getKM (around line 572) silently misses and SUSE_PREBUILT_DRIVER_IMG is never injected. The new Dockerfile template has no fallback — FROM ${SUSE_PREBUILT_DRIVER_IMG} AS driver-source resolves to FROM AS driver-source, which fails the build.

Net effect: with the defaults this PR ships, SP5 and SP6 nodes (which the PR claims to support) cannot build. Either add prebuilt entries for the SP5/SP6 default driver versions, narrow SLESDefaultDriverVersionsMapper to only return versions present here, or surface an explicit error on lookup miss.

Comment on lines +569 to +582
// Inject SUSE_PREBUILT_DRIVER_IMG build arg for SLES nodes.
if strings.HasPrefix(osName, "sles-") {
csVersion := strings.TrimPrefix(osName, "sles-") // e.g. "15.7"
if driverVersions, ok := slesCSDPrebuiltDriverImages[csVersion]; ok {
if prebuiltImg, ok := driverVersions[driversVersion]; ok {
kmmBuild.BuildArgs = append(kmmBuild.BuildArgs,
kmmv1beta1.BuildArg{
Name: "SUSE_PREBUILT_DRIVER_IMG",
Value: prebuiltImg,
},
)
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent-skip behavior. When the codestream or driver-version key is missing from slesCSDPrebuiltDriverImages, this block falls through with no build arg appended and no error logged. Combined with the table on line 100, default-configured SLES 15 SP5 / SP6 nodes will reach the build pod without SUSE_PREBUILT_DRIVER_IMG set, and the build fails with a confusing FROM error from buildkit/podman.

Suggest returning an error here when osName starts with sles- but no prebuilt image is registered for the (codestream, driver-version) pair, so the failure mode is actionable.

Comment thread internal/utils.go
Comment on lines +243 to +256
if err == nil && spVersion >= 7 {
return "7.0.3", nil // Latest stable version for SP7+
}
if err == nil && spVersion >= 6 {
return "7.0.2", nil // Latest stable version for SP6
}
if err == nil && spVersion >= 5 {
return "6.2.2", nil // Stable version for SP5
}
}
// Default for SLES 15 without SP info
return "6.2.2", nil
}
return "", fmt.Errorf("unsupported SLES version: %s. Supported versions include SLES 15 SP5 and above", fullImageStr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contract mismatch: the fallthrough at line 254 returns 6.2.2 for any SLES 15 input that doesn't match the SP≥5 conditions — including SP4 and below. The new test SLES 15 SP4 in utils_test.go asserts this returns 6.2.2 with no error. But the error message on line 256 advertises Supported versions include SLES 15 SP5 and above. Either reject spVersion < 5 explicitly to match the documented contract, or revise the error string + supported-versions claim to match the code. Right now the function silently accepts SP versions it claims not to support.

Minor: hoist regexp.MustCompile to package scope — it is currently recompiled on every call.

// render base image registry
baseImageRegistry := defaultBaseImageRegistry
if devConfig.Spec.Driver.ImageBuild.BaseImageRegistry != "" {
// user-specified registry takes precendence
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: precendenceprecedence.

Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL the comments @Priyankasaggu11929 , thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants