Add SLES support for AMD gpu-operator#365
Add SLES support for AMD gpu-operator#365Priyankasaggu11929 wants to merge 4 commits intoROCm:mainfrom
Conversation
|
Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7. Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works! |
Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR. Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks ! |
Yes, I'll keep posting updates. Thank you! |
| # IMPORTANT for SLES: Base images must come from registry.suse.com | ||
| # Uncomment and set for SLES 15 SP5/SP6 deployments: | ||
| #imageBuild: | ||
| # baseImageRegistry: "registry.suse.com" |
There was a problem hiding this comment.
one minor suggestion,
since the controller will be able to parse the OS image and detect that the workers are SLES based, you can let the controller set the baseImageRegistry for the detected SLES based worker nodes.
PTAL at this function resolveDockerfile
There was a problem hiding this comment.
addressed in 4da60d3
set default to "registry.suse.com" in case of OS == "sles" but still giving precedence if a user defines spec.driver.imageBuild.baseImageRegistry = "custom-image-regisry". I added some minor tests to verify the behavior.
With above, I dropped the docs changes in example/deviceconfig_example.yaml
Please review again. Thank you!
yansun1996
left a comment
There was a problem hiding this comment.
one minor suggestion, the rest of the PR looks good
Let us know when you finished the verification with hardware
d46ce29 to
4da60d3
Compare
yansun1996
left a comment
There was a problem hiding this comment.
thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x
once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support
Created PR for staging branch - #371
Thank you so much! Regarding "the verification on AMD GPU setup" - I'm still in discussion for getting the required lab infra access, so there are no updates as of now on this, but I will post updates as soon as I am able to run some tests. |
* Automate the helm README build in sanity * address comments
…opriate AMD GPU driver versions * add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6' * add `SLESDefaultDriverVersionsMapper` to select driver versions - SLES 15 SP6/SP7 -> driver 7.0.2 (ref: https://repo.radeon.com/amdgpu-install/7.0.2/sle/) - SLES 15 SP5 -> driver 6.2.2 (ref: https://repo.radeon.com/amdgpu-install/6.2.2/sle/) * register both 'sles' and 'suse' identifiers in mappers Co-authored-by: alex-isv <alex.zacharow@suse.com>
4da60d3 to
666be77
Compare
… SUSE AMD GPU driver image
…sles" * although, use-specified `BaseImageRegistry` still takes precedence * also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
666be77 to
7dfec5e
Compare
|
Hello @yansun1996, I have updated the PR today with latest changes. We have tested the PR changes (with the SUSE built amdgpu driver container image for latest version v7.0.3) on a machine with Requesting your review again on the PR changes. (Also, please let me know if staging branch is still the way to submit these changes, in that case, I'll refresh the other PR too - #371) I used the following patch for gpu-operator to detect |
Hi @Priyankasaggu11929 thanks for the update and verification, let me discuss with the team about this PR, will get back to you. |
Thank you. |
|
Hello @yansun1996, could you help with some information on the following: We were waiting for a new amdgpu driver modules release that supports Previously, we had been using the What is the difference between the |
Hi @Priyankasaggu11929 , I will cherry-pick your commits for internal CI (not public available yet) as for your question: v7.x.x and v30.x.x there is a recent driver version change happened. Previously in driver version <= v7.0.x you will see that each amdgpu release has followed the ROCm release version However in recent amdgpu driver releases, the release version has diverged from ROCm release, start to use 30.x.x So as of now, amdgpu driver has its own release and ROCm runtime libs has their release version for more information please check https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html#user-and-amd-gpu-driver-amdgpu-support-matrix |
| var slesCSDPrebuiltDriverImages = map[string]map[string]string{ | ||
| "15.7": { | ||
| "7.0.3": "registry.suse.com/third-party/amd/amdgpu-driver:sles-15.7-7.0.3", | ||
| }, | ||
| } |
There was a problem hiding this comment.
Bug — data tables are inconsistent. This prebuilt-image table only registers 15.7 → 7.0.3, but SLESDefaultDriverVersionsMapper (utils.go) defaults to 7.0.2 for SP6 and 6.2.2 for SP5/base. With those defaults, the lookup in getKM (around line 572) silently misses and SUSE_PREBUILT_DRIVER_IMG is never injected. The new Dockerfile template has no fallback — FROM ${SUSE_PREBUILT_DRIVER_IMG} AS driver-source resolves to FROM AS driver-source, which fails the build.
Net effect: with the defaults this PR ships, SP5 and SP6 nodes (which the PR claims to support) cannot build. Either add prebuilt entries for the SP5/SP6 default driver versions, narrow SLESDefaultDriverVersionsMapper to only return versions present here, or surface an explicit error on lookup miss.
| // Inject SUSE_PREBUILT_DRIVER_IMG build arg for SLES nodes. | ||
| if strings.HasPrefix(osName, "sles-") { | ||
| csVersion := strings.TrimPrefix(osName, "sles-") // e.g. "15.7" | ||
| if driverVersions, ok := slesCSDPrebuiltDriverImages[csVersion]; ok { | ||
| if prebuiltImg, ok := driverVersions[driversVersion]; ok { | ||
| kmmBuild.BuildArgs = append(kmmBuild.BuildArgs, | ||
| kmmv1beta1.BuildArg{ | ||
| Name: "SUSE_PREBUILT_DRIVER_IMG", | ||
| Value: prebuiltImg, | ||
| }, | ||
| ) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Silent-skip behavior. When the codestream or driver-version key is missing from slesCSDPrebuiltDriverImages, this block falls through with no build arg appended and no error logged. Combined with the table on line 100, default-configured SLES 15 SP5 / SP6 nodes will reach the build pod without SUSE_PREBUILT_DRIVER_IMG set, and the build fails with a confusing FROM error from buildkit/podman.
Suggest returning an error here when osName starts with sles- but no prebuilt image is registered for the (codestream, driver-version) pair, so the failure mode is actionable.
| if err == nil && spVersion >= 7 { | ||
| return "7.0.3", nil // Latest stable version for SP7+ | ||
| } | ||
| if err == nil && spVersion >= 6 { | ||
| return "7.0.2", nil // Latest stable version for SP6 | ||
| } | ||
| if err == nil && spVersion >= 5 { | ||
| return "6.2.2", nil // Stable version for SP5 | ||
| } | ||
| } | ||
| // Default for SLES 15 without SP info | ||
| return "6.2.2", nil | ||
| } | ||
| return "", fmt.Errorf("unsupported SLES version: %s. Supported versions include SLES 15 SP5 and above", fullImageStr) |
There was a problem hiding this comment.
Contract mismatch: the fallthrough at line 254 returns 6.2.2 for any SLES 15 input that doesn't match the SP≥5 conditions — including SP4 and below. The new test SLES 15 SP4 in utils_test.go asserts this returns 6.2.2 with no error. But the error message on line 256 advertises Supported versions include SLES 15 SP5 and above. Either reject spVersion < 5 explicitly to match the documented contract, or revise the error string + supported-versions claim to match the code. Right now the function silently accepts SP versions it claims not to support.
Minor: hoist regexp.MustCompile to package scope — it is currently recompiled on every call.
| // render base image registry | ||
| baseImageRegistry := defaultBaseImageRegistry | ||
| if devConfig.Spec.Driver.ImageBuild.BaseImageRegistry != "" { | ||
| // user-specified registry takes precendence |
There was a problem hiding this comment.
Typo: precendence → precedence.
yansun1996
left a comment
There was a problem hiding this comment.
PTAL the comments @Priyankasaggu11929 , thanks
Motivation
This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP5+ to the AMD GPU operator.
Technical Details
781c5b5 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions
slesCMNameMapperto parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6'SLESDefaultDriverVersionsMapperto select driver versions0170a9a - add SLES Dockerfile template (
DockerfileTemplate.sles) for building AMD GPU drivers on SLES (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).c2dce44 - docs: update example/deviceconfig_example.yaml<- dropped4da60d3 - use "registry.suse.com" as the default base image registry if OS == "sles"
BaseImageRegistrystill takes precedenceinternal/kmmodule/kmmodule_test.goto test above changes inresolveDockerfilefuncTest Plan
Test Result
truncated output of
make unit-testafter new added tests in b625441output from tests added as part of 4da60d3
Submission Checklist