Skip to content

Auto node remediation changes#543

Merged
sajmera-pensando merged 12 commits intoROCm:mainfrom
biluriuday:cp-anr-rocm1
May 6, 2026
Merged

Auto node remediation changes#543
sajmera-pensando merged 12 commits intoROCm:mainfrom
biluriuday:cp-anr-rocm1

Conversation

@biluriuday
Copy link
Copy Markdown
Contributor

This PR brings in multiple enhancements, code fixes and documentation updates for Auto node remediation feature

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

yansun1996
yansun1996 previously approved these changes May 5, 2026
Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

biluriuday added 12 commits May 6, 2026 05:09
* update controller-manager serviceaccount rbac

* set deviceconfig as owner for the default workflow template

* update correct config map structure in docs

(cherry picked from commit d38ee15)
* create remediation config map using utils image

* add make targets

* update documentation

* update e2e tests

* ignore lint for pytests docs

(cherry picked from commit 04e3886)
(cherry picked from commit 35b4fabf1eb49dcd5d34b8bc9b53b29b9b21595f)
(cherry picked from commit 420d676d07cee939f55cea7f7cca2c9a289f5e88)
(cherry picked from commit 6829f3082f49e70c7acfc7e888d18de6ad946b43)
(cherry picked from commit be661f5b3013b04b4658eadfba881ad3716a7216)
* handle reboot step failure in workflow

* fix node selector and affinity rules

* update documentation

* use boot id to detect successful reboot

* ignore tests folder from docs lint

(cherry picked from commit 1c647b6169de2f320a4b5bf164a49f3450b44bbf)
* anr docs update and minor fixes

* update sed command to overcome Openshift permission issue

(cherry picked from commit 90b20f956109f7203dcd2eee481b032f814868dc)
* make reboot wait time configurable via deviceconfig

* fix helm-e2e test

* address review comments

(cherry picked from commit 21f7023f4c23d92958732d08239ed9378a3dbc2a)
(cherry picked from commit b035ce92918a194eee89613e848cd3121c4f763b)
@sajmera-pensando sajmera-pensando merged commit ca2c1b3 into ROCm:main May 6, 2026
3 checks passed
@biluriuday biluriuday deleted the cp-anr-rocm1 branch May 7, 2026 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants