Skip to content

Fix GPU attach/detach handling and CM response processing#44

Open
NekoHK wants to merge 11 commits into
CoHDI:mainfrom
NekoHK:github-sync-ado-dev
Open

Fix GPU attach/detach handling and CM response processing#44
NekoHK wants to merge 11 commits into
CoHDI:mainfrom
NekoHK:github-sync-ado-dev

Conversation

@NekoHK

@NekoHK NekoHK commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Background

GPU attach/detach processing needed improvements around driver detection, GPU load checks, kubelet plugin restart behavior, and empty device responses.

CM response handling also needed to correctly process machine lookup results and removal responses.

Changes

  • added missing unit tests for upstream syncer and garbage collection
  • added CRO MT improvement tests
  • refactored GPU driver type detection
  • added fallback detection for the NVIDIA kernel module
  • made the GPU operator namespace configurable by environment variable
  • replaced kubelet plugin daemonset restart with pod restart
  • improved GPU load check and drain handling
  • handled GPU detach cases where no devices are found
  • updated CM machine lookup and response handling

NekoHK added 11 commits June 15, 2026 20:33
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
@NekoHK NekoHK force-pushed the github-sync-ado-dev branch from ba02cdc to 2ec315c Compare June 15, 2026 11:34

@mgazz mgazz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the submission. The PR looks good, just a couple of suggestions to improve readibility.

var composableResourceLog = ctrl.Log.WithName("composable_resource_controller")

func nvidiaGpuOperatorNamespace() string {
if ns := os.Getenv("NVIDIA_GPU_OPERATOR_NAMESPACE"); ns != "" {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would update the documentation explaining that, if the gpu operator was installed in a custom namespace, this variable needs to be set. The default value is gpu-operator

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also log the default value when the variable is not set. Something like

log.Info("GPU_OPERATOR_NAMESPACE not set, using default", "namespace", ns)

if device.Detail.ResourceOPStatus[:1] == "0" {
resourceOPStatus := string(device.Detail.ResourceOPStatus)
if len(resourceOPStatus) == 0 {
return fmt.Errorf("the target gpu '%s' has empty status in CM", instance.Status.DeviceID)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expand this message with a reference to the machine (machineID) making it easier to inspect the resources involved in the attachment


err = fmt.Errorf("failed to process CM scaleup request. http returned status: '%d', cm return code: '%s', error message: '%s'", errBody.Status, errBody.Detail.Code, errBody.Detail.Message)
clientLog.Error(err, "failed to process CM scaleup request", "ComposableResource", instance.Name)
if response.StatusCode < 200 || response.StatusCode >= 300 {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability I would isolate this check in a separated function.

func isHTTPSuccess(statusCode int) bool {
    return statusCode >= 200 && statusCode < 300 
}

The function can be used as following:

if !isHTTPSuccess(response.StatusCode) {
....
}

Comment on lines +293 to 299
if resourceOPStatus[:1] == "0" {
// The target device exists and has no error, return OK.
return nil
} else if device.Detail.ResourceOPStatus[:1] == "1" {
} else if resourceOPStatus[:1] == "1" {
return fmt.Errorf("the target gpu '%s' is showing a Warning status in CM", instance.Status.DeviceID)
} else if device.Detail.ResourceOPStatus[:1] == "2" {
} else if resourceOPStatus[:1] == "2" {
return fmt.Errorf("the target gpu '%s' is showing a Critical status in CM", instance.Status.DeviceID)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use constants to explain the various status.

const ( 
ResourceStatusOK       = "0"
ResourceStatusWarning  = "1"
ResourceStatusCritical = "2"
)
if resourceOPStatus[:1] == ResourceStatusOK {
   return nil
} else if resourceOPStatus[:1] == ResourceStatusWarning { ....

@NekoHK NekoHK force-pushed the github-sync-ado-dev branch 2 times, most recently from e6fd6a8 to 2ec315c Compare June 16, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants