Fix GPU attach/detach handling and CM response processing#44
Conversation
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
Signed-off-by: Ko Kai <ko.kai@jp.fujitsu.com>
ba02cdc to
2ec315c
Compare
mgazz
left a comment
There was a problem hiding this comment.
Thank you for the submission. The PR looks good, just a couple of suggestions to improve readibility.
| var composableResourceLog = ctrl.Log.WithName("composable_resource_controller") | ||
|
|
||
| func nvidiaGpuOperatorNamespace() string { | ||
| if ns := os.Getenv("NVIDIA_GPU_OPERATOR_NAMESPACE"); ns != "" { |
There was a problem hiding this comment.
I would update the documentation explaining that, if the gpu operator was installed in a custom namespace, this variable needs to be set. The default value is gpu-operator
There was a problem hiding this comment.
I would also log the default value when the variable is not set. Something like
log.Info("GPU_OPERATOR_NAMESPACE not set, using default", "namespace", ns)| if device.Detail.ResourceOPStatus[:1] == "0" { | ||
| resourceOPStatus := string(device.Detail.ResourceOPStatus) | ||
| if len(resourceOPStatus) == 0 { | ||
| return fmt.Errorf("the target gpu '%s' has empty status in CM", instance.Status.DeviceID) |
There was a problem hiding this comment.
I would expand this message with a reference to the machine (machineID) making it easier to inspect the resources involved in the attachment
|
|
||
| err = fmt.Errorf("failed to process CM scaleup request. http returned status: '%d', cm return code: '%s', error message: '%s'", errBody.Status, errBody.Detail.Code, errBody.Detail.Message) | ||
| clientLog.Error(err, "failed to process CM scaleup request", "ComposableResource", instance.Name) | ||
| if response.StatusCode < 200 || response.StatusCode >= 300 { |
There was a problem hiding this comment.
For readability I would isolate this check in a separated function.
func isHTTPSuccess(statusCode int) bool {
return statusCode >= 200 && statusCode < 300
}The function can be used as following:
if !isHTTPSuccess(response.StatusCode) {
....
}| if resourceOPStatus[:1] == "0" { | ||
| // The target device exists and has no error, return OK. | ||
| return nil | ||
| } else if device.Detail.ResourceOPStatus[:1] == "1" { | ||
| } else if resourceOPStatus[:1] == "1" { | ||
| return fmt.Errorf("the target gpu '%s' is showing a Warning status in CM", instance.Status.DeviceID) | ||
| } else if device.Detail.ResourceOPStatus[:1] == "2" { | ||
| } else if resourceOPStatus[:1] == "2" { | ||
| return fmt.Errorf("the target gpu '%s' is showing a Critical status in CM", instance.Status.DeviceID) |
There was a problem hiding this comment.
I would use constants to explain the various status.
const (
ResourceStatusOK = "0"
ResourceStatusWarning = "1"
ResourceStatusCritical = "2"
)if resourceOPStatus[:1] == ResourceStatusOK {
return nil
} else if resourceOPStatus[:1] == ResourceStatusWarning { ....e6fd6a8 to
2ec315c
Compare
Background
GPU attach/detach processing needed improvements around driver detection, GPU load checks, kubelet plugin restart behavior, and empty device responses.
CM response handling also needed to correctly process machine lookup results and removal responses.
Changes