-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise a warning instead of an error if GPU mode labeler fails #1106
Conversation
LGTM. I am assuming that GFD just becomes a no-op when run on a node where pci devices aren't detected, yes? |
In WSL2, NVML is available so GFD behaves like normal, except for the gpu.mode label which requires information about the PCI device under /sys. |
@@ -215,7 +215,8 @@ func isMPSCapable(manager resource.Manager) (bool, error) { | |||
func newGPUModeLabeler(devices []resource.Device) (Labeler, error) { | |||
classes, err := getDeviceClasses(devices) | |||
if err != nil { | |||
return nil, err | |||
klog.Warningf("Failed to create GPU mode labeler: failed to get device classes: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one question is whether we should return an nvidia.com/gpu.mode="unknown"
label here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed. Do we have any precedent for using unknown
for other labels where we are unable to gather the required information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already return unknown
when getting the mode below. We also use this for the machinetype label here
machineType, err := getMachineType(machineTypePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I clearly did not look at the code below carefully enough. I have updated this to return unknown
instead.
For environments like WSL2 where GPUs do not show up as PCI devices, creating the GPU mode labeler will fail. This change allows GFD to proceed even if it is unable to get the GPU mode label. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
a62dbde
to
bd00e7d
Compare
@cdesiniotis do we need to backport this? |
No, as long as this change makes it into the next device-plugin release (0.18.0) and our upcoming GPU Operator release picks it up we should be fine. |
For environments like WSL2 where GPUs do not show up as PCI devices, creating the GPU mode labeler will fail. This change allows GFD to proceed even if it is unable to get the GPU mode label.