Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise a warning instead of an error if GPU mode labeler fails #1106

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

cdesiniotis
Copy link
Contributor

For environments like WSL2 where GPUs do not show up as PCI devices, creating the GPU mode labeler will fail. This change allows GFD to proceed even if it is unable to get the GPU mode label.

@tariq1890
Copy link
Contributor

LGTM. I am assuming that GFD just becomes a no-op when run on a node where pci devices aren't detected, yes?

@cdesiniotis
Copy link
Contributor Author

LGTM. I am assuming that GFD just becomes a no-op when run on a node where pci devices aren't detected, yes?

In WSL2, NVML is available so GFD behaves like normal, except for the gpu.mode label which requires information about the PCI device under /sys.

@@ -215,7 +215,8 @@ func isMPSCapable(manager resource.Manager) (bool, error) {
func newGPUModeLabeler(devices []resource.Device) (Labeler, error) {
classes, err := getDeviceClasses(devices)
if err != nil {
return nil, err
klog.Warningf("Failed to create GPU mode labeler: failed to get device classes: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one question is whether we should return an nvidia.com/gpu.mode="unknown" label here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed. Do we have any precedent for using unknown for other labels where we are unable to gather the required information?

Copy link
Member

@elezar elezar Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already return unknown when getting the mode below. We also use this for the machinetype label here

machineType, err := getMachineType(machineTypePath)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I clearly did not look at the code below carefully enough. I have updated this to return unknown instead.

For environments like WSL2 where GPUs do not show up as PCI devices, creating
the GPU mode labeler will fail. This change allows GFD to proceed even if it
is unable to get the GPU mode label.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
@cdesiniotis cdesiniotis force-pushed the gpu-mode-labeler-optional branch from a62dbde to bd00e7d Compare January 7, 2025 21:22
@cdesiniotis cdesiniotis requested a review from elezar January 7, 2025 21:23
@elezar
Copy link
Member

elezar commented Jan 9, 2025

@cdesiniotis do we need to backport this?

@cdesiniotis
Copy link
Contributor Author

@cdesiniotis do we need to backport this?

No, as long as this change makes it into the next device-plugin release (0.18.0) and our upcoming GPU Operator release picks it up we should be fine.

@cdesiniotis cdesiniotis merged commit f7dc5f1 into main Jan 10, 2025
8 checks passed
@cdesiniotis cdesiniotis deleted the gpu-mode-labeler-optional branch January 10, 2025 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants