-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for GPU resources #1701
Conversation
Thanks for your contribution. We'll review the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You took the right direction, but I'm concerned that a GPU device only considers Nvidia hardware. Users who have gpu.intel.com/i915
or gpu.intel.com/xe
for example, cannot use this analyser.
I don't yet have a clean suggestion that works, but I'd explore combining filters and outcomes. |
I think at this stage we can document the limitation and work with just Nvidia, expanding the scope in a future PR if there's demand. |
The more generic approach would be fetching by resource name. However it would have to be a function This would cover any custom resource added by any device plugin. I started on this, but I was having a hard time getting the regex to work as I wanted it to |
#1162 is related |
@miguelvr we were reviewing this and issue #1162 from above and wanted to get your input on this approach. I think that would be a bit more flexible to address the other open issue but want to confirm it would resolve your issue as well. I think it's what you're doing, but just a bit more flexible for the future. I'd like you to review that though and let us know what you think :) |
Description, Motivation and Context
nvidia.com/gpu
is a common resource for many AI applications, but currently, Troubleshoot does not allow users to check the availability of GPUs in their cluster.I've considered taking a more generic approach with a function
resourceName(<resource>)
, but decided against it because it would either require taking in the units as a parameterresourceName(<resource>, <units>)
or default to something likeDecimalSI
, which could lead to errors for certain resources.For simplicity, I decided to go with the more explicit approach.
Checklist
Does this PR introduce a breaking change?