Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access NVIDIA GPUs in K8s in a non-privileged container #605

Open
pintohutch opened this issue Mar 18, 2024 · 6 comments
Open

Access NVIDIA GPUs in K8s in a non-privileged container #605

pintohutch opened this issue Mar 18, 2024 · 6 comments
Assignees

Comments

@pintohutch
Copy link

Hello - I'm trying to see if it's possible to deploy NVIDIA DCGM on K8s with the securityContext.privileged field set to false for security reasons.

I was able to get this working by setting the container's resource requests as the following:

          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
              drop:
                - ALL

However, this is not ideal for a few reasons:

  1. We sacrifice an entire GPU just for monitoring, which is an over-allocation as DCGM does not need the full GPU compute capacity.
  2. This prevents other workloads from using an expensive resource.
  3. The Kubernetes scheduler will only allocate pods on nodes with excess GPU capacity.
  4. The container only seems to have access to one GPU device instead of all of the devices available on the node.

Is there any way to permit the container device access without reserving more of the resource requests via nvida.com/gpu?

Thanks for any help you can provide.

@elezar elezar self-assigned this Mar 19, 2024
Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2024
@pintohutch
Copy link
Author

Hey @elezar - I see that you're assigned to this. Is this feasible in any way that you know of?

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024
Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2024
@pintohutch
Copy link
Author

Hey @elezar gentle ping :)

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2024
Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024
@chipzoller
Copy link
Contributor

Isn't this more appropriate for either the DCGM or DCGM Exporter repositories? If this refers to deploying DCGM Exporter, then the DaemonSet used to do so is neither privileged nor does it request any GPUs to do so. It does use node labels to affine the DCGM Exporter pods to only those nodes with an NVIDIA GPU(s).

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants