Access NVIDIA GPUs in K8s in a non-privileged container #605

pintohutch · 2024-03-18T22:36:07Z

Hello - I'm trying to see if it's possible to deploy NVIDIA DCGM on K8s with the securityContext.privileged field set to false for security reasons.

I was able to get this working by setting the container's resource requests as the following:

          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
              drop:
                - ALL

However, this is not ideal for a few reasons:

We sacrifice an entire GPU just for monitoring, which is an over-allocation as DCGM does not need the full GPU compute capacity.
This prevents other workloads from using an expensive resource.
The Kubernetes scheduler will only allocate pods on nodes with excess GPU capacity.
The container only seems to have access to one GPU device instead of all of the devices available on the node.

Is there any way to permit the container device access without reserving more of the resource requests via nvida.com/gpu?

Thanks for any help you can provide.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-18T04:26:08Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

pintohutch · 2024-06-18T13:17:48Z

Hey @elezar - I see that you're assigned to this. Is this feasible in any way that you know of?

github-actions · 2024-09-17T04:26:21Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

pintohutch · 2024-09-17T12:02:22Z

Hey @elezar gentle ping :)

github-actions · 2024-12-17T04:29:04Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

chipzoller · 2024-12-19T13:09:55Z

Isn't this more appropriate for either the DCGM or DCGM Exporter repositories? If this refers to deploying DCGM Exporter, then the DaemonSet used to do so is neither privileged nor does it request any GPUs to do so. It does use node labels to affine the DCGM Exporter pods to only those nodes with an NVIDIA GPU(s).

elezar self-assigned this Mar 19, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access NVIDIA GPUs in K8s in a non-privileged container #605

Access NVIDIA GPUs in K8s in a non-privileged container #605

pintohutch commented Mar 18, 2024

github-actions bot commented Jun 18, 2024

pintohutch commented Jun 18, 2024

github-actions bot commented Sep 17, 2024

pintohutch commented Sep 17, 2024

github-actions bot commented Dec 17, 2024

chipzoller commented Dec 19, 2024

Access NVIDIA GPUs in K8s in a non-privileged container #605

Access NVIDIA GPUs in K8s in a non-privileged container #605

Comments

pintohutch commented Mar 18, 2024

github-actions bot commented Jun 18, 2024

pintohutch commented Jun 18, 2024

github-actions bot commented Sep 17, 2024

pintohutch commented Sep 17, 2024

github-actions bot commented Dec 17, 2024

chipzoller commented Dec 19, 2024