Failed to send command to MPS daemon #762

RonanQuigley · 2024-06-10T15:38:39Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
Kernel Version: 5.15.0-112-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.6.12
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s

2. Issue or feature description

I'm struggling to understand how to enable MPS with the provided README . I'm using helm chart version 0.15.0. I'm using the nvidia device plugin helm chart. I'm not using the gpu-operator chart.

Am I supposed to do something after enabling mps via the config map? I've also tried going onto the relevant gpu worker node and enabling mps via nvidia-cuda-mps-control -d but that made no difference.

[2024-06-10 15:16:40.777 Control 111377] Starting control daemon using socket /tmp/nvidia-mps/control
[2024-06-10 15:16:40.777 Control 111377] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps

Logs from the nvidia-device-plugin-ctr container in the nvidia-device-plugin pod:

Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 20
        }
      ]
    }
  }
}
I0610 15:26:41.022164      39 main.go:279] Retrieving plugins.
I0610 15:26:41.022191      39 factory.go:104] Detected NVML platform: found NVML library
I0610 15:26:41.022226      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0610 15:26:41.076279      39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0610 15:26:41.076311      39 main.go:208] Failed to start one or more plugins. Retrying in 30s...

# values.yaml
nodeSelector: {
  nvidia.com/gpu: "true"
}

gfd: 
  enabled: true
  nameOverride: gpu-feature-discovery
  namespaceOverride: <NAMESPACE>
  nodeSelector: {
    nvidia.com/gpu: "true"
  }

nfd:
  master:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  worker:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }

config: 
  name: nvidia-device-plugin-config

# nvidia-device-plugin-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: <NAMESPACE>
data:
  config: |-
    version: v1
    sharing:
      mps:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 20

Additional information that might help better understand your environment and reproduce the bug:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:BE:00.0 Off |                    0 |
| N/A   67C    P0            279W /  350W |    3809MiB /  46068MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

The text was updated successfully, but these errors were encountered:

klueska · 2024-06-10T15:50:15Z

I haven't read your issue in detail, but maybe this will help:
https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit

RonanQuigley · 2024-06-10T18:15:24Z

Furthermore, the presence of the nvidia.com/mps.capable=true label triggers the creation of a daemonset to manage the MPS control daemon.

Thanks, so I did read this doc before posting the issue. The problem is that this never happens.

RonanQuigley · 2024-06-11T10:58:15Z

So I don't know why, but if I reboot the offending machines after enabling MPS via the config map then the mps control daemon pods startup.

It'd be good to get to the bottom of why this is, as it took me hours to figure this out plus others might be having the same problem. Any ideas on what I can look at?

github-actions · 2024-09-10T04:27:35Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

haitwang-cloud · 2024-12-10T07:15:02Z

@RonanQuigley I'm encountering the same issue that began with k8s-device-plugin v0.15, which continues in v0.17 based on my testing today.

[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] User did not send valid credentials
[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] NEW CLIENT 0 from user 1001: Server is not ready, push client to pending list
[2024-12-10 06:38:01.120 Control    64] Starting new server 92 for user 1001
[2024-12-10 06:38:01.127 Control    64] Accepting connection...
[2024-12-10 06:38:01.158 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.170 Control    64] Server 92 exited with status 1
[2024-12-10 06:38:01.170 Control    64] Starting new server 95 for user 1001
[2024-12-10 06:38:01.177 Control    64] Accepting connection...
[2024-12-10 06:38:01.210 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.222 Control    64] Server 95 exited with status 1
[2024-12-10 06:38:01.223 Control    64] Starting new server 98 for user 1001
[2024-12-10 06:38:01.229 Control    64] Accepting connection...
[2024-12-10 06:38:01.260 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.272 Control    64] Server 98 exited with status 1
[2024-12-10 06:38:01.272 Control    64] Starting new server 101 for user 1001
[2024-12-10 06:38:01.279 Control    64] Accepting connection...
[2024-12-10 06:38:01.311 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.323 Control    64] Server 101 exited with status 1
[2024-12-10 06:38:01.324 Control    64] Starting new server 104 for user 1001
[2024-12-10 06:38:01.330 Control    64] Accepting connection...
[2024-12-10 06:38:01.361 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.373 Control    64] Server 104 exited with status 1
[2024-12-10 06:38:01.373 Control    64] Starting new server 107 for user 1001
[2024-12-10 06:38:01.377 Control    64] Accepting connection...
[2024-12-10 06:38:01.406 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.418 Control    64] Server 107 exited with status 1
[2024-12-10 06:38:01.418 Control    64] Removed Shm file at 
[2024-12-10 06:38:15.078 Control    64] Accepting connection...
[2024-12-10 06:38:15.078 Control    64] NEW UI
[2024-12-10 06:38:15.078 Control    64] Cmd:get_default_active_thread_percentage
[2024-12-10 06:38:15.078 Control    64] 25.0
[2024-12-10 06:38:15.078 Control    64] UI closed

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 10, 2024

haitwang-cloud mentioned this issue Dec 10, 2024

MPS Functionality Not Working Correctly in k8s-device-plugin Versions v0.15 to v0.17 #1094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to send command to MPS daemon #762

Failed to send command to MPS daemon #762

RonanQuigley commented Jun 10, 2024

klueska commented Jun 10, 2024

RonanQuigley commented Jun 10, 2024 •

edited

Loading

RonanQuigley commented Jun 11, 2024

github-actions bot commented Sep 10, 2024

haitwang-cloud commented Dec 10, 2024

Failed to send command to MPS daemon #762

Failed to send command to MPS daemon #762

Comments

RonanQuigley commented Jun 10, 2024

1. Quick Debug Information

2. Issue or feature description

klueska commented Jun 10, 2024

RonanQuigley commented Jun 10, 2024 • edited Loading

RonanQuigley commented Jun 11, 2024

github-actions bot commented Sep 10, 2024

haitwang-cloud commented Dec 10, 2024

RonanQuigley commented Jun 10, 2024 •

edited

Loading