Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to send command to MPS daemon #762

Open
RonanQuigley opened this issue Jun 10, 2024 · 5 comments
Open

Failed to send command to MPS daemon #762

RonanQuigley opened this issue Jun 10, 2024 · 5 comments

Comments

@RonanQuigley
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 5.15.0-112-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.6.12
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s

2. Issue or feature description

I'm struggling to understand how to enable MPS with the provided README . I'm using helm chart version 0.15.0. I'm using the nvidia device plugin helm chart. I'm not using the gpu-operator chart.

Am I supposed to do something after enabling mps via the config map? I've also tried going onto the relevant gpu worker node and enabling mps via nvidia-cuda-mps-control -d but that made no difference.

[2024-06-10 15:16:40.777 Control 111377] Starting control daemon using socket /tmp/nvidia-mps/control
[2024-06-10 15:16:40.777 Control 111377] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps

Logs from the nvidia-device-plugin-ctr container in the nvidia-device-plugin pod:

Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 20
        }
      ]
    }
  }
}
I0610 15:26:41.022164      39 main.go:279] Retrieving plugins.
I0610 15:26:41.022191      39 factory.go:104] Detected NVML platform: found NVML library
I0610 15:26:41.022226      39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0610 15:26:41.076279      39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0610 15:26:41.076311      39 main.go:208] Failed to start one or more plugins. Retrying in 30s...
# values.yaml
nodeSelector: {
  nvidia.com/gpu: "true"
}

gfd: 
  enabled: true
  nameOverride: gpu-feature-discovery
  namespaceOverride: <NAMESPACE>
  nodeSelector: {
    nvidia.com/gpu: "true"
  }

nfd:
  master:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  worker:
    nodeSelector: {
      nvidia.com/gpu: "true"
    }

config: 
  name: nvidia-device-plugin-config

# nvidia-device-plugin-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: <NAMESPACE>
data:
  config: |-
    version: v1
    sharing:
      mps:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 20

Additional information that might help better understand your environment and reproduce the bug:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:BE:00.0 Off |                    0 |
| N/A   67C    P0            279W /  350W |    3809MiB /  46068MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@klueska
Copy link
Contributor

klueska commented Jun 10, 2024

I haven't read your issue in detail, but maybe this will help:
https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit

@RonanQuigley
Copy link
Author

RonanQuigley commented Jun 10, 2024

Furthermore, the presence of the nvidia.com/mps.capable=true label triggers the creation of a daemonset to manage the MPS control daemon.

Thanks, so I did read this doc before posting the issue. The problem is that this never happens.

@RonanQuigley
Copy link
Author

So I don't know why, but if I reboot the offending machines after enabling MPS via the config map then the mps control daemon pods startup.

It'd be good to get to the bottom of why this is, as it took me hours to figure this out plus others might be having the same problem. Any ideas on what I can look at?

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 10, 2024
@haitwang-cloud
Copy link

@RonanQuigley I'm encountering the same issue that began with k8s-device-plugin v0.15, which continues in v0.17 based on my testing today.

[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] User did not send valid credentials
[2024-12-10 06:38:01.120 Control    64] Accepting connection...
[2024-12-10 06:38:01.120 Control    64] NEW CLIENT 0 from user 1001: Server is not ready, push client to pending list
[2024-12-10 06:38:01.120 Control    64] Starting new server 92 for user 1001
[2024-12-10 06:38:01.127 Control    64] Accepting connection...
[2024-12-10 06:38:01.158 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.170 Control    64] Server 92 exited with status 1
[2024-12-10 06:38:01.170 Control    64] Starting new server 95 for user 1001
[2024-12-10 06:38:01.177 Control    64] Accepting connection...
[2024-12-10 06:38:01.210 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.222 Control    64] Server 95 exited with status 1
[2024-12-10 06:38:01.223 Control    64] Starting new server 98 for user 1001
[2024-12-10 06:38:01.229 Control    64] Accepting connection...
[2024-12-10 06:38:01.260 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.272 Control    64] Server 98 exited with status 1
[2024-12-10 06:38:01.272 Control    64] Starting new server 101 for user 1001
[2024-12-10 06:38:01.279 Control    64] Accepting connection...
[2024-12-10 06:38:01.311 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.323 Control    64] Server 101 exited with status 1
[2024-12-10 06:38:01.324 Control    64] Starting new server 104 for user 1001
[2024-12-10 06:38:01.330 Control    64] Accepting connection...
[2024-12-10 06:38:01.361 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.373 Control    64] Server 104 exited with status 1
[2024-12-10 06:38:01.373 Control    64] Starting new server 107 for user 1001
[2024-12-10 06:38:01.377 Control    64] Accepting connection...
[2024-12-10 06:38:01.406 Control    64] Server encountered a fatal exception. Shutting down
[2024-12-10 06:38:01.418 Control    64] Server 107 exited with status 1
[2024-12-10 06:38:01.418 Control    64] Removed Shm file at 
[2024-12-10 06:38:15.078 Control    64] Accepting connection...
[2024-12-10 06:38:15.078 Control    64] NEW UI
[2024-12-10 06:38:15.078 Control    64] Cmd:get_default_active_thread_percentage
[2024-12-10 06:38:15.078 Control    64] 25.0
[2024-12-10 06:38:15.078 Control    64] UI closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants