Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE GPU nodes: nvidia-smi not found, likely missing env PATH and LD_LIBRARY_PATH #176

Open
MeCode4Food opened this issue Jan 7, 2025 · 1 comment

Comments

@MeCode4Food
Copy link

MeCode4Food commented Jan 7, 2025

I am trying to run a vscode notebook on my GKE cluster's kubeflow platform. On the node(s), nvidia-drivers are already installed:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["nvidia-smi; while true; do sleep 600; done;"]
    resources:
      limits:
       nvidia.com/gpu: 1
❯ kubectl logs -f my-gpu-pod
Tue Jan  7 02:42:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

However, in the notebook this is not the case

(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
bash: nvidia-smi: command not found
❯ kubectl get pods -n kubeflow-user-example-com ck-test-vscode-notebook-gpu-ok-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    istio.io/rev: default
    kubectl.kubernetes.io/default-container: ck-test-vscode-notebook-gpu-ok
    kubectl.kubernetes.io/default-logs-container: ck-test-vscode-notebook-gpu-ok
    poddefault.admission.kubeflow.org/poddefault-access-ml-pipeline: "43967"
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}'
  creationTimestamp: "2025-01-06T14:25:37Z"
  generateName: ck-test-vscode-notebook-gpu-ok-
  labels:
    access-ml-pipeline: "true"
    app: ck-test-vscode-notebook-gpu-ok
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: ck-test-vscode-notebook-gpu-ok-68cf86c894
    notebook-name: ck-test-vscode-notebook-gpu-ok
    security.istio.io/tlsMode: istio
    service.istio.io/canonical-name: ck-test-vscode-notebook-gpu-ok
    service.istio.io/canonical-revision: latest
    statefulset: ck-test-vscode-notebook-gpu-ok
    statefulset.kubernetes.io/pod-name: ck-test-vscode-notebook-gpu-ok-0
  name: ck-test-vscode-notebook-gpu-ok-0
  namespace: kubeflow-user-example-com
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: ck-test-vscode-notebook-gpu-ok
    uid: -
  resourceVersion: "25035037"
  uid: -
spec:
  containers:
  - env:
    - name: NB_PREFIX
      value: /notebook/kubeflow-user-example-com/ck-test-vscode-notebook-gpu-ok
    image: kubeflownotebookswg/codeserver-python:v1.8.0
    imagePullPolicy: IfNotPresent
    name: ck-test-vscode-notebook-gpu-ok
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      limits:
        cpu: 600m
        memory: 1288490188800m
        nvidia.com/gpu: "1"
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /home/jovyan
      name: ck-test-vscode-notebook-gpu-ok-workspace
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-fzmnf
      readOnly: true
    - mountPath: /var/run/secrets/kubeflow/pipelines
      name: volume-kf-pipeline-token
      readOnly: true
    workingDir: /home/jovyan
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --proxyLogLevel=warning
    - --proxyComponentLogLevel=misc:error
    - --log_output_level=default:info
    env:
    - name: JWT_POLICY
      value: third-party-jwt
    - name: PILOT_CERT_PROVIDER
      value: istiod
    - name: CA_ADDR
      value: istiod.istio-system.svc:15012
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: ISTIO_CPU_LIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.cpu
    - name: PROXY_CONFIG
      value: |
        {}
    - name: ISTIO_META_POD_PORTS
      value: |-
        [
            {"name":"notebook-port","containerPort":8888,"protocol":"TCP"}
        ]
    - name: ISTIO_META_APP_CONTAINERS
      value: ck-test-vscode-notebook-gpu-ok
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.cpu
    - name: ISTIO_META_CLUSTER_ID
      value: Kubernetes
    - name: ISTIO_META_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_META_WORKLOAD_NAME
      value: ck-test-vscode-notebook-gpu-ok
    - name: ISTIO_META_OWNER
      value: kubernetes://apis/apps/v1/namespaces/kubeflow-user-example-com/statefulsets/ck-test-vscode-notebook-gpu-ok
    - name: ISTIO_META_MESH_ID
      value: cluster.local
    - name: TRUST_DOMAIN
      value: cluster.local
    image: docker.io/istio/proxyv2:1.20.2
    imagePullPolicy: IfNotPresent
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 4
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1337
      runAsNonRoot: true
      runAsUser: 1337
    startupProbe:
      failureThreshold: 600
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 3
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/workload-spiffe-uds
      name: workload-socket
    - mountPath: /var/run/secrets/credential-uds
      name: credential-socket
    - mountPath: /var/run/secrets/workload-spiffe-credentials
      name: workload-certs
    - mountPath: /var/run/secrets/istio
      name: istiod-ca-cert
    - mountPath: /var/lib/istio/data
      name: istio-data
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /var/run/secrets/tokens
      name: istio-token
    - mountPath: /etc/istio/pod
      name: istio-podinfo
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-fzmnf
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: ck-test-vscode-notebook-gpu-ok-0
  initContainers:
  - args:
    - istio-iptables
    - -p
    - "15001"
    - -z
    - "15006"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - '*'
    - -d
    - 15090,15021,15020
    - --log_output_level=default:info
    image: docker.io/istio/proxyv2:1.20.2
    imagePullPolicy: IfNotPresent
    name: istio-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_ADMIN
        - NET_RAW
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: false
      runAsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-fzmnf
      readOnly: true
  nodeName: gke-ml-sg-gpu-pool-3-x-y
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default-editor
  serviceAccountName: default-editor
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  volumes:
  - emptyDir: {}
    name: workload-socket
  - emptyDir: {}
    name: credential-socket
  - emptyDir: {}
    name: workload-certs
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - emptyDir: {}
    name: istio-data
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels
        path: labels
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
    name: istio-podinfo
  - name: istio-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: istio-ca
          expirationSeconds: 43200
          path: istio-token
  - configMap:
      defaultMode: 420
      name: istio-ca-root-cert
    name: istiod-ca-cert
  - emptyDir:
      medium: Memory
    name: dshm
  - name: ck-test-vscode-notebook-gpu-ok-workspace
    persistentVolumeClaim:
      claimName: ck-test-vscode-notebook-gpu-ok-workspace
  - name: kube-api-access-fzmnf
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
  - name: volume-kf-pipeline-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: pipelines.kubeflow.org
          expirationSeconds: 7200
          path: token
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:49Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:50Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:52Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:52Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:41Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://y
    image: docker.io/kubeflownotebookswg/codeserver-python:v1.8.0
    imageID: docker.io/kubeflownotebookswg/codeserver-python@sha256:bf91bc4c205a8674f4dfe9dd92ed1e63ca2ebd74026e54dc39107c95087962ba
    lastState: {}
    name: ck-test-vscode-notebook-gpu-ok
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-01-06T14:25:50Z"
  - containerID: containerd://z
    image: docker.io/istio/proxyv2:1.20.2
    imageID: docker.io/istio/proxyv2@sha256:5786e72bf56c4cdf58e88dad39579a24875d05e213aa9a7bba3c59206f84ab6c
    lastState: {}
    name: istio-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-01-06T14:25:50Z"
  hostIP: x
  hostIPs:
  - ip: x
  initContainerStatuses:
  - containerID: containerd://x
    image: docker.io/istio/proxyv2:1.20.2
    imageID: docker.io/istio/proxyv2@sha256:5786e72bf56c4cdf58e88dad39579a24875d05e213aa9a7bba3c59206f84ab6c
    lastState: {}
    name: istio-init
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://x
        exitCode: 0
        finishedAt: "2025-01-06T14:25:49Z"
        reason: Completed
        startedAt: "2025-01-06T14:25:49Z"
  phase: Running
  podIP: y
  podIPs:
  - ip: y
  qosClass: Burstable
  startTime: "2025-01-06T14:25:41Z"

This is however remedied with a change to PATH

(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
bash: nvidia-smi: command not found
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ export PATH=$PATH:/usr/local/nvidia/bin
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
Tue Jan  7 02:48:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:05.0 Off |                    0 |
| N/A   37C    P8             10W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Note: this issue is also present in jupyter-pytorch-cuda-full

@github-project-automation github-project-automation bot moved this to Needs Triage in Kubeflow Notebooks Jan 7, 2025
@MeCode4Food MeCode4Food changed the title GKE GPU nodes: nvidia-smi not found, likely missingPATH and LD_LIBRARY_PATH GKE GPU nodes: nvidia-smi not found, likely missing env PATH and LD_LIBRARY_PATH Jan 7, 2025
@MeCode4Food
Copy link
Author

LD_LIBRARY_PATH can be easily added via the PodDefaults CRD, but PATH variable is not easily extended with that. Creating a new image to extend the PATH variable resolves this issue, but I wonder if this can be added into the base image(s), unless there is a way to resolve this without.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs Triage
Development

No branches or pull requests

1 participant