Skip to content

NCCL on Kubernetes

NCCL on Kubernetes #58

Triggered via schedule January 3, 2025 08:36
Status Failure
Total duration 23m 2s
Artifacts 1

nccl-k8s.yaml

on: schedule
build-mpi-operator-compatible-base  /  build-mpi-operator-compatible-base
1m 43s
build-mpi-operator-compatible-base / build-mpi-operator-compatible-base
Matrix: nccl-test
Fit to window
Zoom out
Zoom in

Annotations

4 errors and 1 warning
nccl-test (all_gather_perf_mpi)
The self-hosted runner: eks-mfpzq-runner-5g6p6 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
nccl-test (broadcast_perf_mpi)
The self-hosted runner: eks-mfpzq-runner-khdp5 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
nccl-test (reduce_scatter_perf_mpi)
The self-hosted runner: eks-mfpzq-runner-f6mxn lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
nccl-test (all_reduce_perf_mpi)
The job was canceled because "broadcast_perf_mpi" failed.
nccl-test (all_reduce_perf_mpi)
Runner eks-mfpzq-runner-spklg did not respond to a cancelation request with 00:05:00.

Artifacts

Produced during runtime
Name Size
artifact-mpi-operator-compatible-base-build-amd64
638 Bytes