Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIJob EFA example doesn't apply #19

Open
kwohlfahrt opened this issue Aug 18, 2023 · 0 comments
Open

MPIJob EFA example doesn't apply #19

kwohlfahrt opened this issue Aug 18, 2023 · 0 comments

Comments

@kwohlfahrt
Copy link

kwohlfahrt commented Aug 18, 2023

The MPIJob EFA example here, doesn't apply cleanly, it shows the following error:

Error from server (BadRequest): error when creating "mpijob.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.worker.template.spec.imagePullPolicy"

The issue is that the imagePullPolicy must be specified on the container, not the spec. Changing it so the scheduler reads like this (and the same for the worker) allows it to apply:

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
          restartPolicy: OnFailure
          containers:
          #- image: <account>.dkr.ecr.us-west-2.amazonaws.com/cuda-efa-nccl-tests:ubuntu18.04
          - image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11-ubuntu18.04
            imagePullPolicy: IfNotPresent

Edit: actually, even with this fix, I'm unable to get it running. The connection from the launcher is refused by the worker: Connection reset by 172.17.5.245 port 22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant