Update the dockerfile base image so that we can support NCCL #1248

Steboss · 2025-01-14T11:42:27Z

Update the base docker image, so we can use cuda-dl-base from nvcr. This supports NCCL and it can be useful for long term, to be used on AWS and GCP

olupton · 2025-01-14T13:05:11Z

.github/container/Dockerfile.base

GitHub won't let me comment on individual lines lower down this file.

I think that, as well as the change you've made, we want to drop a significant number of the later lines in this Dockerfile, because they relate to components that are already shipped in cuda-dl-base.

e.g.

I think the tcpx plugin code is outdated and unused (@yhtang, @chaserileyroberts) - similarly the autoconfig script for GCP

install-nsight.sh, install-cudnn.sh, install-nccl.sh: all of these install CUDA-ish components that are distributed in the cuda-dl-base container

install-ofed.sh, install-efa.sh: these install bits of the networking stack that come in the base container, running them on top of cuda-dl-base will actively break things

check-shm.sh: the base container implements this

@DwarKapex can advise on other Dockerfile changes that are needed when changing the base.

I'll start introducing these changes :)

Steboss · 2025-01-14T17:00:58Z

So I run a check in the cuda-dl-base image, and I can see that:

for nccl we need to create symlink for include and lib directories, so they're mapped in opt/nvidia/nccl
same for cudNN
we can safely remove install-ofed.sh
and amazon efa support

For the symlink, we'd just need this part of the install-nccl.sh script (and the counterpart in install-cudnn.sh script):

arch=$(uname -m)-linux-gnu
for nccl_file in $(dpkg -L libnccl2 libnccl-dev | sort -u); do
  # Real files and symlinks are linked into $prefix
  if [[ -f "${nccl_file}" || -h "${nccl_file}" ]]; then
    # Replace /usr with $prefix and remove arch-specific lib directories
    nosysprefix="${nccl_file#"/usr/"}"
    noarchlib="${nosysprefix/#"lib/${arch}"/lib}"
    link_name="${prefix}/${noarchlib}"
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${nccl_file}" "${link_name}"
  else
    echo "Skipping ${nccl_file}"
  fi
done

@DwarKapex does it sound right to you?

Update the dockerfile base image so that we can support NCCL

dedf8c2

Steboss requested review from yhtang and olupton January 14, 2025 11:42

olupton requested changes Jan 14, 2025

View reviewed changes

remove no needed installs

ad2493a

STEFANO BOSISIO added 3 commits January 15, 2025 09:36

fix installs and symlinks

d95d51c

forgot to take out tcpx

d341fcd

fix way to link files, so cudnn.h is visible

62000ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the dockerfile base image so that we can support NCCL #1248

Update the dockerfile base image so that we can support NCCL #1248

Steboss commented Jan 14, 2025

olupton Jan 14, 2025

Steboss Jan 14, 2025

Steboss commented Jan 14, 2025

Update the dockerfile base image so that we can support NCCL #1248

Are you sure you want to change the base?

Update the dockerfile base image so that we can support NCCL #1248

Conversation

Steboss commented Jan 14, 2025

olupton Jan 14, 2025

Choose a reason for hiding this comment

Steboss Jan 14, 2025

Choose a reason for hiding this comment

Steboss commented Jan 14, 2025