Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the dockerfile base image so that we can support NCCL #1248

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Steboss
Copy link

@Steboss Steboss commented Jan 14, 2025

Update the base docker image, so we can use cuda-dl-base from nvcr. This supports NCCL and it can be useful for long term, to be used on AWS and GCP

@Steboss Steboss requested review from yhtang and olupton January 14, 2025 11:42
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub won't let me comment on individual lines lower down this file.

I think that, as well as the change you've made, we want to drop a significant number of the later lines in this Dockerfile, because they relate to components that are already shipped in cuda-dl-base.

e.g.

  • I think the tcpx plugin code is outdated and unused (@yhtang, @chaserileyroberts) - similarly the autoconfig script for GCP
  • install-nsight.sh, install-cudnn.sh, install-nccl.sh: all of these install CUDA-ish components that are distributed in the cuda-dl-base container
  • install-ofed.sh, install-efa.sh: these install bits of the networking stack that come in the base container, running them on top of cuda-dl-base will actively break things
  • check-shm.sh: the base container implements this

@DwarKapex can advise on other Dockerfile changes that are needed when changing the base.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start introducing these changes :)

@Steboss
Copy link
Author

Steboss commented Jan 14, 2025

So I run a check in the cuda-dl-base image, and I can see that:

  • for nccl we need to create symlink for include and lib directories, so they're mapped in opt/nvidia/nccl
  • same for cudNN
  • we can safely remove install-ofed.sh
  • and amazon efa support

For the symlink, we'd just need this part of the install-nccl.sh script (and the counterpart in install-cudnn.sh script):

arch=$(uname -m)-linux-gnu
for nccl_file in $(dpkg -L libnccl2 libnccl-dev | sort -u); do
  # Real files and symlinks are linked into $prefix
  if [[ -f "${nccl_file}" || -h "${nccl_file}" ]]; then
    # Replace /usr with $prefix and remove arch-specific lib directories
    nosysprefix="${nccl_file#"/usr/"}"
    noarchlib="${nosysprefix/#"lib/${arch}"/lib}"
    link_name="${prefix}/${noarchlib}"
    link_dir=$(dirname "${link_name}")
    mkdir -p "${link_dir}"
    ln -s "${nccl_file}" "${link_name}"
  else
    echo "Skipping ${nccl_file}"
  fi
done

@DwarKapex does it sound right to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants