Skip to content

Commit

Permalink
Fix(docker): update docker image and dockerfile for new version (#200)
Browse files Browse the repository at this point in the history
  • Loading branch information
li126com authored Jul 16, 2024
1 parent aa3e9c4 commit 0f87f47
Show file tree
Hide file tree
Showing 10 changed files with 93 additions and 82 deletions.
6 changes: 3 additions & 3 deletions README-zh-Hans.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@
[![使用文档](https://readthedocs.org/projects/internevo/badge/?version=latest)](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)
[![license](./doc/imgs/license.svg)](./LICENSE)

[📘使用教程](./doc/en/usage.md) |
[🛠️安装指引](./doc/en/install.md) |
[📊框架性能](./doc/en/train_performance.md) |
[📘使用教程](./doc/usage.md) |
[🛠️安装指引](./doc/install.md) |
[📊框架性能](./doc/train_performance.md) |
[🤔问题报告](https://github.com/InternLM/InternEvo/issues/new)

[English](./README.md) |
Expand Down
24 changes: 16 additions & 8 deletions doc/en/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,10 @@ cd ../../../../
Install Apex (version 23.05):
```bash
cd ./third_party/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../../
```

Expand All @@ -88,31 +91,36 @@ pip install git+https://github.com/databricks/megablocks@v0.3.2 # MOE need
```

### Environment Image
Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternEvo runtime environment installed from https://hub.docker.com/r/internlm/internlm.
Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternEvo runtime environment installed from https://hub.docker.com/r/internlm/internevo/tags.

#### Image Configuration and Build
The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternEvo:
``` bash
make -f docker.Makefile BASE_OS=centos7
```
In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. The default is the recommended environment version. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.

#### Pull Standard Image
The standard image based on ubuntu and centos has been built and can be directly pulled:

```bash
# ubuntu20.04
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-ubuntu20.04
# centos7
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7
```

#### Run Container
For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name internevo_centos internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7 bash
```

#### Start Training
The default directory in the container is `/InternEvo`, please start training according to the [Usage](./usage.md). The default 7B model starts the single-machine with 8-GPU training command example as follows:
```bash
torchrun --nproc_per_node=8 --nnodes=1 train.py --config configs/7B_sft.py --launcher torch
```
The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).

## Environment Installation (NPU)
For machines with NPU, the version of the installation environment can refer to that of GPU. Use Ascend's torch_npu instead of torch on NPU machines. Additionally, Flash-Attention and Apex are no longer supported for installation on NPU. The corresponding functionalities have been internally implemented in the InternEvo codebase. The following tutorial is only for installing torch_npu.
Expand All @@ -135,4 +143,4 @@ pip3 install pyyaml
pip3 install setuptools
wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc1-pytorch2.1.0/torch_npu-2.1.0.post3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
pip install torch_npu-2.1.0.post3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
```
```
21 changes: 14 additions & 7 deletions doc/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,10 @@ cd ../../../../
安装 Apex (version 23.05):
```bash
cd ./third_party/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../../
```

Expand All @@ -88,32 +91,36 @@ pip install git+https://github.com/databricks/megablocks@v0.3.2 # MOE相关
```

### 环境镜像
用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像,或者也可以从 https://hub.docker.com/r/internlm/internlm 获取安装了 InternEvo 运行环境的镜像。
用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像,或者也可以从 https://hub.docker.com/r/internlm/internevo/tags 获取安装了 InternEvo 运行环境的镜像。

#### 镜像配置及构造
dockerfile 的配置以及构造均通过 docker.Makefile 文件实现,在 InternEvo 根目录下执行如下命令即可 build 镜像:
``` bash
make -f docker.Makefile BASE_OS=centos7
```
在 docker.Makefile 中可自定义基础镜像,环境版本等内容,对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
在 docker.Makefile 中可自定义基础镜像,环境版本等内容,对应参数可直接通过命令行传递,默认为推荐的环境版本。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。

#### 镜像拉取
基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用:

```bash
# ubuntu20.04
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-ubuntu20.04
# centos7
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7
```

#### 容器启动
对于使用 dockerfile 构建或拉取的本地标准镜像,使用如下命令启动并进入容器:
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name internevo_centos internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7 bash
```
容器内默认目录即 `/InternLM`,根据[使用文档](./usage.md)即可启动训练。

#### 训练启动
容器内默认目录即 `/InternEvo`,参考[使用文档](./usage.md)可获取具体使用方法。默认7B模型启动单机8卡训练命令样例:
```bash
torchrun --nproc_per_node=8 --nnodes=1 train.py --config configs/7B_sft.py --launcher torch
```

## 环境安装(NPU)
在搭载NPU的机器上安装环境的版本可参考GPU,在NPU上使用昇腾torch_npu代替torch,同时Flash-Attention和Apex不再支持安装,相应功能已由InternEvo代码内部实现。以下教程仅为torch_npu安装。
Expand Down
24 changes: 10 additions & 14 deletions docker.Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
DOCKER_REGISTRY ?= docker.io
DOCKER_ORG ?= my
DOCKER_IMAGE ?= internlm
DOCKER_ORG ?= internlm
DOCKER_IMAGE ?= internevo
DOCKER_FULL_NAME = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)

CUDA_VERSION = 11.7.1
GCC_VERSION = 10.2.0

CUDA_VERSION = 11.8.0
CUDNN_VERSION = 8

BASE_RUNTIME =
# ubuntu20.04 centos7
BASE_OS = centos7
Expand All @@ -17,9 +16,10 @@ CUDA_CHANNEL = nvidia
INSTALL_CHANNEL ?= pytorch

PYTHON_VERSION ?= 3.10
PYTORCH_VERSION ?= 1.13.1
TORCHVISION_VERSION ?= 0.14.1
TORCHAUDIO_VERSION ?= 0.13.1
PYTORCH_TAG ?= 2.1.0
PYTORCH_VERSION ?= 2.1.0+cu118
TORCHVISION_VERSION ?= 0.16.0+cu118
TORCHAUDIO_VERSION ?= 2.1.0+cu118
BUILD_PROGRESS ?= auto
TRITON_VERSION ?=
GMP_VERSION ?= 6.2.1
Expand All @@ -28,18 +28,14 @@ MPC_VERSION ?= 1.2.1
GCC_VERSION ?= 10.2.0
HTTPS_PROXY_I ?=
HTTP_PROXY_I ?=
FLASH_ATTEN_VERSION ?= 1.0.5
FLASH_ATTEN_VERSION ?= 2.2.1
FLASH_ATTEN_TAG ?= v${FLASH_ATTEN_VERSION}

BUILD_ARGS = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
--build-arg CUDA_VERSION=$(CUDA_VERSION) \
--build-arg CUDA_CHANNEL=$(CUDA_CHANNEL) \
--build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
--build-arg TORCHVISION_VERSION=$(TORCHVISION_VERSION) \
--build-arg TORCHAUDIO_VERSION=$(TORCHAUDIO_VERSION) \
--build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) \
--build-arg TRITON_VERSION=$(TRITON_VERSION) \
--build-arg GMP_VERSION=$(GMP_VERSION) \
--build-arg MPFR_VERSION=$(MPFR_VERSION) \
--build-arg MPC_VERSION=$(MPC_VERSION) \
Expand Down Expand Up @@ -98,7 +94,7 @@ all: devel-image

.PHONY: devel-image
devel-image: BASE_IMAGE := $(BASE_DEVEL)
devel-image: DOCKER_TAG := torch${PYTORCH_VERSION}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
devel-image: DOCKER_TAG := torch${PYTORCH_TAG}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
devel-image:
$(DOCKER_BUILD)

Expand Down
15 changes: 9 additions & 6 deletions docker/Dockerfile-centos
Original file line number Diff line number Diff line change
Expand Up @@ -107,18 +107,18 @@ ENV CXX=${GCC_HOME}/bin/c++


##############################################################################
# Install InternLM development environment, including flash-attention and apex
# Install InternEvo development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
COPY . /InternEvo
WORKDIR /InternEvo
ARG https_proxy
ARG http_proxy
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
RUN git submodule update --init --recursive \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
&& cd /InternLM/third_party/flash-attention \
&& cd /InternEvo/third_party/flash-attention \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
Expand All @@ -127,6 +127,9 @@ RUN git submodule update --init --recursive \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
&& /opt/conda/bin/pip install pytorch-extension \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip
&& rm -rf ~/.cache/pip \
&& /opt/conda/bin/conda init \
&& . ~/.bashrc
15 changes: 9 additions & 6 deletions docker/Dockerfile-ubuntu
Original file line number Diff line number Diff line change
Expand Up @@ -88,18 +88,18 @@ ENV CXX=${GCC_HOME}/bin/c++


##############################################################################
# Install InternLM development environment, including flash-attention and apex
# Install InternEvo development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
COPY . /InternEvo
WORKDIR /InternEvo
ARG https_proxy
ARG http_proxy
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
RUN git submodule update --init --recursive \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
&& cd /InternLM/third_party/flash-attention \
&& cd /InternEvo/third_party/flash-attention \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
Expand All @@ -108,6 +108,9 @@ RUN git submodule update --init --recursive \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
&& /opt/conda/bin/pip install pytorch-extension \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip
&& rm -rf ~/.cache/pip \
&& /opt/conda/bin/conda init \
&& . ~/.bashrc
23 changes: 13 additions & 10 deletions experiment/Dockerfile-centos
Original file line number Diff line number Diff line change
Expand Up @@ -106,11 +106,11 @@ ENV CXX=${GCC_HOME}/bin/c++


##############################################################################
# Install InternLM development environment, including flash-attention and apex
# Install InternEvo development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
COPY . /InternEvo
WORKDIR /InternEvo
ARG https_proxy
ARG http_proxy
ARG PYTORCH_VERSION
Expand All @@ -134,11 +134,11 @@ RUN /opt/conda/bin/pip --no-cache-dir install \
torch-scatter \
pyecharts \
py-libnuma \
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}.html \
&& /opt/conda/bin/pip --no-cache-dir install \
--extra-index-url https://download.pytorch.org/whl/cu117 \
torch==${PYTORCH_VERSION}+cu117 \
torchvision==${TORCHVISION_VERSION}+cu117 \
--extra-index-url https://download.pytorch.org/whl/cu118 \
torch==${PYTORCH_VERSION} \
torchvision==${TORCHVISION_VERSION} \
torchaudio==${TORCHAUDIO_VERSION}

ARG https_proxy
Expand All @@ -147,7 +147,7 @@ ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
ARG FLASH_ATTEN_TAG

RUN git submodule update --init --recursive \
&& cd /InternLM/third_party/flash-attention \
&& cd /InternEvo/third_party/flash-attention \
&& git checkout ${FLASH_ATTEN_TAG} \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
Expand All @@ -157,6 +157,9 @@ RUN git submodule update --init --recursive \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
&& /opt/conda/bin/pip install pytorch-extension \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip
&& rm -rf ~/.cache/pip \
&& /opt/conda/bin/conda init \
&& . ~/.bashrc
23 changes: 13 additions & 10 deletions experiment/Dockerfile-ubuntu
Original file line number Diff line number Diff line change
Expand Up @@ -87,11 +87,11 @@ ENV CXX=${GCC_HOME}/bin/c++


##############################################################################
# Install InternLM development environment, including flash-attention and apex
# Install InternEvo development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
COPY . /InternEvo
WORKDIR /InternEvo
ARG https_proxy
ARG http_proxy
ARG PYTORCH_VERSION
Expand All @@ -115,11 +115,11 @@ RUN /opt/conda/bin/pip --no-cache-dir install \
torch-scatter \
pyecharts \
py-libnuma \
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}.html \
&& /opt/conda/bin/pip --no-cache-dir install \
--extra-index-url https://download.pytorch.org/whl/cu117 \
torch==${PYTORCH_VERSION}+cu117 \
torchvision==${TORCHVISION_VERSION}+cu117 \
--extra-index-url https://download.pytorch.org/whl/cu118 \
torch==${PYTORCH_VERSION} \
torchvision==${TORCHVISION_VERSION} \
torchaudio==${TORCHAUDIO_VERSION}

ARG https_proxy
Expand All @@ -128,7 +128,7 @@ ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
ARG FLASH_ATTEN_TAG

RUN git submodule update --init --recursive \
&& cd /InternLM/third_party/flash-attention \
&& cd /InternEvo/third_party/flash-attention \
&& git checkout ${FLASH_ATTEN_TAG} \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
Expand All @@ -138,6 +138,9 @@ RUN git submodule update --init --recursive \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
&& /opt/conda/bin/pip install pytorch-extension \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip
&& rm -rf ~/.cache/pip \
&& /opt/conda/bin/conda init \
&& . ~/.bashrc
Loading

0 comments on commit 0f87f47

Please sign in to comment.