Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
gaoyang07 committed Jan 16, 2024
1 parent 28984c8 commit bf4baaf
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 76 deletions.
70 changes: 7 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,12 @@
<div> </div>
</div>

[![Documentation Status](https://readthedocs.org/projects/internevo/badge/?version=latest)](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)
[![license](./doc/imgs/license.svg)](./LICENSE)
[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)

[📘Usage](./doc/en/usage.md) |
[🛠️Installation](./doc/en/install.md) |
[📊Train Performance](./doc/en/train_performance.md) |
[🤗HuggingFace](https://huggingface.co/internlm) |
[📊Performance](./doc/en/train_performance.md) |
[🤔Reporting Issues](https://github.com/InternLM/InternEvo/issues/new)

[English](./README.md) |
Expand All @@ -40,71 +38,17 @@ InternEvo is an open-sourced lightweight training framework aims to support mode

Based on the InternEvo training framework, we are continually releasing a variety of large language models, including the InternLM-7B series and InternLM-20B series, which significantly outperform numerous renowned open-source LLMs such as LLAMA and other leading models in the field.

### Dialogue

You can interact with the InternLM Chat 7B model through a frontend interface by running the following code:

```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run web_demo.py
```

The effect is as follows

![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)

### Deployment

We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternEvo.

1. First, install LMDeploy:

```shell
python3 -m pip install lmdeploy
```

2. Use the following command for iteractive communication with `internlm-chat-7b` model on localhost:

```shell
lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
```

3. Besides chatting via command line, you can start lmdeploy `api_server` as below:

```shell
lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
For a comprehensive understanding of the `api_server` RESTful API, kindly consult [this](https://github.com/InternLM/lmdeploy/blob/main/docs/en/restful_api.md) guide. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).

## Training

### Pre-training and Fine-tuning Tutorial
## Quick Start

Please refer to [Usage Tutorial](./doc/en/usage.md) to start InternEvo installation, data processing, pre-training and fine-tuning.

### Convert to Transformers Format

The model trained by InternEvo can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/transformers/convert2hf.py`, the weights saved during training can be converted into transformers format with one command:

```bash
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
```

After conversion, it can be loaded as transformers by the following code

```python
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
```

## Training System
For more details, please check [internevo.readthedocs.io](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)

### System Architecture
## System Architecture

Please refer to the [System Architecture document](./doc/en/structure.md) for further details.
Please refer to the [System Architecture document](./doc/en/structure.md) for architecture details.

### Training Performance
## Performance

InternEvo deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternEvo supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternEvo's scalability test data at different configurations:

Expand Down
14 changes: 7 additions & 7 deletions doc/en/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,16 @@ export CXX=${GCC_HOME}/bin/c++
```

### Environment Installation
Clone the project `internlm` and its dependent submodules from the github repository, as follows:
Clone the project `InternEvo` and its dependent submodules from the github repository, as follows:
```bash
git clone git@github.com:InternLM/InternLM.git --recurse-submodules
git clone git@github.com:InternLM/InternEvo.git --recurse-submodules
```

It is recommended to build a Python-3.10 virtual environment using conda and install the required dependencies based on the `requirements/` files:
```bash
conda create --name internlm-env python=3.10 -y
conda activate internlm-env
cd internlm
conda create --name internevo python=3.10 -y
conda activate internevo
cd InternEvo
pip install -r requirements/torch.txt
pip install -r requirements/runtime.txt
```
Expand Down Expand Up @@ -62,7 +62,7 @@ cd ../../
Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternLM runtime environment installed from https://hub.docker.com/r/internlm/internlm.

#### Image Configuration and Build
The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternLM:
The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternEvo:
``` bash
make -f docker.Makefile BASE_OS=centos7
```
Expand All @@ -83,4 +83,4 @@ For the local standard image built with dockerfile or pulled, use the following
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
```
The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).
The default directory in the container is `/InternEvo`, please start training according to the [Usage](./usage.md).
10 changes: 5 additions & 5 deletions doc/en/train_performance.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
## Training Performance


InternLM deeply integrates Flash-Attention, Apex, and other high-performance model operators to improve training efficiency. It achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training by building the Hybrid Zero technique. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-card scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
InternEvo deeply integrates Flash-Attention, Apex, and other high-performance model operators to improve training efficiency. It achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training by building the Hybrid Zero technique. InternEvo supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-card scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternEvo's scalability test data at different configurations:

| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |


We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
We tested the performance of training the 7B model in InternEvo using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:

| Hardware | Model |
| ----------------------- | ----------------------------- |
Expand All @@ -24,13 +24,13 @@ We tested the performance of training the 7B model in InternLM using various par
| micro_bsz | 2 | 4 |
| seq_len | 2048 | 2048 |

The configuration of `zero1` in InternLM determines the allocation range of optimizer states.
The configuration of `zero1` in InternEvo determines the allocation range of optimizer states.
- `zero1=-1` indicates that optimizer states are distributed across all data-parallel nodes (equivalent to Deepspeed Zero-1).
- In the case of `zero1=8, tp=1`, optimizer states are distributed within 8 GPUs in a single node, and the optimizer states remain consistent across different nodes.

### Throughput Measurement

Throughput is defined as TGS, the average number of tokens processed per GPU per second. In this test, the training configuration had `pack_sample_into_one=False` and `checkpoint=False`. The test results are shown in the following table. When using `zero1=8, tp=1`, InternLM achieves an acceleration efficiency of `88%` for training the 7B model with a thousand cards.
Throughput is defined as TGS, the average number of tokens processed per GPU per second. In this test, the training configuration had `pack_sample_into_one=False` and `checkpoint=False`. The test results are shown in the following table. When using `zero1=8, tp=1`, InternEvo achieves an acceleration efficiency of `88%` for training the 7B model with a thousand cards.

| Parallel Configuration | 8 GPUs | 16 GPUs | 32 GPUs | 64 GPUs | 128 GPUs | 256 GPUs | 512 GPUs | 1024 GPUs |
| ---------------------- | ------ | ------- | ------- | ------- | -------- | -------- | -------- | --------- |
Expand All @@ -49,7 +49,7 @@ Throughput is defined as TGS, the average number of tokens processed per GPU per
The computational workload of model training is based on the FLOPS calculation method described in the [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) paper. To ensure constant FLOPS during training, the test configuration had `pack_sample_into_one=True`, `dtype=torch.bfloat16`.


When `Activation Ckpt` is enabled,the test results are shown in the table below. InternLM can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.
When `Activation Ckpt` is enabled,the test results are shown in the table below. InternEvo can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.

- TGS: Tokens per GPU per Second

Expand Down
2 changes: 1 addition & 1 deletion doc/en/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Please refer to the [installation guide](./install.md) for instructions on how t

### Dataset Preparation (Pre-training)

The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
The dataset for the InternEvo training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.

You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.

Expand Down

0 comments on commit bf4baaf

Please sign in to comment.