update docs

InternLM · Jan 16, 2024 · bf4baaf · bf4baaf
1 parent 28984c8
commit bf4baaf
Show file tree

Hide file tree

Showing 4 changed files with 20 additions and 76 deletions.
diff --git a/README.md b/README.md
@@ -14,14 +14,12 @@
     <div> </div>
   </div>
 
+[![Documentation Status](https://readthedocs.org/projects/internevo/badge/?version=latest)](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)
 [![license](./doc/imgs/license.svg)](./LICENSE)
-[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
-[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
 
 [📘Usage](./doc/en/usage.md) |
 [🛠️Installation](./doc/en/install.md) |
-[📊Train Performance](./doc/en/train_performance.md) |
-[🤗HuggingFace](https://huggingface.co/internlm) |
+[📊Performance](./doc/en/train_performance.md) |
 [🤔Reporting Issues](https://github.com/InternLM/InternEvo/issues/new)
 
 [English](./README.md) |
@@ -40,71 +38,17 @@ InternEvo is an open-sourced lightweight training framework aims to support mode
 
 Based on the InternEvo training framework, we are continually releasing a variety of large language models, including the InternLM-7B series and InternLM-20B series, which significantly outperform numerous renowned open-source LLMs such as LLAMA and other leading models in the field.
 
-### Dialogue
-
-You can interact with the InternLM Chat 7B model through a frontend interface by running the following code:
-
-```bash
-pip install streamlit==1.24.0
-pip install transformers==4.30.2
-streamlit run web_demo.py
-```
-
-The effect is as follows
-
-![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
-
-### Deployment
-
-We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternEvo.
-
-1. First, install LMDeploy:
-
-```shell
-python3 -m pip install lmdeploy
-```
-
-2. Use the following command for iteractive communication with `internlm-chat-7b` model on localhost:
-
-```shell
-lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
-```
-
-3. Besides chatting via command line, you can start lmdeploy `api_server` as below:
-
-```shell
-lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
-```
-For a comprehensive understanding of the `api_server` RESTful API, kindly consult [this](https://github.com/InternLM/lmdeploy/blob/main/docs/en/restful_api.md) guide. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).
-
-## Training
-
-### Pre-training and Fine-tuning Tutorial
+## Quick Start
 
 Please refer to [Usage Tutorial](./doc/en/usage.md) to start InternEvo installation, data processing, pre-training and fine-tuning.
 
-### Convert to Transformers Format
-
-The model trained by InternEvo can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/transformers/convert2hf.py`, the weights saved during training can be converted into transformers format with one command:
-
-```bash
-python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
-```
-
-After conversion, it can be loaded as transformers by the following code
-
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
-```
-
-## Training System
+For more details, please check [internevo.readthedocs.io](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)
 
-### System Architecture
+## System Architecture
 
-Please refer to the [System Architecture document](./doc/en/structure.md) for further details.
+Please refer to the [System Architecture document](./doc/en/structure.md) for architecture details.
 
-### Training Performance
+## Performance
 
 InternEvo deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternEvo supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternEvo's scalability test data at different configurations:
 

diff --git a/doc/en/install.md b/doc/en/install.md
@@ -25,16 +25,16 @@ export CXX=${GCC_HOME}/bin/c++
 ```
 
 ### Environment Installation
-Clone the project `internlm` and its dependent submodules from the github repository, as follows:
+Clone the project `InternEvo` and its dependent submodules from the github repository, as follows:
 ```bash
-git clone git@github.com:InternLM/InternLM.git --recurse-submodules
+git clone git@github.com:InternLM/InternEvo.git --recurse-submodules
 ```
 
 It is recommended to build a Python-3.10 virtual environment using conda and install the required dependencies based on the `requirements/` files:
 ```bash
-conda create --name internlm-env python=3.10 -y
-conda activate internlm-env
-cd internlm
+conda create --name internevo python=3.10 -y
+conda activate internevo
+cd InternEvo
 pip install -r requirements/torch.txt
 pip install -r requirements/runtime.txt
 ```
@@ -62,7 +62,7 @@ cd ../../
 Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternLM runtime environment installed from https://hub.docker.com/r/internlm/internlm.
 
 #### Image Configuration and Build
-The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternLM:
+The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternEvo:
 ``` bash
 make -f docker.Makefile BASE_OS=centos7
 ```
@@ -83,4 +83,4 @@ For the local standard image built with dockerfile or pulled, use the following
 ```bash
 docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
 ```
-The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).
+The default directory in the container is `/InternEvo`, please start training according to the [Usage](./usage.md).
diff --git a/doc/en/train_performance.md b/doc/en/train_performance.md
@@ -1,15 +1,15 @@
 ## Training Performance
 
 
-InternLM deeply integrates Flash-Attention, Apex, and other high-performance model operators to improve training efficiency. It achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training by building the Hybrid Zero technique. InternLM supports expanding the 7B model from 8  GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-card scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
+InternEvo deeply integrates Flash-Attention, Apex, and other high-performance model operators to improve training efficiency. It achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training by building the Hybrid Zero technique. InternEvo supports expanding the 7B model from 8  GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-card scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternEvo's scalability test data at different configurations:
 
 | GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
 | TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
 | TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
 
 
-We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
+We tested the performance of training the 7B model in InternEvo using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
 
 | Hardware                | Model                         |
 | ----------------------- | ----------------------------- |
@@ -24,13 +24,13 @@ We tested the performance of training the 7B model in InternLM using various par
 | micro_bsz       | 2    | 4    |
 | seq_len         | 2048 | 2048 |
 
-The configuration of `zero1` in InternLM determines the allocation range of optimizer states.
+The configuration of `zero1` in InternEvo determines the allocation range of optimizer states.
 - `zero1=-1` indicates that optimizer states are distributed across all data-parallel nodes (equivalent to Deepspeed Zero-1).
 - In the case of `zero1=8, tp=1`, optimizer states are distributed within 8 GPUs in a single node, and the optimizer states remain consistent across different nodes.
 
 ### Throughput Measurement
 
-Throughput is defined as TGS, the average number of tokens processed per GPU per second. In this test, the training configuration had `pack_sample_into_one=False` and `checkpoint=False`. The test results are shown in the following table. When using `zero1=8, tp=1`, InternLM achieves an acceleration efficiency of `88%` for training the 7B model with a thousand cards.
+Throughput is defined as TGS, the average number of tokens processed per GPU per second. In this test, the training configuration had `pack_sample_into_one=False` and `checkpoint=False`. The test results are shown in the following table. When using `zero1=8, tp=1`, InternEvo achieves an acceleration efficiency of `88%` for training the 7B model with a thousand cards.
 
 | Parallel Configuration | 8 GPUs | 16 GPUs | 32 GPUs | 64 GPUs | 128 GPUs | 256 GPUs | 512 GPUs | 1024 GPUs |
 | ---------------------- | ------ | ------- | ------- | ------- | -------- | -------- | -------- | --------- |
@@ -49,7 +49,7 @@ Throughput is defined as TGS, the average number of tokens processed per GPU per
 The computational workload of model training is based on the FLOPS calculation method described in the [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) paper. To ensure constant FLOPS during training, the test configuration had `pack_sample_into_one=True`, `dtype=torch.bfloat16`.
 
 
-When `Activation Ckpt` is enabled，the test results are shown in the table below. InternLM can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.
+When `Activation Ckpt` is enabled，the test results are shown in the table below. InternEvo can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.
 
 - TGS: Tokens per GPU per Second
 

diff --git a/doc/en/usage.md b/doc/en/usage.md
@@ -8,7 +8,7 @@ Please refer to the [installation guide](./install.md) for instructions on how t
 
 ### Dataset Preparation (Pre-training)
 
-The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
+The dataset for the InternEvo training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
 
 You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.