-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
81fb735
commit 34dcfe4
Showing
2 changed files
with
256 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,232 @@ | ||
## 数据格式 | ||
|
||
### Dataloader加载数据 | ||
|
||
数据的准备过程参考[使用教程](./usage.md) | ||
|
||
这里详细介绍一下在Dataloader中对数据进行pack的过程。 | ||
|
||
当`config.py`中设置`use_packed_dataset`为`True`时,通过`build_pack`函数构建pack data,构建逻辑举例如下: | ||
|
||
假设`micro_bsz`为2,`SEQ_LEN`为8,输入数据格式如下: | ||
```bash | ||
[2323, 442, 252, 341] | ||
[233, 3442, 322, 31, 2514, 49731, 51] | ||
[4326, 427, 465, 22, 314, 9725, 346, 1343] | ||
[24, 2562, 5, 25, 356] | ||
``` | ||
|
||
`packed_length`的值为`micro_bsz * SEQ_LEN`,即16,对于上述输入数据,将每一条输入pack为长度为16的数据,其中,单个子句拼接超出`packed_length`的部分,截断处理,后半部分作为下一条输入的开始。全部文本pack完成之后,对于最后不足`packed_length`的部分,用`0`填充。pack之后的数据格式如下: | ||
```bash | ||
[2323, 442, 252, 341, 233, 3442, 322, 31, 2514, 49731, 51, 4326, 427, 465, 22, 314] | ||
[9725, 346, 1343, 24, 2562, 5, 25, 356, 0, 0, 0, 0, 0, 0, 0, 0] | ||
``` | ||
|
||
`label`的值取data每条输入的第二个值到最后一个值,并在最后一位填充`-100`。上述数据pack之后,对应的label值如下: | ||
```bash | ||
[442, 252, 341, -100, 3442, 322, 31, 2514, 49731, 51, -100, 427, 465, 22, 314, 9725] | ||
[346, 1343, -100, 2562, 5, 25, 356, -100, -100, -100, -100, -100, -100, -100, -100, -100] | ||
``` | ||
|
||
当`config.py`中设置`use_packed_dataset`为`False`时,通过`build_unpack`函数构建pack data,构建逻辑举例如下: | ||
|
||
假设`micro_bsz`为2,`SEQ_LEN`为8,输入数据格式如下: | ||
```bash | ||
[2323, 442, 252, 341] | ||
[233, 3442, 322, 31, 2514, 49731, 51] | ||
[4326, 427, 465, 22, 314, 9725, 346, 1343] | ||
[24, 2562, 5, 25, 356, 3145, 246, 25, 1451, 67, 73, 541, 265] | ||
[4524, 2465, 562, 67, 26, 265, 21, 256, 145, 1345] | ||
[34, 14] | ||
``` | ||
|
||
这里pack的过程遵循三个条件: | ||
|
||
1. pack子句的个数不能超过micro_bsz,即便pack之后的总长度小于`micro_bsz * SEQ_LEN`,也不能将下一个子句的内容进行pack,长度不够的部分补`0`。 | ||
|
||
2. 单条子句的长度不能超过SEQ_LEN,超过部分直接截断丢弃,之后再与下一个子句进行pack。 | ||
|
||
3. 如果子句pack之后的长度超过了`micro_bsz * SEQ_LEN`,超过部分截断丢弃。 | ||
|
||
按照上述规则,pack之后的数据格式如下: | ||
```bash | ||
[2323, 442, 252, 341, 233, 3442, 322, 31, 2514, 49731, 51, 0, 0, 0, 0, 0] | ||
[4326, 427, 465, 22, 314, 9725, 346, 1343, 24, 2562, 5, 25, 356, 3145, 246, 25] | ||
[4524, 2465, 562, 67, 26, 265, 21, 256, 34, 14, 0, 0, 0, 0, 0, 0] | ||
``` | ||
|
||
`label`的值如下: | ||
```bash | ||
[442, 252, 341, -100, 3442, 322, 31, 2514, 49731, 51, -100, -100, -100, -100, -100, -100] | ||
[427, 465, 22, 314, 9725, 346, 1343, -100, 2562, 5, 25, 356, 3145, 246, 25, -100] | ||
[2465, 562, 67, 26, 265, 21, 256, -100, 14, -100, -100, -100, -100, -100, -100, -100] | ||
``` | ||
|
||
注:`use_packed_dataset`不设置时,默认为`True`。一般使用`use_packed_dataset`为`True`的模式训练,以提升训练效率和精度。 | ||
|
||
### 获取Dataloader数据 | ||
#### 从Dataloader中取出数据 | ||
```bash | ||
batch_data, actual_batch_size = engine.load_batch(data_iter) | ||
``` | ||
这里`batch_data`的类型为`list`,其中包含两个元素,第一个元素为`dict`类型的数据`data`,第二个元素为`torch.Tensor`类型的标签`label`。 | ||
|
||
其中,第一个元素`data`包含`input_ids`、`cu_seqlens`、`indexes`三个字段,其类型及形状分别为: | ||
```bash | ||
batch_data[0]['input_ids'] -> torch.Size([micro_num, micro_bsz * SEQ_LEN]),保存输入语句经过tokenize之后的id值 | ||
batch_data[0]['cu_seqlens'] -> list类型,大小为micro_num, 其中每个元素类型为torch.Tensor,保存pack到micro_bsz * SEQ_LEN长度的每个拼接字句的索引 | ||
batch_data[0]['indexes'] -> torch.Size([micro_num, micro_bsz * SEQ_LEN]),保存每个input_ids的索引值,从0开始递增 | ||
``` | ||
|
||
第二个元素`label`的形状为: | ||
```bash | ||
batch_data[1] -> torch.Size([micro_num, micro_bsz * SEQ_LEN]) | ||
``` | ||
`micro_num`在`config.py`配置文件中设置,为梯度累计的大小,即经过`micro_num`次`forward` + `backward`之后,进行梯度更新。 | ||
`micro_bsz * SEQ_LEN`为`pack data`的长度,即将多条输入拼接为`micro_bsz * SEQ_LEN`长度的单条输入,以提高训练效率。 | ||
|
||
举例: | ||
假设`micro_num`为2,`micro_bsz`为2,`SEQ_LEN`为8 | ||
```bash | ||
batch_data[0]['input_ids']: | ||
tensor([[ 2323, 442, 252, 341, 233, 3442, 322, 31, 2514, 49731, 51, 4326, 427, 465, 22, 314], | ||
[ 9725, 346, 1343, 24, 2562, 5, 25, 356, 0, 0, 0, 0, 0, 0, 0, 0]]) | ||
``` | ||
|
||
其中第一个batch由长度分别为4,7,5的子句拼接而成,第二个batch由长度分别为3,5的子句拼接而成,则: | ||
```bash | ||
batch_data[0]['cu_seqlens']: | ||
tensor([[ 0, 4, 11, 16], | ||
[ 0, 3, 8, 16]]) | ||
``` | ||
其中,每相邻两个数字的差值,为当前子句的长度。 | ||
|
||
```bash | ||
batch_data[0]['indexes']: | ||
tensor([[ 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4], | ||
[ 0, 1, 2, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7]]) | ||
``` | ||
其中,每一个数字表示了token在当前子句中的位置。如果最后一句存在padding,则indexes依然按照从0递增的方式直至padding结束。 | ||
|
||
```bash | ||
batch_data[1]: | ||
tensor([[ 442, 252, 341, -100, 3442, 322, 31, 2514, 49731, 51, -100, 427, 465, 22, 314, 9725], | ||
[ 346, 1343, -100, 2562, 5, 25, 356, -100, -100, -100, -100, -100, -100, -100, -100, -100]]) | ||
``` | ||
这里为对应的label的数值。 | ||
|
||
#### 处理数据 | ||
```bash | ||
_data, _label = self._load_accum_batch(data, label) | ||
``` | ||
首先,通过`_load_micro_batch`函数,将`data`和`label`中数据的第一个维度`micro_num`转化为1,并通过更新`offset`的值,依次获取每个微批次的数据。 | ||
|
||
其次,通过注册`data_process_func`对数据做进一步处理。 | ||
|
||
当`config.py`中设置`use_packed_dataset`为`True`时,`data_process_func`中的流程如下: | ||
|
||
通过`packed_data_normalizer`函数,对`data['indexes']`和`data['cu_seqlens']`做降维处理,去掉size为1的第一维,并通过`data['cu_seqlens']`中的值,计算出单个字句的最大长度,记录在`data['max_seqlen']`中。 | ||
|
||
按照上述举例,假设加载第一个批次的数据,经过`_load_accum_batch`处理后的`data`和`label`如下: | ||
```bash | ||
data['input_ids']: | ||
tensor([[ 2323, 442, 252, 341, 233, 3442, 322, 31, 2514, 49731, 51, 4326, 427, 465, 22, 314]]) | ||
data['cu_seqlens']: | ||
tensor([ 0, 4, 11, 16]) | ||
data['indexes']: | ||
tensor([ 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4]) | ||
data['max_seqlen']: | ||
7 | ||
|
||
label: | ||
tensor([[ 442, 252, 341, -100, 3442, 322, 31, 2514, 49731, 51, -100, 427, 465, 22, 314, 9725]]) | ||
``` | ||
|
||
如果`tp`并行模式为`isp`,且tp size(即sequence parallel size)大于1,则会在`data_process_func`中注册`split_data_sequence_parallel`函数,对数据的`sequence`维度进行切分。 | ||
|
||
假设tp size为2,则对上述数据`data['input_ids']`、`data['indexes']`和`label`切分之后的结果如下: | ||
|
||
tp rank0 中的数据: | ||
```bash | ||
data['input_ids']: | ||
tensor([[ 2323, 442, 252, 341, 233, 3442, 322, 31]]) | ||
data['indexes']: | ||
tensor([ 0, 1, 2, 3, 0, 1, 2, 3]) | ||
label: | ||
tensor([[ 442, 252, 341, -100, 3442, 322, 31, 2514]]) | ||
``` | ||
|
||
tp rank1 中的数据: | ||
```bash | ||
data['input_ids']: | ||
tensor([[ 2514, 49731, 51, 4326, 427, 465, 22, 314]]) | ||
data['indexes']: | ||
tensor([ 4, 5, 6, 0, 1, 2, 3, 4]) | ||
label: | ||
tensor([[ 49731, 51, -100, 427, 465, 22, 314, 9725]]) | ||
``` | ||
|
||
当`config.py`中设置`use_packed_dataset`为`False`时,`data_process_func`中的流程如下: | ||
|
||
通过`unpack_data`函数对数据做unpack处理,将`data["input_ids"]`和`label`的数据恢复到unpack的格式,并从data中去除掉"cu_seqlens"和"indexes"字段。 | ||
|
||
unpack之后`data["input_ids"]`和`label`的形状为`torch.Size([micro_bsz, SEQ_LEN])`。 | ||
|
||
按照上述数据举例: | ||
|
||
假设`micro_bsz`为2,`SEQ_LEN`为8,输入数据格式如下: | ||
```bash | ||
[2323, 442, 252, 341] | ||
[233, 3442, 322, 31, 2514, 49731, 51] | ||
``` | ||
|
||
pack之后的数据格式如下: | ||
```bash | ||
[2323, 442, 252, 341, 233, 3442, 322, 31, 2514, 49731, 51, 0, 0, 0, 0, 0] | ||
``` | ||
|
||
`label`的值如下: | ||
```bash | ||
[442, 252, 341, -100, 3442, 322, 31, 2514, 49731, 51, -100, -100, -100, -100, -100, -100] | ||
``` | ||
|
||
经过`unpack_data`处理之后,`data["input_ids"]`和`label`分别如下: | ||
```bash | ||
data["input_ids"]: | ||
tensor([[2323, 442, 252, 341, 0, 0, 0, 0], | ||
[233, 3442, 322, 31, 2514, 49731, 51, 0]]) | ||
|
||
label: | ||
tensor([[442, 252, 341, -100, -100, -100, -100, -100], | ||
[3442, 322, 31, 2514, 49731, 51, -100, -100]]) | ||
``` | ||
|
||
如果`tp`并行模式为`isp`,且tp size(即sequence parallel size)大于1,则会在`data_process_func`中注册`split_data_sequence_parallel`函数,对数据的`sequence`维度进行切分。 | ||
|
||
假设tp size为2,则对上述数据`data['input_ids']`和`label`切分之后的结果如下: | ||
|
||
tp rank0 中的数据: | ||
```bash | ||
data["input_ids"]: | ||
tensor([[2323, 442, 252, 341], | ||
[233, 3442, 322, 31]]) | ||
|
||
label: | ||
tensor([[442, 252, 341, -100], | ||
[3442, 322, 31, 2514]]) | ||
``` | ||
|
||
tp rank1 中的数据: | ||
```bash | ||
data["input_ids"]: | ||
tensor([[0, 0, 0, 0], | ||
[2514, 49731, 51, 0]]) | ||
|
||
label: | ||
tensor([[-100, -100, -100, -100], | ||
[49731, 51, -100, -100]]) | ||
``` | ||
|
||
### Forward过程数据格式 | ||
以internlm2模型为例,详细介绍一下在整个模型运行的过程中数据的流动过程。 | ||
[未完待续] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters