Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tp/pp数增加,初始loss值会增加 #417

Closed
GuokunWang opened this issue Jan 7, 2025 · 7 comments
Closed

tp/pp数增加,初始loss值会增加 #417

GuokunWang opened this issue Jan 7, 2025 · 7 comments

Comments

@GuokunWang
Copy link

使用示例example/qwen2_5/README.md中的模型转换脚本和模型训练脚本,将官方的Qwen2.5-7B-Instruct模型转换为mcore模型,然后使用示例中的数据,将序列长度从128增加至8k,并测试不同tp/pp设置时的初始loss情况,发现初始loss值在tp/pp数增大时,会逐渐增大,以下分别是tp=2/pp=2,tp=4/pp=8,tp=4/pp=16时的loss情况

tp=2/pp2
tp=2/pp=2
tp=4/pp8
tp=4/pp=8
tp=4/pp=16
tp=4/pp=16

后续训练出的模型性能也会下降,请问如何解决?非常感谢

@jerryli1981
Copy link
Collaborator

您好,请问tp=4/pp8这组实验是在4机32卡上做的吗?

@jerryli1981
Copy link
Collaborator

如果您在单机八卡上加载tp4pp8切分后的ckpt,那肯定是错的

@GuokunWang
Copy link
Author

您好,请问tp=4/pp8这组实验是在4机32卡上做的吗?

在4台8卡机器上做的。tp=4/pp=16这组实验,是在8台8卡机器上做的

@jerryli1981
Copy link
Collaborator

您好,我们在4机32卡上测试了qwen2.5-72B的tp4pp4的loss都是正常的,这是我们的切分脚本:bash hf2mcore_qwen2.5_convertor.sh 72B /mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B /mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B-tp4-pp4 4 4 bf16 true false

@jerryli1981
Copy link
Collaborator

切分后的训练脚本如下, loss是2.x起步

sh run_mcore_qwen-no-overlap.sh
dlc
72B
1
1024
1e-5
1e-6
4096
4096
bf16
4
4
1
true
true
true
false
false
false
100000
/mnt/data/jerry.lp/datasets/pretrain/mmap_qwen2_datasets_content_document
/mnt/data/jerry.lp/datasets/pretrain/mmap_qwen2_datasets_content_document
/mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B-tp4-pp4
2088288256
208828825
/mnt/data/jerry.lp/xxx

@GuokunWang
Copy link
Author

您好,我们在4机32卡上测试了qwen2.5-72B的tp4pp4的loss都是正常的,这是我们的切分脚本:bash hf2mcore_qwen2.5_convertor.sh 72B /mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B /mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B-tp4-pp4 4 4 bf16 true false

qwen2.5-72B我自己使用tp=8/pp=2也测试过,loss值正常。
不过现在tp、pp数增大,造成loss异常的是7B模型。如果可以的话,麻烦验证下qwen2.5-7B模型在使用较大的tp和pp设置时是否会有loss异常的情况,非常感谢

@lostkevin
Copy link
Contributor

7B模型共28层,不支持PP8 / PP16, 但megatron仅在开启VPP的情况下才会检查PP合法性

您可以尝试使用PP4 PP7 PP14观察loss是否正常

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants