-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tp/pp数增加,初始loss值会增加 #417
Comments
您好,请问tp=4/pp8这组实验是在4机32卡上做的吗? |
如果您在单机八卡上加载tp4pp8切分后的ckpt,那肯定是错的 |
在4台8卡机器上做的。tp=4/pp=16这组实验,是在8台8卡机器上做的 |
您好,我们在4机32卡上测试了qwen2.5-72B的tp4pp4的loss都是正常的,这是我们的切分脚本:bash hf2mcore_qwen2.5_convertor.sh 72B /mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B /mnt/data/jerry.lp/qwen-ckpts/Qwen2.5-72B-tp4-pp4 4 4 bf16 true false |
切分后的训练脚本如下, loss是2.x起步 sh run_mcore_qwen-no-overlap.sh |
qwen2.5-72B我自己使用tp=8/pp=2也测试过,loss值正常。 |
7B模型共28层,不支持PP8 / PP16, 但megatron仅在开启VPP的情况下才会检查PP合法性 您可以尝试使用PP4 PP7 PP14观察loss是否正常 |
使用示例
example/qwen2_5/README.md
中的模型转换脚本和模型训练脚本,将官方的Qwen2.5-7B-Instruct模型转换为mcore模型,然后使用示例中的数据,将序列长度从128增加至8k,并测试不同tp/pp设置时的初始loss情况,发现初始loss值在tp/pp数增大时,会逐渐增大,以下分别是tp=2/pp=2,tp=4/pp=8,tp=4/pp=16时的loss情况tp=2/pp2
tp=4/pp8
tp=4/pp=16
后续训练出的模型性能也会下降,请问如何解决?非常感谢
The text was updated successfully, but these errors were encountered: