-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting a RuntimeError after training with mezo #11
Comments
Hi, It looks like an error related to multi processing. Can you report the pytorch/transformers versions here? |
Hello, Also I forget to say that the error was encountered in the large models case. Thanks! |
Hi, Do you mind posting a full error log? Also, have you tried just using single GPU? |
Hi,
Yes, I tried both single and multi GPU. |
It's weird because for single GPU it shouldn't use the distributed training part of torch. This line
will only be triggered when Can you make sure you are not using multiple GPUs and not using |
This error happens when using Transformers>4.28. Starting from 4.29, the accelerate is required for transformers, and somehow it made localrank no longer -1, which is needed for trainer.py included in Mezo to differentiate between single and multi-gpu run. I suggest to update the trainer.py to accommodate new behavior of Transformers 4.29 or newer Transformers 4.28 does not give any errors. |
Hello,
Thank you for sharing your work! I'm getting the error below after training with the mezo.sh script:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
The problem persists when I use multiple GPUs. Thanks!
The text was updated successfully, but these errors were encountered: