Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a RuntimeError after training with mezo #11

Open
sowmaster opened this issue Jul 1, 2023 · 6 comments
Open

Getting a RuntimeError after training with mezo #11

sowmaster opened this issue Jul 1, 2023 · 6 comments

Comments

@sowmaster
Copy link

Hello,
Thank you for sharing your work! I'm getting the error below after training with the mezo.sh script:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

The problem persists when I use multiple GPUs. Thanks!

@gaotianyu1350
Copy link
Member

Hi,

It looks like an error related to multi processing. Can you report the pytorch/transformers versions here?

@sowmaster
Copy link
Author

sowmaster commented Jul 1, 2023

Hello,
Thanks for the reply. I first used pytorch 1.13 + transformers 4.29.2 but then updated to pytorch 2.0.1 + transformers 4.29.2 and the issue persists.
If I make the number of eval steps larger than the number of tine-tuning steps so that evaluation is only done at end of training then the error disappear but the model evaluated at the end has poor performance (~50% for SST2 task, which is random performance).

Also I forget to say that the error was encountered in the large models case. Thanks!

@gaotianyu1350
Copy link
Member

Hi,

Do you mind posting a full error log? Also, have you tried just using single GPU?

@sowmaster
Copy link
Author

Hi,
Here is the full error log:

Traceback (most recent call last):
  File "../mezo/large_models/run.py", line 533, in <module>
    main()
  File "../mezo/large_models/run.py", line 498, in main
    framework.train(train_samples, dev_samples if dev_samples is not None else eval_samples)
  File "../mezo/large_models/run.py", line 433, in train
    trainer.train(resume_from_checkpoint=last_checkpoint) 
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop
    dist.barrier()
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3327, in barrier
    default_pg = _get_default_group()
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 

Yes, I tried both single and multi GPU.

@gaotianyu1350
Copy link
Member

It's weird because for single GPU it shouldn't use the distributed training part of torch.

This line

  File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop

will only be triggered when args.local_rank != -1, which means you are using multiple GPUs.

Can you make sure you are not using multiple GPUs and not using torchrun, accelerate, or srun?

@hibagus
Copy link

hibagus commented Feb 6, 2024

This error happens when using Transformers>4.28. Starting from 4.29, the accelerate is required for transformers, and somehow it made localrank no longer -1, which is needed for trainer.py included in Mezo to differentiate between single and multi-gpu run. I suggest to update the trainer.py to accommodate new behavior of Transformers 4.29 or newer

Transformers 4.28 does not give any errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants