Getting a RuntimeError after training with mezo #11

sowmaster · 2023-07-01T07:37:07Z

Hello,
Thank you for sharing your work! I'm getting the error below after training with the mezo.sh script:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

The problem persists when I use multiple GPUs. Thanks!

gaotianyu1350 · 2023-07-01T18:03:30Z

Hi,

It looks like an error related to multi processing. Can you report the pytorch/transformers versions here?

sowmaster · 2023-07-01T23:30:28Z

Hello,
Thanks for the reply. I first used pytorch 1.13 + transformers 4.29.2 but then updated to pytorch 2.0.1 + transformers 4.29.2 and the issue persists.
If I make the number of eval steps larger than the number of tine-tuning steps so that evaluation is only done at end of training then the error disappear but the model evaluated at the end has poor performance (~50% for SST2 task, which is random performance).

Also I forget to say that the error was encountered in the large models case. Thanks!

gaotianyu1350 · 2023-07-02T18:13:03Z

Hi,

Do you mind posting a full error log? Also, have you tried just using single GPU?

sowmaster · 2023-07-04T04:14:34Z

Hi,
Here is the full error log:

Traceback (most recent call last):
  File "../mezo/large_models/run.py", line 533, in <module>
    main()
  File "../mezo/large_models/run.py", line 498, in main
    framework.train(train_samples, dev_samples if dev_samples is not None else eval_samples)
  File "../mezo/large_models/run.py", line 433, in train
    trainer.train(resume_from_checkpoint=last_checkpoint) 
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop
    dist.barrier()
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3327, in barrier
    default_pg = _get_default_group()
  File "../anaconda3/envs/InstructZero/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Yes, I tried both single and multi GPU.

gaotianyu1350 · 2023-07-04T14:18:22Z

It's weird because for single GPU it shouldn't use the distributed training part of torch.

This line

  File "../mezo/large_models/trainer.py", line 660, in _inner_training_loop

will only be triggered when args.local_rank != -1, which means you are using multiple GPUs.

Can you make sure you are not using multiple GPUs and not using torchrun, accelerate, or srun?

hibagus · 2024-02-06T10:46:31Z

This error happens when using Transformers>4.28. Starting from 4.29, the accelerate is required for transformers, and somehow it made localrank no longer -1, which is needed for trainer.py included in Mezo to differentiate between single and multi-gpu run. I suggest to update the trainer.py to accommodate new behavior of Transformers 4.29 or newer

Transformers 4.28 does not give any errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting a RuntimeError after training with mezo #11

Getting a RuntimeError after training with mezo #11

sowmaster commented Jul 1, 2023

gaotianyu1350 commented Jul 1, 2023

sowmaster commented Jul 1, 2023 •

edited

Loading

gaotianyu1350 commented Jul 2, 2023

sowmaster commented Jul 4, 2023

gaotianyu1350 commented Jul 4, 2023

hibagus commented Feb 6, 2024

Getting a RuntimeError after training with mezo #11

Getting a RuntimeError after training with mezo #11

Comments

sowmaster commented Jul 1, 2023

gaotianyu1350 commented Jul 1, 2023

sowmaster commented Jul 1, 2023 • edited Loading

gaotianyu1350 commented Jul 2, 2023

sowmaster commented Jul 4, 2023

gaotianyu1350 commented Jul 4, 2023

hibagus commented Feb 6, 2024

sowmaster commented Jul 1, 2023 •

edited

Loading