Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP sample fails model validation #460

Open
shimomut opened this issue Oct 17, 2024 · 2 comments
Open

FSDP sample fails model validation #460

shimomut opened this issue Oct 17, 2024 · 2 comments

Comments

@shimomut
Copy link
Collaborator

When running FSDP sample app, it fails at the model evaluation with following error message.
As --validation_freq=500 is specified, it fails at 500th batch. By changing this configuration we can see the error sooner.

This error reproduced at least on HyperPod Slurm cluster with p5 x 4.

3: [rank3]: Traceback (most recent call last):
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 284, in <module>
3: [rank3]:     main(args)
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 269, in main
3: [rank3]:     train(model, 
3: [rank3]:     ^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 115, in train
3: [rank3]:     val_loss, val_ppl = eval_model(
3: [rank3]:                         ^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/./train.py", line 54, in eval_model
3: [rank3]:     for batch_idx, input_data in enumerate(dataloader):
3: [rank3]:                                  ^^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
3: [rank3]:     return self._get_iterator()
3: [rank3]:            ^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
3: [rank3]:     return _MultiProcessingDataLoaderIter(self)
3: [rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
3: [rank3]:     w.start()
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/process.py", line 121, in start
3: [rank3]:     self._popen = self._Popen(self)
3: [rank3]:                   ^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/context.py", line 224, in _Popen
3: [rank3]:     return _default_context.get_context().Process._Popen(process_obj)
3: [rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/context.py", line 281, in _Popen
3: [rank3]:     return Popen(process_obj)
3: [rank3]:            ^^^^^^^^^^^^^^^^^^
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
3: [rank3]:     self._launch(process_obj)
3: [rank3]:   File "/fsx/ubuntu/shimomut/smddp/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.11/multiprocessing/popen_fork.py", line 66, in _launch
3: [rank3]:     self.pid = os.fork()
3: [rank3]:                ^^^^^^^^^
3: [rank3]: OSError: [Errno 12] Cannot allocate memory
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jan 16, 2025
@shimomut
Copy link
Collaborator Author

Now this issue is happening not at evaluation phase, but at the initialization phase.

@github-actions github-actions bot removed the stale label Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant