-
Notifications
You must be signed in to change notification settings - Fork 949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python/examples/alpha_zero.py crashes with CUDA_ERROR_NOT_INITIALIZED
#1122
Comments
Hi, This is a pretty complex setup.. I'm not sure how we can help as we don't have a setup like this to reproduce it. Have you tried the simple program on that thread you linked, i.e. tensorflow/tensorflow#57877 (comment) ? Did you see that CUDA support on Windows was being removed in TF? tensorflow/tensorflow#59905. According to that thread, it should still work in WSL. Seems like you need WSL2... but it seems like you are indeed using that. So, yeah.. seems like it should work. @tewalds: any ideas? |
I don't think this is a complex setup at all! I installed a recent version of I made sure to use the correct dependency versions, even going so far as to track down the missing old version of tensorrt-lib, which should be in PyPi but isn't! I then ran I did run the failing code example in the linked issue, and it did fail in the same way. It seems to me that alpha_zero.py forks processes in a way that CUDA does not support! |
Well, first: OpenSpiel is not officially supported on Windows. We don't have Windows machines easily at our disposal, so we don't test things on Windows hosts and have only run things ourselves within WSL a few times. I have no clue how CUDA drivers are supported through WSL. Second, you're using most recent nightly versions of TF that we are not testing on our CI regularly (we're only testing 2.12.0, see here. Due to this, the new TF requires a specific/custom older version of tensorrt and tensorrt-lib. Maybe these don't come with CUDA support, or are not getting built properly? 🤷 Third, TF most recently stopped support CUDA on Windows. That should not affect you due to running within WSL, but I wonder if in the process of disabling CUDA on native Windows, something else in the code chain is causing the CUDA issues within your setup. (I realize this is unlikely.) Then there's a thread that might be related because it's a forking actor ... ? That feels like a pretty complex setup to me. We'll do our best to help, but without being able to mimic your setup, it will be difficult.
I believe @tewalds might be the only one who has run our Python AlphaZero using CUDA; IIRC it was almost certainly on a native Linux machine, and I believe it was about 3 years ago. 😅 I don't know of any instances of people running the Python TF AlphaZero using CUDA within WSL. I barely know one person who has used it with CUDA, and it was long ago. The more common use is C++ LibTorch version on native Linux machines, because it's faster. I'd like to know if it currently runs on a Linux machine with CUDA. @tewalds, is it easy for you to try on your desktop? Can you tell me if you run into the same issue? |
I have been updating the python AlphaZero to Keras 3, and I'm running into the same thing. I don't think that it's a Windows problem. There's some challenge with Keras 3 and forking. There are a few forum posts about it, but nothing definitive, e.g. https://stackoverflow.com/questions/33748750/cuda-error-initialization-error-when-using-parallel-in-python. I did try changing the start method to "spawn" in spawn.py, but that didn't fix it. I might look into whether it's possible to lazy load the core keras libraries. It didn't seem super easy, but I don't have a lot of ideas. It's not super obvious what's going on, because, for example, this sample code runs correctly:
|
I just reproduced this on a metal Ubuntu 20.04 machine with TF 2.16.1 and Keras 3.3.3. |
I'm running Ubuntu 22.04 WSL2, and I've tried running this with both
tensorflow==2.14.0
andtf-nightly==2.15.0.dev20231010
. I am usingPython 3.11.5
, which is supported by the latest version of Tensorflow.You can correctly install Tensorflow with GPU support via
pip install --extra-index-url https://pypi.nvidia.com tensorflow[and-cuda]
, or install the nightly version withpip install --extra-index-url https://pypi.nvidia.com tf-nightly[and-cuda]
. Note that, without the extra-index-url flag, the installation will fail as Tensorflow 2.14.0 depends on specific versions oftensorrt
andtensorrt-lib
which are not in the public pypi repository.I verified that my graphics card is visible to the WSL2 container:
And I verified that Tensorflow itself runs code correctly with my GPU, by running this code, seeing results, and noting the spike in my GPU's utilization when I run this script:
But even though tensorflow is working with my graphics card,
alpha_zero.py
fails:AlphaZero is forking actor, evaluator, and learner processes, and it's these subprocesses which fail, so I believe this is related to tensorflow/tensorflow#57877.
The text was updated successfully, but these errors were encountered: