-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release GIL when doing standalone solves #359
Conversation
Hi @jberg5, thanks a lot for this suggestion. It is interesting and we should think about how we best integrate this into Proxsuite. To be clear: there are two interfaces to solve a QP problem with the dense backend. a) create a qp object by passing the problem data (matrices, vectors) to the qp.init method (this does memory allocation and the preconditioning) and then calling qp.solve or b) use the solve function directly taking the problem data as input (this does everything in one go). Currently, only the qp.solve method (a) is parallelized (using openmp). Therefore the memory alloc + preconditioning is done in serial when building a batch of QPs that is then passed to the I have made some modifications on top of yours, mainly using
You can see, the However, when I use the problem dimension defined in the script I get
I am not exactly sure why the Are you OK that I push my modifications here onto the PR for the benchmark script and you check on your machine to compare? Currently, people who are using the |
Hi @fabinsch, thank you for your comment! I would be happy for you to apply your changes to the benchmark script on top of my branch. Once you do I'll re-run on my machine and put some results here. Your results with n=50 and n=500 are really interesting. I can confirm that I get similar results running the same problem configuration on my machine. I confess I had only run tests on a scale that was closer to my use case (n >= 1000) and I don't have a good intuition for why n=500, n_eq = n_in = 200 is so slow in a threadpool. Might be interesting to run this benchmark against a lot of different problem setups and plot the results. Re: |
Thank you for the answer, I pushed my changes. One concern I have is that our implementation uses MatRef (an Eigen::Ref to the input data), which directly references the underlying NumPy arrays without copying them. If the GIL is released during the execution of Existing users of |
@fabinsch I agree with you. It's better to add this as an option. So advanced user can release the GIL, but standard user will benefit of this safety. |
I do agree too. |
@fabinsch @jorisv @jcarpent thanks guys, appreciate the feedback! Separate API sounds good. Just updated my branch - lmk if it looks good to you. Regarding timings, I got very similar results on my machine (had to reduce the number of problems from 128 to 32 to get everything to fit in one process on my machine), where the n=500 case is slower with the ThreadPoolExecutor than even just solving everything serially, which is a very interesting result. I'll investigate some more. All the other test cases (50, 100, 200, 1000) were faster with the _no_gil methods and ThreadPoolExecutor. |
This PR here looked good to me, all tests pass locally. I wanted to run our ci, which failed with an unrelated miniforge error when trying to create the conda env. When I tried to push a commit on top, I accidentally merged it. I had to reopen a new PR (#363). I apply exactly the same commits as here, plus I try to fix the conda ci to check if all is good. |
Addresses #358.
Releases the Global Interpreter Lock (GIL) when using
proxsuite.proxqp.dense.solve
so python users can easily parallelize solving problems across cores. Quadprog and OSQP already do something similar.This enables 2x or more faster parallel solving compared to using
proxsuite.proxqp.dense.VectorQP()
andsolve_in_parallel
.On my M2 Mac, I see the following:
Headline result is that solving the problems using the standalone
dense.solve
and aThreadPoolExecutor
is more than 2x faster than solving the problems serially now that the GIL gets released, and almost 3x faster than setting up aVectorQP
and runningsolve_in_parallel
(iiuc the latter mostly being a result of the problem conditioning happening when building the vector, which is a serial operation). If the GIL weren't released, then using aThreadPoolExecutor
would only add overhead and would take longer than just solving the problems serially.As an additional benefit, this change doesn't require building proxsuite with OpenMP, which I think is what has kept
solve_in_parallel
from being released on all platforms.