Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release GIL when doing standalone solves #359

Merged
merged 5 commits into from
Jan 10, 2025

Conversation

jberg5
Copy link
Contributor

@jberg5 jberg5 commented Jan 4, 2025

Addresses #358.

Releases the Global Interpreter Lock (GIL) when using proxsuite.proxqp.dense.solve so python users can easily parallelize solving problems across cores. Quadprog and OSQP already do something similar.

This enables 2x or more faster parallel solving compared to using proxsuite.proxqp.dense.VectorQP() and solve_in_parallel.

On my M2 Mac, I see the following:

python ../benchmark/timings-parallel.py
Problem specs: n=1000 n_eq=200 n_in=200. Generating 64 such problems.
Generated problems. Solving 64 problems with 11 threads.
...
setup_vector_of_qps: 4655.346ms
setup_vector_and_solve_parallel_heuristics_threads: 5850.690ms
solve_serial_dense: 4716.554ms
solve_parallel_dense: 1934.598ms

Headline result is that solving the problems using the standalone dense.solve and a ThreadPoolExecutor is more than 2x faster than solving the problems serially now that the GIL gets released, and almost 3x faster than setting up a VectorQP and running solve_in_parallel (iiuc the latter mostly being a result of the problem conditioning happening when building the vector, which is a serial operation). If the GIL weren't released, then using a ThreadPoolExecutor would only add overhead and would take longer than just solving the problems serially.

As an additional benefit, this change doesn't require building proxsuite with OpenMP, which I think is what has kept solve_in_parallel from being released on all platforms.

@CLAassistant
Copy link

CLAassistant commented Jan 4, 2025

CLA assistant check
All committers have signed the CLA.

@jberg5 jberg5 changed the title Disable GIL when doing standalone solves Release GIL when doing standalone solves Jan 5, 2025
@fabinsch
Copy link
Collaborator

fabinsch commented Jan 7, 2025

Hi @jberg5,

thanks a lot for this suggestion. It is interesting and we should think about how we best integrate this into Proxsuite.

To be clear: there are two interfaces to solve a QP problem with the dense backend. a) create a qp object by passing the problem data (matrices, vectors) to the qp.init method (this does memory allocation and the preconditioning) and then calling qp.solve or b) use the solve function directly taking the problem data as input (this does everything in one go).

Currently, only the qp.solve method (a) is parallelized (using openmp). Therefore the memory alloc + preconditioning is done in serial when building a batch of QPs that is then passed to the solve_in_parallel function. The solve function (b) is not parallelized but can easily be parallelized in Python using ThreadPoolExecutor (as you showed in the benchmark file).

I have made some modifications on top of yours, mainly using BatchQP instead of VectorQP + some comments. I ran the benchmark on two different problem dimensions, and the timings that I obtained on my i7-11850H @ 2.50GHz are:

Problem specs: n=50 n_eq=20 n_in=20. Generating 128 such problems.

Timings:
setup_vector_of_qps: 24.188ms
setup_batch_of_qps: 14.157ms
solve_in_parallel_heuristics_threads: 5.422ms
qp_solve_serial: 36.973ms
solve_in_parallel_1_threads: 31.002ms
solve_in_parallel_3_threads: 10.711ms
solve_in_parallel_5_threads: 6.466ms
solve_in_parallel_7_threads: 5.087ms
solve_in_parallel_9_threads: 4.139ms
solve_in_parallel_11_threads: 4.071ms
solve_in_parallel_13_threads: 3.727ms
solve_in_parallel_15_threads: 4.256ms
solve_in_parallel_heuristics_threads_and_setup_batch_of_qps: 19.579ms
solve_in_parallel_1_threads_and_setup_batch_of_qps: 45.159ms
solve_in_parallel_3_threads_and_setup_batch_of_qps: 24.868ms
solve_in_parallel_5_threads_and_setup_batch_of_qps: 20.623ms
solve_in_parallel_7_threads_and_setup_batch_of_qps: 19.244ms
solve_in_parallel_9_threads_and_setup_batch_of_qps: 18.295ms
solve_in_parallel_11_threads_and_setup_batch_of_qps: 18.228ms
solve_in_parallel_13_threads_and_setup_batch_of_qps: **17.883ms**
solve_in_parallel_15_threads_and_setup_batch_of_qps: 18.412ms
solve_fun_serial: 43.014ms
solve_fun_parallel: **8.487ms**

You can see, the solve_fun_parallel (using ThreadPool) is faster than any solve_in_parallel_X_threads_and_setup_batch_of_qps, when the setup time for batch of qps is much higher than the solve time.

However, when I use the problem dimension defined in the script I get

Problem specs: n=500 n_eq=200 n_in=200. Generating 128 such problems.

Timings:
setup_vector_of_qps: 2553.777ms
setup_batch_of_qps: 1419.743ms
solve_in_parallel_heuristics_threads: 845.065ms
qp_solve_serial: 3525.624ms
solve_in_parallel_1_threads: 2632.830ms
solve_in_parallel_3_threads: 1074.261ms
solve_in_parallel_5_threads: 879.457ms
solve_in_parallel_7_threads: 946.398ms
solve_in_parallel_9_threads: 1007.100ms
solve_in_parallel_11_threads: 1127.983ms
solve_in_parallel_13_threads: 1231.873ms
solve_in_parallel_15_threads: 1237.943ms
solve_in_parallel_heuristics_threads_and_setup_batch_of_qps: **2264.807ms**
solve_in_parallel_1_threads_and_setup_batch_of_qps: 4052.572ms
solve_in_parallel_3_threads_and_setup_batch_of_qps: 2494.003ms
solve_in_parallel_5_threads_and_setup_batch_of_qps: 2299.200ms
solve_in_parallel_7_threads_and_setup_batch_of_qps: 2366.141ms
solve_in_parallel_9_threads_and_setup_batch_of_qps: 2426.843ms
solve_in_parallel_11_threads_and_setup_batch_of_qps: 2547.725ms
solve_in_parallel_13_threads_and_setup_batch_of_qps: 2651.615ms
solve_in_parallel_15_threads_and_setup_batch_of_qps: 2657.685ms
solve_fun_serial: 3885.436ms
solve_fun_parallel: **3589.555ms**

I am not exactly sure why the solve_fun_parallel (using ThreadPool) is performing so poorly in this setup.

Are you OK that I push my modifications here onto the PR for the benchmark script and you check on your machine to compare?

Currently, people who are using the proxsuite.proxqp.dense.solve should assume that it is thread-safe, and releasing the gil on this function might cause some trouble. Maybe it would be a better option to just expose a new function, something like proxsuite.proxqp.dense.solve_no_gil, where you intentionally use this feature of relating the gil. What do you think?

@jberg5
Copy link
Contributor Author

jberg5 commented Jan 7, 2025

Hi @fabinsch, thank you for your comment! I would be happy for you to apply your changes to the benchmark script on top of my branch. Once you do I'll re-run on my machine and put some results here.

Your results with n=50 and n=500 are really interesting. I can confirm that I get similar results running the same problem configuration on my machine. I confess I had only run tests on a scale that was closer to my use case (n >= 1000) and I don't have a good intuition for why n=500, n_eq = n_in = 200 is so slow in a threadpool. Might be interesting to run this benchmark against a lot of different problem setups and plot the results.

Re: proxsuite.proxqp.dense.solve_no_gil, as long as the implementation of proxsuite.proxqp.dense.solve is already threadsafe, then I don't think we would need a separate solve_no_gil. Releasing the GIL wouldn't impact thread safety - it just means that the Python program can do other things while the c++ crunches the numbers. I can update the changelog and the docs to make the GIL change explicit just so users know that they can take advantage of the new behavior. Let me know if this addresses your concerns!

@fabinsch
Copy link
Collaborator

fabinsch commented Jan 7, 2025

Thank you for the answer, I pushed my changes.

One concern I have is that our implementation uses MatRef (an Eigen::Ref to the input data), which directly references the underlying NumPy arrays without copying them. If the GIL is released during the execution of dense.solve and the user modifies the input arrays (e.g., resizing or changing elements) in Python, this could lead to undefined behavior.

Existing users of dense.solve might rely on it holding the GIL for specific use cases and changing the behavior could unintentionally break their workflows. That's why I think it should be intentional to release it, by either using a new function or passing an additional keyword arg like release_gil=True.

What do you think about it @jorisv or @jcarpent ?

@jorisv
Copy link
Contributor

jorisv commented Jan 9, 2025

@fabinsch I agree with you. It's better to add this as an option. So advanced user can release the GIL, but standard user will benefit of this safety.

@jcarpent
Copy link
Member

jcarpent commented Jan 9, 2025

@fabinsch I agree with you. It's better to add this as an option. So advanced user can release the GIL, but standard user will benefit of this safety.

I do agree too.

@jberg5
Copy link
Contributor Author

jberg5 commented Jan 9, 2025

@fabinsch @jorisv @jcarpent thanks guys, appreciate the feedback! Separate API sounds good. Just updated my branch - lmk if it looks good to you.

Regarding timings, I got very similar results on my machine (had to reduce the number of problems from 128 to 32 to get everything to fit in one process on my machine), where the n=500 case is slower with the ThreadPoolExecutor than even just solving everything serially, which is a very interesting result. I'll investigate some more. All the other test cases (50, 100, 200, 1000) were faster with the _no_gil methods and ThreadPoolExecutor.

@fabinsch fabinsch merged commit b72e27f into Simple-Robotics:devel Jan 10, 2025
32 of 73 checks passed
@fabinsch
Copy link
Collaborator

This PR here looked good to me, all tests pass locally. I wanted to run our ci, which failed with an unrelated miniforge error when trying to create the conda env. When I tried to push a commit on top, I accidentally merged it.

I had to reopen a new PR (#363). I apply exactly the same commits as here, plus I try to fix the conda ci to check if all is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants