Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of unbalanced datasets in multiprocessing #7

Open
maxgalli opened this issue Sep 1, 2020 · 0 comments
Open

Improvement of unbalanced datasets in multiprocessing #7

maxgalli opened this issue Sep 1, 2020 · 0 comments

Comments

@maxgalli
Copy link
Owner

maxgalli commented Sep 1, 2020

As it was noticed during the last benchmark tests run, the treatment of unbalanced datasets is suboptimal when running with multiprocessing enabled if one of the RDataFrames is built on top of a dataset whose size is much bigger than the others, the worker that process it end up creating a bottleneck for the entire analysis. Several ways (to be investigated and implemented separately) can fix this issue:

  • combine the usage of multiprocessing and multithreading: detect in advance the larger datasets and split the workers that get to process these into multiple threads; in order not to increase the number of cores used, the overall number of workers decreases;
  • using only multiprocessing: detect in advance the larger datasets and split them into different RDataFrames, so that they are taken by different workers; the results can be easily merged at the end to get the proper histograms; this solution also requires something to check that the largest RDataFrames are the first ones sent to the workers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant