Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Nakroma
Copy link
Contributor

@Nakroma Nakroma commented Jan 24, 2025

Student project SYSTEMDS-3548 and follow-up to #2154

Contributions/discussion:

  • I did some follow-up to both suggestions from @Baunsgaard in the first PR and did testing with both chunking into smaller parts and fusing operations into fewer java calls. I was unable to get any real improvements that were replicable over the larger datasets, altough I don't have a ton of experience with Py4J, so this might still have some potential. I added some of the adjacent code for it though (fusing convert, setting only chunks of a FrameBlock etc.), so at least some of the work I did there contributes to the project.
  • As it turns out, anything involving the java gateway is super costly, so for example even simply doing a if var == jvm.gate.sds.ValueType.String comparison has a big overhead. I was able to optimize another constant time by reducing stuff like that to a minimum, see the first graph below.
  • For cases where cols > rows the current column-wise processing is very slow, so I added row-wise processing for that case to speed it up (see second graph below, tested on 1k rows x 10k cols). Note here that it currently only does that for edge cases where all columns have the same data type. This is because when testing, serializing over a row with different columns was very costly. I wasn't able to spend a lot of time on this, as the deadline is approaching, so I think there is a lot of potential here to find an efficient way to serialize to be able to also use it for mixed columns. I'd also expect in the most optimal case for the time to be the same as as the rows > cols case, so I think there is probably also more optimization potential in the current row-wise processing.
  • Small note that I switched how I compared times, before I was averaging runs but now I take the min as suggested by the timeit docs, so times might be slightly different from the first PR.
  • Fixed a regression from my first PR where exceptions in the threaded function calls wouldn't propagate properly.
  • Fixed a small bug in the perftests to be able to read multi-file data (since that's what datagen generates for larger datasets)

load_pandas_10k_1k_dense
load_pandas_1k_10k_dense

This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.
IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.
This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.
@Nakroma Nakroma marked this pull request as ready for review January 24, 2025 16:46
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the edits,

The only elements missing are documentation on the methods defined. If you could add these, then i will merge it!

@@ -555,6 +555,12 @@ public void reset() {
reset(0, true);
}

public void setRow(int c, Object[] row) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add javadocs for these methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants