[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Nakroma · 2025-01-24T16:43:53Z

Student project SYSTEMDS-3548 and follow-up to #2154

Contributions/discussion:

I did some follow-up to both suggestions from @Baunsgaard in the first PR and did testing with both chunking into smaller parts and fusing operations into fewer java calls. I was unable to get any real improvements that were replicable over the larger datasets, altough I don't have a ton of experience with Py4J, so this might still have some potential. I added some of the adjacent code for it though (fusing convert, setting only chunks of a FrameBlock etc.), so at least some of the work I did there contributes to the project.
As it turns out, anything involving the java gateway is super costly, so for example even simply doing a if var == jvm.gate.sds.ValueType.String comparison has a big overhead. I was able to optimize another constant time by reducing stuff like that to a minimum, see the first graph below.
For cases where cols > rows the current column-wise processing is very slow, so I added row-wise processing for that case to speed it up (see second graph below, tested on 1k rows x 10k cols). Note here that it currently only does that for edge cases where all columns have the same data type. This is because when testing, serializing over a row with different columns was very costly. I wasn't able to spend a lot of time on this, as the deadline is approaching, so I think there is a lot of potential here to find an efficient way to serialize to be able to also use it for mixed columns. I'd also expect in the most optimal case for the time to be the same as as the rows > cols case, so I think there is probably also more optimization potential in the current row-wise processing.
Small note that I switched how I compared times, before I was averaging runs but now I take the min as suggested by the timeit docs, so times might be slightly different from the first PR.
Fixed a regression from my first PR where exceptions in the threaded function calls wouldn't propagate properly.
Fixed a small bug in the perftests to be able to read multi-file data (since that's what datagen generates for larger datasets)

This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.

IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.

This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.

Baunsgaard

LGTM, thanks for the edits,

The only elements missing are documentation on the methods defined. If you could add these, then i will merge it!

Baunsgaard · 2025-01-24T20:42:38Z

src/main/java/org/apache/sysds/runtime/frame/data/FrameBlock.java

@@ -555,6 +555,12 @@ public void reset() {
 		reset(0, true);
 	}

+	public void setRow(int c, Object[] row) {


please add javadocs for these methods.

Nakroma added 4 commits January 19, 2025 15:36

[SYSTEMDS-3548] Optimize python dataframe transfer column processing

7298c84

This commit optimizes how the pandas_to_frame_block function accesses Java types. It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.

[SYSTEMDS-3548] Fix perftests not working with large, split-up datasets

0d90e4f

IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.

[SYSTEMDS-3548] Add pandas to FrameBlock row-wise parallel processing

3eee164

This commit adds basic row-wise processing in the case of cols > rows. It also adds some other small, unused utility methods.

Merge branch 'apache:main' into main

013c248

Nakroma marked this pull request as ready for review January 24, 2025 16:46

Baunsgaard reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Nakroma commented Jan 24, 2025 •

edited

Loading

Baunsgaard left a comment

Baunsgaard Jan 24, 2025

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Are you sure you want to change the base?

[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189

Conversation

Nakroma commented Jan 24, 2025 • edited Loading

Baunsgaard left a comment

Choose a reason for hiding this comment

Baunsgaard Jan 24, 2025

Choose a reason for hiding this comment

Nakroma commented Jan 24, 2025 •

edited

Loading