[SYSTEMDS-3548] Optimize IO path Python interface for SystemDS #2189
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Student project
SYSTEMDS-3548
and follow-up to #2154Contributions/discussion:
if var == jvm.gate.sds.ValueType.String
comparison has a big overhead. I was able to optimize another constant time by reducing stuff like that to a minimum, see the first graph below.cols > rows
the current column-wise processing is very slow, so I added row-wise processing for that case to speed it up (see second graph below, tested on 1k rows x 10k cols). Note here that it currently only does that for edge cases where all columns have the same data type. This is because when testing, serializing over a row with different columns was very costly. I wasn't able to spend a lot of time on this, as the deadline is approaching, so I think there is a lot of potential here to find an efficient way to serialize to be able to also use it for mixed columns. I'd also expect in the most optimal case for the time to be the same as as therows > cols
case, so I think there is probably also more optimization potential in the current row-wise processing.