You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When passing DataFrames as an input it turns out that the computations for larger DFs are 2-3x slower in total for 1 thread.
When analyzing that in more details it turns out that the sources of this slowdown is at least twofolds:
a) per batch join is ~50% slower
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(1) is done building a hash table from 147 batches, 1194285 rows, took 41 ms
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(1) finished execution, total processed batches: 1215, total join time: 6242 ms
vs native
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(1) is done building a hash table from 147 batches, 1194285 rows, took 49 ms
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(1) finished execution, total processed batches: 1215, total join time: 4079 ms
b) additional delay - for native join time ~ total time:
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(1) finished execution, total processed batches: 1215, total join time: 4079 ms
7-8
{
"results": [
{
"name": "polars_bio",
"min": 4.342850207933225,
"max": 4.342850207933225,
"mean": 4.342850207933225,
"speedup": 1.0
}
]
}
for DF inputs
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(1) finished execution, total processed batches: 1215, total join time: 6379 ms
7-8
{
"results": [
{
"name": "polars_bio_polars_eager",
"min": 9.375052332994528,
"max": 9.375052332994528,
"mean": 9.375052332994528,
"speedup": 1.0
}
]
}
so there is another 3s spent somewhere...
Additional observation is that it might be related to threading - if we set DF target partition num =2
dataframes:
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(29) finished execution, total processed batches: 1215, total join time: 2908 ms
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(30) finished execution, total processed batches: 1215, total join time: 3591 ms
7-8
{
"results": [
{
"name": "polars_bio_polars_eager",
"min": 3.7248092079535127,
"max": 3.7248092079535127,
"mean": 3.7248092079535127,
"speedup": 1.0
}
]
}
native:
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(31) finished execution, total processed batches: 1215, total join time: 1740 ms
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(32) finished execution, total processed batches: 1215, total join time: 2212 ms
7-8
{
"results": [
{
"name": "polars_bio",
"min": 2.328427125001326,
"max": 2.328427125001326,
"mean": 2.328427125001326,
"speedup": 1.0
}
]
}
3 observations:
total time ~ max (threadX processing time) 3591ms vs 3724ms so this additional "delay" is gone
in a system monitor we see that in fact there 3 threads running (CPU ~300%) - like 1 for polars + 2 for DataFusion. This is unlike native approach where if set target partitions 2 we have constant ~200% CPU utilization
because of point 2 - speedup is ~ 2.5x not ~2.0x
when comparing to native
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(31) finished execution, total processed batches: 1215, total join time: 1740 ms
INFO:sequila_core.physical_planner.joins.interval_join:ThreadId(32) finished execution, total processed batches: 1215, total join time: 2212 ms
gap
for 1th
4079ms vs 6379ms,
for 2th
1740 ms vs 2908 ms
2212 ms vs 3591 ms
The text was updated successfully, but these errors were encountered:
The problem seems to be related to Memtable that we create from input DataFrames.
The operation is zero-copy but there is some significant overhead while running queries on top of that (maybe sth related to GIL?).
The current workaround is for larger dataframes use temp parquet files. Needs further investigation.
See results: https://biodatageeks.org/polars-bio/performance/#dataframes-comparison
Summary:
Problem:
a) per batch join is ~50% slower
b) additional delay - for native join time ~ total time:
for DF inputs
so there is another 3s spent somewhere...
Additional observation is that it might be related to threading - if we set DF target partition num =2
dataframes:
native:
3 observations:
when comparing to native
gap
for 1th
4079ms vs 6379ms,
for 2th
1740 ms vs 2908 ms
2212 ms vs 3591 ms
The text was updated successfully, but these errors were encountered: