dataset-selectivity performance regression? #129

jgehrcke · 2023-01-05T09:26:58Z

I have seen something in conbench.ursa.dev that I would love to use as an example scenario: do we have a performance regression, or do we maybe have a methodological weakness?

https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/

That is, around 2023-01-05 07:49 it was measured with apache/arrow@e5ec942 that benchmark dataset-selectivity with case permutation 10%, nyctaxi_multi_parquet_s3 took almost two seconds in each of three iterations: [1.951955, 1.846497, 1.891674].

In the previous 1-2 weeks it took ~1.2 seconds:

The text was updated successfully, but these errors were encountered:

jgehrcke · 2023-01-05T13:31:42Z

https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/ does not show a lot of history. It goes back only until 22-12-28.

Manual history extension: https://conbench.ursa.dev/benchmarks/db5ae59ff5944b2180dc73e2c35e2c43/

jgehrcke · 2023-01-05T13:34:35Z

Data points from 22-12-22: [1.24206, 1.174114, 1.248703].

Summarized observations:

between 22-12-13 and 23-01-04 the duration was around ~1.2 seconds, with the variability being smaller than 0.1 seconds.
on 2023-01-05 the duration changed to ~1.9 seconds, still a small variability.

jgehrcke · 2023-01-05T13:37:42Z

Returned back to normal in subsequent run:
https://conbench.ursa.dev/benchmarks/a8b86353d80145bb91d1bd2a1125d81e/

timestamp 2023-01-05 10:44
data [1.225894, 1.226221, 1.172097]

jgehrcke · 2023-01-05T13:38:28Z

Notably, this was executed on ursa-i9-9960x. I suppose that the benchmark run https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/ was affected by an special event happening on that machine at that time, probably something interfering with disk I/O.

jonkeane · 2023-01-05T14:46:58Z

Yeah, this is a circumstance that we have observed in a few places even with dedicated runners, we have some benchmark cases that have blips like this that look like regressions initially, but go away on the next run. I was talking to Austin about this the other day and wrote up (some of) what we talked about in conbench/conbench#572 which is a large(r) project, but one I think would be good to spec out and see how much it would take to add.

Of course, like I mention there: that project should not preclude looking at this benchmark code to see if there's something we can do to make it more reliable. One question about the disk I/O interference: I thought you crafted this benchmark so that it was using ram disks to prevent something like that from happening (but I will admit I didn't follow super closely while I was out, so I might be wrong about that), which should make disk I/O not super important, yeah?

jgehrcke · 2023-01-06T09:58:45Z

I was talking to Austin about this the other day and wrote up (some of) what we talked about in conbench/conbench#572 which is a large(r) project, but one I think would be good to spec out and see how much it would take to add.

Cool, replied there!

that project should not preclude looking at this benchmark code to see if there's something we can do to make it more reliable

Of course, and instead of 'code' I'd more generally say 'method'. Careful thought about what a benchmark is even supposed to measure and whether or not it makes sense to have that affected by disk I/O is I suppose one of the most influential approaches to fight instability.

One question about the disk I/O interference: I thought you crafted this benchmark so that it was using ram disks to prevent something like that from happening (but I will admit I didn't follow super closely while I was out, so I might be wrong about that), which should make disk I/O not super important, yeah?

I did that in 'my' benchmark dataset-serialize. This one here is dataset-selectivity which reads gigabytes of data from disk (and measures how long that takes).

jonkeane · 2023-01-06T14:14:27Z

I did that in 'my' benchmark dataset-serialize. This one here is dataset-selectivity which reads gigabytes of data from disk (and measures how long that takes).

AH, right right of course I got those wires crossed, sorry

jonkeane mentioned this issue Jan 5, 2023

What would it take to re-run a benchmark within a run if it looks like it regressed? conbench/conbench#572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset-selectivity performance regression? #129

dataset-selectivity performance regression? #129

jgehrcke commented Jan 5, 2023 •

edited

Loading

jgehrcke commented Jan 5, 2023

jgehrcke commented Jan 5, 2023

jgehrcke commented Jan 5, 2023

jgehrcke commented Jan 5, 2023 •

edited

Loading

jonkeane commented Jan 5, 2023

jgehrcke commented Jan 6, 2023 •

edited

Loading

jonkeane commented Jan 6, 2023

dataset-selectivity performance regression? #129

dataset-selectivity performance regression? #129

Comments

jgehrcke commented Jan 5, 2023 • edited Loading

jgehrcke commented Jan 5, 2023

jgehrcke commented Jan 5, 2023

jgehrcke commented Jan 5, 2023

jgehrcke commented Jan 5, 2023 • edited Loading

jonkeane commented Jan 5, 2023

jgehrcke commented Jan 6, 2023 • edited Loading

jonkeane commented Jan 6, 2023

jgehrcke commented Jan 5, 2023 •

edited

Loading

jgehrcke commented Jan 5, 2023 •

edited

Loading

jgehrcke commented Jan 6, 2023 •

edited

Loading