Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset-selectivity performance regression? #129

Open
jgehrcke opened this issue Jan 5, 2023 · 7 comments
Open

dataset-selectivity performance regression? #129

jgehrcke opened this issue Jan 5, 2023 · 7 comments

Comments

@jgehrcke
Copy link
Contributor

jgehrcke commented Jan 5, 2023

I have seen something in conbench.ursa.dev that I would love to use as an example scenario: do we have a performance regression, or do we maybe have a methodological weakness?

https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/

That is, around 2023-01-05 07:49 it was measured with apache/arrow@e5ec942 that benchmark dataset-selectivity with case permutation 10%, nyctaxi_multi_parquet_s3 took almost two seconds in each of three iterations: [1.951955, 1.846497, 1.891674].

In the previous 1-2 weeks it took ~1.2 seconds:

image

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jan 5, 2023

https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/ does not show a lot of history. It goes back only until 22-12-28.

Manual history extension: https://conbench.ursa.dev/benchmarks/db5ae59ff5944b2180dc73e2c35e2c43/

image

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jan 5, 2023

Data points from 22-12-22: [1.24206, 1.174114, 1.248703].

Summarized observations:

  • between 22-12-13 and 23-01-04 the duration was around ~1.2 seconds, with the variability being smaller than 0.1 seconds.
  • on 2023-01-05 the duration changed to ~1.9 seconds, still a small variability.

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jan 5, 2023

Returned back to normal in subsequent run:
https://conbench.ursa.dev/benchmarks/a8b86353d80145bb91d1bd2a1125d81e/

image

timestamp 2023-01-05 10:44
data [1.225894, 1.226221, 1.172097]

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jan 5, 2023

Notably, this was executed on ursa-i9-9960x. I suppose that the benchmark run https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/ was affected by an special event happening on that machine at that time, probably something interfering with disk I/O.

@jonkeane
Copy link
Contributor

jonkeane commented Jan 5, 2023

Yeah, this is a circumstance that we have observed in a few places even with dedicated runners, we have some benchmark cases that have blips like this that look like regressions initially, but go away on the next run. I was talking to Austin about this the other day and wrote up (some of) what we talked about in conbench/conbench#572 which is a large(r) project, but one I think would be good to spec out and see how much it would take to add.

Of course, like I mention there: that project should not preclude looking at this benchmark code to see if there's something we can do to make it more reliable. One question about the disk I/O interference: I thought you crafted this benchmark so that it was using ram disks to prevent something like that from happening (but I will admit I didn't follow super closely while I was out, so I might be wrong about that), which should make disk I/O not super important, yeah?

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jan 6, 2023

I was talking to Austin about this the other day and wrote up (some of) what we talked about in conbench/conbench#572 which is a large(r) project, but one I think would be good to spec out and see how much it would take to add.

Cool, replied there!

that project should not preclude looking at this benchmark code to see if there's something we can do to make it more reliable

Of course, and instead of 'code' I'd more generally say 'method'. Careful thought about what a benchmark is even supposed to measure and whether or not it makes sense to have that affected by disk I/O is I suppose one of the most influential approaches to fight instability.

One question about the disk I/O interference: I thought you crafted this benchmark so that it was using ram disks to prevent something like that from happening (but I will admit I didn't follow super closely while I was out, so I might be wrong about that), which should make disk I/O not super important, yeah?

I did that in 'my' benchmark dataset-serialize. This one here is dataset-selectivity which reads gigabytes of data from disk (and measures how long that takes).

@jonkeane
Copy link
Contributor

jonkeane commented Jan 6, 2023

I did that in 'my' benchmark dataset-serialize. This one here is dataset-selectivity which reads gigabytes of data from disk (and measures how long that takes).

AH, right right of course I got those wires crossed, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants