-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add eventloop utilization default metric #518
base: master
Are you sure you want to change the base?
feat: add eventloop utilization default metric #518
Conversation
@Eomm, I find you as a contributor of fastify-elu-scaler. If you have some time and interested, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
To measure ELU, you need to get the first utilization value, wait for some timeout, get the second utilization value and then count the "difference". The are at least two ways to implement this.
I don't know which way is more acceptable from user perspective. |
Thank you both for the PR and review. @trevnorris you wrote a nice blog post on this (and I think did a lot of the impl?), any recommendation whether we should use |
@zbjornson I'd recommend using a |
@zbjornson Does summary sliding window work for sum and count parameters? It looks like not. How I can count the average value for some time period? |
@zbjornson what's the status on getting this merged? thanks! |
Hey @zbjornson @ivan-tymoshenko is this PR blocked? Anything we can do to help get this through? 🙏🏽 |
I don't remember why it was blocked. I will take a look this week. |
@trevnorris @glensc @SimenB @johnytiago Please take a look if this implementation makes sence and works for you. |
I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.
The only posible situattion when
The main question for me here is when we should start and stop measuring elu in this case. |
You're right, I tried it out at 100ms and it looks good.
Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.
|
I understand. If you have a suggestion on how to measure elu in a more accurate way, you are welcome. |
@ivan-tymoshenko is this mostly complete and just need conflicts resolved, or something is still missing? |
@ivan-tymoshenko Just want to make sure it's understood, ELU as measured by Node isn't approximated. I patched libuv so it tracks it down to the system call. The interval length you choose to get the ELU won't change what it returns. It's always precise. |
Why not just go with implementation like this one? The majority of users just need an average value. What will you do with the whole distribution of loop utilization? This looks more like profiling rather then metric. You aren't interested in load spikes when you are running production code (will you scale on each spike? Or what's the use case?). But in case you are — it looks like out of scope of the |
I tried this out, it works, but it tends to give exactly same data as CPU usage because it gets averaged over a too long period. (The same thing happens with the ELL metrics - which isn't a problem because quantile metrics are provided. https://github.com/siimon/prom-client/blob/master/lib/metrics/eventLoopLag.js#L74-L132) The usefulness of the ELU metric is that it works well when measured over a small timespan (like 1s) so you can respond to performance drops very quickly. Here's an example. 15s sum(rate(idleTime)) (its the same) 1s ELU window. This chart paints the correct picture, it shows that the server is actually not very healthy.
I am very interested in small load spikes. |
Close #506