-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would it be possible to exposé capacity / limit / requests here #41
Comments
Hey @lozbrown I like this idea and it seems possible. Let me play around with this and I will follow up. |
Cool, i have this working and being scraped by grafana agent into our Mimir.. pretty useful (annoyingly my work wont let me put a sceenshot in) However its a multi app EKS cluster that we use for lots of things, but right now I'm working on monitoring one specific app, so having the capacity metrics at the node level doesn't make it easy to select the ones that are relevant to the pods in use. Furthermore we have requests and limits set at the pod level EG
it would be useful to get these metrics as well to create an actual percentage usage per pod. |
caadvisor exports stuff like I also have stuff like this
but half the labels seem to be missing |
also a list in the readme of exported metrics would be good, the picture is now out of date i think |
Hey @lozbrown I will create a pod percentage used metric this weekend. For the request limits, do you want to report on something like this?
|
Hi @jmcgrath207 It's actually not that important to calculate the percentage as if you have the capacity in the same units as the usage you can easily calculate stuff like percentages in observability platforms like grafana So node capacity with labels for namespace and pod_name gives you that ability (even though that would be somewhat repetitive when many pods exist on a node)
YES! something like ephemeral_storage_pod_limit_bytes Would allow you to understand how much of the resources you've allocated to your pods they are actually using. Combined with the node capacity metrics with this and you can tell how much you can up those requests and limits before it starts splitting up your apps across more nodes (increasing costs) or allocating more than is available on the nodes (makeing your pods unshedulable (made up word but you probably get what I mean)) Some background: We run an EKS auto scaling cluster with many apps, some of the more important ones like the trino query engine have dedicated node groups with one application pod per node. We had issues where spilling to disk for the query engine caused node eviction which was bad, we configured limits which prevent that but we also want to ensure we are using our nodes efficiently and not leaving resources unused that we're paying for but not increase costs. Thanks very much for all your help with this, it's really useful. |
looks like someone else might have patched in something similar in the past |
@lozbrown Awesome, thanks for going into detail on this and finding that example. I will start working on it. |
I ran some tests on the e2e grow pod with this configuration on kind k8s 1.27.0
I found that limits and requests are not reflected in the output.
I also checked this and couldn't find anything.
However, I did see this, so I know it functionally works.
However, the pod's dead pods are not evicted.
Submitted an upstream error for this. To get this, I am going to have to look into scraping the pod manifests. Let me play around with the concept, and I will follow up. |
I've added a percentage metric in the new release. Let me know how it works out. FWIW, I am going to add support for memory only back ephemeral storage later this week. |
So After a while using this, it very useful Some thoughts i use this instead which gives me used percentage and
I'd still rather have ephemeral_storage_container_limit_bytes as opposed to ephemeral_storage_container_limit_percentage then I could do something like which would be how over or under allocated my nodes are... difficult to do that with the percentage but the percentages can always be worked out if you have the usage and the limits. I'd also still like the requests |
For However, we do want flexibility for all query types. Yes, we can add For |
not much of a go coder but i think that line would need to be As currently its percentage available rather than percentage used which to me in unintuitive. so to state the obvious, Request controls where node can be scheduled... lets assume all your nodes have 50GB storage, and its an autoscaling cluster that will spin up additional nodes to host your apps lets assume you put your request on all 2 of your container to 30GB, this would cause you to have each container on a different node. sum(ephemeral_storage_node_capacity) = 100GB I'd consider this to be poor node utilization, in this scenario reducing the request down to say 24Gb would cut one node out of your cluster and save you money. In a money is no object world you work out what you need for each container and set both the limit and request to be for that value, but in reality your pods may almost never get up to that value and not at the same time, so it ok to over limit pods. Eg in the above example would probably be ok most of the time with the slight risk of them useing more than 50GB total and getting problems. you could also view over / under limiting (and therefore level of the above risk in your cluster
or |
I am closing this issue.
I am not convinced that requests are needed since it's only used for Kubernetes scheduling decisions. I also don't see a case where not having requests would result in a triage event or cost. |
Hi
I've started looking for metrics here after having a pod evicted after using all the ephemeral storage available.
This seems useful but to create a meaningful graph in grafana that can be used it would be best to present that as a percentage of the capacity.
Its hard to know what this means without knowing if that's likely to be an issue soon.
The text was updated successfully, but these errors were encountered: