Would it be possible to exposé capacity / limit / requests here #41

lozbrown · 2023-11-24T08:11:37Z

Hi

I've started looking for metrics here after having a pod evicted after using all the ephemeral storage available.

This seems useful but to create a meaningful graph in grafana that can be used it would be best to present that as a percentage of the capacity.

Its hard to know what this means without knowing if that's likely to be an issue soon.

jmcgrath207 · 2023-11-24T17:29:17Z

Hey @lozbrown

I like this idea and it seems possible. Let me play around with this and I will follow up.

jmcgrath207 · 2023-11-27T04:51:52Z

@lozbrown

Added this to the new release. Let me know how it works out.

#42

lozbrown · 2023-11-29T15:10:14Z

Cool, i have this working and being scraped by grafana agent into our Mimir.. pretty useful

(annoyingly my work wont let me put a sceenshot in)

However its a multi app EKS cluster that we use for lots of things, but right now I'm working on monitoring one specific app, so having the capacity metrics at the node level doesn't make it easy to select the ones that are relevant to the pods in use.
(maybe its possible in grafana to match the nod_name labels in the query)

Furthermore we have requests and limits set at the pod level

EG

    limits:
      cpu: 6
      memory: 24Gi
      ephemeral-storage: "4Gi"
    requests:
      cpu: 6
      memory: 24Gi
      ephemeral-storage: "2Gi"

it would be useful to get these metrics as well to create an actual percentage usage per pod.

lozbrown · 2023-11-29T15:49:27Z

caadvisor exports stuff like
container_spec_memory_limit_bytes
but not something similar for ephemeral storage

I also have stuff like this

container_fs_limit_bytes{container="",device="/dev/nvme0n1p12",id="/",image="",name="",namespace="",pod=""} 3.6073472e+07 1701272403268 container_fs_limit_bytes{container="",device="/dev/nvme0n1p3",id="/",image="",name="",namespace="",pod=""} 2.9394944e+07 1701272403268 container_fs_limit_bytes{container="",device="/dev/nvme1n1p1",id="/",image="",name="",namespace="",pod=""} 2.1113950208e+10 1701272403268 container_fs_limit_bytes{container="",device="/dev/root",id="/",image="",name="",namespace="",pod=""} 9.47175424e+08 1701272403268 container_fs_limit_bytes{container="",device="/dev/shm",id="/",image="",name="",namespace="",pod=""} 1.6542400512e+10 1701272403268 container_fs_limit_bytes{container="",device="/etc",id="/",image="",name="",namespace="",pod=""} 1.6542400512e+10 1701272403268 container_fs_limit_bytes{container="",device="/etc/cni",id="/",image="",name="",namespace="",pod=""} 1.6542400512e+10 1701272403268

but half the labels seem to be missing

lozbrown · 2023-11-29T15:51:55Z

also a list in the readme of exported metrics would be good, the picture is now out of date i think

jmcgrath207 · 2023-12-01T03:35:22Z

Hey @lozbrown

I will create a pod percentage used metric this weekend.

For the request limits, do you want to report on something like this?

    resources:
      requests:
        ephemeral-storage: "2Gi"
      limits:
        ephemeral-storage: "4Gi"

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage

lozbrown · 2023-12-01T06:36:29Z

Hi @jmcgrath207

It's actually not that important to calculate the percentage as if you have the capacity in the same units as the usage you can easily calculate stuff like percentages in observability platforms like grafana

So node capacity with labels for namespace and pod_name gives you that ability (even though that would be somewhat repetitive when many pods exist on a node)

For the request limits, do you want to report on something like this?
    resources:
      requests:
        ephemeral-storage: "2Gi"
      limits:
        ephemeral-storage: "4Gi"
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage

YES! something like

ephemeral_storage_pod_limit_bytes
ephemeral_storage_pod_request_bytes

Would allow you to understand how much of the resources you've allocated to your pods they are actually using.

Combined with the node capacity metrics with this and you can tell how much you can up those requests and limits before it starts splitting up your apps across more nodes (increasing costs) or allocating more than is available on the nodes (makeing your pods unshedulable (made up word but you probably get what I mean))

Some background:

We run an EKS auto scaling cluster with many apps, some of the more important ones like the trino query engine have dedicated node groups with one application pod per node. We had issues where spilling to disk for the query engine caused node eviction which was bad, we configured limits which prevent that but we also want to ensure we are using our nodes efficiently and not leaving resources unused that we're paying for but not increase costs.

Thanks very much for all your help with this, it's really useful.

lozbrown · 2023-12-01T11:05:27Z

looks like someone else might have patched in something similar in the past
sangheee@6ec363d

jmcgrath207 · 2023-12-01T16:07:14Z

@lozbrown Awesome, thanks for going into detail on this and finding that example. I will start working on it.

jmcgrath207 · 2023-12-03T04:34:20Z

@lozbrown

I ran some tests on the e2e grow pod with this configuration on kind k8s 1.27.0

          resources:
            requests:
              ephemeral-storage: "1Ki"
            limits:
              ephemeral-storage: "5Ki"

I found that limits and requests are not reflected in the output.
Ref: kubernetes/kubernetes#83038

kubectl get --raw "/api/v1/nodes/(your-node-name)/proxy/stats/summary"

I also checked this and couldn't find anything.

kubectl get --raw "/api/v1/nodes/(your-node-name)/proxy/metrics/cadvisor""

However, I did see this, so I know it functionally works.

Events:
  Type     Reason               Age    From               Message
  ----     ------               ----   ----               -------
  Normal   Scheduled            4m52s  default-scheduler  Successfully assigned ephemeral-metrics/grow-test-7484cd87cb-q9ngs to ephemeral-metrics-cluster-worker
  Normal   Pulled               4m53s  kubelet            Container image "local.io/local/grow-test:latest" already present on machine
  Normal   Created              4m53s  kubelet            Created container grow-test
  Normal   Started              4m53s  kubelet            Started container grow-test
  Warning: Evicted              4m51s  kubelet            Pod ephemeral local storage usage exceeds the total limit of containers 5Ki.
  Normal   Killing              4m51s  kubelet            Stopping container grow-test
  Warning  ExceededGracePeriod  4m41s  kubelet            Container runtime did not kill the pod within the specified grace period.

However, the pod's dead pods are not evicted.

$ kubectl get pods 
NAMESPACE            NAME                                                              READY   STATUS                   RESTARTS       AGE
ephemeral-metrics    grow-test-7484cd87cb-9fhcb                                        1/1     Running                  0              31s
ephemeral-metrics    grow-test-7484cd87cb-bkxnj                                        0/1     ContainerStatusUnknown   1              95s
ephemeral-metrics    grow-test-7484cd87cb-h48r9                                        0/1     ContainerStatusUnknown   1              63s
ephemeral-metrics    grow-test-7484cd87cb-hh7wm                                        0/1     ContainerStatusUnknown   1              3m11s
ephemeral-metrics    grow-test-7484cd87cb-jn6q9                                        0/1     ContainerStatusUnknown   1              2m39s
ephemeral-metrics    grow-test-7484cd87cb-mtnnz                                        0/1     ContainerStatusUnknown   1              4m21s
ephemeral-metrics    grow-test-7484cd87cb-q9ngs                                        0/1     ContainerStatusUnknown   1              3m43s
ephemeral-metrics    grow-test-7484cd87cb-taxed                                        0/1     ContainerStatusUnknown   1              2m7s

Submitted an upstream error for this.
kubernetes/kubernetes#122160

To get this, I am going to have to look into scraping the pod manifests. Let me play around with the concept, and I will follow up.

jmcgrath207 · 2023-12-04T01:26:08Z

I've added a percentage metric in the new release. Let me know how it works out.
#47 (comment)

FWIW, I am going to add support for memory only back ephemeral storage later this week.
#46

lozbrown · 2023-12-12T15:32:11Z

So

After a while using this, it very useful

Some thoughts
ephemeral_storage_node_percentage seems to be percentage available, which to me is unintuitive

i use this instead
((ephemeral_storage_node_capacity{node_name="$node",container=""}-ephemeral_storage_node_available{node_name="$node",container=""})/ephemeral_storage_node_capacity{node_name="$node",container=""}) * 100

which gives me used percentage
in seperate queries i add the following
(ephemeral_storage_node_capacity{node_name="$node",container=""}-ephemeral_storage_node_available{node_name="$node",container=""})

and

ephemeral_storage_node_capacity{node_name="$node",container=""}
this gives me a clear view of the usage of my nodes

I'd still rather have ephemeral_storage_container_limit_bytes as opposed to ephemeral_storage_container_limit_percentage

then I could do something like
(sum(ephemeral_storage_container_limit_bytes{node_name="$node"}) /ephemeral_storage_node_capacity{node_name="$node",container=""}) * 100

which would be how over or under allocated my nodes are... difficult to do that with the percentage but the percentages can always be worked out if you have the usage and the limits.

I'd also still like the requests

jmcgrath207 · 2023-12-13T05:50:16Z

For ephemeral_storage_node_percentage I am struggling to see how those queries you mentioned are different than this action.
https://github.com/jmcgrath207/k8s-ephemeral-storage-metrics/blob/master/main.go#L352

However, we do want flexibility for all query types.

Yes, we can add ephemeral_storage_container_limit_bytes.

For ephemeral_storage_container_request_bytes . Can you expand on your case for this? I am struggling to see a case where having request would be helpful. We would also have to account for containers that don't have the request field and have to set them to zero.

lozbrown · 2023-12-13T14:27:40Z

For ephemeral_storage_node_percentage I am struggling to see how those queries you mentioned are different than this action.
https://github.com/jmcgrath207/k8s-ephemeral-storage-metrics/blob/master/main.go#L352

not much of a go coder but i think that line would need to be
percentage := (capacityBytes - availableBytes / capacityBytes) * 100.0

As currently its percentage available rather than percentage used which to me in unintuitive.

so to state the obvious, Request controls where node can be scheduled... lets assume all your nodes have 50GB storage, and its an autoscaling cluster that will spin up additional nodes to host your apps

lets assume you put your request on all 2 of your container to 30GB, this would cause you to have each container on a different node.

sum(ephemeral_storage_node_capacity) = 100GB
sum(ephemeral_storage_container_request_bytes) 60GB
(sum(ephemeral_storage_container_request_bytes / sum(ephemeral_storage_node_capacity)) * 100 = 60%
Lets call that metric node_storage_allocation_percentage

I'd consider this to be poor node utilization, in this scenario reducing the request down to say 24Gb would cut one node out of your cluster and save you money.

In a money is no object world you work out what you need for each container and set both the limit and request to be for that value, but in reality your pods may almost never get up to that value and not at the same time, so it ok to over limit pods.

Eg in the above example
request = 24GB
limit = 30GB

would probably be ok most of the time with the slight risk of them useing more than 50GB total and getting problems.

you could also view over / under limiting (and therefore level of the above risk in your cluster

((sum(ephemeral_storage_container_limit_bytes)/sum(ephemeral_storage_container_request_bytes)) * 100

or
((ephemeral_storage_container_limit_bytes/ephemeral_storage_container_request_bytes) * 100) by (node_name)

jmcgrath207 · 2023-12-27T04:14:27Z

I am closing this issue.

ephemeral_storage_container_limit_percentage is sufficient. The calculation is correct, and you can filter by node name.
https://github.com/jmcgrath207/k8s-ephemeral-storage-metrics/blob/master/main.go#L352

Percentage = (Value/Total Value) x 100
https://www.dummies.com/article/academics-the-arts/math/basic-math/how-to-calculate-percentages-240018/

I am not convinced that requests are needed since it's only used for Kubernetes scheduling decisions. ephemeral_storage_pod_usage can be used as a source to tune your request to your liking in your pod manifest.

I also don't see a case where not having requests would result in a triage event or cost.

jmcgrath207 self-assigned this Nov 24, 2023

jmcgrath207 mentioned this issue Dec 4, 2023

Added ephemeral_storage_container_limit_percentage metric #47

Merged

jmcgrath207 closed this as completed Dec 27, 2023

jmcgrath207 mentioned this issue Jul 1, 2024

set percentage node used instead of available #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would it be possible to exposé capacity / limit / requests here #41

Would it be possible to exposé capacity / limit / requests here #41

lozbrown commented Nov 24, 2023

jmcgrath207 commented Nov 24, 2023 •

edited

Loading

jmcgrath207 commented Nov 27, 2023

lozbrown commented Nov 29, 2023 •

edited

Loading

lozbrown commented Nov 29, 2023

lozbrown commented Nov 29, 2023

jmcgrath207 commented Dec 1, 2023 •

edited

Loading

lozbrown commented Dec 1, 2023 •

edited

Loading

lozbrown commented Dec 1, 2023

jmcgrath207 commented Dec 1, 2023

jmcgrath207 commented Dec 3, 2023

jmcgrath207 commented Dec 4, 2023

lozbrown commented Dec 12, 2023

jmcgrath207 commented Dec 13, 2023

lozbrown commented Dec 13, 2023

jmcgrath207 commented Dec 27, 2023

Would it be possible to exposé capacity / limit / requests here #41

Would it be possible to exposé capacity / limit / requests here #41

Comments

lozbrown commented Nov 24, 2023

jmcgrath207 commented Nov 24, 2023 • edited Loading

jmcgrath207 commented Nov 27, 2023

lozbrown commented Nov 29, 2023 • edited Loading

lozbrown commented Nov 29, 2023

lozbrown commented Nov 29, 2023

jmcgrath207 commented Dec 1, 2023 • edited Loading

lozbrown commented Dec 1, 2023 • edited Loading

lozbrown commented Dec 1, 2023

jmcgrath207 commented Dec 1, 2023

jmcgrath207 commented Dec 3, 2023

jmcgrath207 commented Dec 4, 2023

lozbrown commented Dec 12, 2023

jmcgrath207 commented Dec 13, 2023

lozbrown commented Dec 13, 2023

jmcgrath207 commented Dec 27, 2023

jmcgrath207 commented Nov 24, 2023 •

edited

Loading

lozbrown commented Nov 29, 2023 •

edited

Loading

jmcgrath207 commented Dec 1, 2023 •

edited

Loading

lozbrown commented Dec 1, 2023 •

edited

Loading