Skip to content

Commit

Permalink
added prometheus rules to the helm chart (#99)
Browse files Browse the repository at this point in the history
Signed-off-by: Michal Minář <michal.minar@id.ethz.ch>
  • Loading branch information
miminar authored Jul 8, 2024
1 parent 8e468d3 commit 82d1431
Show file tree
Hide file tree
Showing 10 changed files with 274 additions and 9 deletions.
4 changes: 3 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
GITROOT ?= $(shell pwd)
DEPLOYMENT_NAME = ephemeral-metrics
K8S_VERSION ?= 1.27.0
PROMETHEUS_OPERATOR_VERSION ?= v0.65.1

## Location to install dependencies to
LOCALBIN ?= $(shell pwd)/bin
Expand Down Expand Up @@ -34,6 +35,7 @@ test-helm-render:
helm template ./chart -f ./chart/test-values.yaml 1> /dev/null

minikube_new:
export PROMETHEUS_OPERATOR_VERSION=$(PROMETHEUS_OPERATOR_VERSION)
./scripts/create-minikube.sh

minikube_scale_up:
Expand Down Expand Up @@ -107,4 +109,4 @@ prerelease-github:
rm chart/k8s-ephemeral-storage-metrics-*.tgz

github_login:
gh auth login --web --scopes=read:packages,write:packages
gh auth login --web --scopes=read:packages,write:packages
33 changes: 32 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ helm upgrade --install my-deployment k8s-ephemeral-storage-metrics/k8s-ephemeral
| pprof | bool | `false` | Enable Pprof |
| prometheus.enable | bool | `true` | |
| prometheus.release | string | `"kube-prometheus-stack"` | |
| prometheus.rules.enable | bool | `false` | Create PrometheusRules firing alerts when out of ephemeral storage |
| prometheus.rules.predictFilledHours | int | `12` | How many hours in the future to predict filling up of a volume |
| serviceMonitor | object | `{"additionalLabels":{},"metricRelabelings":[],"podTargetLabels":[],"relabelings":[],"targetLabels":[]}` | Configure the Service Monitor |
| serviceMonitor.additionalLabels | object | `{}` | Add labels to the ServiceMonitor.Spec |
| serviceMonitor.metricRelabelings | list | `[]` | Set metricRelabelings as per https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.RelabelConfig |
Expand All @@ -59,6 +61,35 @@ helm upgrade --install my-deployment k8s-ephemeral-storage-metrics/k8s-ephemeral
| serviceMonitor.targetLabels | list | `[]` | Set targetLabels as per https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.ServiceMonitorSpec |
| tolerations | list | `[]` | |

## Prometheus alert rules

To prevent from multiple kind of alerts being fired for a single container or
emptyDir volume when both `prometheus.enable` and `prometheus.rules.enable` are
on, add the following [inhibition
rules](https://prometheus.io/docs/alerting/latest/configuration/#inhibition-related-settings)
to your Alert Manager config:

```yaml
- source_matchers:
- alertname="EphemeralStorageVolumeFilledUp"
target_matchers:
- severity="warning"
- alertname="EphemeralStorageVolumeFillingUp"
equal:
- pod_namespace
- pod_name
- volume_name
- source_matchers:
- alertname="ContainerEphemeralStorageUsageAtLimit"
target_matchers:
- severity="warning"
- alertname="ContainerEphemeralStorageUsageReachingLimit"
equal:
- pod_namespace
- pod_name
- exported_container
```
## Contribute
### Start minikube
Expand Down Expand Up @@ -90,4 +121,4 @@ Then run a debug against [deployment_test.go](tests/e2e/deployment_test.go)

## License

This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
33 changes: 32 additions & 1 deletion chart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ helm upgrade --install my-deployment k8s-ephemeral-storage-metrics/k8s-ephemeral
| pprof | bool | `false` | Enable Pprof |
| prometheus.enable | bool | `true` | |
| prometheus.release | string | `"kube-prometheus-stack"` | |
| prometheus.rules.enable | bool | `false` | Create PrometheusRules firing alerts when out of ephemeral storage |
| prometheus.rules.predictFilledHours | int | `12` | How many hours in the future to predict filling up of a volume |
| serviceMonitor | object | `{"additionalLabels":{},"metricRelabelings":[],"podTargetLabels":[],"relabelings":[],"targetLabels":[]}` | Configure the Service Monitor |
| serviceMonitor.additionalLabels | object | `{}` | Add labels to the ServiceMonitor.Spec |
| serviceMonitor.metricRelabelings | list | `[]` | Set metricRelabelings as per https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.RelabelConfig |
Expand All @@ -42,6 +44,35 @@ helm upgrade --install my-deployment k8s-ephemeral-storage-metrics/k8s-ephemeral
| serviceMonitor.targetLabels | list | `[]` | Set targetLabels as per https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.ServiceMonitorSpec |
| tolerations | list | `[]` | |

## Prometheus alert rules

To prevent from multiple kind of alerts being fired for a single container or
emptyDir volume when both `prometheus.enable` and `prometheus.rules.enable` are
on, add the following [inhibition
rules](https://prometheus.io/docs/alerting/latest/configuration/#inhibition-related-settings)
to your Alert Manager config:

```yaml
- source_matchers:
- alertname="EphemeralStorageVolumeFilledUp"
target_matchers:
- severity="warning"
- alertname="EphemeralStorageVolumeFillingUp"
equal:
- pod_namespace
- pod_name
- volume_name
- source_matchers:
- alertname="ContainerEphemeralStorageUsageAtLimit"
target_matchers:
- severity="warning"
- alertname="ContainerEphemeralStorageUsageReachingLimit"
equal:
- pod_namespace
- pod_name
- exported_container
```
## Contribute
### Start minikube
Expand Down Expand Up @@ -73,4 +104,4 @@ Then run a debug against [deployment_test.go](tests/e2e/deployment_test.go)

## License

This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
31 changes: 30 additions & 1 deletion chart/README.md.gotmpl
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,35 @@ helm upgrade --install my-deployment k8s-ephemeral-storage-metrics/k8s-ephemeral

{{ template "chart.valuesSection" . }}

## Prometheus alert rules

To prevent from multiple kind of alerts being fired for a single container or
emptyDir volume when both `prometheus.enable` and `prometheus.rules.enable` are
on, add the following [inhibition
rules](https://prometheus.io/docs/alerting/latest/configuration/#inhibition-related-settings)
to your Alert Manager config:

```yaml
- source_matchers:
- alertname="EphemeralStorageVolumeFilledUp"
target_matchers:
- severity="warning"
- alertname="EphemeralStorageVolumeFillingUp"
equal:
- pod_namespace
- pod_name
- volume_name
- source_matchers:
- alertname="ContainerEphemeralStorageUsageAtLimit"
target_matchers:
- severity="warning"
- alertname="ContainerEphemeralStorageUsageReachingLimit"
equal:
- pod_namespace
- pod_name
- exported_container
```

## Contribute

### Start minikube
Expand Down Expand Up @@ -39,4 +68,4 @@ Then run a debug against [deployment_test.go](tests/e2e/deployment_test.go)

## License

This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
125 changes: 125 additions & 0 deletions chart/templates/prometheusrules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
{{- if .Values.prometheus.enable | default true }}
{{- with $rules := default (dict) .Values.prometheus.rules }}
{{- if $rules.enable | default false }}
{{- with $predictFilledHours := $rules.predictFilledHours | default 12 }}
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: k8s-ephemeral-storage-container-limits
namespace: "{{ $.Release.Namespace }}"
spec:
groups:
- name: k8s.rules.container_resource
rules:
- alert: ContainerEphemeralStorageUsageAtLimit
annotations:
description: >-
{{ `Ephemeral storage usage of pod/container {{ $labels.pod_name
}}/{{ $labels.exported_container }} in Namespace {{
$labels.pod_namespace }} on Node {{ $labels.node_name }} {{ with
$labels.cluster -}} on Cluster {{ . }} {{- end }} is at {{ $value
}}% of the limit.` }}
summary: Container ephemeral storage usage is at the limit.
expr: |-2
( max by (node_name, pod_namespace, pod_name, exported_container)
(avg_over_time(ephemeral_storage_container_limit_percentage{source="container"}[5m]))
> 85.0)
# ignore pods that haven't been running for some time (e.g. completed jobs)
unless on (pod_namespace, pod_name)
( (label_replace(label_replace(
max_over_time(kube_pod_status_phase{phase="Running"}[2m]),
"pod_namespace", "$1", "namespace", "(.*)"),
"pod_name", "$1", "pod", "(.*)"))
== 0)
for: 1m
labels:
severity: warning

- alert: ContainerEphemeralStorageUsageReachingLimit
annotations:
description: >-
{{ `Based on recent sampling, the ephemeral storage limit of
pod/container {{ $labels.pod_name }}/{{
$labels.exported_container }} in Namespace {{
$labels.pod_namespace }} on Node {{ $labels.node_name }} {{
with $labels.cluster -}} on Cluster {{ . }} {{- end }} is
expected to be reached within ` }}{{ $predictFilledHours }}
{{ ` hours. Currently, {{ $value }}% is used.` }}
summary: Container ephemeral storage usage is reaching the limit.
expr: |-2
( ( max by (node_name, pod_namespace, pod_name, exported_container)
(ephemeral_storage_container_limit_percentage{source="container"})
> 33.3)
and on (pod_namespace, pod_name, exported_container)
( predict_linear( ephemeral_storage_container_limit_percentage{source="container"}[2h]
, {{ $predictFilledHours | float64 }}*3600)
> 99.0)
)
# ignore pods that haven't been running for enough time
unless on (pod_namespace, pod_name)
( (label_replace(label_replace(
min_over_time(kube_pod_status_phase{phase="Running"}[10m]),
"pod_namespace", "$1", "namespace", "(.*)"),
"pod_name", "$1", "pod", "(.*)"))
== 0)
for: 15m
labels:
severity: warning

- alert: EphemeralStorageVolumeFilledUp
annotations:
description: >-
{{ `Ephemeral storage volume "{{ $labels.volume_name }}" of pod
{{ $labels.pod_name }} in Namespace {{ $labels.pod_namespace }}
{{ with $labels.cluster -}} on Cluster {{ . }} {{- end }} is
filled from {{ $value }}%.` }}
summary: Ephemeral storage volume is filled up.
expr: |-2
( max by (pod_namespace, pod_name, volume_name)
(avg_over_time(ephemeral_storage_container_volume_limit_percentage[5m]))
> 85.0)
# ignore pods that haven't been running for some time (e.g. completed jobs)
unless on (pod_namespace, pod_name)
( (label_replace(label_replace(
max_over_time(kube_pod_status_phase{phase="Running"}[2m]),
"pod_namespace", "$1", "namespace", "(.*)"),
"pod_name", "$1", "pod", "(.*)"))
== 0)
for: 1m
labels:
severity: warning

- alert: EphemeralStorageVolumeFillingUp
annotations:
description: >-
{{ `Based on recent sampling, the ephemeral storage volume "{{
$labels.volume_name }}" of pod {{ $labels.pod_name }} in
Namespace {{ $labels.pod_namespace }} {{ with $labels.cluster
-}} on Cluster {{ . }} {{- end }} is expected to be filled up
within ` }}{{ $predictFilledHours }}{{ ` hours. Currently, {{
$value }}% is used.` }}
summary: Ephemeral storage volume is filling up.
expr: |-2
( ( max by (pod_namespace, pod_name, volume_name)
(ephemeral_storage_container_volume_limit_percentage)
> 33.3)
and ( max by (pod_namespace, pod_name, volume_name)
(predict_linear( ephemeral_storage_container_volume_limit_percentage[2h]
, {{ $predictFilledHours | float64 }}*3600))
> 99)
)
# ignore pods that haven't been running for enough time
unless on (pod_namespace, pod_name)
( (label_replace(label_replace(
min_over_time(kube_pod_status_phase{phase="Running"}[10m]),
"pod_namespace", "$1", "namespace", "(.*)"),
"pod_name", "$1", "pod", "(.*)"))
== 0)
for: 15m
labels:
severity: warning
{{- end }}
{{- end }}
{{- end }}
{{- end }}
5 changes: 5 additions & 0 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,11 @@ max_node_concurrency: 10
prometheus:
enable: true
release: kube-prometheus-stack
rules:
# -- Create PrometheusRules firing alerts when out of ephemeral storage
enable: false
# -- How many hours in the future to predict filling up of a volume
predictFilledHours: 12

# -- Enable Pprof
pprof: false
Expand Down
10 changes: 8 additions & 2 deletions scripts/create-minikube.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/bin/bash

readonly CRD_BASE_URL=https://raw.githubusercontent.com/prometheus-operator/prometheus-operator
: "${PROMETHEUS_OPERATOR_VERSION:=v0.65.1}"

minikube delete || true
c=$(docker ps -q) && [[ $c ]] && docker kill $c
docker network prune -f
Expand All @@ -10,5 +13,8 @@ minikube start \
--memory=3900MB
minikube addons enable registry

# Add Service Monitor CRD
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.65.1/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
# Deploy Service Monitor and Prometheus Rule CRDs
for crd in monitoring.coreos.com_{servicemonitors,prometheusrules}.yaml; do
kubectl apply --server-side -f \
"$CRD_BASE_URL/${PROMETHEUS_OPERATOR_VERSION}/example/prometheus-operator-crd/$crd" || exit 1
done
10 changes: 8 additions & 2 deletions scripts/create_kind.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/bin/bash

readonly CRD_BASE_URL=https://raw.githubusercontent.com/prometheus-operator/prometheus-operator
: "${PROMETHEUS_OPERATOR_VERSION:=v0.65.1}"

function trap_func_kind() {
kind export logs
exit 1
Expand All @@ -21,5 +24,8 @@ kubectl config set-context "${DEPLOYMENT_NAME}-cluster"
echo "Kubernetes cluster:"
kubectl get nodes -o wide

# Deploy Service Monitor CRD
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.65.1/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
# Deploy Service Monitor and Prometheus Rule CRDs
for crd in monitoring.coreos.com_{servicemonitors,prometheusrules}.yaml; do
kubectl apply --server-side -f \
"$CRD_BASE_URL/${PROMETHEUS_OPERATOR_VERSION}/example/prometheus-operator-crd/$crd" || exit 1
done
1 change: 1 addition & 0 deletions scripts/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ function main() {
"dev.grow.image=${internal_registry}/${grow_repo_image}"
"metrics.adjusted_polling_rate=true"
"pprof=true"
"prometheus.rules.enable=true"
)

if [[ $ENV =~ "e2e" ]]; then
Expand Down
31 changes: 30 additions & 1 deletion tests/charts/many-pods/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,35 @@ helm upgrade --install my-deployment k8s-ephemeral-storage-metrics/k8s-ephemeral
| volumeMounts | list | `[]` | |
| volumes | list | `[]` | |

## Prometheus alert rules

To prevent from multiple kind of alerts being fired for a single container or
emptyDir volume when both `prometheus.enable` and `prometheus.rules.enable` are
on, add the following [inhibition
rules](https://prometheus.io/docs/alerting/latest/configuration/#inhibition-related-settings)
to your Alert Manager config:

```yaml
- source_matchers:
- alertname="EphemeralStorageVolumeFilledUp"
target_matchers:
- severity="warning"
- alertname="EphemeralStorageVolumeFillingUp"
equal:
- pod_namespace
- pod_name
- volume_name
- source_matchers:
- alertname="ContainerEphemeralStorageUsageAtLimit"
target_matchers:
- severity="warning"
- alertname="ContainerEphemeralStorageUsageReachingLimit"
equal:
- pod_namespace
- pod_name
- exported_container
```
## Contribute
### Start minikube
Expand Down Expand Up @@ -80,4 +109,4 @@ Then run a debug against [deployment_test.go](tests/e2e/deployment_test.go)

## License

This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). See the `LICENSE` file for more details.

0 comments on commit 82d1431

Please sign in to comment.