Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-17906: Create alerts for egress proxy availability #107

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions resources/prometheus/prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,26 @@ spec:
description: "Scanner container `{{ $labels.pod }}/{{ $labels.container }}` in namespace `{{ $labels.namespace }}` has restarted more than 3 times during the last 30 minutes."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-managed-service-runbooks/blob/master/sops/dp-003-rhacs-instance-unavailable.md"

- name: rhacs-egress-proxy
rules:
- alert: RHACSEgressProxyReplicaCount
expr: kube_deployment_status_replicas_ready{namespace=~"rhacs-.*",deployment="egress-proxy"} < 3
for: 20m
labels:
severity: critical
annotations:
summary: "Egress proxy cannot reach desired replica count (3) in namespace `{{ $labels.namespace }}`."
description: "During the last 30 minutes, the egress-proxy deployment in namespace `{{ $labels.namespace }}` has not reached three (3) replicas. This alert is raised when at least one replica is continuously marked as not ready for at least 20 minutes."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-managed-service-runbooks/blob/master/sops/dp-003-rhacs-instance-unavailable.md"
- alert: RHACSEgressProxyContainerFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total{namespace=~"rhacs-.*",container="egress-proxy"}[30m]) > 3
labels:
severity: warning
annotations:
summary: "Egress proxy container `{{ $labels.pod }}/{{ $labels.container }}` in namespace `{{ $labels.namespace }}` restarted more than 3 times."
description: "Egress proxy container `{{ $labels.pod }}/{{ $labels.container }}` in namespace `{{ $labels.namespace }}` has restarted more than 3 times during the last 30 minutes."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-managed-service-runbooks/blob/master/sops/dp-003-rhacs-instance-unavailable.md"

- name: rhacs-fleetshard
rules:
- alert: RHACSFleetshardOperatorContainerDown
Expand Down
52 changes: 52 additions & 0 deletions resources/prometheus/unit_tests/RHACSEgressProxy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
rule_files:
- /tmp/prometheus-rules-test.yaml

evaluation_interval: 1m

tests:
- interval: 1m
input_series:
- series: kube_deployment_status_replicas_ready{namespace="rhacs-aaaabbbbccccddddeeee", deployment="egress-proxy", container="kube-rbac-proxy-main"}
values: "3x10 2x10 1x20"
alert_rule_test:
- eval_time: 10m
alertname: RHACSEgressProxyReplicaCount
exp_alerts: []
- eval_time: 20m
alertname: RHACSEgressProxyReplicaCount
exp_alerts: []
- eval_time: 40m
alertname: RHACSEgressProxyReplicaCount
exp_alerts:
- exp_labels:
alertname: RHACSEgressProxyReplicaCount
namespace: rhacs-aaaabbbbccccddddeeee
deployment: egress-proxy
# not sure why the observed metrics have container=kube-rbac-proxy-main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because the kube metrics are scraped by OpenShift platform monitoring over a secured endpoint, where the authn/z of /metrics is handled by kube-rbac-proxy.

container: kube-rbac-proxy-main
severity: critical
exp_annotations:
summary: "Egress proxy cannot reach desired replica count (3) in namespace `rhacs-aaaabbbbccccddddeeee`."
description: "During the last 30 minutes, the egress-proxy deployment in namespace `rhacs-aaaabbbbccccddddeeee` has not reached three (3) replicas. This alert is raised when at least one replica is continuously marked as not ready for at least 20 minutes."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-managed-service-runbooks/blob/master/sops/dp-003-rhacs-instance-unavailable.md"
- interval: 1m
input_series:
- series: kube_pod_container_status_restarts_total{namespace="rhacs-aaaabbbbccccddddeeee", pod="egress-proxy-1234-5678", container="egress-proxy"}
values: "0+0x10 1+1x10 10+1x20"
alert_rule_test:
- eval_time: 10m
alertname: RHACSEgressProxyContainerFrequentlyRestarting
exp_alerts: []
- eval_time: 30m
alertname: RHACSEgressProxyContainerFrequentlyRestarting
exp_alerts:
- exp_labels:
alertname: RHACSEgressProxyContainerFrequentlyRestarting
container: egress-proxy
namespace: rhacs-aaaabbbbccccddddeeee
pod: egress-proxy-1234-5678
severity: warning
exp_annotations:
summary: "Egress proxy container `egress-proxy-1234-5678/egress-proxy` in namespace `rhacs-aaaabbbbccccddddeeee` restarted more than 3 times."
description: "Egress proxy container `egress-proxy-1234-5678/egress-proxy` in namespace `rhacs-aaaabbbbccccddddeeee` has restarted more than 3 times during the last 30 minutes."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-managed-service-runbooks/blob/master/sops/dp-003-rhacs-instance-unavailable.md"