Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-17469: implemented sli/alerts for central api latencies #117

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions resources/prometheus/prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,78 @@ spec:
)
record: central:sli:availability:extended_avg_over_time28d

# - Queries the 90th percentile of central's handled GRPC/HTTP API requests latencies over the last 10 minutes.
- expr: |
(histogram_quantile(0.9, sum by(le, namespace, grpc_service, grpc_method) (rate(grpc_server_handling_seconds_bucket{container="central", grpc_method!~"ScanImageInternal|DeleteImages|EnrichLocalImageInternal|RunReport|ScanImage|TriggerExternalBackup|Ping"}[10m]))) > 0) < bool 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, didn't know about < bool. Probably can use that else where as well to simplify the promQL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incoming gRPC calls are already very sparse for most Centrals. I think we should consider consolidating them if they roughly do the same thing latency wise. So I would sum here over grpc_service / grpc_method.

record: central:grpc_server_handling_seconds:rate10m:p90
- expr: |
(histogram_quantile(0.9, sum by(le, namespace, path) (rate(http_incoming_request_duration_histogram_seconds_bucket{container="central", code!~"5.*|4.*", path!~"/api/extensions/scannerdefinitions|/api/graphql|/sso/|/|/api/cli/download/"}[10m]))) > 0) > 0.1
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90

# - Queries the current central API latency (GRPC and HTTP) SLI by calculating the ratio of successful
# instances of central:xxx:rate10m:p90 over its total instances for a certain period.
# - Note that to get the current SLI with a variable PERIOD, simply run the following query where PERIOD is the desired period in
# promql duration format. This query is useful for dynamically determining an SLI regardless of an SLO.
#
# sum_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[PERIOD]) / count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[PERIOD])
#
- expr: |
sum_over_time(central:grpc_server_handling_seconds:rate10m:p90[28d]) / count_over_time(central:grpc_server_handling_seconds:rate10m:p90[28d])
record: central:grpc_server_handling_seconds:rate10m:p90:sli
- expr: |
sum_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[28d]) / count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[28d])
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sli

# - Queries the error rate or the ratio of the instances of central:xxx:rate10m:p90
# were equal to 0 over the total instances of central:xxx:rate10m:p90 within a period.
- expr: |
1 - central:grpc_server_handling_seconds:rate10m:p90:sli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you changed the order in the naming compared to existing metrics? E.g. central:grpc_server_handling_seconds:rate10m:p90:sli vs central:sli:grpc_server_handling_seconds:rate10m:p90.

record: central:grpc_server_handling_seconds:rate10m:p90:error_rate28d
- expr: |
1 - central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sli
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_rate28d

# - Queries error rate for a 1h window.
- expr: |
1 - (sum_over_time(central:grpc_server_handling_seconds:rate10m:p90[1h]) / count_over_time(central:grpc_server_handling_seconds:rate10m:p90[1h]))
record: central:grpc_server_handling_seconds:rate10m:p90:error_rate1h
- expr: |
1 - (sum_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[1h]) / count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[1h]))
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_rate1h

# - Queries the error budget exhaustion (or consumption) for the whole slo window (28d).
- expr: |
(1 - central:grpc_server_handling_seconds:rate10m:p90:sli) / 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define a scalar recording rule for the target (0.01 / 0.99)? That allows us to change the value in a single place.

record: central:grpc_server_handling_seconds:rate10m:p90:error_budget_exhaustion28d
- expr: |
(1 - central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sli) / 0.01
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_budget_exhaustion28d

# - Queries error budget burn rate (a.k.a. burn rate) is the ratio of central:xxx:rate10m:p90:error_rateyyy
# over the error budget for a period (e.g. 1h, 1d, etc).
- expr: |
central:grpc_server_handling_seconds:rate10m:p90:error_rate1h / 0.01
record: central:grpc_server_handling_seconds:rate10m:p90:burn_rate1h
- expr: |
central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_rate1h / 0.01
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:burn_rate1h

# - A sample count filter that ignores central:xxx:rate10m:p90 instances that has samples less than the expected sample count.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the filter treat periods where there is no incoming traffic (and the base metrics are therefore undefined)?

# - The expected count of 10m samples of central:xxx:rate10m:p90 over 28 days (e.g. 28d/10m) is equal to 4032.
# - The expected count of 10m samples of central:xxx:rate10m:p90 over an hour (e.g. 1h / 10m) is equal to 6.
- expr: |
(count_over_time(central:grpc_server_handling_seconds:rate10m:p90[28d]) >= 4032) / count_over_time(central:grpc_server_handling_seconds:rate10m:p90[28d])
record: central:grpc_server_handling_seconds:rate10m:p90:sample_count_filter28d
- expr: |
(count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[28d]) >= 4032) / count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[28d])
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sample_count_filter28d
- expr: |
(count_over_time(central:grpc_server_handling_seconds:rate10m:p90[1h]) >= 6) / count_over_time(central:grpc_server_handling_seconds:rate10m:p90[1h])
record: central:grpc_server_handling_seconds:rate10m:p90:sample_count_filter1h
- expr: |
(count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[1h]) >= 6) / (count_over_time(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90[1h]))
record: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sample_count_filter1h

- name: rhacs-central.slo
rules:
# Availability SLO
Expand Down Expand Up @@ -533,6 +605,11 @@ spec:
severity: critical
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.rhacs_instance_id }}"
rhacs_org_name: "{{ $labels.rhacs_org_name }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't add these labels here in general, because the values originate in Central, and if Central itself is down, they don't exist.

rhacs_org_id: "{{ $labels.rhacs_org_id }}"
rhacs_cluster_name: "{{ $labels.rhacs_cluster_name }}"
rhacs_environment: "{{ $labels.rhacs_environment }}"

- name: az-resources
rules:
- record: strictly_worker_nodes
Expand Down Expand Up @@ -639,3 +716,96 @@ spec:
summary: "There is a high risk of over-committing CPU resources on worker nodes in AZ {{ $labels.availability_zone }}."
description: "During the last 5 minutes, the average CPU limit commitment on worker nodes in AZ {{ $labels.availability_zone }} was {{ $value | humanizePercentage }}. This is above the recommended threshold of 200%."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-cloud-service/runbooks/-/blob/master/sops/dp-027-cluster-scale-up.md"

- alert: Central latency error budget exhaustion for GRPC API - 90%
annotations:
message: "Latency error budget exhaustion for central's GRPC API. Current exhaustion: {{ $value | humanizePercentage }}."
expr: |
(central:grpc_server_handling_seconds:rate10m:p90:sample_count_filter28d * central:grpc_server_handling_seconds:rate10m:p90:error_budget_exhaustion28d) >= 0.9
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance id is not the same as the namespace. The namespace is rhacs-{instance_id}.

grpc_service: "{{ $labels.grpc_service }}"
grpc_method: "{{ $labels.grpc_method }}"
severity: critical
- alert: Central latency error budget exhaustion for GRPC API - 70%
annotations:
message: "Latency error budget exhaustion for central's GRPC API. Current exhaustion: {{ $value | humanizePercentage }}."
expr: |
(central:grpc_server_handling_seconds:rate10m:p90:sample_count_filter28d * central:grpc_server_handling_seconds:rate10m:p90:error_budget_exhaustion28d) >= 0.7
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
grpc_service: "{{ $labels.grpc_service }}"
grpc_method: "{{ $labels.grpc_method }}"
severity: warning
- alert: Central latency error budget exhaustion for GRPC API - 50%
annotations:
message: "Latency error budget exhaustion for central's GRPC API. Current exhaustion: {{ $value | humanizePercentage }}."
expr: |
(central:grpc_server_handling_seconds:rate10m:p90:sample_count_filter28d * central:grpc_server_handling_seconds:rate10m:p90:error_budget_exhaustion28d) >= 0.5
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
grpc_service: "{{ $labels.grpc_service }}"
grpc_method: "{{ $labels.grpc_method }}"
severity: warning
- alert: Central latency error budget exhaustion for HTTP API - 90%
annotations:
message: "Latency error budget exhaustion for central's HTTP API. Current exhaustion: {{ $value | humanizePercentage }}."
expr: |
(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sample_count_filter28d * central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_budget_exhaustion28d) >= 0.9
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
path: "{{ $labels.path }}"
severity: critical
- alert: Central latency error budget exhaustion for HTTP API - 70%
annotations:
message: "Latency error budget exhaustion for central's HTTP API. Current exhaustion: {{ $value | humanizePercentage }}."
expr: |
(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sample_count_filter28d * central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_budget_exhaustion28d) >= 0.7
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
path: "{{ $labels.path }}"
severity: warning
- alert: Central latency error budget exhaustion for HTTP API - 50%
annotations:
message: "Latency error budget exhaustion for central's HTTP API. Current exhaustion: {{ $value | humanizePercentage }}."
expr: |
(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sample_count_filter28d * central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:error_budget_exhaustion28d) >= 0.5
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
path: "{{ $labels.path }}"
severity: warning
- alert: Central latency burn rate for GRPC API
annotations:
message: "Latency burn rate for central's GRPC API. Current burn rate per hour: {{ $value | humanize }}."
expr: |
Copy link
Contributor

@stehessel stehessel Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you chose 0.5 as the burn rate threshold? That seems very low to me. Note that by definition, a burn rate of 1 means that the full error budget will be consumed after 28 days.

For slow burns we already have the alerts based on the total consumption. I'd keep this one for high burn alerts.

(central:grpc_server_handling_seconds:rate10m:p90:sample_count_filter1h * central:grpc_server_handling_seconds:rate10m:p90:burn_rate1h) >= 0.5
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
grpc_service: "{{ $labels.grpc_service }}"
grpc_method: "{{ $labels.grpc_method }}"
severity: warning
- alert: Central latency burn rate for HTTP API
annotations:
message: "Latency burn rate for central's HTTP API. Current burn rate per hour: {{ $value | humanize }}."
expr: |
(central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:sample_count_filter1h * central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90:burn_rate1h) >= 0.5
labels:
service: central
namespace: "{{ $labels.namespace }}"
rhacs_instance_id: "{{ $labels.namespace }}"
path: "{{ $labels.path }}"
severity: warning
134 changes: 134 additions & 0 deletions resources/prometheus/unit_tests/RHACSCentralSLISLO.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,137 @@ tests:
exp_annotations:
message: "High availability burn rate for central. Current burn rate per hour: 59.17."
sop_url: "https://gitlab.cee.redhat.com/stackrox/acs-managed-service-runbooks/blob/master/sops/dp-018-rhacs-central-slo-alerts.md"

# Test central GRPC/HTTP API latency alerts and rules
- interval: 10m
input_series:
- series: central:grpc_server_handling_seconds:rate10m:p90{namespace="rhacs-abc", grpc_service="grpcsvc", grpc_method="grpcmeth"}
values: 1+0x4000
- series: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90{namespace="rhacs-abc", grpc_service="grpcsvc", grpc_method="grpcmeth"}
values: 1+0x4000
alert_rule_test:
# Ensure alert for a 28d window doesn't fire if there aren't enough SLI samples.
- eval_time: 28d
alertname: Central latency error budget exhaustion for GRPC API - 90%
exp_alerts: []
# Ensure alert for a 28d window doesn't fire if there aren't enough SLI samples.
- eval_time: 28d
alertname: Central latency error budget exhaustion for HTTP API - 90%
exp_alerts: []
- interval: 10m
input_series:
- series: central:grpc_server_handling_seconds:rate10m:p90{namespace="rhacs-abc", grpc_service="grpcsvc", grpc_method="grpcmeth"}
values: "1+0x3994 0+0x36"
- series: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90{namespace="rhacs-abc", grpc_service="grpcsvc", grpc_method="grpcmeth"}
values: "1+0x3994 0+0x36"
alert_rule_test:
- eval_time: 28d
alertname: Central latency error budget exhaustion for GRPC API - 90%
exp_alerts:
- exp_labels:
alertname: Central availability error budget exhaustion - 90%
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpcsvc
grpc_method: grpcmeth
severity: critical
exp_annotations:
message: "Latency error budget exhaustion for central's GRPC API. Current exhaustion: 91.77%."
- eval_time: 28d
alertname: Central latency error budget exhaustion for GRPC API - 70%
exp_alerts:
- exp_labels:
alertname: Central availability error budget exhaustion - 70%
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpcsvc
grpc_method: grpcmeth
severity: warning
exp_annotations:
message: "Latency error budget exhaustion for central's GRPC API. Current exhaustion: 91.77%."
- eval_time: 28d
alertname: Central latency error budget exhaustion for GRPC API - 50%
exp_alerts:
- exp_labels:
alertname: Central availability error budget exhaustion - 50%
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpcsvc
grpc_method: grpcmeth
severity: warning
exp_annotations:
message: "Latency error budget exhaustion for central's GRPC API. Current exhaustion: 91.77%."
- eval_time: 28d
alertname: Central latency error budget exhaustion for HTTP API - 90%
exp_alerts:
- exp_labels:
alertname: Central availability error budget exhaustion - 90%
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpcsvc
grpc_method: grpcmeth
severity: critical
exp_annotations:
message: "Latency error budget exhaustion for central's HTTP API. Current exhaustion: 91.77%."
- eval_time: 28d
alertname: Central latency error budget exhaustion for HTTP API - 70%
exp_alerts:
- exp_labels:
alertname: Central availability error budget exhaustion - 70%
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpcsvc
grpc_method: grpcmeth
severity: warning
exp_annotations:
message: "Latency error budget exhaustion for central's HTTP API. Current exhaustion: 91.77%."
- eval_time: 28d
alertname: Central latency error budget exhaustion for HTTP API - 50%
exp_alerts:
- exp_labels:
alertname: Central availability error budget exhaustion - 50%
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpcsvc
grpc_method: grpcmeth
severity: warning
exp_annotations:
message: "Latency error budget exhaustion for central's HTTP API. Current exhaustion: 91.77%."
- interval: 10m
input_series:
- series: central:grpc_server_handling_seconds:rate10m:p90{namespace="rhacs-abc", grpc_service="grpc_service", grpc_method="grpc_method"}
values: "1+0x2 0+0x2"
- series: central:http_incoming_request_duration_histogram_seconds_bucket:rate10m:p90{namespace="rhacs-abc", path="path"}
values: "1+0x2 0+0x2"
alert_rule_test:
- eval_time: 1h
alertname: Central latency burn rate for GRPC API
exp_alerts:
- exp_labels:
alertname: Central latency burn rate for GRPC API
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
grpc_service: grpc_service
grpc_method: grpc_method
severity: warning
exp_annotations:
message: "Latency burn rate for central's GRPC API. Current burn rate per hour: 50."
- eval_time: 1h
alertname: Central latency burn rate for HTTP API
exp_alerts:
- exp_labels:
alertname: Central latency burn rate for HTTP API
namespace: rhacs-abc
rhacs_instance_id: rhacs-abc
service: central
path: path
severity: warning
exp_annotations:
message: "Latency burn rate for central's HTTP API. Current burn rate per hour: 50."