ROX-18766: Central Instance Limit Alert #278

aaa5kameric · 2024-08-28T10:08:35Z

Description

Every dataplane cluster has a central_instance_limit configured via app-interface. These limits are enforced by ACS Fleet Manager.

The problem: if the limit has been reached for a specific dataplane cluster, the creation of new central tenants is no longer possible on the cluster.

To see more on this, See central_instance_limit values under CLUSTER_LIST for each environment in the saas.yaml file

Solution: This PR handles the alerting phase for central instance capacity with the goal of notifying how many instances are left within a particular cluster by adding an alert on the rhacs-observability-resources repo:

 If less than 10 instances left --> Warning Alert: `CentralInstanceLimitWarningCapacity`
 If less than 2 instances left  -->  Critical Alert:  `CentralInstanceLimitCriticalCapacity`

The Alert utilizes (subtracts) acs_fleet_manager_cluster_status_capacity_max and acs_fleet_manager_cluster_status_capacity_used, which were defined in acs-fleet-manager pkg/metrics/metrics.go as:

` clusterStatusCapacityMaxMetric`  number of allowed instances per region and instance type
` clusterStatusCapacityUsedMetric` number of existing instances per region and instance type

to calculate the remaining instances left.

Jira Ticket: https://issues.redhat.com/browse/ROX-18766

Testing

testing for both alert-rules, defined in prometheus-rules.yaml under name: rhacs-fleetshard, : CentralInstanceLimit.yaml
run script: test-prom-rules.sh for CentralInstanceLimit.yaml

stehessel

Thank you for the succinct PR description, it is easy to understand 👍 . The changes themselves look good to me, but unfortunately there is a critical issue.

The metric acs_fleet_manager_cluster_status_capacity_max is scraped from fleet-manager, which is hosted in the control plane cluster (to precise, it is https://visual-app-interface.devshift.net/clusters#/openshift/app-sre-prod-04/cluster.yml). So we need to define the alert in the control plane prometheus instance. This repository is for the data plane prometheus, which runs on a different cluster and therefore has no access to the fleet-manager metrics. You can verify this by querying the metrics directly in https://obs-prometheus-rhacs-observability.apps.acs-int-us-01.isbr.p1.openshiftapps.com/graph?g0.expr=&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h.

resources/prometheus/prometheus-rules.yaml

minor changes for consistency

4dd657b

aaa5kameric requested a review from a team as a code owner August 28, 2024 10:08

aaa5kameric requested a review from ludydoo August 28, 2024 10:09

stehessel requested changes Aug 29, 2024

View reviewed changes

resources/prometheus/prometheus-rules.yaml Show resolved Hide resolved

resources/prometheus/prometheus-rules.yaml Show resolved Hide resolved

stehessel reviewed Aug 29, 2024

View reviewed changes

resources/prometheus/prometheus-rules.yaml Show resolved Hide resolved

aaa5kameric closed this Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-18766: Central Instance Limit Alert #278

ROX-18766: Central Instance Limit Alert #278

aaa5kameric commented Aug 28, 2024 •

edited

Loading

stehessel left a comment

ROX-18766: Central Instance Limit Alert #278

ROX-18766: Central Instance Limit Alert #278

Conversation

aaa5kameric commented Aug 28, 2024 • edited Loading

Description

Testing

stehessel left a comment

Choose a reason for hiding this comment

aaa5kameric commented Aug 28, 2024 •

edited

Loading