Skip to content

Commit

Permalink
Fix broken prometheusRules, add missing Runbook for alert
Browse files Browse the repository at this point in the history
Signed-off-by: Nicolas Bigler <nicolas.bigler@vshn.ch>
  • Loading branch information
TheBigLee committed Nov 17, 2023
1 parent 71f25eb commit b725c98
Show file tree
Hide file tree
Showing 6 changed files with 27 additions and 15 deletions.
10 changes: 7 additions & 3 deletions component/component/common.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,10 @@ local generatePrometheusNonSLORules(serviceName, memoryContainerName, additional
spec: {
groups: [
{
name: '%s-general-alerts' % serviceNameLower,
name: '%s-storage' % serviceNameLower,
rules: [
{
name: '%s-storage' % serviceNameLower,

alert: serviceName + 'PersistentVolumeFillingUp',
annotations: {
description: 'The volume claimed by the instance {{ $labels.name }} in namespace {{ $labels.label_appcat_vshn_io_claim_namespace }} is only {{ $value | humanizePercentage }} free.',
Expand All @@ -238,9 +238,13 @@ local generatePrometheusNonSLORules(serviceName, memoryContainerName, additional
severity: 'warning',
},
},
],
},
{
name: std.asciiLower(serviceName) + '-memory',
rules: [
{
alert: serviceName + 'MemoryCritical',
name: std.asciiLower(serviceName) + '-memory',
annotations: {
description: 'The memory claimed by the instance {{ $labels.name }} in namespace {{ $labels.label_appcat_vshn_io_claim_namespace }} has been over 85% for 2 hours.\n Please reducde the load of this instance, or increase the memory.',
// runbook_url: 'TBD',
Expand Down
3 changes: 1 addition & 2 deletions component/component/vshn_postgres.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -774,7 +774,7 @@ local prometheusRule = common.GeneratePrometheusNonSLORules(
alert: 'PostgreSQLConnectionsCritical',
annotations: {
description: 'The number of connections to the instance {{ $labels.name }} in namespace {{ $labels.label_appcat_vshn_io_claim_namespace }} have been over 90% of the configured connections for 2 hours.\n Please reduce the load of this instance.',
// runbook_url: 'TBD',
runbook_url: 'https://hub.syn.tools/appcat/runbooks/vshn-postgresql.html#PostgreSQLConnectionsCritical',
summary: 'Connection usage critical',
},

Expand All @@ -787,7 +787,6 @@ local prometheusRule = common.GeneratePrometheusNonSLORules(
},
],
},
// new
{
name: 'postgresql-replication',
rules: [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -960,7 +960,7 @@ spec:
name: postgresql-rules
spec:
groups:
- name: postgresql-general-alerts
- name: postgresql-storage
rules:
- alert: PostgreSQLPersistentVolumeFillingUp
annotations:
Expand All @@ -981,7 +981,6 @@ spec:
labels:
severity: critical
syn_team: schedar
name: postgresql-storage
- alert: PostgreSQLPersistentVolumeFillingUp
annotations:
description: Based on recent sampling, the volume claimed
Expand All @@ -1003,6 +1002,8 @@ spec:
for: 1h
labels:
severity: warning
- name: postgresql-memory
rules:
- alert: PostgreSQLMemoryCritical
annotations:
description: |-
Expand All @@ -1017,14 +1018,14 @@ spec:
labels:
severity: critical
syn_team: schedar
name: postgresql-memory
- name: postgresql-connections
rules:
- alert: PostgreSQLConnectionsCritical
annotations:
description: |-
The number of connections to the instance {{ $labels.name }} in namespace {{ $labels.label_appcat_vshn_io_claim_namespace }} have been over 90% of the configured connections for 2 hours.
Please reduce the load of this instance.
runbook_url: https://hub.syn.tools/appcat/runbooks/vshn-postgresql.html#PostgreSQLConnectionsCritical
summary: Connection usage critical
expr: label_replace( topk(1, sum(pg_stat_activity_count) by
(pod, namespace) > 90/100 * sum(pg_settings_max_connections)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1062,7 +1062,7 @@ spec:
name: postgresql-rules
spec:
groups:
- name: postgresql-general-alerts
- name: postgresql-storage
rules:
- alert: PostgreSQLPersistentVolumeFillingUp
annotations:
Expand All @@ -1083,7 +1083,6 @@ spec:
labels:
severity: critical
syn_team: schedar
name: postgresql-storage
- alert: PostgreSQLPersistentVolumeFillingUp
annotations:
description: Based on recent sampling, the volume claimed
Expand All @@ -1105,6 +1104,8 @@ spec:
for: 1h
labels:
severity: warning
- name: postgresql-memory
rules:
- alert: PostgreSQLMemoryCritical
annotations:
description: |-
Expand All @@ -1119,14 +1120,14 @@ spec:
labels:
severity: critical
syn_team: schedar
name: postgresql-memory
- name: postgresql-connections
rules:
- alert: PostgreSQLConnectionsCritical
annotations:
description: |-
The number of connections to the instance {{ $labels.name }} in namespace {{ $labels.label_appcat_vshn_io_claim_namespace }} have been over 90% of the configured connections for 2 hours.
Please reduce the load of this instance.
runbook_url: https://hub.syn.tools/appcat/runbooks/vshn-postgresql.html#PostgreSQLConnectionsCritical
summary: Connection usage critical
expr: label_replace( topk(1, sum(pg_stat_activity_count) by
(pod, namespace) > 90/100 * sum(pg_settings_max_connections)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ spec:
name: redis-rules
spec:
groups:
- name: redis-general-alerts
- name: redis-storage
rules:
- alert: redisPersistentVolumeFillingUp
annotations:
Expand All @@ -118,7 +118,6 @@ spec:
labels:
severity: critical
syn_team: schedar
name: redis-storage
- alert: redisPersistentVolumeFillingUp
annotations:
description: Based on recent sampling, the volume claimed
Expand All @@ -140,6 +139,8 @@ spec:
for: 1h
labels:
severity: warning
- name: redis-memory
rules:
- alert: redisMemoryCritical
annotations:
description: |-
Expand All @@ -154,7 +155,6 @@ spec:
labels:
severity: critical
syn_team: schedar
name: redis-memory
providerConfigRef:
name: kubernetes
name: prometheusrule
Expand Down
9 changes: 8 additions & 1 deletion docs/modules/ROOT/pages/runbooks/vshn-postgresql.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,15 @@ This alert fires when there are issues with statefullset responsible for replica

```
kubectl describe -n vshn-postgresql-<instance> sts <instance>
## for exmaple: kubectl -n vshn-postgresql-test-cluster-always-true-jnlj4 describe sts test-cluster-always-true-jnlj4
## for example: kubectl -n vshn-postgresql-test-cluster-always-true-jnlj4 describe sts test-cluster-always-true-jnlj4

## get events from affected namespace and look for issues
k -n vshn-postgresql-test-cluster-always-true-jnlj4 get events
```

[[PostgreSQLConnectionsCritical]]
== PostgreSQLConnectionsCritical

This alert fires when the used connection is over 90% of the configured `max_connections` limit (defaults to 100).
It means that either the connection limit is set too low or an application is misbehaving and spawning too many connections.
You either need to raise the `max_connections` parameter on the PostgreSQL instance or debug the application, as it might be misbehaving and spawning too many connections.

0 comments on commit b725c98

Please sign in to comment.