Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/non slo redis alerts #256

Merged
merged 11 commits into from
Nov 20, 2023
Merged

Add/non slo redis alerts #256

merged 11 commits into from
Nov 20, 2023

Conversation

wejdross
Copy link
Member

@wejdross wejdross commented Nov 3, 2023

Enabling Redis System Metrics, because of:

  • REDIS_EXPORTER_INCL_SYSTEM_METRICS | Whether to include system metrics like total_system_memory_bytes, defaults to false.
  • all rules definitions around Internet use them, so I prefer to reuse them
  • I made nice workaround to those metrics, but it was much longer with bunch of regexes and I dropped that afterwards

Redis Storage was simply copied from PostgreSQL storage

@wejdross wejdross added the enhancement New feature or request label Nov 3, 2023
@wejdross wejdross requested review from a team, Kidswiss, TheBigLee and zugao and removed request for a team November 3, 2023 14:41
@wejdross wejdross requested a review from TheBigLee November 7, 2023 08:54
@Kidswiss
Copy link
Contributor

Kidswiss commented Nov 7, 2023

all rules definitions around Internet use them, so I prefer to reuse them

I get why this could be a reason to go with the specific metrics but here's an idea:

If we use the kube-state-metrics like with PostgreSQL for memory and storage then we can put the generation of those rules into jsonnet functions and just parameterize them. Like what I did for the sloth rules.

Then we won't run into the risk of copy/paste errors, and we know all the metrics are from the same source. If we ever need to adjust them, we can do it in one place.

@wejdross
Copy link
Member Author

wejdross commented Nov 7, 2023

Ok, we could abstract that, but we still need to enable additional metrics for both customers and us, especially if we hit clusters monitoring, I'm flexible with both solutions - @TheBigLee be the tiebreaker :D

Those are rules I was talking about, there are few pages aggregating Prometheus rules and they all reuse those additional metrics, this is why I think it's worth enabling them
https://samber.github.io/awesome-prometheus-alerts/rules.html#redis

@Kidswiss
Copy link
Contributor

Kidswiss commented Nov 7, 2023

Sure, I'm not against enabling the additional metrics, they will be useful for additional alerts and dashboards.

I'm just saying we should use the kube-state-metrics for storage and memory, as they are service agnostic.

@wejdross wejdross force-pushed the Add/non_slo_redis_alerts branch from f218b9b to 51c7b6d Compare November 13, 2023 14:22
@wejdross wejdross requested a review from zugao November 13, 2023 14:22
Copy link
Contributor

@Kidswiss Kidswiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rendered composition looks a bit broken. Also, before merging this test it on the lab and also check if the existing dashboards on insights need adjustments.

component/component/common.libsonnet Outdated Show resolved Hide resolved
component/component/vshn_postgres.jsonnet Show resolved Hide resolved
@wejdross wejdross force-pushed the Add/non_slo_redis_alerts branch from 78075db to 70b68d2 Compare November 14, 2023 10:25
@wejdross wejdross requested a review from Kidswiss November 14, 2023 13:30
@wejdross
Copy link
Member Author

wejdross commented Nov 14, 2023

@Kidswiss @zugao
MR is ready to be roasted :D

it compiles and works on Lab

@wejdross wejdross force-pushed the Add/non_slo_redis_alerts branch from 7d03326 to c566519 Compare November 14, 2023 13:36
@TheBigLee TheBigLee force-pushed the Add/non_slo_redis_alerts branch from c566519 to 26a4794 Compare November 16, 2023 16:41
Copy link
Contributor

@Kidswiss Kidswiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things look a bit broken in the rendered yaml files.

Also please test it on the lab to see if something in the dashboards would also need fixing.

Copy link
Contributor

@zugao zugao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Simon already pointed out there are changes in postgres files which should not be there. This PR should not change anything in Postgres, at best only the names.

@zugao zugao self-requested a review November 17, 2023 09:58
Copy link
Contributor

@zugao zugao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are requested

@TheBigLee TheBigLee force-pushed the Add/non_slo_redis_alerts branch 2 times, most recently from 30d4af9 to b725c98 Compare November 17, 2023 12:25
@TheBigLee
Copy link
Member

As Simon already pointed out there are changes in postgres files which should not be there. This PR should not change anything in Postgres, at best only the names.

This has been fixed now

Signed-off-by: Nicolas Bigler <nicolas.bigler@vshn.ch>
@TheBigLee TheBigLee force-pushed the Add/non_slo_redis_alerts branch from b725c98 to 5adf8b8 Compare November 17, 2023 12:26
@TheBigLee
Copy link
Member

Also please test it on the lab to see if something in the dashboards would also need fixing.

What do you mean? The dashboards are not affected by this, as the dashboards use the metrics and not the alertRules. And nothing changed with the metrics themselves.

@TheBigLee
Copy link
Member

I've refactored the prometheus code a bit and put it into a separate file and also added the missing runbook for the memoryCritical alert.

@TheBigLee TheBigLee requested review from Kidswiss and zugao November 17, 2023 12:58
@TheBigLee TheBigLee force-pushed the Add/non_slo_redis_alerts branch from 1930004 to f40f821 Compare November 17, 2023 13:05
Signed-off-by: Nicolas Bigler <nicolas.bigler@vshn.ch>
@TheBigLee TheBigLee force-pushed the Add/non_slo_redis_alerts branch from f40f821 to 11fcebe Compare November 17, 2023 13:05
Copy link
Contributor

@zugao zugao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I haven't tested it.

@TheBigLee TheBigLee merged commit 79ea63f into master Nov 20, 2023
22 checks passed
@TheBigLee TheBigLee deleted the Add/non_slo_redis_alerts branch November 20, 2023 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants