Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-21530: Certificate Expiry Dashboard #279

Closed
wants to merge 9 commits into from

Conversation

aaa5kameric
Copy link
Contributor

@aaa5kameric aaa5kameric commented Sep 5, 2024

Description

There are 2 steps prior to this PR:
First step was to Implement monitoring for certificate expiration, tracking and managing of digital certificates expiration dates. Certificate Monitoring PR: ROX-21530-certificate-monitoring , extracts timestamps from certificates and exposes metrics to Prometheus.

The second step was the Alerting part. Certificate Alerting PR: ROX-21530-certificate-alerting depends on the monitoring phase for timestamp exposing and extraction. So, External dependencies: adding/extracting the metrics : ROX-21530-certificate-monitoring . In the alerting phase, we defined prometheus rules and tests (RHACSFleetschardCertificateExpiring.yaml) for timestamps expiring on:

WARNING: <= 7 days RHACSFleetshardCertificateExpiringSoon

CRITICAL: <=1 day RHACSFleetshardCertificateExpiringCritical

Lastly, the Certificate Expiry Table-Dashboard was created using Grafana called Certificate Expiry. From the prometheus metric: acs_fleetshard_certificate_expiration_timestamp. This table is located in the RHACS Dataplane - Cluster Metrics section.

Jira Ticket: https://issues.redhat.com/browse/ROX-21530

Dashboard Screenshots:

Screenshot from 2024-09-05 11-29-39

Screenshot from 2024-09-05 12-09-17

Screenshot from 2024-09-05 12-09-26

Link to draft dashboard: https://grafana-route-rhacs-observability.apps.acs-int-us-01.isbr.p1.openshiftapps.com/d/adwo5gllc4av4b/rhacs-dataplane-cluster-metrics-amina-copy?orgId=1&editview=links

@aaa5kameric aaa5kameric requested a review from a team as a code owner September 5, 2024 09:41
Copy link
Contributor

@stehessel stehessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the demo dashboard. A few observations:

  • The table loads way too slowly. I believe the reason for this is that the metric acs_fleetshard_certificate_expiration_timestamp itself is flawed - it seems to retain all certificates even after their namespaces have been deleted. Combined with the ephemeral instances created by the probe, this creates a very large number of time series. I'd suggest to fix this in fleet-shard sync first - namely stop reporting metrics for deleted tenants.
  • The table renders a bit plain for my taste. Some suggestions are column filters, total count as a footer, centered texts, a larger table that spans the entire page width, a new section to separate visually from the instance tables.
  • A column that shows now() - timestamp would be nice as well. That way it's easier to understand how much time is left until the expiration is reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants