Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: expose a per-store replica cpu histogram metric #138672

Closed
angles-n-daemons opened this issue Jan 8, 2025 · 0 comments
Closed

kvserver: expose a per-store replica cpu histogram metric #138672

angles-n-daemons opened this issue Jan 8, 2025 · 0 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@angles-n-daemons
Copy link
Contributor

angles-n-daemons commented Jan 8, 2025

In the interest of better troubleshooting hotspots, we're aiming to expose more observability around range usage and load.

To be considered complete, for each store, when updateReplicationGauges is called, update this new metric to a histogram of seen cpu values.

Epic: CRDB-43150
Jira issue: CRDB-46300

@angles-n-daemons angles-n-daemons added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 8, 2025
@angles-n-daemons angles-n-daemons changed the title kvserver: expose a per-store maximum replica cpu metric kvserver: expose a per-store replica cpu histogram metric Jan 9, 2025
angles-n-daemons added a commit to angles-n-daemons/cockroach that referenced this issue Jan 16, 2025
One of our goals with adding more hotspot telemetry is to better
understand what's happening to a cluster when it has a hotspot. Today this is
possible real time, but information is limited when trying to understand
hotspots from the past.

We currently have a log for the hot ranges in a cluster, which can be enabled
to periodically report the hot ranges, but to limit the output it runs
infrequently, and therefore is likely to miss temporal, or short lived
hotspots.

In the replica deciders, there already exists some functionality for
determining when a specific replica is the target of an unbalanced
portion of the system's load. What this change does is allow for other
parts of the system (namely the hot range logger) to subscribe to when
that tipping point is reached.

The following change will link the hot range logger to this new
notification system, so that temporal hotspots can be better examined.

Fixes: cockroachdb#138672
Epic: CRDB-43150

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

1 participant