Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial metrics #651

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

ronensc
Copy link

@ronensc ronensc commented Sep 27, 2023

Issue link

What changes have been made

This is another take on #573 to add initial metrics considering the redesign of codeflare-operator.

Verification steps

I checked that the metrics are available in the codeflare-operator operator by running it:

# in codeflare-operator root
go mod edit -replace github.com/project-codeflare/multi-cluster-app-dispatcher=../multi-cluster-app-dispatcher
make build
NAMESPACE=default go run ./main.go -kubeconfig <path-to-kubeconfig>

and in a new shell:

curl -s localhost:8080/metrics | grep mcad

# HELP mcad_allocatable_capacity_cpu Allocatable CPU Capacity (in milicores)
# TYPE mcad_allocatable_capacity_cpu gauge
mcad_allocatable_capacity_cpu 49665
# HELP mcad_allocatable_capacity_gpu Allocatable GPU Capacity
# TYPE mcad_allocatable_capacity_gpu gauge
mcad_allocatable_capacity_gpu 0
# HELP mcad_allocatable_capacity_memory Allocatable Memory Capacity
# TYPE mcad_allocatable_capacity_memory gauge
mcad_allocatable_capacity_memory 2.07376797696e+11

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Add dependency: sigs.k8s.io/controller-runtime v0.14.6

go get sigs.k8s.io/controller-runtime@v0.14.6
go mod tidy
@openshift-ci
Copy link

openshift-ci bot commented Sep 27, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sutaakar for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 25 to 43
allocatableCapacityCpu := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Subsystem: "mcad",
Name: "allocatable_capacity_cpu",
Help: "Allocatable CPU Capacity (in milicores)",
}, func() float64 { return controller.GetAllocatableCapacity().MilliCPU })
metrics.Registry.MustRegister(allocatableCapacityCpu)

allocatableCapacityMemory := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Subsystem: "mcad",
Name: "allocatable_capacity_memory",
Help: "Allocatable Memory Capacity",
}, func() float64 { return controller.GetAllocatableCapacity().Memory })
metrics.Registry.MustRegister(allocatableCapacityMemory)

allocatableCapacityGpu := prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Subsystem: "mcad",
Name: "allocatable_capacity_gpu",
Help: "Allocatable GPU Capacity",
}, func() float64 { return float64(controller.GetAllocatableCapacity().GPU) })
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calls to GetAllocatableCapacity() are quite expensive to be executed in GaugeFuncs. Should I call it periodically and cache the values like in the original PR?
https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/573/files#diff-cd034f141bebdabd3ed7e6118e9581ec4646707beaf8438f164ae51f382f2c1dR69-R77

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe caching the values like in the original PR is the right call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, listing all the Nodes and all the Pods clearly does not scale at all.

We need to reconsider re-activating / improving the cache, intersecting #588. That may provide a smooth transition once we migrate over to controller-runtime.

Otherwise, we can consider a simple caching mechanism, e.g. that would rate limit the call to allocatableCapacity using something like golang.org/x/time/rate.NewLimiter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another commit to schedule the expensive call to run once a minute. I wasn't sure how to use golang.org/x/time/rate.NewLimiter without making it look like an overkill.

@ChristianZaccaria
Copy link
Contributor

ChristianZaccaria commented Sep 27, 2023

Nice one! I just followed the verification steps, it works and shows the CPU, GPU, and memory allocatable capacity 👍. Something to note:

  • I had to change the last bit of this line (on the operator) to v1alpha1 as I was getting the following error after running make build:

github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1: module github.com/project-codeflare/multi-cluster-app-dispatcher@latest found (v1.35.0, replaced by ../multi-cluster-app-dispatcher), but does not contain package github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1

Does this need to be changed over in the codeflare-operator?

@ChristianZaccaria
Copy link
Contributor

As commented in the mentioned PR, I'm not sure if this would require an ADR as well. cc: @dimakis @anishasthana @astefanutti

@ChristianZaccaria
Copy link
Contributor

I think we would want to be exposing on port 8083 as done for the original PR

Copy link
Author

@ronensc ronensc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChristianZaccaria Thanks for reviewing and trying this up! :)

Nice one! I just followed the verification steps, it works and shows the CPU, GPU, and memory allocatable capacity 👍. Something to note:

* I had to change the last bit of [this line (on the operator)](https://github.com/project-codeflare/codeflare-operator/blob/c3383c8e19a8a8c94169c218f1ed8c9faa8e12d5/main.go#L30) to `v1alpha1` as I was getting the following error after running `make build`:

github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1: module github.com/project-codeflare/multi-cluster-app-dispatcher@latest found (v1.35.0, replaced by ../multi-cluster-app-dispatcher), but does not contain package github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/quotaplugins/quotasubtree/v1

Does this need to be changed over in the codeflare-operator?

Good catch! I missed this bit of change I had to make too. I think this is due to #650 and once there is a new relase of MCAD, this change in the operator is mandatory.

I think we would want to be exposing on port 8083 as done for the original PR

IIUC, after the redesign, the exposing port is the business of the operator. Its default value is set here
https://github.com/project-codeflare/codeflare-operator/blob/99d2fc8b4e3bdcada5259f4417a8c2b2ca342430/main.go#L98
A different value can be set by setting metrics.bindAddress in codeflare-operator-config config map

@eranra
Copy link
Contributor

eranra commented Sep 27, 2023

@ronensc wonderful. I will be happy to close #573 --- tell me when :-)

@astefanutti
Copy link
Contributor

I think we would want to be exposing on port 8083 as done for the original PR

IIUC, after the redesign, the exposing port is the business of the operator. Its default value is set here https://github.com/project-codeflare/codeflare-operator/blob/99d2fc8b4e3bdcada5259f4417a8c2b2ca342430/main.go#L98 A different value can be set by setting metrics.bindAddress in codeflare-operator-config config map

That's correct. MCAD is only responsible for the instrumentation and the registration of the metrics. Serving these metrics is the responsibility of the CodeFlare operator.

@rachelt44
Copy link

@ChristianZaccaria any conclusion on the necessity of an ADR? I started working on it.

@ronensc
Copy link
Author

ronensc commented Nov 6, 2023

@ChristianZaccaria @astefanutti Is there anything I can do to make progress on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants