Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target allocator #261

Merged
merged 20 commits into from
Nov 26, 2024
Merged

Target allocator #261

merged 20 commits into from
Nov 26, 2024

Conversation

okankoAMZ
Copy link
Contributor

@okankoAMZ okankoAMZ commented Oct 25, 2024

Enhanced CloudWatch Agent Operator for Kubernetes

Summary

This PR introduces significant improvements to the CloudWatch Agent Operator for Kubernetes, addressing limitations in the previous Daemonset implementation and enhancing scalability and efficiency.

Key Changes

  1. Deploying as Statefulset
  2. Integration of Target Allocator component
  3. Dynamic sharding of Prometheus targets

Detailed Description

Background

The Amazon CloudWatch Agent allows customers to collect and publish metrics in Prometheus format across various compute environments, including Kubernetes clusters. The CloudWatch Agent Operator simplifies the onboarding process for Prometheus scraping.

Previous Limitations

  • Metric duplication due to multiple agent instances scraping the same endpoints
  • Lack of horizontal scaling capability

New Features

  1. Statefulset Deployment:

  2. Target Allocator Integration:

    • Watches for Prometheus targets in the cluster
    • Dynamically shards targets across multiple CloudWatch Agent replicas

Benefits:

  • Configurable number of agent replicas
  • Automatic distribution of Prometheus scrape targets
  • Improved efficiency and scalability in metric collection
  • Customizable Prometheus scrape configuration via custom resource

Automatic Updates

The operator automatically applies changes to the scrape configuration, updating both the Target Allocator and CloudWatch Agent instances.

Testing

TBA

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

okankoAMZ and others added 17 commits September 11, 2024 10:45
NodeJS merging-in from main
Co-authored-by: Mitali Salvi <44349099+mitali-salvi@users.noreply.github.com>
* Ta https server (#2921)

* Added https server, tests, secret marshalling


---------

Co-authored-by: ItielOlenick <67790309+ItielOlenick@users.noreply.github.com>
* Reconciler now removes un-used managed resources for CWA collector
* Adding support for NodeJS auto instrumentation and integ tests (#220)

* Support configurable resources for NodeJS. (#225)

* Supporting JMX annotations (#240)

* Add support for a supplemental YAML configuration for the CloudWatchAgent (#241)

* Changed naming for OTLP container ports from agent JSON (#252)

* Updated Release Notes for 1.8.0 (#251)

* Adjust EKS add-on integration test service count expectations (#256)

* Add integration tests for JMX. (#250)

* Implemented Target Allocator Container (#214)

* Implemented TargetAllocator resource deployments. (#208)

* Update cmd/amazon-cloudwatch-agent-target-allocator/config/config.go

Co-authored-by: Musa <musaasad@amazon.com>

* Update internal/config/main.go

Co-authored-by: Musa <musaasad@amazon.com>

---------

Co-authored-by: Parampreet Singh <50599809+Paramadon@users.noreply.github.com>
Co-authored-by: Musa <musaasad@amazon.com>
Co-authored-by: Mitali Salvi <44349099+mitali-salvi@users.noreply.github.com>
Co-authored-by: Jeffrey Chien <chienjef@amazon.com>
@okankoAMZ okankoAMZ marked this pull request as ready for review November 12, 2024 20:40
@okankoAMZ okankoAMZ force-pushed the target-allocator branch 3 times, most recently from d59fed3 to a1a0f2c Compare November 13, 2024 16:26
lisguo
lisguo previously approved these changes Nov 14, 2024
cmd/amazon-cloudwatch-agent-target-allocator/Dockerfile Outdated Show resolved Hide resolved
versions.txt Outdated Show resolved Hide resolved
main.go Show resolved Hide resolved
@@ -9,6 +9,16 @@ func ConfigMap(otelcol string) string {
return DNSName(Truncate("%s", 63, otelcol))
}

// TAConfigMap returns the name for the config map used in the TargetAllocator.
func TAConfigMap(otelcol string) string {
return DNSName(Truncate("%s-target-allocator", 63, otelcol))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: %s-targetallocator to be consistent with upstream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are following the cloudwatch-agent format here but I am open to changing


// PrometheusConfigMap returns the name for the prometheus config map.
func PrometheusConfigMap(otelcol string) string {
return DNSName(Truncate("%s-prometheus-config", 63, otelcol))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: %s-prometheus is probably good enough to be consistent with the other names and its better to keep these short to avoid hitting the max length of 64.

internal/manifests/manifestutils/labels.go Show resolved Hide resolved
)

// Labels return the common labels to all TargetAllocator objects that are part of a managed AmazonCloudWatchAgent.
func Labels(instance v1alpha1.AmazonCloudWatchAgent, name string) map[string]string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need this?
Cant we just call manifestutils.Labels with a new component name like ComponentAmazonCloudWatchAgentTargetAllocator?

Copy link
Contributor

@musa-asad musa-asad Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it, we can remove it and use manifestutils.Labels instead if that's preferable.

internal/manifests/targetallocator/volume.go Show resolved Hide resolved
Namespace: owner.Namespace,
LabelSelector: labels.SelectorFromSet(selector),
}
// Define lists for different Kubernetes resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why just these? are we only looking at the type of resources created for TA currently?
since the controller can create other resources too such as hpsa, pdbs, routes, ingresses etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

Copy link
Contributor

@mitali-salvi mitali-salvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any integration tests which validate the new component being added ?

@@ -87,6 +90,9 @@ func (c CollectorWebhook) defaulter(r *AmazonCloudWatchAgent) error {
if r.Spec.Replicas == nil {
r.Spec.Replicas = &one
}
if r.Spec.TargetAllocator.Enabled && r.Spec.TargetAllocator.Replicas == nil {
r.Spec.TargetAllocator.Replicas = &one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this needs to be documented or made configurable later ? This means TA is not scalable at the moment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this can be an enhancement.

@@ -163,6 +169,32 @@ func (c CollectorWebhook) validate(r *AmazonCloudWatchAgent) (admission.Warnings
return warnings, fmt.Errorf("the OpenTelemetry Collector mode is set to %s, which does not support the attribute 'AdditionalContainers'", r.Spec.Mode)
}

// validate target allocation
if r.Spec.TargetAllocator.Enabled && r.Spec.Mode != ModeStatefulSet {
warnings = append(warnings, fmt.Sprintf("The Amazon CloudWatch Agent mode is set to %s, we do not recommend enabling Target Allocator when not running as a StatefulSet", r.Spec.Mode))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be updated with The Amazon CloudWatch Agent mode is set to %s, which does not support enabling Target Allocator

The mode could be Deployment and the error message would say StatefulSet which can be confusing

Namespace: owner.Namespace,
LabelSelector: labels.SelectorFromSet(selector),
}
// Define lists for different Kubernetes resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

@okankoAMZ okankoAMZ merged commit cf01423 into main Nov 26, 2024
8 checks passed
@okankoAMZ okankoAMZ deleted the target-allocator branch November 26, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants