Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self Service Log Ingestion #3518

Closed
8 tasks done
Tracked by #3515
Rotfuks opened this issue Jun 24, 2024 · 32 comments
Closed
8 tasks done
Tracked by #3515

Self Service Log Ingestion #3518

Rotfuks opened this issue Jun 24, 2024 · 32 comments
Assignees
Labels
team/atlas Team Atlas

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Jun 24, 2024

Motivation

We want customers to be able to ingest whatever data is relevant for them in a self service way, this also includes logs. So we need to make sure we have a way how they can add their own data sources for logs.

Todo

Outcome

  • Customers can now add their own sources of events to be monitored by the observability platform
  • there is docs and educational content out there showing them how it's done
@QuentinBisson
Copy link

We see that we can use pod logs but do we want to force customers to create pod logs for log ingestion? Can we allow them to collect logs at the namespace level (with annotations and so on)?

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 22, 2024

How much effort is it to create podlogs for customers? I would love to have some label based stuff where we can just say "add this label and it's automatically ingested" because that makes it quite flexible and intuitive. It will also help us with mutli-tenancy I believe.

@QuentinBisson
Copy link

The issue I have is not that pod logs don't make sense but I would think they should be used on really rare occasions. Ideally, an annotation/label on the pod or namespace should be enough to get most the tenant for most log and that would make profiles and traces collection easier. I would only use pod logs if the pod needs a custom pipeline imo

What i'm not sure is if we can get alls logs for a namespace if it's annotated unless the pod has it's own label and unless it's equipped with a pod log?

I would think we could do something with drops but I'm not sure. Maybe @TheoBrigitte knows if log sources can exclude data taken from other sources?

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Aug 26, 2024

When using Alloy as logging agent installed within a workload cluster, we configure it in a way which would allow to retrieves logs from specific namespaces and/or pods.

This solution makes use of 2 differente PodLogs (with mutual exclusion):

Those PodLogs would be configured by us and customers would only deal with labels on their resources.

With this solution we might face a problem with resources usage on the Kubeletes, as all the log traffic would happen via the Kubernetes API the network and CPU usage on Kubelet might be problematic especially in cases where many/all pods would be monitored.
Alloy does not currently provide another way to select targets based on their namespace. The usual loki.source.file does not suffer from Kubelet resources usage problem as logs are directly retrieved from the node where Alloy is running, but it does not allow to select pods by namespace.

I opened an upstream issue requesting to add the namespace metadata within the discovery.kubernetes component, this would allow us to avoid using PodLogs and suffering from their overhead.

@QuentinBisson
Copy link

@TheoBrigitte TheoBrigitte self-assigned this Aug 27, 2024
@TheoBrigitte
Copy link
Member

Did you take a look at this?https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes/

Looking it, this would be simpler than the currently used local.file_match in our solution, but I also do not see the benefit over loki.source.podlogs, you get rid of the need for PodLogs resources but also loose the capability to filter on namespaces labels and you still have the network and CPU overhead on the Kubernetes API server.

@QuentinBisson
Copy link

I quite like that we do not have to run it as a daemonset though :D

But why do you not have the namespace ? I thought those should give you __meta_kubernetes_namespace in the loki.process or relabel phase?

@QuentinBisson
Copy link

Oh you meant namespace labels,nevermind

@TheoBrigitte
Copy link
Member

Using a combination of loki.relabel and loki.source.podlogs components it is possible to set the tenant id based on a given label from the pod or its namespace.

In the following example the tenant id is taken from the pod label foo.

Here is the config and the PodLog resource I used

  • Alloy config
loki.source.podlogs "default" {
  forward_to = [loki.relabel.default.receiver]
}

loki.relabel "default" {
  forward_to = [loki.write.default.receiver]

  rule {
    action = "replace"
    source_labels = ["foo"]
    target_label  = "__tenant_id__"
    replacement = "$1"
    regex = "(.*)"
  }

  rule {
    action = "labeldrop"
    regex = "^foo$"
  }
}

loki.write "default" {
  endpoint {
    url = "https://loki.svc/loki/api/v1/push"
  }
}
  • PodLog (note: this will select all pods from all namespaces, change the selectors to fit your need)
apiVersion: monitoring.grafana.com/v1alpha2
kind: PodLogs
metadata:
  name: pod-tenant-id-from-label
spec:
  selector: {}
  namespaceSelector: {}
  relabelings:
  - action: replace
    sourceLabels: ["__meta_kubernetes_pod_label_foo"]
    targetLabel: "foo"
    replacement: "$1"
    regex: "(.*)"

It is also possible to set the tenant id using the loki.process component which has a tenant stage which allow for exactly this; setting the tenant id, but from there only log entry content are accessible.
More info at https://grafana.com/docs/alloy/latest/reference/components/loki/loki.process/#stagetenant-block

@TheoBrigitte
Copy link
Member

Current prototype idea

image

Improvements we want to explore:

  • Workaround the Kubelet traffic limitation by fetching logs from local disk, using either loki.source.kubernetes or some newer features like join or logs.alloy module
  • Avoid duplicated targets
  • How to provide access to Alloy components like loki.process

@TheoBrigitte
Copy link
Member

Using loki.source.kubernetes would only allow to select pods and not namespace using labels, as the pod's namespace labels are not exposed in this component, that's why I opened a upstream issue asking to expose those grafana/alloy#1550

@TheoBrigitte
Copy link
Member

The potential new join feature would not help in our case as this would only allow enriching metadata in theloki.relabel component but there would still be no way to pass the resulting targets into the loki.source.file.

@QuentinBisson
Copy link

QuentinBisson commented Sep 23, 2024

What if you enrich then drop logs instead of trying to discover only those we should "scrape" ?

@TheoBrigitte
Copy link
Member

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

@TheoBrigitte
Copy link
Member

logs.alloy also does not help as its mainly a wrapper around existing alloy components.

@QuentinBisson
Copy link

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

if you join based on the extracted labels from loki.source.file and the one that does kubernetes discovery that's not possible? 🤔 It might be interesting to go to the next community meeting

@TheoBrigitte
Copy link
Member

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

if you join based on the extracted labels from loki.source.file and the one that does kubernetes discovery that's not possible? 🤔 It might be interesting to go to the next community meeting

The namespace metadata is only present when using the loki.source.podlogs component, and this component cannot be chained with loki.source.file. The discovery.kubernetes component does not expose namespace metadata and the join proposition made upstream would only happen in loki.relabel stage and also cannot be linked into the loki.source.file.

loki.source.file is only compatible with component exporting targets: https://grafana.com/docs/alloy/latest/reference/compatibility/#targets-exporters, which in our case means discovery.kubernetes or local.file_match, therefore we cannot access namespace metadata unless exposed by discovery.kuberntes directly.

@TheoBrigitte
Copy link
Member

We can't load components like loki.process dynamically into Alloy.

The way to load dynamic configuration into Alloy is via modules. A module is describe by a declare block which only accept argument and export blocks, meaning there would be no way to pass any of the stage block from loki.process. There is a module import example using loki.process here https://grafana.com/docs/alloy/latest/get-started/modules/#example.

@TheoBrigitte
Copy link
Member

It is currently not possible to use kyverno policy to label the kube-system namespace, as kyverno lacks permissions to do so

$ cat kube-system-logging.cpol.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: kube-system-logging
spec:
  admission: true
  background: true
  mutateExistingOnPolicyUpdate: true
  rules:
    - name: kube-system-enable-logging
      match:
        resources:
          kinds:
            - Namespace
          name: kube-system
      mutate:
        patchesJson6902: '[{"op":"add","path":"/metadata/labels/giantswarm.io~1logging","value":"enabled"}]'
        targets:
        - kind: Namespace
          name: kube-system

$ k apply -f kube-system-logging.cpol.yaml
Error from server: error when creating "kube-system-logging.cpol.yaml": admission webhook "validate-policy.kyverno.svc" denied the request: path: spec.rules[0].mutate.targets.: auth check fails, additional privileges are required for the service account 'system:serviceaccount:kyverno:kyverno-background-controller': failed to get GVR for kind /Namespace; failed to get GVR for kind /Namespace

@TheoBrigitte
Copy link
Member

Just linking the Alloy internal tenant_id label used for tenant override https://github.com/grafana/alloy/blob/8f1be0e86b0ced53e73cb30d228aa736b1380d89/internal/component/common/loki/client/client.go#L35

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Oct 3, 2024

Load testing story

I used Loki canary to load test the logging pipeline and see how the current setup with Promtail compares with Alloy logs. Logs where only ingested from pods in the kube-system namespace. I looked at the Kubernetes API server pods usage and network traffic.

Loki canary was started at: 14:08 UTC, generating approximately 10k log lines per minutes

Kubernetes API server usage stayed the same when using Promtail (which scrapes logs from disk). Usage is ~10 times higher when using Alloy (tailing logs through the api server).

Loki stats

Comparison on data being ingested by Loki

  • Before the test
    Log line: 45/mn
    Bandwith: 45 kB/s

  • Load test with Promtail
    Log line: 12k lines/mn
    Bandwith: 4,2 MB/s

  • Load test with Alloy
    Log line: ~12k lines/mn
    Bandwith: ~3 MB/s

Image

Kube API server pods

Comparison on api server pods resources (memory is not relevant and stays the same).

  • Before the test
    CPU: 0.02
    Bandwith: 3 MB/s

  • Load test with Promtail
    CPU: 0.02
    Bandwith: 3 MB/s

  • Load test with Alloy
    CPU: ~0.5 (spikes ~1)
    Bandwith: 50 MB/s (spikes 150)

Image
Image

Node resources usage

Adding this for information but nothing relevant here, resources usage stayed ruffly the same.

Image

@TheoBrigitte
Copy link
Member

Since the last results showed concerning performances when tailing logs throught Kubernetes API server I experimented with a new solution which allow to fetch logs from files on disk and discover targets using labels. This solution requires an additional container which does update the list of labeled namespaces directly into Alloy's configuration. Here is a high level overview on how this solution looks like.

Image

The additional container is a 4 lines shell script described here https://github.com/giantswarm/logging-operator/pull/235/files#diff-405f451506c2146b4cf915863a848d9a42f39a3b292a9b6f9ada78b4eac32598R57-R61

@QuentinBisson
Copy link

Does this sidecar container work well with clustering?

Also, what do you see as concerning? a bit more cpu looks okay to me because with your tests, you fetch 10k log lines from 6 nodes and not 230 like on bigger clusters right? so the metrics you get will definitely be higher than the actual usage

@TheoBrigitte
Copy link
Member

Does this sidecar container work well with clustering?

I haven't tested this new container with clustering mode, but it should work fine with it.

Also, what do you see as concerning? a bit more cpu looks okay to me because with your tests, you fetch 10k log lines from 6 nodes and not 230 like on bigger clusters right? so the metrics you get will definitely be higher than the actual usage

Yeah maybe in the end performances are not so bad. Anyway with this solution there's no way to override the tenant id, so I am going with solution 1 (PodLogs).

@TheoBrigitte
Copy link
Member

Adding a graph to explain the current implementation

Image

@TheoBrigitte
Copy link
Member

Here is the source for the graph above self-service-logging-2024-10-21-1114.excalidraw.gz

@TheoBrigitte
Copy link
Member

We are good to go here

This will be available from CAPI v30.0.0 releases

Last point: announce this to everyone including customers. How do we proceed ? Do we include this in the v30.0.0 release announcement ?

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Oct 24, 2024

Yeah would be good to have it in the v30.0.0 release announcement. Do we have release notes where we add that?
We can also just push the new feature in news-product - do you want to do that post, or should I?

@TheoBrigitte
Copy link
Member

I'll do the post and make sure we have this also in the v30.0.0 release annoucement

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Oct 24, 2024

We can only craft the release announcement when the next releases are being worked on. Tenet will ping when this happens. I added a todo as a reminder here.

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Nov 26, 2024

Taking the release announcement out of the scope for now, to unblock this ticket. Release will be handled seperately.

@Rotfuks Rotfuks closed this as completed Nov 26, 2024
@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Nov 26, 2024
@Rotfuks Rotfuks removed the blocked label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/atlas Team Atlas
Projects
Archived in project
Development

No branches or pull requests

3 participants