Self Service Log Ingestion #3518

Rotfuks · 2024-06-24T13:51:11Z

Motivation

We want customers to be able to ingest whatever data is relevant for them in a self service way, this also includes logs. So we need to make sure we have a way how they can add their own data sources for logs.

Todo

investigate how exactly we could empower customers to add their own log sources - for example PodLogs https://github.com/giantswarm/giantswarm/issues/29072
Implement any needed changes to make it happen
Create a documentation draft how customers can add their own sources of events/logs to be monitored add Observability Logs data ingestion docs#2308 Update logging architecture docs#2315
Get some feedback from AE about the documentation
Make sure the documentation is published in the new observability platform docs

Outcome

Customers can now add their own sources of events to be monitored by the observability platform
there is docs and educational content out there showing them how it's done

QuentinBisson · 2024-07-22T09:58:58Z

We see that we can use pod logs but do we want to force customers to create pod logs for log ingestion? Can we allow them to collect logs at the namespace level (with annotations and so on)?

Rotfuks · 2024-07-22T14:21:42Z

How much effort is it to create podlogs for customers? I would love to have some label based stuff where we can just say "add this label and it's automatically ingested" because that makes it quite flexible and intuitive. It will also help us with mutli-tenancy I believe.

QuentinBisson · 2024-07-22T19:13:00Z

The issue I have is not that pod logs don't make sense but I would think they should be used on really rare occasions. Ideally, an annotation/label on the pod or namespace should be enough to get most the tenant for most log and that would make profiles and traces collection easier. I would only use pod logs if the pod needs a custom pipeline imo

What i'm not sure is if we can get alls logs for a namespace if it's annotated unless the pod has it's own label and unless it's equipped with a pod log?

I would think we could do something with drops but I'm not sure. Maybe @TheoBrigitte knows if log sources can exclude data taken from other sources?

TheoBrigitte · 2024-08-26T17:31:42Z

When using Alloy as logging agent installed within a workload cluster, we configure it in a way which would allow to retrieves logs from specific namespaces and/or pods.

This solution makes use of 2 differente PodLogs (with mutual exclusion):

One PodLog to select all pods from namespaces with a specific label podlogs_ns.yaml.txt
One PodLog to select all pods with a specific label podlogs_pod.yaml.txt

Those PodLogs would be configured by us and customers would only deal with labels on their resources.

With this solution we might face a problem with resources usage on the Kubeletes, as all the log traffic would happen via the Kubernetes API the network and CPU usage on Kubelet might be problematic especially in cases where many/all pods would be monitored.
Alloy does not currently provide another way to select targets based on their namespace. The usual loki.source.file does not suffer from Kubelet resources usage problem as logs are directly retrieved from the node where Alloy is running, but it does not allow to select pods by namespace.

I opened an upstream issue requesting to add the namespace metadata within the discovery.kubernetes component, this would allow us to avoid using PodLogs and suffering from their overhead.

QuentinBisson · 2024-08-26T18:45:57Z

Did you take a look at this?https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes/

TheoBrigitte · 2024-08-27T12:58:16Z

Did you take a look at this?https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.kubernetes/

Looking it, this would be simpler than the currently used local.file_match in our solution, but I also do not see the benefit over loki.source.podlogs, you get rid of the need for PodLogs resources but also loose the capability to filter on namespaces labels and you still have the network and CPU overhead on the Kubernetes API server.

QuentinBisson · 2024-08-27T13:16:16Z

I quite like that we do not have to run it as a daemonset though :D

But why do you not have the namespace ? I thought those should give you __meta_kubernetes_namespace in the loki.process or relabel phase?

QuentinBisson · 2024-08-27T13:18:02Z

Oh you meant namespace labels,nevermind

TheoBrigitte · 2024-08-27T16:27:20Z

Using a combination of loki.relabel and loki.source.podlogs components it is possible to set the tenant id based on a given label from the pod or its namespace.

In the following example the tenant id is taken from the pod label foo.

Here is the config and the PodLog resource I used

Alloy config

loki.source.podlogs "default" {
  forward_to = [loki.relabel.default.receiver]
}

loki.relabel "default" {
  forward_to = [loki.write.default.receiver]

  rule {
    action = "replace"
    source_labels = ["foo"]
    target_label  = "__tenant_id__"
    replacement = "$1"
    regex = "(.*)"
  }

  rule {
    action = "labeldrop"
    regex = "^foo$"
  }
}

loki.write "default" {
  endpoint {
    url = "https://loki.svc/loki/api/v1/push"
  }
}

PodLog (note: this will select all pods from all namespaces, change the selectors to fit your need)

apiVersion: monitoring.grafana.com/v1alpha2
kind: PodLogs
metadata:
  name: pod-tenant-id-from-label
spec:
  selector: {}
  namespaceSelector: {}
  relabelings:
  - action: replace
    sourceLabels: ["__meta_kubernetes_pod_label_foo"]
    targetLabel: "foo"
    replacement: "$1"
    regex: "(.*)"

It is also possible to set the tenant id using the loki.process component which has a tenant stage which allow for exactly this; setting the tenant id, but from there only log entry content are accessible.
More info at https://grafana.com/docs/alloy/latest/reference/components/loki/loki.process/#stagetenant-block

TheoBrigitte · 2024-09-19T14:52:16Z

Current prototype idea

Improvements we want to explore:

Workaround the Kubelet traffic limitation by fetching logs from local disk, using either loki.source.kubernetes or some newer features like join or logs.alloy module
Avoid duplicated targets
How to provide access to Alloy components like loki.process

TheoBrigitte · 2024-09-20T16:43:56Z

Using loki.source.kubernetes would only allow to select pods and not namespace using labels, as the pod's namespace labels are not exposed in this component, that's why I opened a upstream issue asking to expose those grafana/alloy#1550

TheoBrigitte · 2024-09-23T11:48:14Z

The potential new join feature would not help in our case as this would only allow enriching metadata in theloki.relabel component but there would still be no way to pass the resulting targets into the loki.source.file.

QuentinBisson · 2024-09-23T11:52:56Z

What if you enrich then drop logs instead of trying to discover only those we should "scrape" ?

TheoBrigitte · 2024-09-23T11:54:11Z

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

TheoBrigitte · 2024-09-23T11:55:04Z

logs.alloy also does not help as its mainly a wrapper around existing alloy components.

QuentinBisson · 2024-09-23T11:57:45Z

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

if you join based on the extracted labels from loki.source.file and the one that does kubernetes discovery that's not possible? 🤔 It might be interesting to go to the next community meeting

TheoBrigitte · 2024-09-23T12:05:13Z

What if you enrich then drop logs ?

There would still be now way match the resulting targets against a local file as the loki.relabel and loki.source.file components cannot be connected

if you join based on the extracted labels from loki.source.file and the one that does kubernetes discovery that's not possible? 🤔 It might be interesting to go to the next community meeting

The namespace metadata is only present when using the loki.source.podlogs component, and this component cannot be chained with loki.source.file. The discovery.kubernetes component does not expose namespace metadata and the join proposition made upstream would only happen in loki.relabel stage and also cannot be linked into the loki.source.file.

loki.source.file is only compatible with component exporting targets: https://grafana.com/docs/alloy/latest/reference/compatibility/#targets-exporters, which in our case means discovery.kubernetes or local.file_match, therefore we cannot access namespace metadata unless exposed by discovery.kuberntes directly.

TheoBrigitte · 2024-09-23T13:47:18Z

We can't load components like loki.process dynamically into Alloy.

The way to load dynamic configuration into Alloy is via modules. A module is describe by a declare block which only accept argument and export blocks, meaning there would be no way to pass any of the stage block from loki.process. There is a module import example using loki.process here https://grafana.com/docs/alloy/latest/get-started/modules/#example.

TheoBrigitte · 2024-09-30T19:56:00Z

It is currently not possible to use kyverno policy to label the kube-system namespace, as kyverno lacks permissions to do so

$ cat kube-system-logging.cpol.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: kube-system-logging
spec:
  admission: true
  background: true
  mutateExistingOnPolicyUpdate: true
  rules:
    - name: kube-system-enable-logging
      match:
        resources:
          kinds:
            - Namespace
          name: kube-system
      mutate:
        patchesJson6902: '[{"op":"add","path":"/metadata/labels/giantswarm.io~1logging","value":"enabled"}]'
        targets:
        - kind: Namespace
          name: kube-system

$ k apply -f kube-system-logging.cpol.yaml
Error from server: error when creating "kube-system-logging.cpol.yaml": admission webhook "validate-policy.kyverno.svc" denied the request: path: spec.rules[0].mutate.targets.: auth check fails, additional privileges are required for the service account 'system:serviceaccount:kyverno:kyverno-background-controller': failed to get GVR for kind /Namespace; failed to get GVR for kind /Namespace

TheoBrigitte · 2024-10-01T11:06:20Z

Just linking the Alloy internal tenant_id label used for tenant override https://github.com/grafana/alloy/blob/8f1be0e86b0ced53e73cb30d228aa736b1380d89/internal/component/common/loki/client/client.go#L35

TheoBrigitte · 2024-10-03T17:34:38Z

Load testing story

I used Loki canary to load test the logging pipeline and see how the current setup with Promtail compares with Alloy logs. Logs where only ingested from pods in the kube-system namespace. I looked at the Kubernetes API server pods usage and network traffic.

Loki canary was started at: 14:08 UTC, generating approximately 10k log lines per minutes

Kubernetes API server usage stayed the same when using Promtail (which scrapes logs from disk). Usage is ~10 times higher when using Alloy (tailing logs through the api server).

Loki stats

Comparison on data being ingested by Loki

Before the test
Log line: 45/mn
Bandwith: 45 kB/s
Load test with Promtail
Log line: 12k lines/mn
Bandwith: 4,2 MB/s
Load test with Alloy
Log line: ~12k lines/mn
Bandwith: ~3 MB/s

Kube API server pods

Comparison on api server pods resources (memory is not relevant and stays the same).

Before the test
CPU: 0.02
Bandwith: 3 MB/s
Load test with Promtail
CPU: 0.02
Bandwith: 3 MB/s
Load test with Alloy
CPU: ~0.5 (spikes ~1)
Bandwith: 50 MB/s (spikes 150)

Node resources usage

Adding this for information but nothing relevant here, resources usage stayed ruffly the same.

TheoBrigitte · 2024-10-04T18:38:57Z

Since the last results showed concerning performances when tailing logs throught Kubernetes API server I experimented with a new solution which allow to fetch logs from files on disk and discover targets using labels. This solution requires an additional container which does update the list of labeled namespaces directly into Alloy's configuration. Here is a high level overview on how this solution looks like.

The additional container is a 4 lines shell script described here https://github.com/giantswarm/logging-operator/pull/235/files#diff-405f451506c2146b4cf915863a848d9a42f39a3b292a9b6f9ada78b4eac32598R57-R61

QuentinBisson · 2024-10-07T07:54:11Z

Does this sidecar container work well with clustering?

Also, what do you see as concerning? a bit more cpu looks okay to me because with your tests, you fetch 10k log lines from 6 nodes and not 230 like on bigger clusters right? so the metrics you get will definitely be higher than the actual usage

TheoBrigitte · 2024-10-07T16:45:01Z

Does this sidecar container work well with clustering?

I haven't tested this new container with clustering mode, but it should work fine with it.

Also, what do you see as concerning? a bit more cpu looks okay to me because with your tests, you fetch 10k log lines from 6 nodes and not 230 like on bigger clusters right? so the metrics you get will definitely be higher than the actual usage

Yeah maybe in the end performances are not so bad. Anyway with this solution there's no way to override the tenant id, so I am going with solution 1 (PodLogs).

TheoBrigitte · 2024-10-11T10:10:07Z

Adding a graph to explain the current implementation

TheoBrigitte · 2024-10-21T09:16:50Z

Here is the source for the graph above self-service-logging-2024-10-21-1114.excalidraw.gz

TheoBrigitte · 2024-10-24T10:09:26Z

We are good to go here

logging-operator implementation is complete and released in v0.13.0
doc is updated and published at https://docs.giantswarm.io/tutorials/observability/data-ingestion/logs/

This will be available from CAPI v30.0.0 releases

CAPA: CAPA: Release v30.0.0. releases#1357
CAPZ: CAPZ: Release v30.0.0. releases#1359
Defined in logging-operator

Last point: announce this to everyone including customers. How do we proceed ? Do we include this in the v30.0.0 release announcement ?

Rotfuks · 2024-10-24T11:14:02Z

Yeah would be good to have it in the v30.0.0 release announcement. Do we have release notes where we add that?
We can also just push the new feature in news-product - do you want to do that post, or should I?

TheoBrigitte · 2024-10-24T11:53:46Z

I'll do the post and make sure we have this also in the v30.0.0 release annoucement

TheoBrigitte · 2024-10-24T14:40:23Z

We can only craft the release announcement when the next releases are being worked on. Tenet will ping when this happens. I added a todo as a reminder here.

Rotfuks · 2024-11-26T13:42:26Z

Taking the release announcement out of the scope for now, to unblock this ticket. Release will be handled seperately.

github-project-automation bot added this to Roadmap Jun 24, 2024

Rotfuks mentioned this issue Jun 24, 2024

Self-Service Data Ingestion #3515

Closed

github-project-automation bot moved this to Inbox 📥 in Roadmap Jun 24, 2024

Rotfuks added the team/atlas Team Atlas label Jun 24, 2024

Rotfuks removed this from Roadmap Jun 24, 2024

github-project-automation bot added this to Roadmap Jun 24, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Jun 24, 2024

TheoBrigitte assigned TheoBrigitte and unassigned TheoBrigitte Aug 6, 2024

TheoBrigitte self-assigned this Aug 27, 2024

TheoBrigitte mentioned this issue Oct 3, 2024

Configure Alloy to be use as a self-service tool for logs (source podlogs) giantswarm/logging-operator#233

Merged

This was referenced Oct 4, 2024

add configmap rbac giantswarm/alloy-app#56

Closed

add configmap rbac giantswarm/observability-bundle#249

Closed

Configure Alloy to be used as a self-service too for logs (source files) giantswarm/logging-operator#235

Closed

TheoBrigitte mentioned this issue Oct 5, 2024

Upgrade alloyLogs giantswarm/observability-bundle#250

Closed

TheoBrigitte mentioned this issue Oct 15, 2024

Use PodLogs only for log targets discovery instead of labels giantswarm/logging-operator#241

Merged

This was referenced Oct 22, 2024

Update logging architecture giantswarm/docs#2315

Merged

Release v0.13.0 giantswarm/logging-operator#244

Merged

TheoBrigitte mentioned this issue Oct 24, 2024

CAPI: Request observability-bundle >= 1.7.0. giantswarm/releases#1468

Merged

TheoBrigitte added the blocked label Oct 24, 2024

Rotfuks closed this as completed Nov 26, 2024

github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Nov 26, 2024

Rotfuks removed the blocked label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self Service Log Ingestion #3518

Self Service Log Ingestion #3518

Rotfuks commented Jun 24, 2024 •

edited

Loading

QuentinBisson commented Jul 22, 2024

Rotfuks commented Jul 22, 2024

QuentinBisson commented Jul 22, 2024

TheoBrigitte commented Aug 26, 2024 •

edited

Loading

QuentinBisson commented Aug 26, 2024

TheoBrigitte commented Aug 27, 2024

QuentinBisson commented Aug 27, 2024

QuentinBisson commented Aug 27, 2024

TheoBrigitte commented Aug 27, 2024

TheoBrigitte commented Sep 19, 2024

TheoBrigitte commented Sep 20, 2024

TheoBrigitte commented Sep 23, 2024

QuentinBisson commented Sep 23, 2024 •

edited

Loading

TheoBrigitte commented Sep 23, 2024

TheoBrigitte commented Sep 23, 2024

QuentinBisson commented Sep 23, 2024

TheoBrigitte commented Sep 23, 2024

TheoBrigitte commented Sep 23, 2024

TheoBrigitte commented Sep 30, 2024

TheoBrigitte commented Oct 1, 2024

TheoBrigitte commented Oct 3, 2024 •

edited

Loading

TheoBrigitte commented Oct 4, 2024

QuentinBisson commented Oct 7, 2024

TheoBrigitte commented Oct 7, 2024

TheoBrigitte commented Oct 11, 2024

TheoBrigitte commented Oct 21, 2024

TheoBrigitte commented Oct 24, 2024

Rotfuks commented Oct 24, 2024

TheoBrigitte commented Oct 24, 2024

TheoBrigitte commented Oct 24, 2024 •

edited

Loading

Rotfuks commented Nov 26, 2024

Self Service Log Ingestion #3518

Self Service Log Ingestion #3518

Comments

Rotfuks commented Jun 24, 2024 • edited Loading

Motivation

Todo

Outcome

QuentinBisson commented Jul 22, 2024

Rotfuks commented Jul 22, 2024

QuentinBisson commented Jul 22, 2024

TheoBrigitte commented Aug 26, 2024 • edited Loading

QuentinBisson commented Aug 26, 2024

TheoBrigitte commented Aug 27, 2024

QuentinBisson commented Aug 27, 2024

QuentinBisson commented Aug 27, 2024

TheoBrigitte commented Aug 27, 2024

TheoBrigitte commented Sep 19, 2024

TheoBrigitte commented Sep 20, 2024

TheoBrigitte commented Sep 23, 2024

QuentinBisson commented Sep 23, 2024 • edited Loading

TheoBrigitte commented Sep 23, 2024

TheoBrigitte commented Sep 23, 2024

QuentinBisson commented Sep 23, 2024

TheoBrigitte commented Sep 23, 2024

TheoBrigitte commented Sep 23, 2024

TheoBrigitte commented Sep 30, 2024

TheoBrigitte commented Oct 1, 2024

TheoBrigitte commented Oct 3, 2024 • edited Loading

Load testing story

Loki stats

Kube API server pods

Node resources usage

TheoBrigitte commented Oct 4, 2024

QuentinBisson commented Oct 7, 2024

TheoBrigitte commented Oct 7, 2024

TheoBrigitte commented Oct 11, 2024

TheoBrigitte commented Oct 21, 2024

TheoBrigitte commented Oct 24, 2024

Rotfuks commented Oct 24, 2024

TheoBrigitte commented Oct 24, 2024

TheoBrigitte commented Oct 24, 2024 • edited Loading

Rotfuks commented Nov 26, 2024

Rotfuks commented Jun 24, 2024 •

edited

Loading

TheoBrigitte commented Aug 26, 2024 •

edited

Loading

QuentinBisson commented Sep 23, 2024 •

edited

Loading

TheoBrigitte commented Oct 3, 2024 •

edited

Loading

TheoBrigitte commented Oct 24, 2024 •

edited

Loading