Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed tracing for k6-backed checks #923

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nadiamoe
Copy link
Member

@nadiamoe nadiamoe commented Oct 2, 2024

This PR adds opt-in distributed tracing support in the agent, using the opentelemetry-go sdk and OTLP transport. These traces are meant to help probe operators diagnose and understand private probes, and they are not sent to the stacks where the checks are running, but rather to a single OTLP endpoint configured by the probe operator.

image

For now, tracing is enabled only for k6-backed checks executed by remote runners. However, the mechanism can easily be extended to other check types and local k6 runners.

The code follows the opentelemetry tracing conventions, relying on context.Context to propagate spans across the code: A function creates a span, adds it to the context, and passes that context down so the process can repeat downstream and spans be correctly linked.

Additionally, this PR also adds a convenience tracing package. This package creates a TracerProvider configured to output otlp traces using variables from the environment (See upstream docs), and injects it into the context. Further down the stack, probers can retrieve this TracerProvider from that context. This is inspired by the same pattern above, but allows propagating the TracerProvider down even when no span is currently active. The alternative to this would be to manually pass down a TracerProvider to all objects that initiate traces (i.e. create root spans), but given that we have quite a lot of them already satisfying established interfaces, I thought this option would be less intrusive.

@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 3 times, most recently from f4faba0 to 7f71def Compare October 2, 2024 16:40
@nadiamoe nadiamoe changed the title k6runner/http: add tracing support Add distributed tracing support Oct 2, 2024
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 5 times, most recently from e1e7ac3 to 6e5d732 Compare October 3, 2024 14:49
@nadiamoe nadiamoe changed the title Add distributed tracing support Add distributed tracing for k6-backed checks Oct 3, 2024
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 5 times, most recently from c01bb9d to b386066 Compare October 4, 2024 13:57
@nadiamoe nadiamoe marked this pull request as ready for review October 4, 2024 14:10
@nadiamoe nadiamoe requested a review from a team as a code owner October 4, 2024 14:10
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 4 times, most recently from 4f8df52 to a054992 Compare October 8, 2024 10:06
Copy link
Contributor

@mem mem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Sorry about the comments all over the place. I really dislike the otel API, it's messy, complicated and it has very low discoverability. A lot of the comments come from that place.

internal/tracing/tracing.go Outdated Show resolved Hide resolved
internal/tracing/tracing.go Outdated Show resolved Hide resolved
cmd/synthetic-monitoring-agent/main.go Outdated Show resolved Hide resolved
cmd/synthetic-monitoring-agent/main.go Outdated Show resolved Hide resolved
@@ -60,12 +62,18 @@ func (p Prober) Name() string {
}

func (p Prober) Probe(ctx context.Context, target string, registry *prometheus.Registry, logger logger.Logger) (bool, float64) {
ctx, probeSpan := tracing.TracerProvider(ctx).Tracer("").Start(ctx, "browser check")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost want to say the span should start one level up.

The fact that the code is repeated across a bunch of different probers makes me think that's probably correct.

Tracer("") is one of those "ugh!" otel things...

and Tracer(...) is the other one, where Java slipped into Go. What that thing is doing is going to a global (!!!) registry of tracers and pulling one out. I have always found it so confusing that I almost want to say that's what should be in the context instead of the tracer provider.

But that's the reason why I think the span should start one level up, because it would be useful to have the check ID, tenant ID, region ID, probe name, and possibly a couple of bits of info in the trace, and that should be added to the span.

It might make sense for the tracer to have the name of the probe as an attribute, but I never remember which bit of info ends up where. I think the cluster and instance are getting added by Alloy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the prober knows its name, you should use that instead of hard coding "browser check"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost want to say the span should start one level up.

I'm fine moving it one up, what I'm not sure is if we do want traces for non-k6 checks. The possibility is there, but do we actually have a use for them?

Copy link
Member Author

@nadiamoe nadiamoe Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Tracer(...) is the other one, where Java slipped into Go. What that thing is doing is going to a global (!!!) registry of tracers and pulling one out. I have always found it so confusing that I almost want to say that's what should be in the context instead of the tracer provider.

I don't think that's true. Each TracerProvider keeps a list of Tracers indexed by name, and we only create one TracerProvider:
https://github.com/open-telemetry/opentelemetry-go/blob/sdk/v1.30.0/sdk/trace/provider.go#L132

That's a bit awful but it means that calling TracerProvider(ctx).Tracer("") is cheap, the tracer is created only once. That's why it is repeated all over the code.

Not trying to say otel design is good or bad, but I try to say this is how they want it to be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine moving it one up, what I'm not sure is if we do want traces for non-k6 checks. The possibility is there, but do we actually have a use for them?

I would rather have the consistency.

If traces for protocol checks aren't useful, we can configure a sampling policy to drop those.

internal/k6runner/http.go Outdated Show resolved Hide resolved
Comment on lines +176 to +182
httpClient := http.Client{
Transport: otelhttp.NewTransport(
http.DefaultTransport,
otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),
// Span names do not include method and path by default to avoid cardinality explosion with paths containing
// IDs. As this is not the case with this endpoint, we use a custom formatter that includes both.
otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string {
return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
}),
otelhttp.WithPropagators(propagation.TraceContext{}), // Send TraceIDs in outgoing requests.
),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I distrust otel a lot...

Isn't this doing a lot of extra work over and over again?

Wouldn't it be better to create a client once (e.g. sync.OnceValue)

Also, this line:

otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),

isn't the tracer provider stored in the context? You don't need to extract the span to extract its provider, you can ask for the provider directly from the context. I guess there are situations where you have multiple tracer provider, but unless I'm missing something this isn't one of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this doing a lot of extra work over and over again?

Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.

Copy link
Member Author

@nadiamoe nadiamoe Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the tracer provider stored in the context? You don't need to extract the span to extract its provider, you can ask for the provider directly from the context. I guess there are situations where you have multiple tracer provider, but unless I'm missing something this isn't one of them.

Also possible. I would do it that way if I was creating the http.Client only once.

My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way" (Example). I will do it like that everywhere I can. Storing the tracerprovider in the context directly is a hack I made up for two reasons:

  • It doesnt make sense, in our codebase, for a root span to be created before a scrape is started (i.e. there is no global span anywhere)
  • To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.

I will only use the tracing hack where necessary, so if at some point we want to drop the hack (the tracing package), we will need to change the least amount of code possible. That's what I tried to convey with the comment here:

// Note: This function is useful to obtain a TracerProvider to generate a root span. If a span already exists in the
// context, the idiomatic way of retrieving it is using:
// trace.SpanFromContext(ctx).TracerProvider()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.

Note there's a subtle change here: when using DefaultClient, the client is not getting created over and over again. Clients are reusable. If I remember things correctly, reusing clients allows the underlying code the reuse connections.

My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way"

I know. I'm not debating that. I'm debating that the otel way doesn't buy you anything in this case. In otel it's theoretically possible to have multiple tracer providers (I challenge you to find something in the documentation that explains why you might want to do that) but within the actual use case we have here, I cannot see that happening.

To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.

Hmm...

The current implementation is confusing.

  • The tracerprovider is in the context, but you don't want to use it.
  • Looking at the line of code I flagged, I'm not sure whether that context has or hasn't a span.

The tracerprovider should be added to the scraper and passed down from there to the point where the root span should be created, which is probably the runProber function in scraper.go. There a context is created, which is where the span should be added.

If you do that, the only way to get a tracer provider is from the span, and that resolves the issue of why it's done one way or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is confusing.

I will grant you that, it's not the clearest. I was hoping the comment would help, but it's true that it creates some ambiguity. Let me try to drop the context thingie and pass the tracerprovider everywhere. It will be a noisy PR.

Copy link
Member Author

@nadiamoe nadiamoe Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked the stuff. At this point it didn't make sense to keep with the fixup!s, so I've rebased the entire thing. Changes roughly are:

  • Objects now receive the TracerProvider normally, not through the context. I can hardly find words for how painful it was to add this on everywhere. I've spent one hour trying to get the thing to compile and another hour debugging nil pointer dereferences. I'm still not sure all of them are solved.
  • Root spans are now initiated by the scraper and the adhoc handler. Probers no longer know what a trace is. This means we lose the ability to record probe errors in a span, as Probers hide that behind the success boolean. We can create a child span from the Prober just with that info, but that would be duplicating it on every prober. Your call.

internal/k6runner/k6runner.go Show resolved Hide resolved
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 4 times, most recently from d3a6d18 to 3b2a355 Compare October 9, 2024 12:15
@nadiamoe nadiamoe requested a review from mem October 9, 2024 12:41
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 2 times, most recently from fde200d to 2973a75 Compare October 9, 2024 13:07
defer cancel()

var tracerProvider trace.TracerProvider = noop.NewTracerProvider()
if os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" || os.Getenv("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT") != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem.

The implicit behavior that otel favors is problematic.

  • Take the URL from a command line option.
  • Ignore the environment variables and just use otlptracehttp.WithEndpointURL.
  • If no option is passed, use the noop tracer provider.
  • Regardless of any of this, set the global tracer provider to noop, too.

That last point might be contentious: just like the code you have here, there are libraries out there which go thru the same dance of looking at the environment and setting up tracing all by themselves, or even worse, allow otel to do all of that on its own. Libraries should not define any of this. They should simply look for a span in the context and act accordingly.

And I don't want to use a code review to beat a dead horse, but I hope that this allows you to realize how problematic the functional options pattern is. There is an option to override the behavior with the environment variables, it's just very well hidden by that pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem.

I don't see the problem. What you suggest is what I would do if this SDK was used in vacuum, but it is not. The entire ecosystem is build around the expectation that you set those environment variables and it works. It's not just the trace exporter, it's the sampling algorithm, parameters, all of it is configured automatically from environment variables. We would be going out of our way to change how one thing is configured, and by doing that we make that thing inconsistent with the rest of the ecosystem, and also with the rest of the things that can be configured in this very project. I really don't think that's a good idea.

you to realize how problematic the functional options pattern is

I do, I don't like it either. Big fan of opts structs myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire ecosystem is build around the expectation that you set those environment variables and it works.

The ecosystem is built around an ill-defined notion of "ease of use". I have had to debug issues with Otel-using libraries that obtain configuration from the environment "because it's there".

I do not want stuff happening behind our backs, and more importantly I do not want stuff happening behind our user's backs. I don't want users accidentally pushing traces. If they do want to push traces, they can explicitly ask for it.

I have added a commit on top of this PR. If you are not OK with that, remove it.

internal/k6runner/http.go Outdated Show resolved Hide resolved
Comment on lines +176 to +182
httpClient := http.Client{
Transport: otelhttp.NewTransport(
http.DefaultTransport,
otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),
// Span names do not include method and path by default to avoid cardinality explosion with paths containing
// IDs. As this is not the case with this endpoint, we use a custom formatter that includes both.
otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string {
return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
}),
otelhttp.WithPropagators(propagation.TraceContext{}), // Send TraceIDs in outgoing requests.
),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.

Note there's a subtle change here: when using DefaultClient, the client is not getting created over and over again. Clients are reusable. If I remember things correctly, reusing clients allows the underlying code the reuse connections.

My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way"

I know. I'm not debating that. I'm debating that the otel way doesn't buy you anything in this case. In otel it's theoretically possible to have multiple tracer providers (I challenge you to find something in the documentation that explains why you might want to do that) but within the actual use case we have here, I cannot see that happening.

To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.

Hmm...

The current implementation is confusing.

  • The tracerprovider is in the context, but you don't want to use it.
  • Looking at the line of code I flagged, I'm not sure whether that context has or hasn't a span.

The tracerprovider should be added to the scraper and passed down from there to the point where the root span should be created, which is probably the runProber function in scraper.go. There a context is created, which is where the span should be added.

If you do that, the only way to get a tracer provider is from the span, and that resolves the issue of why it's done one way or the other.

internal/k6runner/k6runner.go Show resolved Hide resolved
internal/k6runner/k6runner.go Show resolved Hide resolved
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 3 times, most recently from 2661413 to 6ebe479 Compare October 10, 2024 10:13
@nadiamoe nadiamoe requested a review from mem October 10, 2024 11:25
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch from 6ebe479 to bf0a1df Compare October 10, 2024 11:26
defer checkSpan.End()
checkSpan.SetAttributes(
attribute.String("type", s.prober.Name()),
attribute.String("region", s.probe.Region),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the probe region (as expected), not the tenant region. Not sure where to get the latter from.

@nadiamoe nadiamoe force-pushed the http-runner-tracing branch 2 times, most recently from 6a4187e to 8d91f8f Compare October 10, 2024 13:27
@nadiamoe nadiamoe force-pushed the http-runner-tracing branch from 8d91f8f to 9ff0ca4 Compare October 14, 2024 13:55
Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants