-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed tracing for k6-backed checks #923
base: main
Are you sure you want to change the base?
Conversation
f4faba0
to
7f71def
Compare
e1e7ac3
to
6e5d732
Compare
c01bb9d
to
b386066
Compare
4f8df52
to
a054992
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Sorry about the comments all over the place. I really dislike the otel API, it's messy, complicated and it has very low discoverability. A lot of the comments come from that place.
internal/prober/browser/browser.go
Outdated
@@ -60,12 +62,18 @@ func (p Prober) Name() string { | |||
} | |||
|
|||
func (p Prober) Probe(ctx context.Context, target string, registry *prometheus.Registry, logger logger.Logger) (bool, float64) { | |||
ctx, probeSpan := tracing.TracerProvider(ctx).Tracer("").Start(ctx, "browser check") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost want to say the span should start one level up.
The fact that the code is repeated across a bunch of different probers makes me think that's probably correct.
Tracer("")
is one of those "ugh!" otel things...
and Tracer(...)
is the other one, where Java slipped into Go. What that thing is doing is going to a global (!!!) registry of tracers and pulling one out. I have always found it so confusing that I almost want to say that's what should be in the context instead of the tracer provider.
But that's the reason why I think the span should start one level up, because it would be useful to have the check ID, tenant ID, region ID, probe name, and possibly a couple of bits of info in the trace, and that should be added to the span.
It might make sense for the tracer to have the name of the probe as an attribute, but I never remember which bit of info ends up where. I think the cluster and instance are getting added by Alloy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the prober knows its name, you should use that instead of hard coding "browser check"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost want to say the span should start one level up.
I'm fine moving it one up, what I'm not sure is if we do want traces for non-k6 checks. The possibility is there, but do we actually have a use for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and Tracer(...) is the other one, where Java slipped into Go. What that thing is doing is going to a global (!!!) registry of tracers and pulling one out. I have always found it so confusing that I almost want to say that's what should be in the context instead of the tracer provider.
I don't think that's true. Each TracerProvider
keeps a list of Tracers indexed by name, and we only create one TracerProvider
:
https://github.com/open-telemetry/opentelemetry-go/blob/sdk/v1.30.0/sdk/trace/provider.go#L132
That's a bit awful but it means that calling TracerProvider(ctx).Tracer("")
is cheap, the tracer is created only once. That's why it is repeated all over the code.
Not trying to say otel design is good or bad, but I try to say this is how they want it to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine moving it one up, what I'm not sure is if we do want traces for non-k6 checks. The possibility is there, but do we actually have a use for them?
I would rather have the consistency.
If traces for protocol checks aren't useful, we can configure a sampling policy to drop those.
httpClient := http.Client{ | ||
Transport: otelhttp.NewTransport( | ||
http.DefaultTransport, | ||
otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()), | ||
// Span names do not include method and path by default to avoid cardinality explosion with paths containing | ||
// IDs. As this is not the case with this endpoint, we use a custom formatter that includes both. | ||
otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string { | ||
return fmt.Sprintf("%s %s", r.Method, r.URL.Path) | ||
}), | ||
otelhttp.WithPropagators(propagation.TraceContext{}), // Send TraceIDs in outgoing requests. | ||
), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I distrust otel a lot...
Isn't this doing a lot of extra work over and over again?
Wouldn't it be better to create a client once (e.g. sync.OnceValue
)
Also, this line:
otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),
isn't the tracer provider stored in the context? You don't need to extract the span to extract its provider, you can ask for the provider directly from the context. I guess there are situations where you have multiple tracer provider, but unless I'm missing something this isn't one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this doing a lot of extra work over and over again?
Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the tracer provider stored in the context? You don't need to extract the span to extract its provider, you can ask for the provider directly from the context. I guess there are situations where you have multiple tracer provider, but unless I'm missing something this isn't one of them.
Also possible. I would do it that way if I was creating the http.Client
only once.
My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way" (Example). I will do it like that everywhere I can. Storing the tracerprovider in the context directly is a hack I made up for two reasons:
- It doesnt make sense, in our codebase, for a root span to be created before a scrape is started (i.e. there is no global span anywhere)
- To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.
I will only use the tracing
hack where necessary, so if at some point we want to drop the hack (the tracing
package), we will need to change the least amount of code possible. That's what I tried to convey with the comment here:
synthetic-monitoring-agent/internal/tracing/tracing.go
Lines 55 to 57 in 3b2a355
// Note: This function is useful to obtain a TracerProvider to generate a root span. If a span already exists in the | |
// context, the idiomatic way of retrieving it is using: | |
// trace.SpanFromContext(ctx).TracerProvider() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.
Note there's a subtle change here: when using DefaultClient, the client is not getting created over and over again. Clients are reusable. If I remember things correctly, reusing clients allows the underlying code the reuse connections.
My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way"
I know. I'm not debating that. I'm debating that the otel way doesn't buy you anything in this case. In otel it's theoretically possible to have multiple tracer providers (I challenge you to find something in the documentation that explains why you might want to do that) but within the actual use case we have here, I cannot see that happening.
To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.
Hmm...
The current implementation is confusing.
- The tracerprovider is in the context, but you don't want to use it.
- Looking at the line of code I flagged, I'm not sure whether that context has or hasn't a span.
The tracerprovider should be added to the scraper and passed down from there to the point where the root span should be created, which is probably the runProber function in scraper.go. There a context is created, which is where the span should be added.
If you do that, the only way to get a tracer provider is from the span, and that resolves the issue of why it's done one way or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is confusing.
I will grant you that, it's not the clearest. I was hoping the comment would help, but it's true that it creates some ambiguity. Let me try to drop the context thingie and pass the tracerprovider everywhere. It will be a noisy PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked the stuff. At this point it didn't make sense to keep with the fixup!
s, so I've rebased the entire thing. Changes roughly are:
- Objects now receive the
TracerProvider
normally, not through the context. I can hardly find words for how painful it was to add this on everywhere. I've spent one hour trying to get the thing to compile and another hour debugging nil pointer dereferences. I'm still not sure all of them are solved. - Root spans are now initiated by the scraper and the adhoc handler. Probers no longer know what a trace is. This means we lose the ability to record probe errors in a span, as Probers hide that behind the
success
boolean. We can create a child span from the Prober just with that info, but that would be duplicating it on every prober. Your call.
d3a6d18
to
3b2a355
Compare
fde200d
to
2973a75
Compare
defer cancel() | ||
|
||
var tracerProvider trace.TracerProvider = noop.NewTracerProvider() | ||
if os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" || os.Getenv("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT") != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a problem.
The implicit behavior that otel favors is problematic.
- Take the URL from a command line option.
- Ignore the environment variables and just use
otlptracehttp.WithEndpointURL
. - If no option is passed, use the noop tracer provider.
- Regardless of any of this, set the global tracer provider to noop, too.
That last point might be contentious: just like the code you have here, there are libraries out there which go thru the same dance of looking at the environment and setting up tracing all by themselves, or even worse, allow otel to do all of that on its own. Libraries should not define any of this. They should simply look for a span in the context and act accordingly.
And I don't want to use a code review to beat a dead horse, but I hope that this allows you to realize how problematic the functional options pattern is. There is an option to override the behavior with the environment variables, it's just very well hidden by that pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a problem.
I don't see the problem. What you suggest is what I would do if this SDK was used in vacuum, but it is not. The entire ecosystem is build around the expectation that you set those environment variables and it works. It's not just the trace exporter, it's the sampling algorithm, parameters, all of it is configured automatically from environment variables. We would be going out of our way to change how one thing is configured, and by doing that we make that thing inconsistent with the rest of the ecosystem, and also with the rest of the things that can be configured in this very project. I really don't think that's a good idea.
you to realize how problematic the functional options pattern is
I do, I don't like it either. Big fan of opts structs myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire ecosystem is build around the expectation that you set those environment variables and it works.
The ecosystem is built around an ill-defined notion of "ease of use". I have had to debug issues with Otel-using libraries that obtain configuration from the environment "because it's there".
I do not want stuff happening behind our backs, and more importantly I do not want stuff happening behind our user's backs. I don't want users accidentally pushing traces. If they do want to push traces, they can explicitly ask for it.
I have added a commit on top of this PR. If you are not OK with that, remove it.
httpClient := http.Client{ | ||
Transport: otelhttp.NewTransport( | ||
http.DefaultTransport, | ||
otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()), | ||
// Span names do not include method and path by default to avoid cardinality explosion with paths containing | ||
// IDs. As this is not the case with this endpoint, we use a custom formatter that includes both. | ||
otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string { | ||
return fmt.Sprintf("%s %s", r.Method, r.URL.Path) | ||
}), | ||
otelhttp.WithPropagators(propagation.TraceContext{}), // Send TraceIDs in outgoing requests. | ||
), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.
Note there's a subtle change here: when using DefaultClient, the client is not getting created over and over again. Clients are reusable. If I remember things correctly, reusing clients allows the underlying code the reuse connections.
My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way"
I know. I'm not debating that. I'm debating that the otel way doesn't buy you anything in this case. In otel it's theoretically possible to have multiple tracer providers (I challenge you to find something in the documentation that explains why you might want to do that) but within the actual use case we have here, I cannot see that happening.
To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.
Hmm...
The current implementation is confusing.
- The tracerprovider is in the context, but you don't want to use it.
- Looking at the line of code I flagged, I'm not sure whether that context has or hasn't a span.
The tracerprovider should be added to the scraper and passed down from there to the point where the root span should be created, which is probably the runProber function in scraper.go. There a context is created, which is where the span should be added.
If you do that, the only way to get a tracer provider is from the span, and that resolves the issue of why it's done one way or the other.
2661413
to
6ebe479
Compare
6ebe479
to
bf0a1df
Compare
internal/scraper/scraper.go
Outdated
defer checkSpan.End() | ||
checkSpan.SetAttributes( | ||
attribute.String("type", s.prober.Name()), | ||
attribute.String("region", s.probe.Region), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the probe region (as expected), not the tenant region. Not sure where to get the latter from.
6a4187e
to
8d91f8f
Compare
8d91f8f
to
9ff0ca4
Compare
Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>
This PR adds opt-in distributed tracing support in the agent, using the opentelemetry-go sdk and OTLP transport. These traces are meant to help probe operators diagnose and understand private probes, and they are not sent to the stacks where the checks are running, but rather to a single OTLP endpoint configured by the probe operator.
For now, tracing is enabled only for k6-backed checks executed by remote runners. However, the mechanism can easily be extended to other check types and local k6 runners.
The code follows the opentelemetry tracing conventions, relying on
context.Context
to propagate spans across the code: A function creates a span, adds it to the context, and passes that context down so the process can repeat downstream and spans be correctly linked.Additionally, this PR also adds a convenience
tracing
package. This package creates aTracerProvider
configured to output otlp traces using variables from the environment (See upstream docs), and injects it into the context. Further down the stack, probers can retrieve thisTracerProvider
from that context. This is inspired by the same pattern above, but allows propagating theTracerProvider
down even when no span is currently active. The alternative to this would be to manually pass down aTracerProvider
to all objects that initiate traces (i.e. create root spans), but given that we have quite a lot of them already satisfying established interfaces, I thought this option would be less intrusive.