-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTEP: Recording exceptions as log based events #4333
base: main
Are you sure you want to change the base?
Conversation
I think this is a related issue: |
b06a09f
to
76c7d85
Compare
1a1ea49
to
5ddfd05
Compare
db27087
to
e9f38aa
Compare
A small doubt:
Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this: currentSpan.recordException(e);
logger.logRecordBuilder
.addException(e); Is this the case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm very supportive. Just some nits and one mitigation I'd like to see called out/addressed.
|
||
5. An error should be logged with appropriate severity depending on the available context. | ||
|
||
- Errors that don't indicate any issue should be recorded with severity not higher than `Info`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example of this? I'm struggling to understand when this would be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most popular is exception-driven logic - you check that something is available and it throws if it's not. You try to delete something and it throws if someone concurrently deleted it.
You send a request, it times-out, you retry and get 'already exists', etc.
It's not always the best development style, but there are plenty of real-life examples. Adding one to the text.
4. It's not recommended to record the same error as it propagates through the stack trace or | ||
attach the same instance of exception to multiple log records. | ||
|
||
5. An error should be logged with appropriate severity depending on the available context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the goal of the taxonomy, but think we need to crisp up the language around Info
/Warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, updated each severity bullet with examples and some criteria related to error impact. PTAL
> OTel should provide API like `setException` when creating log record that will record only necessary information depending | ||
> on the configuration and log severity. | ||
|
||
It should not be an instrumentation library concern to decide whether exception stack trace should be recorded or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- SHOULD is normative, so please capatilize (and I think IS a normative statement here).
- This may not be language neutral, so I think SHOULD is the right guidance here. For example, in Rust, stack traces are something you can opt-in on an error. They leave some details to libraries (see Rust Backtrace/Source capabilities on https://docs.rs/thiserror/latest/thiserror/ e.g. or C++ [prior to 23] https://github.com/jeremy-rifkin/cpptrace).
Additionally, in some highly green-thread/async APIs, I've seen custom stack traces created (e.g. Scala's ZIO where they try to preserve logical stack when physical stack is a confusing mess of work-stealing green-threads. We should allow these to interact with exception reporting in OTEL in some fashion.
I agree with the sentiment, I'd expand the wording though to allow languages like Rust/C++ (and Java ecosystem) to provide stack trace compatibility with their library ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use normative language. Also added
The signature of the method is to be determined by each language
and can be overloaded as appropriate including ability to customize stack trace
collection.
It MUST be possible to efficiently set exception information on a log record based on configuration
and without using thesetException
method.
with appropriate severity (or stop reporting them). | ||
- We should provide opt-in mechanism for existing instrumentations to switch to logs. | ||
|
||
2. Recording exceptions as log-based events would result in UX degradation for users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also call out the fact that now we have two channels of exporting/batching/recording information of exceptions and Traces. In this new world, you may see a trace before an exception or vice versa, and one may be dropped where the other is not.
We probably need some other mitigatioin should that requiring knowledge of an exception event under a Span is no longer needed (e.g. more aggressively using Span.status and attributes around "transient failures" as we discussed in Semconv SIG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarification in the OTEP that logs are intended to replace span events (gracefully). I.e. there should be just one channel of communication at a time.
Co-authored-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
5a4b286
to
61d9551
Compare
Doesn't this potentially make data much harder to query? If the exception is on the span itself I can do something like: select *
from records
where span_name = 'db query' and has_exceptions(events) Whereas if it's emitted as a (non span) event (aka a log?) we would have to either re-assemble during ingest (very difficult) or it ends up as it's own row, which means the query would have to be something like: select *
from records r1
where span_name = 'db query' and exists (
select
from records r2
where r1.trace_id = r2.trace_id and r2.parent_span_id = r1.span_id and has_exceptions(r2.events)
) |
How is this related to the OTEP? Many backends already show log records related to the given trace. It looks like a complain to the trace-log correlation. Or even https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events (different OTEP).
I guess the difficulty depends on the given backend design. |
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion, | ||
but that's not enough - instrumentations will have to change their behavior to report errors | ||
as logs with appropriate severity. | ||
- We should provide opt-in mechanism for existing instrumentations to switch to logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a limited mitigation that will not work for instrumentations that we do not own.
for any component following existing [exception guidance](/specification/trace/exceptions.md). | ||
|
||
**Mitigation:** | ||
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long term, errors recorded on logs will replace span events
(according to Event vision OTEP).
I think this mechanism will be necessary for migrating from Span Events to Exception Log Records and we should probably have a separate OTEP for this.
The OTEP is proposing to move data from the span to related logs, so understanding the impact of this on user experience is important. I guess you could say my question is more in general about https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events but I think exceptions are a particularly important tipping point.
We do as well. But do any of them support doing this programmatically? It's one thing to have the backend know how to retrieve that data, it's another thing to do it programatically e.g. to build a dashboard of error rate by endpoint. If they do I'd like to see what that query looks like before and after this proposal from a general user's perspective. |
**Mitigation:** | ||
- OpenTelemetry API and/or SDK may provide span events -> log events conversion. | ||
See also [Event vision OTEP](./0265-event-vision.md#relationship-to-span-events). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the mitigation supposed to be long term instead of short term for trace-only backends then?
opentelemetry-specification/oteps/0265-event-vision.md
Lines 41 to 46 in c55ed5c
### Relationship to Span Events | |
Events are intended to replace Span Events in the long term. | |
Span Events will be deprecated to signal that users should prefer Events. | |
Interoperability between Events and Span Events will be defined in the short term. |
It seems this section is proposing a long term mitigation, but the link provided specifies that it should be a short term solution.
I also want to confirm what the proposed behavior is based on this scenario. It was not clear to me from reading the OTEP. from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.trace import ReadableSpan
resource = Resource(attributes={SERVICE_NAME:"test"})
def formatter(span: ReadableSpan) -> str:
out: list[str] = []
out.append(f"{span.name}")
for event in span.events:
if event.name == "exception":
attributes = event.attributes or {}
stacktrace = attributes.get("exception.stacktrace") or ""
out.append(f" {event.name}: {stacktrace}")
out.append("")
return "\n".join(out)
trace_provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(ConsoleSpanExporter(formatter=formatter))
trace_provider.add_span_processor(processor)
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("test")
def inner() -> None:
with tracer.start_as_current_span("inner"):
raise ValueError('the message')
def outer() -> None:
with tracer.start_as_current_span("outer"):
inner()
try:
with tracer.start_as_current_span("root"):
outer()
except Exception as e:
pass # keep the output clean Currently this gives (cleanup up a bit):
The point is that each span has the exception recorded on it, and they each have a different stack trace. Under this new paradigm, would a log/event be emitted for each span the exception bubbles up through? Or would only a single log/event be emitted for the first span it bubbles though? The current version of this PR states:
If this means that only a single log would be emitted we'd loose information on the fact that this exception bubbled up through the other spans. Even if the span's status code and message get set, I still don't have the stack trace that bubbled up through the event, meaning information is lost relative to the current system. |
> [!NOTE] | ||
> Throughout this OTEP, the terms exception and error are defined as follows: | ||
> | ||
> - *Error* refers to a general concept describing any non-success condition, | ||
> which may manifest as an exception, non-successful status code, or an invalid | ||
> response. | ||
> - *Exception* specifically refers to runtime exceptions and their associated stack traces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
|
||
## Motivation | ||
|
||
Today OTel supports recording *exceptions* using span events available through Trace API. Outside of OTel world, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today OTel supports recording *exceptions* using span events available through Trace API. Outside of OTel world, | |
Today OTel supports recording *exceptions* using span events available through the Trace API. Outside the OTel world, |
## Motivation | ||
|
||
Today OTel supports recording *exceptions* using span events available through Trace API. Outside of OTel world, | ||
*errors* are usually recorded by user apps and libraries using logging libraries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*errors* are usually recorded by user apps and libraries using logging libraries | |
*errors* are usually recorded by user apps and libraries by using logging libraries, |
|
||
Today OTel supports recording *exceptions* using span events available through Trace API. Outside of OTel world, | ||
*errors* are usually recorded by user apps and libraries using logging libraries | ||
and may be recorded as OTel logs via logging bridge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and may be recorded as OTel logs via logging bridge. | |
and may be recorded as OTel logs via a logging bridge. |
*errors* are usually recorded by user apps and libraries using logging libraries | ||
and may be recorded as OTel logs via logging bridge. | ||
|
||
Errors recorded on logs have the following advantages over span events: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors recorded on logs have the following advantages over span events: | |
Using logs to record errors has the following advantages over using span events: |
- Background job terminates with an exception | ||
- HTTP framework error handler catches exception thrown by the application code. | ||
|
||
Note: some frameworks use exceptions as a communication mechanism when request fails. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: some frameworks use exceptions as a communication mechanism when request fails. For example, | |
Note: some frameworks use exceptions as a communication mechanism when a request fails. For example, |
- HTTP framework error handler catches exception thrown by the application code. | ||
|
||
Note: some frameworks use exceptions as a communication mechanism when request fails. For example, | ||
Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html) | |
Spring users can throw a [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html) |
|
||
Note: some frameworks use exceptions as a communication mechanism when request fails. For example, | ||
Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html) | ||
exception to return unsuccessful status code. Such exceptions represent errors already handled by the application code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exception to return unsuccessful status code. Such exceptions represent errors already handled by the application code. | |
exception to return an unsuccessful status code. Such exceptions represent errors already handled by the application code. |
|
||
- Errors that result in application shutdown SHOULD be recorded with severity `Fatal`. | ||
|
||
- The application detects an invalid configuration at startup and shuts down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The application detects an invalid configuration at startup and shuts down. | |
Examples: | |
- The application detects an invalid configuration at startup and shuts down. |
to describe the context that the exception was thrown in. | ||
They are also encouraged to define their own error events and enrich them with exception details. | ||
|
||
7. OTel SDK SHOULD record stack traces on exceptions with severity `Error` or higher and SHOULD allow users to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7. OTel SDK SHOULD record stack traces on exceptions with severity `Error` or higher and SHOULD allow users to | |
7. The OTel SDK SHOULD record stack traces on exceptions with severity `Error` or higher and SHOULD allow users to |
4. It's NOT RECOMMENDED to record the same error as it propagates through the call stack, or | ||
to attach the same instance of an exception to multiple log records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python, because exceptions are objects, if you have exceptions bubbling through nested spans, I'm not sure how you'd be able to achieve this while still being able to distinguish which specific exception was raised through the higher spans.
Here is a concrete example:
import asyncio
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.trace import ReadableSpan
resource = Resource(attributes={SERVICE_NAME:"test"})
def formatter(span: ReadableSpan) -> str:
out: list[str] = []
out.append(f"{span.name}")
for event in span.events:
if event.name == "exception":
attributes = event.attributes or {}
stacktrace = attributes.get("exception.stacktrace") or ""
out.append(f" {event.name}: {stacktrace}")
out.append('')
return "\n".join(out)
trace_provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(ConsoleSpanExporter(formatter=formatter))
trace_provider.add_span_processor(processor)
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("test")
async def task_1(fail: bool):
with tracer.start_as_current_span("task_1"):
if fail:
raise ValueError('fail')
async def task_2(fail: bool):
with tracer.start_as_current_span("task_2"):
if fail:
raise ValueError('fail')
async def task_3(fail: bool):
with tracer.start_as_current_span("task_3"):
if fail:
raise ValueError('fail')
async def root(fail_1: bool, fail_2: bool, fail_3: bool) -> None:
with tracer.start_as_current_span("root"):
# We want to run all tasks concurrently, attempting to finish as many as possible, even if some fail
results = await asyncio.gather(task_1(fail_1), task_2(fail_2), task_3(fail_3), return_exceptions=True)
# Here is some concrete logic for handling the produced exceptions,
# but note that we could just as easily raise the last error, ignore certain exception types, etc.
for result in results:
if isinstance(result, Exception):
raise result
async def main():
# No failures
await root(False, False, False)
# All failures
try:
await root(True, True, True)
except Exception as e:
pass # keep the output clean
asyncio.run(main())
If you run this script, you get the following output:
task_1
task_2
task_3
root
task_1
exception: Traceback (most recent call last):
File "python3.10/site-packages/opentelemetry/trace/__init__.py", line 589, in use_span
yield span
File "python3.10/site-packages/opentelemetry/sdk/trace/__init__.py", line 1104, in start_as_current_span
yield span
File "scratch_file.py", line 34, in task_1
raise ValueError('fail')
ValueError: fail
task_2
exception: Traceback (most recent call last):
File "python3.10/site-packages/opentelemetry/trace/__init__.py", line 589, in use_span
yield span
File "python3.10/site-packages/opentelemetry/sdk/trace/__init__.py", line 1104, in start_as_current_span
yield span
File "scratch_file.py", line 40, in task_2
raise ValueError('fail')
ValueError: fail
task_3
exception: Traceback (most recent call last):
File "python3.10/site-packages/opentelemetry/trace/__init__.py", line 589, in use_span
yield span
File "python3.10/site-packages/opentelemetry/sdk/trace/__init__.py", line 1104, in start_as_current_span
yield span
File "scratch_file.py", line 46, in task_3
raise ValueError('fail')
ValueError: fail
root
exception: Traceback (most recent call last):
File "python3.10/site-packages/opentelemetry/trace/__init__.py", line 589, in use_span
yield span
File "python3.10/site-packages/opentelemetry/sdk/trace/__init__.py", line 1104, in start_as_current_span
yield span
File "scratch_file.py", line 59, in root
raise result
File "scratch_file.py", line 34, in task_1
raise ValueError('fail')
ValueError: fail
With the current behavior, you can tell by looking in the stack trace of the exception bubbling through the "root"
span that the raised exception was coming from task_1
. However, if we were to only record each exception once, I'm not sure how you'd be able to determine which exception was raised through the "root"
span (e.g., if the code were changed to raise the last exception rather than the first).
While this may feel somewhat contrived (a consequence of trying to keep the example fairly minimal), I am confident that there are lots of python use cases today where, in the face of an exception, you might try a fallback codepath, and if that fallback codepath fails you still raise the error coming from the original execution path. (And there are obviously scenarios where you would want to just raise the error of the fallback codepath.) So, if each of those exceptions end up bubbling through their own spans, and subsequently bubble through a parent span, I think we need to retain a way to know which exception was raised through the parent span.
Given these examples, and the flexibility of python, it's hard for me to imagine how you'd be able to know programmatically whether there was information loss if you don't record the same error as it passes through spans higher in the call stack, therefore I would consider it a regression if we lost the current behavior of recording an event every time an exception passes through a span, even if the same exception has already been recorded as passing through some child span. (And in particular, the events should retain the full stack trace — note that the otel status message isn't enough to distinguish the specific exception in the example above, but the stack trace is enough.)
I see this note below:
It also may be desirable by some vendors/apps to record all exception details at all levels.
As long as this can be done for an arbitrary python application and isn't dependent on the instrumentations or otel SDK behavior, then I think everything is okay, I just want to point out that retaining that option is important, at least for python.
Related to open-telemetry/semantic-conventions#1536
Changes
Recording exceptions as span events is problematic since it
This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.
If accepted, the follow-up spec changes are expected to replace existing (stable) documents:
Related OTEP(s) #CHANGELOG.md
file updated for non-trivial changesspec-compliance-matrix.md
updated if necessary