Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort extensions during JSON serialization #1117

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sargas
Copy link

@sargas sargas commented Jan 18, 2025

This PR proposes to make JSON serialization deterministic, i.e., always produce the same result for the same input. This matches Go's behavior for built-in types, which notably will explicitly sort fields within a map when serializing.

v2/event/event_marshal.go's WriteJson's only non-deterministic behavior is an iteration over a map of extensions, which is performed in random order. This PR takes a similar approach as encoding/json, without their logic for "resolving" key names (since extension names are always strings), and without the use of reflect.Value.

The motivation behind this change is a subtle dependence that the Restate framework has on JSON ordering being deterministic. This framework allows for code blocks that are non-deterministic (e.g., creating a CloudEvent Event with the current time), but then expects when retrying that the journal is "consistent" on replays before any errors. In affect, this means that normal code should be deterministic. This breaks down when a CloudEvent is passed as input to a service and serialized as JSON by default: even when the Event hasn't changed, the serialization will produce different bytes, causing the framework to believe the code is non-deterministic when everything was the same until the final serialization of a CloudEvent.

For this framework, it is easy to work around - we need to manually convert to JSON and normalize it (e.g., by converting to a map and then back to JSON, to make use of golang's deterministic JSON serialization of maps). It is an unfortunate extra step caused by the CloudEvent SDK's use of a non-deterministic serialization and, at least in this use, unexpected difference given the behavior of the standard library.

@sargas sargas requested a review from a team as a code owner January 18, 2025 15:34
for k, v := range ext {
extensionKeyValues = append(extensionKeyValues, keyValue{k, v})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but this makes a copy of "v", which if large, could be an issue.
What do you think about just putting the ext names into a slice, sort it, then then iterate over the slice and grab the values from the ext map? Then the only copying is the ext key names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, that actually simplified the code nicely since I didn't need this keyValue pair type when I'm just working with the extension names.

I've updated the code accordingly

This matches Go's behavior for built-in types, which
uses deterministic ordering (alphabetical for maps and
by order defined in file for structs).

Signed-off-by: Joseph Jon Booker <joe@neoturbine.net>
@sargas sargas force-pushed the deterministic-json-serialization branch from 4c40d06 to 098365f Compare January 19, 2025 04:23
@embano1
Copy link
Member

embano1 commented Jan 19, 2025

@sargas I'm not an expert with Restate, but looking at the Restate SDK/APIs why could one not wrap the creation/serialization of a CloudEvent into a Run() call which will be journaled and replayed. Or is the issue actually during the journaling and how Restate (de)serializes JSON? Can you provide a code example so we can discuss?

I'm a bit wary introducing changes to our core marshaling logic since I don't yet understand 1/ the impact it has to other users (perhaps relying on the current non-deterministic marshaling behavior) and 2/ the performance impact of the proposed code change.

@sargas
Copy link
Author

sargas commented Jan 21, 2025

@embano1

I'm not an expert with Restate, but looking at the Restate SDK/APIs why could one not wrap the creation/serialization of a CloudEvent into a Run() call which will be journaled and replayed. Or is the issue actually during the journaling and how Restate (de)serializes JSON?

The last part is the correct - Restate has an implicit assumption that the serialization (using json.Marshal by default) doesn't change as long as the inputs are the same, which is true for all types in Go by default unless a custom MarshalJSON does otherwise.

Can you provide a code example so we can discuss?

import (
	"context"
	"fmt"
	cloudevents "github.com/cloudevents/sdk-go/v2"
	"github.com/google/uuid"
	"log"
	"math/rand/v2"
	"time"

	restate "github.com/restatedev/sdk-go"
	"github.com/restatedev/sdk-go/server"
)

func main() {
	if err := server.NewRestate().
		Bind(restate.Reflect(FirstService{})).
		Bind(restate.Reflect(SecondService{})).
		Start(context.Background(), ":9080"); err != nil {
		log.Fatal(err)
	}
}

type SecondService struct{}

func (SecondService) Process(_ restate.Context, _ cloudevents.Event) error {
	return nil
}

type FirstService struct{}

func (FirstService) Handle(ctx restate.Context) error {
	event, _ := restate.Run(ctx, func(ctx restate.RunContext) (cloudevents.Event, error) {
		event := cloudevents.NewEvent()
		event.SetID(uuid.NewString())
		event.SetSource("example/uri")
		event.SetType("example.type")
		event.SetTime(time.Now())
		event.SetExtension("a", "b")
		event.SetExtension("c", "d")
		event.SetExtension("e", "f")
		return event, nil
	})

	_, _ = restate.Service[restate.Void](ctx, "SecondService", "Process").Request(event)
	if rand.Float32() < 0.5 {
		return nil
	} else {
		return fmt.Errorf("failing with purposeful error")
	}
}

Requests to trigger FirstService/Handle will create the cloudevents.Event (inside restate.Run, since the current time is being used), pass it to the second service, and half the time go on happily.
Otherwise, if we randomly hit the error condition at the end, then the FirstService.Handle method should be called again - the cloudevents.Event from the first run will be returned from the restate.Run line from the journal, and the same (ignored) null response should be returned from the restate.Service[..](..).Request as that call is the next one in the journal. This may retry any number of times depending on the random number generator, but we shouldn't ever create more than one cloud event value or call the SecondService more than once.

The issue here is that since event serializes randomly when extensions are present, the journal entries created during retries can be inconsistent, leading to an error like:

2025/01/20 22:30:16 ERROR Journal mismatch: Replayed journal entries did not correspond to the user code. The user code has to be deterministic! method=FirstService/Handle invocationID=inv_1gdbLDeS59C52GA3G1nv6lPHPgBoN7LaHn expectedType=*wire.CallEntryMessage expectedMessage="{\"service_name\":\"SecondService\",\"handler_name\":\"Process\",\"parameter\":\"eyJzcGVjdmVyc2lvbiI6IjEuMCIsImlkIjoiMmMzNzJlY2YtNzY2Zi00MzcwLWI2YmMtZGIxNTE1M2M2OTlkIiwic291cmNlIjoiZXhhbXBsZS91cmkiLCJ0eXBlIjoiZXhhbXBsZS50eXBlIiwidGltZSI6IjIwMjUtMDEtMjFUMDQ6MzA6MTYuNTc0NTgxNzM3WiIsImUiOiJmIiwiYSI6ImIiLCJjIjoiZCJ9\",\"Result\":null}" actualType=*wire.CallEntryMessage actualMessage="{\"service_name\":\"SecondService\",\"handler_name\":\"Process\",\"parameter\":\"eyJzcGVjdmVyc2lvbiI6IjEuMCIsImlkIjoiMmMzNzJlY2YtNzY2Zi00MzcwLWI2YmMtZGIxNTE1M2M2OTlkIiwic291cmNlIjoiZXhhbXBsZS91cmkiLCJ0eXBlIjoiZXhhbXBsZS50eXBlIiwidGltZSI6IjIwMjUtMDEtMjFUMDQ6MzA6MTYuNTc0NTgxNzM3WiIsImEiOiJiIiwiYyI6ImQiLCJlIjoiZiJ9\",\"Result\":{\"Value\":\"\"}}"

Base64 decoding the "parameter" in the actual and expected journal messages shows they only differ in the order of the extensions in the JSON-serialized form of the event.

I'm a bit wary introducing changes to our core marshaling logic

I'm completely sympathetic to being cautious here - that's part of why I worded this PR as a proposal. I don't think there's anything wrong with the existing code (a Marshaler has to create valid JSON, but it isn't documented that it needs to be deterministic). My example above can have workarounds added, like dealing with strings or []byte's instead of cloudevents.Event or converting back and forth to a map to get a sort of normalized JSON.

At the same time, something is off about cloudevents.Event - it's rare to see a type that doesn't have deterministic marshaling. Restate is the first place I've seen that is affected by this, but given how widespread this convention is, it may not be the last piece of code with an implicit assumption that objects have a stable serialization.

the impact it has to other users (perhaps relying on the current non-deterministic marshaling behavior)

I'm not sure anyone would expect this, let alone rely on it. I'd expect the opposite is more common (people or software like Restate that assumes it would be deterministic).

the performance impact of the proposed code change.

I don't have an answer to this, but I'll note that this PR still keeps the marshaling simpler then the code used in encoding/json for marshaling maps (since we have string field names). This approach of creating a slice for sorting the names is what the standard library uses anytime the JSON of a map is determined.

@embano1
Copy link
Member

embano1 commented Jan 21, 2025

Thx for the example, I understand the ask better now.

At the same time, something is off about cloudevents.Event - it's rare to see a type that doesn't have deterministic marshaling. Restate is the first place I've seen that is affected by this, but given how widespread this convention is, it may not be the last piece of code with an implicit assumption that objects have a stable serialization.

I disagree, for 1/ the JSON spec is explicit about the unordered behavior of maps (objects) and 2/ even the Go language itself faces this scenario - so I wouldn't say CE is "off" or incorrect per se.

I'm not sure anyone would expect this, let alone rely on it. I'd expect the opposite is more common (people or software like Restate that assumes it would be deterministic).

I don't want to be pedantic, but not only have I been struck by Hyrum's Law couple times, but I see it in my daily job as well. That's why I'm cautious about these changes - not to block them with this excuse, but to raise awareness of the implications if we introduce them because not everyone might be aware that this can be considered a breaking change for (some) users.

Also, as per Go's behavior, developers often "exploit" such randomness e.g., in tests (slightly different scenario, but inherent to our discussion here).

I don't have an answer to this, but I'll note that this PR still keeps the marshaling simpler then the code used in encoding/json for marshaling maps (since we have string field names).

Will review your code again now that I have a better understanding of the issue. I'd thought that introducing ordered behavior requires additional components, such as slices/arrays which allocate and therefore natively have performance implications.

tl;dr I absolutely agree that the current standard behavior in how Go/CE marshals to JSON is an issue, perhaps a wider issue as users might require more deterministic behavior. I just want to understand and discuss the implication if we'd change this behavior for all users. An alternative might be to give users control over the marshaller :) WDYT?

cc/ @duglin

@duglin
Copy link
Contributor

duglin commented Jan 21, 2025

My perspective...

  • if it's random today then code receiving CEs from our SDK MUST be able to handle them being determinate going forward. I have no idea how it couldn't since it could (randomly) always end up being alphabetized
  • performance: if someone has so many CE properties that this one "sort" impacts them I suspect they're sending too much data as CE properties and I suspect this sort probably isn't their only performance concern. CE props are not meant to include all event data, just what's needed to help get it to it's final destination... where they crack open the body
  • I'm anal enough that this one made me smile. I'm so bad that in my xRegistry impl I not only sort things I group the output so that system defined props come first (in order of importance to the user), then a blank line, then sorted extensions. My code doesn't depend on incoming stuff to be this way, but I wanted my output to be "human friendly" and deterministic for internal test validation purposes. If someone ever changes the code so that the output changes (even by one byte) I want that change to be scrutinized because (to me) UX is critical. So, I'm pretty biased here :-)

Now... if someone were to suggest that we demand incoming CE's be sorted.... then that's a different story, obviously.

@duglin
Copy link
Contributor

duglin commented Jan 21, 2025

I will add though... I think Restate is wrong in its processing and in MOST other cases I would tell people to yell at them to fix their code because as @embano1 said... the JSON spec doesn't mandate that things be ordered. I'm just willing to let this ordering happen because I see a potential benefit and no harm to it.

@sargas
Copy link
Author

sargas commented Jan 21, 2025

I disagree, for 1/ the JSON spec is explicit about the unordered behavior of maps (objects) and 2/ even the Go language itself faces this scenario - so I wouldn't say CE is "off" or incorrect per se.

Yeah, I don't think restate is right here either, although I understand their logic in just comparing the opaque serialized output - my companion issue on their end asks about adding some kind of "mime type" so these journal entries can be compared in a more sophisticated way then just byte-by-byte.
I meant that CE is "off" from what Go's json serialization does by default no matter what mixtures of struct's or maps you have, and so would be unexpected for projects that didn't test with unsorted map iteration in a custom MarshalJSON implementation.

I don't want to be pedantic, but not only have I been struck by Hyrum's Law couple times, ...

If clients are depending on extensions being randomly ordered when JSON-serialized despite no documentation saying to expect that, that sounds terrifying to make any changes 😨 I'm not aware of a scenario where a developer would depend on this, but can't rule it out.

I'd thought that introducing ordered behavior requires additional components, such as slices/arrays which allocate and therefore natively have performance implications.

It does, sorry for the confusion - I was comparing to encoding/json in Go's standard library, not the current CE SDK code. Right now:

  • The current CE SDK code does a map iteration for extensions, which is randomly ordered, and adds each extension to the JSON object created as it sees them.
  • The encoding/json module in Go will iterate through a map (using an iterator from reflect module's MapRange, which is documented as having the "the same iteration semantics as a range statement" - so seems random?); figure out what kind of string-like value it can use for the keys; put these into a slice of key, value pairs; sort them; and iterate through that sorted slice to add to the JSON object (code here)
  • This PR will iterate through the map; put the extension names into a slice (we know these are strings from CE standard); sorts them; and then iterates through that sorted slice to add them to the JSON object created for the CE.

So this PR does have some kind of performance cost for sorting extension names and memory cost (currently 16kb per extension, plus overhead for GC?). My point was that the impact is less than what work is normally done performing JSON serialization for maps.

An alternative might be to give users control over the marshaller

Like some kind of static global? That might mess up users' code and tests more. The Marshaler interface doesn't give any flexibility here for customizing the marshaling unfortunately

@embano1
Copy link
Member

embano1 commented Jan 21, 2025

Like some kind of static global? That might mess up users' code and tests more. The Marshaler interface doesn't give any flexibility here for customizing the marshaling unfortunately

Before you wrote this, I also dug into the current implementation and I thought for extensions we'd use encoding/json directly instead of the stream... package. I better understand the issue now on your side.

I ran the existing benchmarks to get a better understanding on the impact:

goos: darwin
goarch: arm64
pkg: github.com/cloudevents/sdk-go/v2/event
cpu: Apple M1 Pro
                                                                                   │ current.txt  │               new.txt               │
                                                                                   │    sec/op    │    sec/op     vs base               │
Marshal/struct_data_v0.3-10                                                          3.553µ ±  4%   3.611µ ±  7%       ~ (p=0.247 n=10)
Marshal/nil_data_v0.3-10                                                             3.308µ ±  7%   3.126µ ±  6%  -5.52% (p=0.003 n=10)
Marshal/string_data_v0.3-10                                                          3.544µ ±  4%   3.659µ ±  7%       ~ (p=0.210 n=10)
Marshal/struct_data_v1.0-10                                                          3.553µ ±  4%   3.557µ ±  4%       ~ (p=0.853 n=10)
Marshal/nil_data_v1.0-10                                                             3.345µ ± 15%   3.588µ ± 11%       ~ (p=0.143 n=10)
Marshal/string_data_v1.0-10                                                          3.406µ ±  2%   3.603µ ±  6%  +5.77% (p=0.019 n=10)
Marshal/base64_json_encoded_data_v1.0-10                                             1.895µ ±  3%   1.850µ ±  6%       ~ (p=0.119 n=10)
Marshal/number_data_v1.0-10                                                          1.960µ ±  8%   1.870µ ±  1%  -4.57% (p=0.000 n=10)
Unmarshal/data,_attributes_and_extensions_and_specversion_with_struct_data_v0.3-10   4.359µ ±  5%   4.366µ ±  2%       ~ (p=0.837 n=10)
Unmarshal/string_data_v1.0-10                                                        4.202µ ±  1%   4.363µ ± 14%  +3.83% (p=0.000 n=10)
Unmarshal/nil_data_v1.0-10                                                           4.058µ ±  4%   4.067µ ± 10%       ~ (p=0.353 n=10)
Unmarshal/struct_data_v0.3-10                                                        4.343µ ± 15%   4.368µ ±  9%       ~ (p=0.631 n=10)
Unmarshal/nil_data_v0.3-10                                                           3.812µ ±  4%   3.884µ ±  2%       ~ (p=0.109 n=10)
Unmarshal/data,_attributes_and_extensions_and_specversion_with_struct_data_v1.0-10   4.537µ ±  9%   4.609µ ±  5%       ~ (p=0.469 n=10)
Unmarshal/base64_json_encoded_data_v1.0-10                                           2.725µ ±  4%   2.695µ ±  2%       ~ (p=0.325 n=10)
Unmarshal/base64_xml_encoded_data_v1.0-10                                            2.837µ ±  4%   2.794µ ±  1%  -1.52% (p=0.003 n=10)
Unmarshal/string_data_v0.3-10                                                        4.203µ ±  5%   4.127µ ±  4%       ~ (p=0.143 n=10)
Unmarshal/struct_data_v1.0-10                                                        4.600µ ± 12%   4.338µ ±  3%  -5.70% (p=0.014 n=10)
geomean                                                                              3.466µ         3.471µ        +0.15%

                                                                                   │ current.txt  │                new.txt                 │
                                                                                   │     B/op     │     B/op      vs base                  │
Marshal/struct_data_v0.3-10                                                          2.228Ki ± 0%   2.465Ki ± 0%  +10.65% (p=0.000 n=10)
Marshal/nil_data_v0.3-10                                                             2.246Ki ± 0%   1.995Ki ± 0%  -11.15% (p=0.000 n=10)
Marshal/string_data_v0.3-10                                                          2.162Ki ± 0%   2.375Ki ± 0%   +9.85% (p=0.000 n=10)
Marshal/struct_data_v1.0-10                                                          2.255Ki ± 0%   2.465Ki ± 0%   +9.31% (p=0.000 n=10)
Marshal/nil_data_v1.0-10                                                             2.247Ki ± 0%   1.995Ki ± 0%  -11.21% (p=0.000 n=10)
Marshal/string_data_v1.0-10                                                          2.216Ki ± 0%   2.418Ki ± 0%   +9.12% (p=0.000 n=10)
Marshal/base64_json_encoded_data_v1.0-10                                               801.0 ± 0%     801.0 ± 0%        ~ (p=1.000 n=10) ¹
Marshal/number_data_v1.0-10                                                          1.563Ki ± 0%   1.563Ki ± 0%        ~ (p=1.000 n=10)
Unmarshal/data,_attributes_and_extensions_and_specversion_with_struct_data_v0.3-10   1.758Ki ± 0%   1.758Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/string_data_v1.0-10                                                        1.719Ki ± 0%   1.719Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/nil_data_v1.0-10                                                           1.680Ki ± 0%   1.680Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/struct_data_v0.3-10                                                        1.734Ki ± 0%   1.734Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/nil_data_v0.3-10                                                           1.680Ki ± 0%   1.680Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/data,_attributes_and_extensions_and_specversion_with_struct_data_v1.0-10   1.758Ki ± 0%   1.758Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/base64_json_encoded_data_v1.0-10                                           1.203Ki ± 0%   1.203Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/base64_xml_encoded_data_v1.0-10                                            1.227Ki ± 0%   1.227Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/string_data_v0.3-10                                                        1.719Ki ± 0%   1.719Ki ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/struct_data_v1.0-10                                                        1.734Ki ± 0%   1.734Ki ± 0%        ~ (p=1.000 n=10) ¹
geomean                                                                              1.718Ki        1.731Ki        +0.75%
¹ all samples are equal

                                                                                   │ current.txt │               new.txt                │
                                                                                   │  allocs/op  │ allocs/op   vs base                  │
Marshal/struct_data_v0.3-10                                                           19.00 ± 0%   21.00 ± 0%  +10.53% (p=0.000 n=10)
Marshal/nil_data_v0.3-10                                                              17.00 ± 0%   18.00 ± 0%   +5.88% (p=0.000 n=10)
Marshal/string_data_v0.3-10                                                           20.00 ± 0%   23.00 ± 0%  +15.00% (p=0.000 n=10)
Marshal/struct_data_v1.0-10                                                           19.00 ± 0%   21.00 ± 0%  +10.53% (p=0.000 n=10)
Marshal/nil_data_v1.0-10                                                              17.00 ± 0%   18.00 ± 0%   +5.88% (p=0.000 n=10)
Marshal/string_data_v1.0-10                                                           19.00 ± 0%   21.00 ± 0%  +10.53% (p=0.000 n=10)
Marshal/base64_json_encoded_data_v1.0-10                                              9.000 ± 0%   9.000 ± 0%        ~ (p=1.000 n=10) ¹
Marshal/number_data_v1.0-10                                                           11.00 ± 0%   11.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/data,_attributes_and_extensions_and_specversion_with_struct_data_v0.3-10    44.00 ± 0%   44.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/string_data_v1.0-10                                                         42.00 ± 0%   42.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/nil_data_v1.0-10                                                            40.00 ± 0%   40.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/struct_data_v0.3-10                                                         43.00 ± 0%   43.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/nil_data_v0.3-10                                                            40.00 ± 0%   40.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/data,_attributes_and_extensions_and_specversion_with_struct_data_v1.0-10    44.00 ± 0%   44.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/base64_json_encoded_data_v1.0-10                                            25.00 ± 0%   25.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/base64_xml_encoded_data_v1.0-10                                             25.00 ± 0%   25.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/string_data_v0.3-10                                                         42.00 ± 0%   42.00 ± 0%        ~ (p=1.000 n=10) ¹
Unmarshal/struct_data_v1.0-10                                                         43.00 ± 0%   43.00 ± 0%        ~ (p=1.000 n=10) ¹
geomean                                                                               25.75        26.56        +3.13%
¹ all samples are equal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants