Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

Open
mahadzaryab1 opened this issue Dec 22, 2024 · 9 comments · May be fixed by #6499
Open

[v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed #6393

mahadzaryab1 opened this issue Dec 22, 2024 · 9 comments · May be fixed by #6499
Labels
good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Dec 22, 2024

Originally posted by @yurishkuro in #6391 (comment)

  1. Some storage backends (Cassandra, in particular), perform similar deduping by computing a hash before the span is saved and using it as part of the partition key (it creates tombstones if identical span is saved 2 times or more but no dups on read). So we could make this hashing process to be a part of the ingestion pipeline (e.g. in sanitizers) and simply store it as an attribute on the span. Then this adjuster would be "lazy", it will only recompute the hash if it doesn't already exist in the storage.

  2. If we do this on the write path, we would want this to be as efficient as possible, so we would need to implement manual hashing by iterating through the attributes (and pre-sorting them to avoid dependencies) and but manually going through all fields of the Span / SpanEvent / SpanLink. The reason I was reluctant to do that in the past was to avoid unintended bugs if the data model was changed, like a new field added that we'd forget to add to the hash function. To protect against that we probably could use some fuzzing tests, by setting / unsetting each field individually and making sure the hash code changes as a result.

@mahadzaryab1 mahadzaryab1 changed the title Enhance Span Hash Adjuster For Spans That Have Already Been Hashed [v2][adjuster] Enhance Span Hash Adjuster For Spans That Have Already Been Hashed Dec 22, 2024
@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners and removed performance storage/cassandra v2 labels Dec 22, 2024
@yurishkuro
Copy link
Member

@suryaaprakassh What guidance do you need?

@zzzk1
Copy link
Contributor

zzzk1 commented Dec 25, 2024

@yurishkuro Has anyone started this? If not, I’d like to try.

@ADI-ROXX
Copy link
Contributor

@yurishkuro Regarding the issue, I was able to understand the first point that you made that we want to add hash in the span as a new field so that we are calculating hashes of only those that are not yet calculated.

I have doubts regarding the 2nd point that you are making. Can you provide a sample testcase where, if we do not include the 2nd point in the final code, there might be an issue?

I think that if, in the future, a new attribute is added to the hash, it might not be that big of an issue because we just have to change the input to the hash function. That is, while adding a new field in the span, simultaneously change the hashing code so that the issue is resolved.
That is what I am able to see for now. If there is anything that I am missing, can you please point it out (especially on the 2nd point)?

@yurishkuro
Copy link
Member

That is, while adding a new field in the span, simultaneously change the hashing code so that the issue is resolved.

What if we forget to do that? The intent of item (2) is to ensure that we will have some tests fail if we forget to add a new field to the hash function. A solution that does not provide this guarantee is unsafe and can lead to future bugs.

@ADI-ROXX
Copy link
Contributor

ADI-ROXX commented Dec 25, 2024

A solution that does not provide this guarantee is unsafe and can lead to future bugs.

@yurishkuro
For this two cases are possible:

  1. The new field is intended to be used in the hashing function: In this case, if there is a fail in the test case, it signifies that there is some issue in the code.
  2. The new field added is not intended to be used in the hashing function: In this case, if there is a faile in a test case, it falsely signifies that there is some issue in the code.

Instead of adding unit tests, there is a simpler solution that I have that could prevent this issue. Keep an array that stores what all fields have to be used for hashing and what will be their order.

Lets say that following are the fields of a span:
[traceID, spanID, operationName, warnings, process]

and out of these [traceID, spanID, operationName] are the ones that are used for hashing.

So, what we can do is that we can define an array that will contain what field to be used.

So the array might look like the following:

Index-----------Value
0 --------------traceID
1 --------------spanID
2 -------------- operationName

and if we want to add a new field, we can just add it to the array as follows, and it will be adjusted automatically by the hashing function.

Index ------- Value
0 -------------- traceID
1 --------------newField
2 -------------- spanID
3 --------------operationName

@yurishkuro
Copy link
Member

@ADI-ROXX

  1. All public fields are meant to be used in a hash.
  2. Your solution requires reflection, which is inefficient at runtime. I expect the hash functions to access data directly via compile-time functions. However, I am not opposed to the code for such functions to be auto-generated from the underlying data types, and during the generation it's fine to use reflection.

@ADI-ROXX
Copy link
Contributor

Your solution requires reflection, which is inefficient at runtime.

@yurishkuro I understood the concern. The problem being that for all the spans during deployment, we cannot use reflection as consumes the resource.

So the solution I can think of is a combination of my proposed solution and yours.
There will be an array that would contain the order of parameters that have to be used for hashing:

Index-----------Value
0 --------------traceID
1 --------------spanID
2 -------------- operationName

And during the testing, it will be checked whether the hash of the span that comes out of the adjuster is the same or not.

So this will reduce the headache of adding custom tests, and the user just has to make changes in the ordering array, and everything will be done automatically.

Although this might not be efficient during the testing phase, I don't think it matters.

@yurishkuro
Copy link
Member

it's ok to use reflection in the tests. The "array" is irrelevant since it can be constructed via reflection from the actual types.

@adityachopra29
Copy link
Contributor

Hey, I spent some time understanding this issue, and found it interesting.
I think the following steps will be required to solve this:

  • In the processor(or sanitizer), create a hash function with arbitrary number of inputs to return hash function as used by the adjuster in [v2][adjuster] Implement adjuster for deduplicating spans #6391
  • Change all locations in code where the processor is being used
  • Create tests for testing hash function, as described in point 2.

Is this issue still being worked on or can I work on it @chahatsagarmain @yurishkuro ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants