Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Investigate if any dual-stream ORC data type other than timestamp can trigger the "desync" decoding bug #17738

Open
kingcrimsontianyu opened this issue Jan 14, 2025 · 1 comment
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@kingcrimsontianyu
Copy link
Contributor

kingcrimsontianyu commented Jan 14, 2025

This is a remaining question following the fix (#17570) to the reported ORC decoding bug (#17155).

So far it has been known that:

  • The bug can occur to the ORC timestamp data v0.12 (calling RLEv2 decode function), consisting of two streams: DATA (encoding the "second" component) and SECONDARY (encoding the "nanosecond" component).
  • The bug is not expected to occur to timestamp data v0.11 (calling RLEv1 decode function). This is because in v0.11 the run length is represented by 7 bits for both the "runs" and the "literals", which is 127 instead of 512 as in v0.12. With 1024 being the same initial limit of "max data to be consumed", the decoder can then always consume enough runs from the SECONDARY stream such that its progress is always ahead of the DATA stream's, not the opposite.

Our current question is:

  • Are other data types composed of more than one streams, such as string, char, varchar, binary, decimal, subject to the same "desync" bug as timestamp v0.12 did?

PS: Relevant sketch demonstrating the difference between v0.12 (marked as V2 on the diagram) and v0.11 (marked as V1): https://sketchtoy.com/71353765

@kingcrimsontianyu kingcrimsontianyu added the feature request New feature or request label Jan 14, 2025
@kingcrimsontianyu kingcrimsontianyu self-assigned this Jan 14, 2025
@kingcrimsontianyu
Copy link
Contributor Author

cc @GregoryKimball

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jan 21, 2025
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Jan 21, 2025
@GregoryKimball GregoryKimball removed the status in libcudf Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

No branches or pull requests

2 participants