-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference to line/byte from each segment #26
Comments
This is not completely trivial. pydifact stores its data in lists, and does not create Token objects. This is mainly for performance reasons. Creating an own object for each Token would cause much more memory usage. But OTOH, it's not possible to store information additionally to the Token into the list, without breaking compatibility to all structures using Tokens and Segments. One approach would be, creating a separate list in parallel, where line information is stored with the matching index like in the token list. This seems a bit clumsy to me, but maybe is the most straightforward way to go. Refactoring pydifact to use Token objects (with a linenumber/byte attribute is a no-go IMHO, as it uses to much overhead. |
OMG. Big mistake. I should read my own code. Tokens ARE in fact objects, so we can store line information in them. ;-) |
I also like using
You could easily add some extra methods there for extra convenience like you do now for |
Maybe it would be an option to skip the step of creating Segments as nested lists, and instead directly create the Segment objects using the factory. |
Btw: It would be nice if each segment had a reference to the exact input line (byte position start/end) to ease debugging. That might also help handling malformed files. (I just swapped our self-written edifact parser against pydifact and it was a pretty pleasant experience. However exact byte positions was the one feature we lost in that process.)
Originally posted by @FelixSchwarz in #24 (comment)
The text was updated successfully, but these errors were encountered: