Reference to line/byte from each segment #26

nerdoc · 2020-08-24T17:20:10Z

Btw: It would be nice if each segment had a reference to the exact input line (byte position start/end) to ease debugging. That might also help handling malformed files. (I just swapped our self-written edifact parser against pydifact and it was a pretty pleasant experience. However exact byte positions was the one feature we lost in that process.)

Originally posted by @FelixSchwarz in #24 (comment)

nerdoc · 2020-08-24T17:26:18Z

This is not completely trivial. pydifact stores its data in lists, and does not create Token objects. This is mainly for performance reasons. Creating an own object for each Token would cause much more memory usage. But OTOH, it's not possible to store information additionally to the Token into the list, without breaking compatibility to all structures using Tokens and Segments. One approach would be, creating a separate list in parallel, where line information is stored with the matching index like in the token list. This seems a bit clumsy to me, but maybe is the most straightforward way to go.

Refactoring pydifact to use Token objects (with a linenumber/byte attribute is a no-go IMHO, as it uses to much overhead.
Even the "parallel list" approach could be made "switch-off"-able - to use less memory when parsing bigger EDI files, at the cost of one additional if clause in each iter.
I'll test that a bit.

nerdoc · 2020-08-24T17:42:57Z

OMG. Big mistake. I should read my own code. Tokens ARE in fact objects, so we can store line information in them. ;-)

FelixSchwarz · 2020-08-24T17:46:36Z

I also like using namedtuple. Previously our own parser used something like this

Segment = namedtuple('Segment', ('data', 'line_nr', 'offset_start', 'offset_end', 'raw_data'))

You could easily add some extra methods there for extra convenience like you do now for Token.

nerdoc · 2020-08-24T19:07:51Z

Maybe it would be an option to skip the step of creating Segments as nested lists, and instead directly create the Segment objects using the factory.

nerdoc mentioned this issue May 6, 2021

New release with changes? #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference to line/byte from each segment #26

Reference to line/byte from each segment #26

nerdoc commented Aug 24, 2020

nerdoc commented Aug 24, 2020

nerdoc commented Aug 24, 2020

FelixSchwarz commented Aug 24, 2020

nerdoc commented Aug 24, 2020

Reference to line/byte from each segment #26

Reference to line/byte from each segment #26

Comments

nerdoc commented Aug 24, 2020

nerdoc commented Aug 24, 2020

nerdoc commented Aug 24, 2020

FelixSchwarz commented Aug 24, 2020

nerdoc commented Aug 24, 2020