Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyfield vs parent child field #4

Open
broex opened this issue Feb 3, 2016 · 1 comment
Open

Keyfield vs parent child field #4

broex opened this issue Feb 3, 2016 · 1 comment

Comments

@broex
Copy link

broex commented Feb 3, 2016

When using keyfields of parent-child fields there is a difference in the time it takes to diff two csv's.
With a small number of records this is not a big issue, however, with 20.000+ record using the keyfields function takes a lot more time.
Is there a reason the keyfield function takes more time, and is there a way to speed up this process?

@agardiner
Copy link
Owner

Key fields are essential to how csv-diff works, and are always used (if no key field is specified, the first field is used as the key field). csv-diff uses the notion of key field(s) to allow the diff process to identify differences regardless of the relative locations of equivalent records in two files. This is very different to the way a normal diff process works, but it is a key feature as it means that csv-diff can correctly identify a single change (e.g. a move of a parent in a tree) that may result in a large number of lines (e.g. all descendants) ending up in different locations in the two files.

When performing the diff process, each source (i.e. to/from or left/right) is first indexed using the key fields. Next, the diff process iterates over one of the sources, probing for equivalent records in the other using the index created on the key fields. Because of the need to create the indexes, diff time is a function of the size of the inputs, and currently exponential based on source size.

The key difference between using key_fields option vs parent_fields and child_fields comes about when there is more than one field on which to index, and a subset of the fields represents a logical parent record. If you use key_fields with multiple fields, there is always an implicit conversion to a parent field set that includes all but the last field, which is then made the child field; as such, if you have multiple key fields, it is always advisable to use the parent_fields and child_fields options to correctly identify the logical parent (if any).

When a diff is using a notion of parents, it creates multiple indexes, one for each unique parent, witihn which each child is indexed. This is probably a key reason why larger files take much longer to diff.

The current implementation is fairly naive, and leaves plenty of room for improvement - as always, time is the limiting factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants