Keyfield vs parent child field #4

broex · 2016-02-03T14:33:37Z

When using keyfields of parent-child fields there is a difference in the time it takes to diff two csv's.
With a small number of records this is not a big issue, however, with 20.000+ record using the keyfields function takes a lot more time.
Is there a reason the keyfield function takes more time, and is there a way to speed up this process?

agardiner · 2016-02-04T12:28:39Z

Key fields are essential to how csv-diff works, and are always used (if no key field is specified, the first field is used as the key field). csv-diff uses the notion of key field(s) to allow the diff process to identify differences regardless of the relative locations of equivalent records in two files. This is very different to the way a normal diff process works, but it is a key feature as it means that csv-diff can correctly identify a single change (e.g. a move of a parent in a tree) that may result in a large number of lines (e.g. all descendants) ending up in different locations in the two files.

When performing the diff process, each source (i.e. to/from or left/right) is first indexed using the key fields. Next, the diff process iterates over one of the sources, probing for equivalent records in the other using the index created on the key fields. Because of the need to create the indexes, diff time is a function of the size of the inputs, and currently exponential based on source size.

The key difference between using key_fields option vs parent_fields and child_fields comes about when there is more than one field on which to index, and a subset of the fields represents a logical parent record. If you use key_fields with multiple fields, there is always an implicit conversion to a parent field set that includes all but the last field, which is then made the child field; as such, if you have multiple key fields, it is always advisable to use the parent_fields and child_fields options to correctly identify the logical parent (if any).

When a diff is using a notion of parents, it creates multiple indexes, one for each unique parent, witihn which each child is indexed. This is probably a key reason why larger files take much longer to diff.

The current implementation is fairly naive, and leaves plenty of room for improvement - as always, time is the limiting factor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keyfield vs parent child field #4

Keyfield vs parent child field #4

broex commented Feb 3, 2016

agardiner commented Feb 4, 2016

Keyfield vs parent child field #4

Keyfield vs parent child field #4

Comments

broex commented Feb 3, 2016

agardiner commented Feb 4, 2016