You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using keyfields of parent-child fields there is a difference in the time it takes to diff two csv's.
With a small number of records this is not a big issue, however, with 20.000+ record using the keyfields function takes a lot more time.
Is there a reason the keyfield function takes more time, and is there a way to speed up this process?
The text was updated successfully, but these errors were encountered:
Key fields are essential to how csv-diff works, and are always used (if no key field is specified, the first field is used as the key field). csv-diff uses the notion of key field(s) to allow the diff process to identify differences regardless of the relative locations of equivalent records in two files. This is very different to the way a normal diff process works, but it is a key feature as it means that csv-diff can correctly identify a single change (e.g. a move of a parent in a tree) that may result in a large number of lines (e.g. all descendants) ending up in different locations in the two files.
When performing the diff process, each source (i.e. to/from or left/right) is first indexed using the key fields. Next, the diff process iterates over one of the sources, probing for equivalent records in the other using the index created on the key fields. Because of the need to create the indexes, diff time is a function of the size of the inputs, and currently exponential based on source size.
The key difference between using key_fields option vs parent_fields and child_fields comes about when there is more than one field on which to index, and a subset of the fields represents a logical parent record. If you use key_fields with multiple fields, there is always an implicit conversion to a parent field set that includes all but the last field, which is then made the child field; as such, if you have multiple key fields, it is always advisable to use the parent_fields and child_fields options to correctly identify the logical parent (if any).
When a diff is using a notion of parents, it creates multiple indexes, one for each unique parent, witihn which each child is indexed. This is probably a key reason why larger files take much longer to diff.
The current implementation is fairly naive, and leaves plenty of room for improvement - as always, time is the limiting factor.
When using keyfields of parent-child fields there is a difference in the time it takes to diff two csv's.
With a small number of records this is not a big issue, however, with 20.000+ record using the keyfields function takes a lot more time.
Is there a reason the keyfield function takes more time, and is there a way to speed up this process?
The text was updated successfully, but these errors were encountered: