Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conduit parser #2

Open
wants to merge 37 commits into
base: master
Choose a base branch
from
Open

Conversation

awkure
Copy link

@awkure awkure commented May 17, 2021

Sometimes when we want to parse large .xlsx files we stumble upon the problem of exponentially increasing memory consumption. This PR aims to deal with that problem by introducing conduit parser that attempts to run in constant memory.

Here're some benchmarks
image

And how they deal with memory

Default parser
image

Conduit parser
image

In large scale it does appear to run in constant memory, consider a file with 1 million rows
image

And here the comparison with python xlsx2csv streaming library on 1 million rows file (spolier, not much of the difference here)

$ time python3 ./bench.py --takeyourtime 
2988.90 real      2984.17 user         1.98 sys
$ time ./conduit --takeyourtime
2686.39 real       409.44 user       841.22 sys

jappeace added 30 commits May 17, 2021 20:53
This allows us to add stream support per row.
Which should be good enough
too much work, let's keep original design
This made me realize I should undo these changes,
it would make the PR easier to accept as well and we don't really
need it.
also export the various lenses for SheetItem.
awkure added 6 commits May 17, 2021 20:54
Problem: We want for the project to be as reproducible as possible.
Me, as a mac user by now cannot build it due to the fact that
haskell tooling is not very stable on this system. More concretely
in my case, while I was building it on my global environment I got
the following cryptic errors

```
streamin: src/Data/Conduit/Internal/Pipe.hs:(413,5)-(418,38): Non-exhaustive patterns in function go
```

And this was at runtime! I even tried to make a local fork of
conduit library and fix the error, but the problem is that it was
building on my machine and while I was trying to use it in this
project, I got even more cryptic and non-helpful error.

```
streamin: internal error: stg_v_ap_get
  (GHC version 8.8.4 for x86_64_apple_darwin)
  Please reporth this as a GHC bug: https://www.haskell.org/ghc/reportabug
```

So my patience went over and I've decided to build it via nix.

Solution: Add nix flakes template, cabal.project for cabal env and
shell for hls.
For now coundit parser is not able to parse all possible xlsx files,
we need to change that and also be able to parse untyped values.
We need to optimize parser agressively since using conduits is not
fast as loading everything in memory. Also docs need to be added.
Make criterion collect info about all three parsers
@awkure awkure requested a review from markflorisson May 17, 2021 18:04
@awkure awkure self-assigned this May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants