Add conduit parser #2

awkure · 2021-05-17T18:04:19Z

Sometimes when we want to parse large .xlsx files we stumble upon the problem of exponentially increasing memory consumption. This PR aims to deal with that problem by introducing conduit parser that attempts to run in constant memory.

Here're some benchmarks

And how they deal with memory

Default parser

Conduit parser

In large scale it does appear to run in constant memory, consider a file with 1 million rows

And here the comparison with python xlsx2csv streaming library on 1 million rows file (spolier, not much of the difference here)

$ time python3 ./bench.py --takeyourtime 
2988.90 real      2984.17 user         1.98 sys
$ time ./conduit --takeyourtime
2686.39 real       409.44 user       841.22 sys

This allows us to add stream support per row. Which should be good enough

too much work, let's keep original design

This reverts commit 13fe548.

This made me realize I should undo these changes, it would make the PR easier to accept as well and we don't really need it.

This reverts commit a5b18d1.

This reverts commit d17897f.

also export the various lenses for SheetItem.

I think it tracks every loop being called

… docs" This reverts commit b8adc00.

Problem: We want for the project to be as reproducible as possible. Me, as a mac user by now cannot build it due to the fact that haskell tooling is not very stable on this system. More concretely in my case, while I was building it on my global environment I got the following cryptic errors ``` streamin: src/Data/Conduit/Internal/Pipe.hs:(413,5)-(418,38): Non-exhaustive patterns in function go ``` And this was at runtime! I even tried to make a local fork of conduit library and fix the error, but the problem is that it was building on my machine and while I was trying to use it in this project, I got even more cryptic and non-helpful error. ``` streamin: internal error: stg_v_ap_get (GHC version 8.8.4 for x86_64_apple_darwin) Please reporth this as a GHC bug: https://www.haskell.org/ghc/reportabug ``` So my patience went over and I've decided to build it via nix. Solution: Add nix flakes template, cabal.project for cabal env and shell for hls.

For now coundit parser is not able to parse all possible xlsx files, we need to change that and also be able to parse untyped values. We need to optimize parser agressively since using conduits is not fast as loading everything in memory. Also docs need to be added.

Make criterion collect info about all three parsers

jappeace added 30 commits May 17, 2021 20:53

Change type alias for CellMap

ff539de

This allows us to add stream support per row. Which should be good enough

Try seeing if we can list everything in the test data

aadd86c

Loop trough the entire thing

a3440b6

Split of go into seperate function

28284ae

Don't print the bytestring

aa2f4a7

I'm finding out it aint ordered

e3d7a41

Go a long way towards completion

8bf96d9

Use coerce for free speed

35c2c9f

Do some rundimentary parsing

4a70c01

Add styling the binary to see what's going on

67ce372

Filter out the formula

60ddd8a

Add string table parsing, cleanup warnings

a370951

THis only get's the last string (lol)

8690e0a

Fix shared string bug

3efcde5

Filter empty rows

ad2ac8e

Try allow user to use sideffect to lookup string

f546cf2

too much work, let's keep original design

Revert "Try allow user to use sideffect to lookup string"

f311d69

This reverts commit 13fe548.

Make a conduit out of shared string instead of use state monad

f7b049a

Try fix tests after that Cellmap change

209b2b2

This made me realize I should undo these changes, it would make the PR easier to accept as well and we don't really need it.

Revert "Try fix tests after that Cellmap change"

85bc369

This reverts commit a5b18d1.

Revert "Change type alias for CellMap"

aa6a590

This reverts commit d17897f.

Fix the tests

d2b3abd

Remove MonadIO constraint, throw errors with monadthrow

70390dd

also export the various lenses for SheetItem.

Remove several redundant constraints

e75fe0f

Add haskcallstack, but it's not a good idea after reading the docs

f7e6041

I think it tracks every loop being called

Revert "Add haskcallstack, but it's not a good idea after reading the…

a7d0d37

… docs" This reverts commit b8adc00.

Add bool parse support

f48f1ef

Add some docs

fdfa1cf

Show the issue with a test

59a65f2

Fix tests for shared strings

4d72357

awkure added 6 commits May 17, 2021 20:54

Add stylish haskell to nix shell and fix shell

f515585

add simple xlsx data

93e9107

Update streaming benchmark

7f5c98b

Make criterion collect info about all three parsers

Add streaming parser tests

606488b

awkure requested a review from markflorisson May 17, 2021 18:04

awkure self-assigned this May 17, 2021

Update stack.yaml to include nothunks

1049345

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conduit parser #2

Add conduit parser #2

awkure commented May 17, 2021

Add conduit parser #2

Are you sure you want to change the base?

Add conduit parser #2

Conversation

awkure commented May 17, 2021