forked from qrilka/xlsx
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conduit parser #2
Open
awkure
wants to merge
37
commits into
master
Choose a base branch
from
awkure/add-stream-support-cleanup
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This allows us to add stream support per row. Which should be good enough
too much work, let's keep original design
This reverts commit 13fe548.
This made me realize I should undo these changes, it would make the PR easier to accept as well and we don't really need it.
This reverts commit a5b18d1.
This reverts commit d17897f.
also export the various lenses for SheetItem.
I think it tracks every loop being called
… docs" This reverts commit b8adc00.
Problem: We want for the project to be as reproducible as possible. Me, as a mac user by now cannot build it due to the fact that haskell tooling is not very stable on this system. More concretely in my case, while I was building it on my global environment I got the following cryptic errors ``` streamin: src/Data/Conduit/Internal/Pipe.hs:(413,5)-(418,38): Non-exhaustive patterns in function go ``` And this was at runtime! I even tried to make a local fork of conduit library and fix the error, but the problem is that it was building on my machine and while I was trying to use it in this project, I got even more cryptic and non-helpful error. ``` streamin: internal error: stg_v_ap_get (GHC version 8.8.4 for x86_64_apple_darwin) Please reporth this as a GHC bug: https://www.haskell.org/ghc/reportabug ``` So my patience went over and I've decided to build it via nix. Solution: Add nix flakes template, cabal.project for cabal env and shell for hls.
For now coundit parser is not able to parse all possible xlsx files, we need to change that and also be able to parse untyped values. We need to optimize parser agressively since using conduits is not fast as loading everything in memory. Also docs need to be added.
Make criterion collect info about all three parsers
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sometimes when we want to parse large
.xlsx
files we stumble upon the problem of exponentially increasing memory consumption. This PR aims to deal with that problem by introducing conduit parser that attempts to run in constant memory.Here're some benchmarks
And how they deal with memory
Default parser
Conduit parser
In large scale it does appear to run in constant memory, consider a file with 1 million rows
And here the comparison with python xlsx2csv streaming library on 1 million rows file (spolier, not much of the difference here)