-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss of information oftentimes in the last line of a table #109
Comments
This is not an expected behaviour. |
Yea, that's also what I thought. Will do! Thanks |
Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py. |
I wonder where this pixel shift happens. |
I think I figured out what was happening. pdftotree/pdftotree/TreeExtract.py Lines 256 to 259 in 0686a18
The heuristic used here is that words are vertically aligned in a table. pdftotree/pdftotree/utils/pdf/pdf_parsers.py Lines 54 to 66 in 0686a18
So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines. |
A short-term workaround would be to use |
Describe the bug
I've tried the plain
pdftotree
command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.May I ask is that an expected behavior, or it has something to do with the
extract_tables
utility?To Reproduce
Steps to reproduce the behavior:
pdftotree pdf/table.pdf" -o hocr/table.hocr
Expected behavior
The last line of the table is not extracted in the output.
Environment (please complete the following information):
pdftotree
Version: 0.5.0pdfminer.six
Version: 20200726Additional context
Same behaviors occurred on a few other files I used.
The text was updated successfully, but these errors were encountered: