Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of information oftentimes in the last line of a table #109

Open
linM24 opened this issue Nov 12, 2020 · 7 comments
Open

Loss of information oftentimes in the last line of a table #109

linM24 opened this issue Nov 12, 2020 · 7 comments
Labels

Comments

@linM24
Copy link

linM24 commented Nov 12, 2020

Describe the bug
I've tried the plain pdftotree command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.

May I ask is that an expected behavior, or it has something to do with the extract_tables utility?

To Reproduce
Steps to reproduce the behavior:

  1. sample pdf downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf
  2. run pdftotree pdf/table.pdf" -o hocr/table.hocr
  3. check hOCR output

Expected behavior
The last line of the table is not extracted in the output.

Environment (please complete the following information):

  • OS: macOS 10.15.6
  • pdftotree Version: 0.5.0
  • pdfminer.six Version: 20200726

Additional context
Same behaviors occurred on a few other files I used.

@HiromuHota HiromuHota added the bug label Nov 13, 2020
@HiromuHota
Copy link
Contributor

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

@linM24
Copy link
Author

linM24 commented Nov 16, 2020

Yea, that's also what I thought.

Will do! Thanks

@linM24
Copy link
Author

linM24 commented Dec 11, 2020

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py.

@HiromuHota
Copy link
Contributor

I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area.

$ pdftotree table.pdf -o table.hocr -vv
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[WARNING] pdftotree.utils.pdf.pdf_parsers - No boxes to get figures from on page 1.
[INFO] pdftotree.core - Tree structure built, creating html...
[DEBUG] pdftotree.TreeExtract - Calling tabula at page: 1 and area: (146.20799999999997, 90.0, 331.78175999999996, 539.4936).
[DEBUG] pdftotree.TreeExtract - Tabula recognized 1 table(s).
[INFO] pdftotree.core - HTML created.
hOCR output to table.hocr

As can be seen in the log message, pdftotree specified a table area as (146.20799999999997, 90.0, 331.78175999999996, 539.4936) (top, left, bottom, right).
This is actually a few pixels smaller than the actual table.
Screen Shot 2020-12-12 at 16 35 19

@HiromuHota
Copy link
Contributor

I wonder where this pixel shift happens.

@HiromuHota
Copy link
Contributor

I think I figured out what was happening.
When you run pdftotree without -mt option, it will detect a table heuristically.

# use heuristics to get tables if no model_type is provided
else:
for page_num in self.elems.keys():
tables[page_num] = self.get_tables_page_num(page_num)

The heuristic used here is that words are vertically aligned in a table.

tbls, tbl_features = cluster_vertically_aligned_boxes(
boxes,
elems.layout.bbox,
avg_font_pts,
width,
char_width,
boxes_segments,
boxes_curves,
boxes_figures,
page_width,
combine,
)
return tbls, tbl_features

So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines.

@HiromuHota
Copy link
Contributor

A short-term workaround would be to use -mt option (probably with vision).
A long-term fix would be either to fix the heuristics or offload the table detection to tabula.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants