Loss of information oftentimes in the last line of a table #109

linM24 · 2020-11-12T18:59:14Z

Describe the bug
I've tried the plain pdftotree command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.

May I ask is that an expected behavior, or it has something to do with the extract_tables utility?

To Reproduce
Steps to reproduce the behavior:

sample pdf downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf
run pdftotree pdf/table.pdf" -o hocr/table.hocr
check hOCR output

Expected behavior
The last line of the table is not extracted in the output.

Environment (please complete the following information):

OS: macOS 10.15.6
pdftotree Version: 0.5.0
pdfminer.six Version: 20200726

Additional context
Same behaviors occurred on a few other files I used.

The text was updated successfully, but these errors were encountered:

HiromuHota · 2020-11-13T06:45:13Z

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

linM24 · 2020-11-16T18:03:23Z

Yea, that's also what I thought.

Will do! Thanks

linM24 · 2020-12-11T22:18:36Z

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py.

HiromuHota · 2020-12-13T00:40:06Z

I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area.

$ pdftotree table.pdf -o table.hocr -vv
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[WARNING] pdftotree.utils.pdf.pdf_parsers - No boxes to get figures from on page 1.
[INFO] pdftotree.core - Tree structure built, creating html...
[DEBUG] pdftotree.TreeExtract - Calling tabula at page: 1 and area: (146.20799999999997, 90.0, 331.78175999999996, 539.4936).
[DEBUG] pdftotree.TreeExtract - Tabula recognized 1 table(s).
[INFO] pdftotree.core - HTML created.
hOCR output to table.hocr

As can be seen in the log message, pdftotree specified a table area as (146.20799999999997, 90.0, 331.78175999999996, 539.4936) (top, left, bottom, right).
This is actually a few pixels smaller than the actual table.

HiromuHota · 2020-12-13T00:42:58Z

I wonder where this pixel shift happens.

HiromuHota · 2020-12-13T04:46:44Z

I think I figured out what was happening.
When you run pdftotree without -mt option, it will detect a table heuristically.

pdftotree/pdftotree/TreeExtract.py

Lines 256 to 259 in 0686a18

    
           # use heuristics to get tables if no model_type is provided 
        
           else: 
        
               for page_num in self.elems.keys(): 
        
                   tables[page_num] = self.get_tables_page_num(page_num)

The heuristic used here is that words are vertically aligned in a table.

pdftotree/pdftotree/utils/pdf/pdf_parsers.py

Lines 54 to 66 in 0686a18

    
           tbls, tbl_features = cluster_vertically_aligned_boxes( 
        
               boxes, 
        
               elems.layout.bbox, 
        
               avg_font_pts, 
        
               width, 
        
               char_width, 
        
               boxes_segments, 
        
               boxes_curves, 
        
               boxes_figures, 
        
               page_width, 
        
               combine, 
        
           ) 
        
           return tbls, tbl_features

So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines.

HiromuHota · 2020-12-13T04:49:13Z

A short-term workaround would be to use -mt option (probably with vision).
A long-term fix would be either to fix the heuristics or offload the table detection to tabula.

HiromuHota added the bug label Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss of information oftentimes in the last line of a table #109

Loss of information oftentimes in the last line of a table #109

linM24 commented Nov 12, 2020

HiromuHota commented Nov 13, 2020

linM24 commented Nov 16, 2020

linM24 commented Dec 11, 2020

HiromuHota commented Dec 13, 2020

HiromuHota commented Dec 13, 2020

HiromuHota commented Dec 13, 2020

HiromuHota commented Dec 13, 2020

Loss of information oftentimes in the last line of a table #109

Loss of information oftentimes in the last line of a table #109

Comments

linM24 commented Nov 12, 2020

HiromuHota commented Nov 13, 2020

linM24 commented Nov 16, 2020

linM24 commented Dec 11, 2020

HiromuHota commented Dec 13, 2020

HiromuHota commented Dec 13, 2020

HiromuHota commented Dec 13, 2020

HiromuHota commented Dec 13, 2020