The PDF Content Converter is a tool for converting PDF text as well as structural features into a pandas dataframe, written natively in Python. It retrieves information about textual content, fonts, positions, character frequencies and surrounding visual PDF elements.
- Pass the path of the PDF file which is wanted to be converted to
PDFContentConverter
. - Call the function
pdf2pandas()
. The PDF content is then returned as a pandas dataframe. - Media boxes of a PDF can be accessed using
get_media_boxes()
, the page count overget_page_count()
and the document text usingpdf2text()
. - Using the
convert()
function, the pandas dataframe, textual document content, media boxes and page count are returned as a dictionary.
Example call:
converter = PDFContentConverter(pdf)
result = converter.pdf2pandas()
A more detailed example usage is also given in Tester.py
.
PDFContentConverter.py
: contains thePDFContentConverter
class for converting PDF documents.util
:constants
: paths to input and output data, pdfminer parametersStorageUtil
: store/load functionalities
Tester.py
: Python script for testing thePDFContentConverter
csv
: example csv output files for testspdf
: example pdf input files for tests
The output containing the converted PDF data is stored as pandas dataframe. The different PDF elements are stored as rows. The dataframe contains the following columns:
id
: unique identifier of the PDF elementpage
: page number, starting with 0text
: text of the PDF elementx_0
: left x coordinatex_1
: right x coordinatey_0
: top y coordinatey_1
: bottom y coordinatepos_x
: center x coordinatepos_y
: center y coordinateabs_pos
: tuple containing a page independent representation of(pos_x,pos_y)
coordinatesoriginal_font
: font as extracted by pdfminerfont_name
: name of the font extracted fromoriginal_font
code
: font code as provided by pdfminerbold
: factor 1 indicating that a text is bold and 0 otherwiseitalic
: factor 1 indicating that a text is italic and 0 otherwisefont_size
: size of the text in pointsmasked
: text with numeric content substituted as #frequency_hist
: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbolslen_text
: number of charactersn_tokens
: number of wordstag
: tag for key-value pair extractions, indicating keys or values based on simple heuristicsbox
: box extracted by pdfminer Layout Analysisin_element_ids
: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.in_element
: indicates based on in_element_ids whether an element is stored in a visual rectangle representation (stored as "rectangle") or not (stored as "none").
Additionally, a dictionary is returned containing the following entries, which can be used to transform the absolute CSV coordinates:
x0
: Left x page crop box coordinatex1
: Right x page crop box coordinatey0
: Top y page crop box coordinatey1
: Bottom y page crop box coordinatex0page
: Left x page coordinatex1page
: Right x page coordinatey0page
: Top y page coordinatey1page
: Bottom y page coordinate
Both are returned in a dictionary when using convert()
.
The dataframe is stored as "content", the page characteristics as "media_boxes", the textual content as "text" and the number of pages as "page_count".
- This work is built on top of the pdfminer project https://github.com/euske/pdfminer.
- Example PDFs are obtained from the ICDAR Table Recognition Challenge 2013 https://roundtrippdf.com/en/data-extraction/pdf-table-recognition-dataset/.
- Michael Benedikt Aigner
- Florian Preis