The PDF Content Converter is a tool for converting PDF text as well as structural features into a pandas dataframe, written natively in Python. It retrieves information about textual content, fonts, positions, character frequencies and surrounding visual PDF elements.
- Pass the path of the PDF file which is wanted to be converted to
PDFContentConverter
. - Call the function
pdf2pandas()
. The PDF content is then returned as a pandas dataframe. - Media boxes of a PDF can be accessed using
get_media_boxes()
, the page count overget_page_count()
and the document text usingpdf2text()
. - Using the
convert()
function, the pandas dataframe, textual document content, media boxes and page count are returned as a dictionary.
Example call:
converter = PDFContentConverter(pdf)
result = converter.pdf2pandas()
The output containing the converted PDF data is stored as pandas dataframe.
The different PDF elements are stored as rows.
The dataframe contains the following columns:
id
: unique identifier of the PDF elementpage
: page number, starting with 0text
: text of the PDF elementx_0
: left x coordinatex_1
: right x coordinatey_0
: top y coordinatey_1
: bottom y coordinatepos_x
: center x coordinatepos_y
: center y coordinateabs_pos
: tuple containing a page independent representation of(pos_x,pos_y)
coordinatesoriginal_font
: font as extracted by pdfminerfont_name
: name of the font extracted fromoriginal_font
code
: font code as provided by pdfminerbold
: factor 1 indicating that a text is bold and 0 otherwiseitalic
: factor 1 indicating that a text is italic and 0 otherwisefont_size
: size of the text in pointsmasked
: text with numeric content substituted as #frequency_hist
: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbolslen_text
: number of charactersn_tokens
: number of wordstag
: tag for key-value pair extractions, indicating keys or values based on simple heuristicsbox
: box extracted by pdfminer Layout Analysisin_element_ids
: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.in_element
: indicates based on in*element_ids whether an element is stored in a visual rectangle representation (stored as "rectangle") or not (stored as "none").
Additionally, a dictionary is returned containing the following entries,
which can be used to transform the absolute CSV coordinates:
x0
: Left x page crop box coordinatex1
: Right x page crop box coordinatey0
: Top y page crop box coordinatey1
: Bottom y page crop box coordinatex0page
: Left x page coordinatex1page
: Right x page coordinatey0page
: Top y page coordinatey1page
: Bottom y page coordinate
Both are returned in a dictionary when using convert()
.
The dataframe is stored as "content", the page characteristics as "media_boxes", the textual content as "text" and the number of pages as "page_count".
- This work is built on top of the pdfminer project https://github.com/euske/pdfminer.