format for raw output of `pdfplumber` #910

petermr · 2023-06-21T16:14:22Z

petermr
Jun 21, 2023

Overview

Context

The creation of large amounts of running text from pdfplumber . (Tables, other floats, annotations, vector graphics and images are omitted at this stage but can be integrated later.). Particularly aimed at reports, scientific papers, etc.

process

characters

(Please correct misundestandings!)
PDF pages contain (roughly):

characters
vector graphics
bitmap images
(annotations - e.g. hyperlinks in PDFs)

pdfpplumber reads PDF pages interprets the characters, retrieves glyphs or font data, transforms and normalizes coordiinates to page coords.

text synthesis

pdfplumber uses heuristics to create words and chunks of local running text.
Downstream tools can use more heuristics to create lines, sub.superscripts, join lines, remove headers and footers, manage page numbers, dehyphenate, create lists,. create sections based on content, style, position (e.g. footnotes).

This is a messy, error-prone, and never-ending process. However at this stage it no longer depends on the PDF representation and is a geometrical-linguistic exercise.
A very similar process is required for creating running text from OCR. This is more primitive and error-prone but there are a number of useful tools that can maybe used in conjunction with pdfplumber. The OCR community has created an HTML-like format (hOCR) for its output (http://kba.github.io/hocr-spec/1.2/) and there are downstream tools.

value of HTML for `pdfplumber` output text

I find it very useful to transform pdfplumber output into HTML and am gearing up to transform > 10,000 pages of the IPCC report on ClimateChange . The goal is to automatically read the PDF reports and create running ("flow") text, extract floats and other blocks. Although there are some specific tweaks I think that much of this is applicable to other similar corpora.
(I'm not advertising this widely as the Copyright allows personal use but not redistribution.)

Suggestion

pdfplumber could output characters (and probably words) into either (a subset of) hOCR or HTML. This creates a clean separation between the raw output, and the later synthesised running text.

jsvine · 2023-07-01T20:42:47Z

jsvine
Jul 1, 2023
Maintainer

Thanks @petermr! I like the idea of pdfplumber being able to produce an intermediate, standard format.

hOCR is appealing due to its rigidity relative to HTML (where I worry that we could end up fiddling endlessly with the output, and the output wouldn't match any other program's output). On the other hand, as far as I can tell, hOCR cannot represent graphical elements, such as lines and rects. How essential is it to have the latter?

I also wonder, though: What would this achieve that pdfminer.six's HTML and hOCR output does not?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format for raw output of `pdfplumber` #910

{{title}}

Replies: 1 comment

{{title}}

Select a reply

format for raw output of pdfplumber #910

petermr Jun 21, 2023

Overview

Context

process

characters

text synthesis

value of HTML for pdfplumber output text

Suggestion

Replies: 1 comment

jsvine Jul 1, 2023 Maintainer

format for raw output of `pdfplumber` #910

petermr
Jun 21, 2023

value of HTML for `pdfplumber` output text

jsvine
Jul 1, 2023
Maintainer