Replies: 1 comment
-
Thanks @petermr! I like the idea of hOCR is appealing due to its rigidity relative to HTML (where I worry that we could end up fiddling endlessly with the output, and the output wouldn't match any other program's output). On the other hand, as far as I can tell, hOCR cannot represent graphical elements, such as lines and rects. How essential is it to have the latter? I also wonder, though: What would this achieve that |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
Context
The creation of large amounts of running text from
pdfplumber
. (Tables, other floats, annotations, vector graphics and images are omitted at this stage but can be integrated later.). Particularly aimed at reports, scientific papers, etc.process
characters
(Please correct misundestandings!)
PDF pages contain (roughly):
pdfpplumber
reads PDF pages interprets the characters, retrieves glyphs or font data, transforms and normalizes coordiinates to page coords.text synthesis
pdfplumber
uses heuristics to create words and chunks of local running text.Downstream tools can use more heuristics to create lines, sub.superscripts, join lines, remove headers and footers, manage page numbers, dehyphenate, create lists,. create sections based on content, style, position (e.g. footnotes).
This is a messy, error-prone, and never-ending process. However at this stage it no longer depends on the PDF representation and is a geometrical-linguistic exercise.
A very similar process is required for creating running text from OCR. This is more primitive and error-prone but there are a number of useful tools that can maybe used in conjunction with
pdfplumber
. The OCR community has created an HTML-like format (hOCR
) for its output (http://kba.github.io/hocr-spec/1.2/) and there are downstream tools.value of HTML for
pdfplumber
output textI find it very useful to transform
pdfplumber
output into HTML and am gearing up to transform > 10,000 pages of the IPCC report on ClimateChange . The goal is to automatically read the PDF reports and create running ("flow") text, extract floats and other blocks. Although there are some specific tweaks I think that much of this is applicable to other similar corpora.(I'm not advertising this widely as the Copyright allows personal use but not redistribution.)
Suggestion
pdfplumber
could output characters (and probably words) into either (a subset of)hOCR
or HTML. This creates a clean separation between the raw output, and the later synthesised running text.Beta Was this translation helpful? Give feedback.
All reactions