PDFtoCSV_extract_regesta

The aim of this Jupyter notebook is to extract the retro-digitised and OCRed regesta, for example from the volumes of the Göttinger Papsturkundenwerk, into tabular form and to enable their digital processing as part of the Academy project Die Formierung Europas durch Überwindung der Spaltung im 12. Jahrhundert.

The script relies on the standardised structure of a printed book page to identify individual parts of a Regest. The Python library PyMuPDF is used to extract each line of text. In a first step, the position of blocks and lines on the page can be visualised in order to manually set the limits for the page regions header, footer, page number, Regestennummer, date, and indentation of the first line of a paragraph, taking into account whether the page is even or odd. To see the interactive graphs visualised with plotly (not rendered in github), you can use the nbviewer.

Through page stabilization, all pages are aligned at the intersection of the top and left edges of the type area. This alignment minimizes the deviation in the positional data of individual lines required for categorization.

The program assigns each line to a category based on its positional data and returns the processed text in tabular form. A 'Regestennummer' always marks the beginning of a new entry. The resulting table is saved as a CSV file. Challenges primarily stem from OCR errors, which lead to a chaotic division of the text into spans, lines, and blocks. This often causes misclassification, particularly in distinguishing between the frequently italicized Kopfregesten, archival tradition, and commentary on the one hand, and the regular text of the edition and its footnotes on the other, requiring manual correction.

The code can be adapted to process regesta where, for example, the date is centered below the number or where the number is on the left and the date on the right of the same line. The clearer the visual structure of the text, the better the results.

This tool was developed with the help of ChatGPT.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
images		images
PDFtoCSV_extract_regesta-public.ipynb		PDFtoCSV_extract_regesta-public.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFtoCSV_extract_regesta

Example page, plot of line midpoints

Plotted lines of the example page, manually defined areas of interest and classified text parts

Resulting table

About

Releases

Packages

Languages

SGensicke/PDFtoCSV_extract_regesta

Folders and files

Latest commit

History

Repository files navigation

PDFtoCSV_extract_regesta

Example page, plot of line midpoints

Plotted lines of the example page, manually defined areas of interest and classified text parts

Resulting table

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages