Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement using pdf-to-svg to get underlined and struck-out text formatting #41

Open
clayms opened this issue Aug 21, 2018 · 4 comments

Comments

@clayms
Copy link

clayms commented Aug 21, 2018

I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool mupdf. I then use an an xml parser (e.g. Beautiful Soup) to combine all of the text, text formatting, text position, page metadata, and document metadata into a pandas DataFrame.

I can create an additional pandas DataFrame with all of the page coordinates of each <path> from the svg file and combine with the text DataFrame in such a way that I can identify and tag the specific text that was either struck-out or underlined - a critical feature in my use case.

Using numpy to optimize much of these operations, I can generate the final DataFrame for a 150 page all-text pdf with abundant text underlines and strike-outs in about one second on a consumer laptop.

Also it can then be relatively straight forward to construct text formatting features to then base a document hierarchy on those features.

Combinations of the following text-formatting features can be used deduce document hierarchy:

  • Case: UPPER > Title Case > Sentence > lower
  • Font Size: Large > Small
  • Font Weight: Bold > Italic > Normal
  • Underline: Underline > No Underline
  • Line Spacing: Large > Small
  • Alignment: Centered > Left
  • Indentation: No Indent > Indent
@clayms clayms changed the title Enhancement using svg format to get underlined and struck-out text formatting Enhancement using pdf-to-svg to get underlined and struck-out text formatting Aug 21, 2018
@clayms
Copy link
Author

clayms commented Aug 22, 2018

see HazyResearch/fonduer#111 (comment) for an example pdf.
The mutool draw command described there converts the pdf to html, but it also misses the abundant strikeouts and underlines and all of the text that is clearly struck-out is shown as regular formatted text in the html output.

@jbecke
Copy link

jbecke commented Aug 26, 2019

I'm interested in your method to convert from a mutool-generated SVG to Fonduer's data model. Are you able to release this code? Thanks!

@clayms
Copy link
Author

clayms commented Aug 29, 2019

Let me talk to some people. In the meantime, I could provide some pseudocode outlining the whole process (in more detail than what's above).

@jbecke
Copy link

jbecke commented Aug 29, 2019

Thanks, pseudocode or a bit more detail than above would be helpful! My email is jbecke@wharton.upenn.edu if you prefer to chat over email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants