Removing text from a PDF based on the font #1248
samuelbradshaw
started this conversation in
Ask for help with specific PDFs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi! I'm trying to extract text from several older sheet music PDFs that don't use modern text encodings. An example is this one:
https://assets.churchofjesuschrist.org/1f/1a/1f1aaeab25e6ef8257f3ed0f862aeb4c1df0edf6/the_morning_breaks.pdf
If you copy-paste the title, for example, you get this: "èÓÒÌÛÎÓÒ¸ ÛÚÓ"
From what I understand after reading closed issues (like #274 and #1083), there's not an easy way to extract this kind of text directly with pdfplumber or similar tools, so the next potential solution I'm looking into is OCR. But some OCR engines get tripped up by music notation when extracting text.
That brings me to my question: Is there a way to remove text or glyphs from a PDF based on the font? In this case, some of the music notation is encoded using music fonts – if I could strip out all of those music glyphs based on their font name, it would make the PDF cleaner to feed into OCR. If another tool is better suited for that, I'd appreciate any pointers.
Beta Was this translation helpful? Give feedback.
All reactions