Removing text from a PDF based on the font #1248

samuelbradshaw · 2025-01-12T08:32:51Z

samuelbradshaw
Jan 12, 2025

Hi! I'm trying to extract text from several older sheet music PDFs that don't use modern text encodings. An example is this one:
https://assets.churchofjesuschrist.org/1f/1a/1f1aaeab25e6ef8257f3ed0f862aeb4c1df0edf6/the_morning_breaks.pdf

If you copy-paste the title, for example, you get this: "èÓÒÌÛÎÓÒ¸ ÛÚÓ"

From what I understand after reading closed issues (like #274 and #1083), there's not an easy way to extract this kind of text directly with pdfplumber or similar tools, so the next potential solution I'm looking into is OCR. But some OCR engines get tripped up by music notation when extracting text.

That brings me to my question: Is there a way to remove text or glyphs from a PDF based on the font? In this case, some of the music notation is encoded using music fonts – if I could strip out all of those music glyphs based on their font name, it would make the PDF cleaner to feed into OCR. If another tool is better suited for that, I'd appreciate any pointers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing text from a PDF based on the font #1248

{{title}}

Replies: 0 comments

Select a reply

Removing text from a PDF based on the font #1248

samuelbradshaw Jan 12, 2025

Replies: 0 comments

samuelbradshaw
Jan 12, 2025