-
Notifications
You must be signed in to change notification settings - Fork 693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extracted word is broken #964
Comments
Hi @jnhyperion, and thanks for your interest in this library. Have you tried adjusting the |
@jsvine I've tried already (with the param range from 0.001~3000), and it's not working for my case.
|
Thanks for clarifying. It appears that the issue stems from the PDF including extraneous whitespace characters, in particular a long string of them that overlap with the "Homeowner" text: import pdfplumber
pdf = pdfplumber.open("./example.pdf")
page = pdf.pages[0]
im = page.to_image()
whitespace_chars = [ c for c in page.chars if c["text"] == " " ]
im.reset().draw_rects(whitespace_chars) To resolve this, you'll want to filter out those whitespace characters (something that filtered = page.filter(lambda obj: obj.get("text") != " ")
print(filtered.extract_text(x_tolerance=1)) Returns what I think you want (although perhaps you also want
|
Yes, this is certainly the case, as PDFs themselves are quite varied and designed in a enormous range of styles/layouts/etc. The core functions of It's also possible that, if you're just looking for a universal text-extractor, another tool may solve this problem more directly. |
I see, thanks for your explanation. At least my PR #965 resolves this issue and does not import other issues as well (only tested on our few pdf files). |
Thanks, I'll close this issue, and continue the discussion in the PR. |
Code to reproduce the problem
PDF file
example.pdf
Expected behavior
extracted line:
VLHDU8SHRR Homeowner Discount .....
Actual behavior
VLHDU8SHRR H o m e o w ner Discount .....
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
using param
use_text_flow=True
will avoid this bug, but this param will cause other extract format bugs like:expected:
foo: bar
actual:
bar foo:
The text was updated successfully, but these errors were encountered: