tolerance for new word within cell is reversed #840
Replies: 3 comments 8 replies
-
Hmmm, without seeing the actual code/settings you're using, it's hard to assess exactly what's producing your output. But if I set
That's actually the intent! Perhaps it's not worded as clearly as it could in the documentation? Basically: the tolerance represents how much space between characters you'll tolerate before deciding there's a new word. |
Beta Was this translation helpful? Give feedback.
-
@jsvine Sorry I didn't include code; I was just using the defaults:
Your result actually shows the issue. With text_y_tolerance = 1, it fixes the vertical cell issue, but it breaks the header: So, changing y_tolerance fixes one and breaks the other no matter which way it is tuned. |
Beta Was this translation helpful? Give feedback.
-
@jsvine Once again, I'm impressed that there is a possible solution that doesn't require pdfplumber code changes. In this case, though, I think a separate tolerance parameter is in order for the difference uses. That is, of course, just my opinion. pdfplumber is an awesome resource and you have clearly made good decisions on trade offs. |
Beta Was this translation helpful? Give feedback.
-
close_text_redacted.pdf
In the example pdf, many of the cells combine elements that should be separate (XX, XNS,etc) because they are too close vertically.:
Expected result (\n separation):
I believe this is because the tolerances are reversed in utils/text.py char_begins_new_word:
In this code, as the tolerance increases, it actually makes it harder for items to be separated. I understand that the same tolerance value is used to decide if words are on the same line, but that is a comparison between the tops of the elements. In this case, it is a difference between the space of the top and bottom element. I solved the problem by flipping the sign of the tolerance (I have a PR ready for this). Alternatively, a new tolerance variable is needed for this function.
Beta Was this translation helpful? Give feedback.
All reactions