Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageLayout to_s() merges TextRuns that overlap #290

Open
rockorequin opened this issue Mar 26, 2019 · 8 comments
Open

PageLayout to_s() merges TextRuns that overlap #290

rockorequin opened this issue Mar 26, 2019 · 8 comments

Comments

@rockorequin
Copy link

rockorequin commented Mar 26, 2019

I notice that some PDFs have extra apparently spurious text in them, eg some bank statements (presumably the bank puts them into to make it hard to parse them).

An example is where you have a transparent text run of '6' in one text run and an amount of say '50.00' in a text run that overlaps the '6'. PDF Reader's Page text() method outputs these two as 650.00, so it incorrectly looks like the amount is $650 instead of $50. The overlap also occurs when the '6' ends in the column immediately before the '50.00'.

If I view the PDF in Evince, the spurious text is rendered transparently, so the document looks fine unless I select the text for copy and paste. In the pasted output, the two strings appear with a space between them, ie '6 50.00'. So it's not ideal, but at least the you can recognise that the amount is $50 and not $650.

The PageLayout to_s method is doing the hard work of mapping the TextRun objects and rendering them to a string. It calls local_string_insert to insert each text at its x_pos and y_pos (x_pos and y_pos are converted into columns from the raw x and y coords).

Brainstorming, there might be a couple of ways around this:

  • I have tried moving the text runs that overlap prior to calling PageLayout's to_s method (eg at the end of PageLayout's initialize method) to ensure that there is least one column between them. This fixes the issue - I get 6 50.00 instead of 650.00, so it matches how Evince works. I did it by grouping the text runs into a hash of { y column => [ ary of TextRun ] } and then sorting each ary by its start x column. Then I check for overlap by comparing the endx column of one text run against the following text run's x column. The disadvantage of doing this is that potentially you could lose text off the right hand side of the page, because to_s checks that the text run starts within the expected number of columns on the page before inserting it. Maybe we could remove that check so the text isn't lost.

  • We could add an alternative method to page.text() that returns the TextRun objects directly, eg as a hash of { y_column => [ ary of TextRun ] } or as an Array of [ ary of TextRun ]. If the TextRun object had methods to return its x_col and endx_col as well as the raw x and endx, the caller could figure out for themself where they are located on the page. (As a side benefit, the caller could also see the TextRun attributes like font_size and width. We could even make the TextRun store its font so the user could see which font is applicable, which might help with Parse font of given text #272.)

This might also be related to #43.

I'm using pdf-reader v2.2.0.

@KrauseFx
Copy link

KrauseFx commented Sep 6, 2019

I ran into the same problem, I know this isn't a solution, but for me since I just needed it as a one-off for one file, https://github.com/yomurb/yomu did the rendering for overlapping texts

@yob
Copy link
Owner

yob commented Oct 18, 2019

I have tried moving the text runs that overlap prior to calling PageLayout's to_s method (eg at the end of PageLayout's initialize method) to ensure that there is least one column between them

I've got some logic for handling overlapping characters nearly ready to merge in #299. However, for now it's only throwing away identical characters that overlap. Maybe we could add an option to Page#text that throws away any overlapping character? The only issue would be deciding which one "wins".

Alternatively, we could add some logic that throws away invisible characters. The work in #301 would help with this - it should allow recording an alpha value on each TextRun.

We could add an alternative method to page.text() that returns the TextRun objects directly, eg as a hash of { y_column => [ ary of TextRun ] } or as an Array of [ ary of TextRun ]

I'm very open to this. It'd be a nice "escape hatch" for folks who find Page#text too limiting, and who have the time to write some custom layout code. I helped someone do this last week (see https://gist.github.com/yob/d9e28e39943aec251cb570bf2879bda4), and it'd be nice to have it built in.

@akiotajima
Copy link

akiotajima commented Jan 20, 2020

yob, thank you great and clean code for reading PDF.

I have a question that relates this issue.
When you calculate new x position at PageState#process_glyph_displacement, in the method, you set the font size (as variable fs) by Y position of the text matrix in PageState#font_size.
IMHO, according to the font size is also uses to get glyph width in PageTextReader#internal_show_text, it should be

zero, _ = trm_transform(0,0)                   
one, _  = trm_transform(1,1)

instead of

_, zero = trm_transform(0,0)
_, one  = trm_transform(1,1)

Because the current code get the height of the font size.

@yob
Copy link
Owner

yob commented Jan 20, 2020

Hi @akiotajima,

That's quite possible, I'm a bit hazy on why past-me structured the code that way.

I tried changing it and a number of tests fail. It's possible the tests are wrong too, but failing tests mean it's not a straight forward change. Do you have a sample PDF that renders correctly with your change?

I'm also interested in how this relates to the current issue. I guess incorrect horizontal displacement could result in some characters overlapping when they shouldn't?

@akiotajima
Copy link

akiotajima commented Jan 20, 2020

Hi
Because I met some overwrapping text by calling page.content.

Below example text (ABC123) is psuedo text because the original one is printed in Japanes letters.
In printing, I can read 'ABC123' (ABC is narrow font). However I can get AB123 by calling page.text.

With exact example, the page.content_raw is below.

/C2_0 1 Tf 4.44 0 0 6 51.1569 428.4363 Tm <0BBE0869020605CB0FF606FE10B207AA>Tj 6 0 0 6 86.6769 428.4363 Tm <022A00 ...skip

The text with narrow font (<0BBE... 7AA> is scaled 4.44 according to the first Tm command and the text with normal font (<022A00... ) is scaled 6 according to the second Tm command.

Then I inserted the debug line in PageTextReader#internal_show_text as

puts "x:#{newx}, y:#{newy}, char=#{utf8_chars}, gwidth:#{glyph_width}, scaled_gwidth:#{scaled_glyph_width}" 

I got below output.

x:81.15690000000001, y:428.4363, char=関            # 06FE
x:87.15690000000001, y:428.4363, char=連            # 10B2
x:93.15690000000001, y:428.4363, char=業            # 07AA
x:86.6769, y:428.4363, char=(                  #=> overwrapped '連業' at PageLayout#to_s or sorting TextRun, I did not follow them.
x:92.6769, y:428.4363, char=0
x:95.6769, y:428.4363, char=.
x:98.6769, y:428.4363, char=2
x:101.6769, y:428.4363, char=%
x:107.6769, y:428.4363, char=)

The x positions of TextRun is why I got '関(0.2%)' instead of '関連業(0.2%)'.
After I patched PageState#font_size, I can get expected text from the PDF.

However I'm not certain why you wrote PageState#font_size as it is, for example many English PDF changes font height and it should do that.
I'd like to hear your opinion.

@yob
Copy link
Owner

yob commented Jan 24, 2020

Thanks for the extra details @akiotajima.

To be honest, that code was written so long ago that I've forgotten exactly what i was thinking at the time.

I also have very little experience with PDFs that have Japanese text, so it's quite likely there's a bug that is more significant to Japanese text than English text.

I'm happy to accept pull requests if you have the time to put one together. Ideally, it'd be great to have a spec in spec/data/integration_specs.pdf that tests a simple PDFs that exhibits the situation you describe.

If you want to continue the discussion, can we move it to a new github issue? This issue is primarily about characters where the PDF file intentionally overlaps them, however it sounds like you've hit a bug where characters are overlapping when they shouldn't be.

@akiotajima
Copy link

Hi yob
Thank you for your kindly reponse.
Sorry for my misreading about this overlapping issue, it sounds good to move it a new issue.

It's my pleasure to create a pull request, however, for this issue and at this time, I have no tool to create PDF or have little knowledge to create by myself for the specs.
I wonder if you could wait a while to prepare them.

@yob
Copy link
Owner

yob commented Jan 30, 2020

I wonder if you could wait a while to prepare them.

Absolutely, I'm in no hurry.

I'd offer to help create a sample PDF, but it may be difficult without reading Japanese.

I understand it may not be possible due to privacy reasons, but could we take the file you have and strip it back to a single page with the characters you've mentioned (by editing the file and deleting everything else on the page)? If we're lucky, that might leave us with a small PDF that exhibits the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants