Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting filled polygons and saving as new pdf #57

Closed
ddurgaprasad opened this issue Mar 30, 2018 · 4 comments
Closed

Extracting filled polygons and saving as new pdf #57

ddurgaprasad opened this issue Mar 30, 2018 · 4 comments

Comments

@ddurgaprasad
Copy link

My task is to separate out straight lines, filled polygons, text into different PDFs for further analysis.
I am successful in extracting straight lines by using '_page.edges'. I presume edges are the ones with x0=x1 or y0=y1 . Now filled polygons are to be saved into separate pdfs. Is it possible to separate out the filled polygons and text ?
Sample.pdf
Also I noticed , out of 5 texts available in the attachment, pdfminer could extract only two correctly.In one particular case 8524 is extracted as 8254. Degree symbol ° is getting extracted as (cid:176). For this reason , I am thinking of separating out the text and make use of OCR.

@jsvine
Copy link
Owner

jsvine commented Apr 17, 2018

Hi @ddurgaprasad, sounds interesting!

I presume edges are the ones with x0=x1 or y0=y1

Essentially, yes. There's an exception, however: If pdfminer.six (the library powering pdfplumber) says that the object is a curve (rather than a line or rect), its edges — even if they are strictly vertical or strictly horizontal — will not be recognized as edges.

Is it possible to separate out the filled polygons and text ?

pdfplumber does not yet provide an automated way to detect polygons, but you can detect them yourself by inspecting each curve's points attribute. E.g., for your sample PDF, pdf.pages[0].curves[0] (the thinner of the two sloping lines) is this:

{'x0': Decimal('219.785'),
 'y0': Decimal('127.699'),
 'x1': Decimal('336.328'),
 'y1': Decimal('406.652'),
 'width': Decimal('116.543'),
 'height': Decimal('278.953'),
 'linewidth': Decimal('0'),
 'stroke': Decimal('0'),
 'fill': Decimal('1'),
 'evenodd': Decimal('1'),
 'stroking_color': None,
 'non_stroking_color': None,
 'object_type': 'curve',
 'page_number': 1,
 'points': [(Decimal('225.258'), Decimal('466.301')),
  (Decimal('336.328'), Decimal('189.578')),
  (Decimal('330.785'), Decimal('187.348')),
  (Decimal('219.785'), Decimal('464.141'))],
 'top': Decimal('187.348'),
 'bottom': Decimal('466.301'),
 'doctop': Decimal('187.348')}

@ddurgaprasad
Copy link
Author

@jsvine ,thanks for clarifying about polygons. For extraction of text ,do you suggest any workaround? I am not sure why some characters are juxtaposed.

@jsvine
Copy link
Owner

jsvine commented Apr 17, 2018

I don't know if there's a workaround, unfortunately. For text extraction, pdfplumber depends on pdfminer.six, which in turn depends on the PDF being created correctly. There's some discussion of the issue here:

@jsvine jsvine closed this as completed Jul 18, 2020
@jsvine
Copy link
Owner

jsvine commented Jul 18, 2020

Closing old issues. Feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants