Extracting filled polygons and saving as new pdf #57

ddurgaprasad · 2018-03-30T04:20:30Z

My task is to separate out straight lines, filled polygons, text into different PDFs for further analysis.
I am successful in extracting straight lines by using '_page.edges'. I presume edges are the ones with x0=x1 or y0=y1 . Now filled polygons are to be saved into separate pdfs. Is it possible to separate out the filled polygons and text ?
Sample.pdf
Also I noticed , out of 5 texts available in the attachment, pdfminer could extract only two correctly.In one particular case 8524 is extracted as 8254. Degree symbol ° is getting extracted as (cid:176). For this reason , I am thinking of separating out the text and make use of OCR.

jsvine · 2018-04-17T03:51:24Z

Hi @ddurgaprasad, sounds interesting!

I presume edges are the ones with x0=x1 or y0=y1

Essentially, yes. There's an exception, however: If pdfminer.six (the library powering pdfplumber) says that the object is a curve (rather than a line or rect), its edges — even if they are strictly vertical or strictly horizontal — will not be recognized as edges.

Is it possible to separate out the filled polygons and text ?

pdfplumber does not yet provide an automated way to detect polygons, but you can detect them yourself by inspecting each curve's points attribute. E.g., for your sample PDF, pdf.pages[0].curves[0] (the thinner of the two sloping lines) is this:

{'x0': Decimal('219.785'),
 'y0': Decimal('127.699'),
 'x1': Decimal('336.328'),
 'y1': Decimal('406.652'),
 'width': Decimal('116.543'),
 'height': Decimal('278.953'),
 'linewidth': Decimal('0'),
 'stroke': Decimal('0'),
 'fill': Decimal('1'),
 'evenodd': Decimal('1'),
 'stroking_color': None,
 'non_stroking_color': None,
 'object_type': 'curve',
 'page_number': 1,
 'points': [(Decimal('225.258'), Decimal('466.301')),
  (Decimal('336.328'), Decimal('189.578')),
  (Decimal('330.785'), Decimal('187.348')),
  (Decimal('219.785'), Decimal('464.141'))],
 'top': Decimal('187.348'),
 'bottom': Decimal('466.301'),
 'doctop': Decimal('187.348')}

ddurgaprasad · 2018-04-17T05:38:06Z

@jsvine ,thanks for clarifying about polygons. For extraction of text ,do you suggest any workaround? I am not sure why some characters are juxtaposed.

jsvine · 2018-04-17T11:33:43Z

I don't know if there's a workaround, unfortunately. For text extraction, pdfplumber depends on pdfminer.six, which in turn depends on the PDF being created correctly. There's some discussion of the issue here:

jsvine · 2020-07-18T12:06:57Z

Closing old issues. Feel free to reopen.

jsvine closed this as completed Jul 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting filled polygons and saving as new pdf #57

Extracting filled polygons and saving as new pdf #57

ddurgaprasad commented Mar 30, 2018

jsvine commented Apr 17, 2018 •

edited

Loading

ddurgaprasad commented Apr 17, 2018

jsvine commented Apr 17, 2018

jsvine commented Jul 18, 2020

Extracting filled polygons and saving as new pdf #57

Extracting filled polygons and saving as new pdf #57

Comments

ddurgaprasad commented Mar 30, 2018

jsvine commented Apr 17, 2018 • edited Loading

ddurgaprasad commented Apr 17, 2018

jsvine commented Apr 17, 2018

jsvine commented Jul 18, 2020

jsvine commented Apr 17, 2018 •

edited

Loading