-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting filled polygons and saving as new pdf #57
Comments
Hi @ddurgaprasad, sounds interesting!
Essentially, yes. There's an exception, however: If
{'x0': Decimal('219.785'),
'y0': Decimal('127.699'),
'x1': Decimal('336.328'),
'y1': Decimal('406.652'),
'width': Decimal('116.543'),
'height': Decimal('278.953'),
'linewidth': Decimal('0'),
'stroke': Decimal('0'),
'fill': Decimal('1'),
'evenodd': Decimal('1'),
'stroking_color': None,
'non_stroking_color': None,
'object_type': 'curve',
'page_number': 1,
'points': [(Decimal('225.258'), Decimal('466.301')),
(Decimal('336.328'), Decimal('189.578')),
(Decimal('330.785'), Decimal('187.348')),
(Decimal('219.785'), Decimal('464.141'))],
'top': Decimal('187.348'),
'bottom': Decimal('466.301'),
'doctop': Decimal('187.348')} |
@jsvine ,thanks for clarifying about polygons. For extraction of text ,do you suggest any workaround? I am not sure why some characters are juxtaposed. |
I don't know if there's a workaround, unfortunately. For text extraction, |
Closing old issues. Feel free to reopen. |
My task is to separate out straight lines, filled polygons, text into different PDFs for further analysis.
I am successful in extracting straight lines by using '_page.edges'. I presume edges are the ones with x0=x1 or y0=y1 . Now filled polygons are to be saved into separate pdfs. Is it possible to separate out the filled polygons and text ?
Sample.pdf
Also I noticed , out of 5 texts available in the attachment, pdfminer could extract only two correctly.In one particular case 8524 is extracted as 8254. Degree symbol ° is getting extracted as (cid:176). For this reason , I am thinking of separating out the text and make use of OCR.
The text was updated successfully, but these errors were encountered: