Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need info working with page.images #1217

Open
mratanusarkar opened this issue Oct 19, 2024 · 1 comment
Open

need info working with page.images #1217

mratanusarkar opened this issue Oct 19, 2024 · 1 comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@mratanusarkar
Copy link

The current page.images[0] dump looks like:

{'x0': 37.4602, 'y0': 180.816, 'x1': 53.6929, 'y1': 196.9833, 'width': 16.2327, 'height': 16.16730000000001, 'stream': <PDFStream(24254): raw=213, {'BitsPerComponent': 1, 'DecodeParms': {'Quality': 65}, 'Filter': /'JBIG2Decode', 'Height': 34, 'ImageMask': True, 'Intent': /'RelativeColorimetric', 'Length': 213, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 34}>, 'srcsize': (34, 34), 'imagemask': True, 'bits': 1, 'colorspace': [None], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 33, 'top': 647.7407000000001, 'bottom': 663.908, 'doctop': 27679.82469999998}
{'x0': 56.6807, 'y0': 471.272, 'x1': 317.3447, 'y1': 795.7760000000001, 'width': 260.664, 'height': 324.5040000000001, 'stream': <PDFStream(145): raw=47341, {'BitsPerComponent': 8, 'ColorSpace': <PDFObjRef:74050>, 'Filter': /'JPXDecode', 'Height': 676, 'Intent': /'RelativeColorimetric', 'Length': 47341, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 543}>, 'srcsize': (543, 676), 'imagemask': None, 'bits': 8, 'colorspace': [[/'Separation', /'Black', /'DeviceCMYK', {'C0': [0, 0, 0, 0], 'C1': [0, 0, 0, 1], 'Domain': [0, 1], 'FunctionType': 2, 'N': 1, 'Range': [0, 1, 0, 1, 0, 1, 0, 1]}]], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 33, 'top': 48.94799999999998, 'bottom': 373.45200000000006, 'doctop': 27081.03199999998}

I need help working with this and extracting the image data. I would like to export it to a png image or use pillow. at this point, getting hold of the images in nay format would work, and I can convert and use it as desired.

could anyone help me get access to the image data from page.images? I am trying to extract and export all images, figures, diagrams from each page of a PDF.

#1207 helps a bit, but I am struggling with some errors and issues with that!

some insight on this might even encourage me or someone to write image handling class in pdfplumber.

Thanks!

@mratanusarkar mratanusarkar added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Oct 19, 2024
Repository owner deleted a comment Oct 24, 2024
@jsvine
Copy link
Owner

jsvine commented Nov 22, 2024

Hi @mratanusarkar — can you provide a minimal, runnable Python script and PDF that reproduces the errors you're encountering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

2 participants