Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any more docs or help available? #280

Open
markusdd opened this issue Jun 1, 2024 · 2 comments
Open

Is there any more docs or help available? #280

markusdd opened this issue Jun 1, 2024 · 2 comments

Comments

@markusdd
Copy link

markusdd commented Jun 1, 2024

Hi,

thanks for providing this lib, but I find myself unable to use it.

I need a small tool to crop a certain area out of a PDF page, rotate it, and save it into a new pdf.

This library seemed like a good match, but the modify example is so simple it doesn't help me at all and the documentation isn't very verbose as well.

I'm already struggling to understand why the page_ids are a BTreeMap and how to get pages out of that.

Is there any additional help available?

@markusdd
Copy link
Author

markusdd commented Jun 1, 2024

It is probaby catastrophically wrong what I am trying here but how is this supposed to work.
I managed to guess pages is somehow at the arcane entry (7,0), no I am trying to somehow get to the MediaBox property to modify it.
When I print the arcane LinkedHashMap, all the keys are lists that look like they are the ASCII encoded characters of what the keys actually should be.
I mean...this can't be the recommended way to work with this?
image

So, simple question for an example: I got a pdf (assume single page only), which I need to rotate 90 degrees to the right and then essentially crop out a predefined area.

@colbyn
Copy link

colbyn commented Jun 18, 2024

If it helps here is my code that parses a PDF and extracts all page text:

use std::collections::BTreeMap;

#[derive(Debug, Default)]
pub struct PdfPages {
    pub pages: BTreeMap<usize, Page>,
    pub errors: Vec<lopdf::Error>,
}

impl PdfPages {
    pub fn open(file_path: impl AsRef<std::path::Path>) -> Result<Self, Box<dyn std::error::Error>> {
        let doc = lopdf::Document::load_filtered(file_path, filter_func)?;
        Self::extract_document_pages(doc)
    }
    fn extract_document_pages(doc: lopdf::Document) -> Result<Self, Box<dyn std::error::Error>> {
        let mut pages: Vec<Page> = Default::default();
        let mut errors: Vec<lopdf::Error> = Default::default();
        for (page_num, _) in doc.get_pages().into_iter() {
            match doc.extract_text(&[page_num]) {
                Ok(text) => {
                    pages.push(Page {
                        page_number: page_num as usize,
                        page_content: text,
                    });
                }
                Err(error) => {
                    errors.push(error);
                }
            }
        }
        let pages = pages
            .into_iter()
            .map(|page| (page.page_number, page))
            .collect::<BTreeMap<_, _>>();
        let payload = Self { pages, errors };
        Ok(payload)
    }
}

#[derive(Debug, Clone)]
pub struct Page {
    pub page_number: usize,
    pub page_content: String,
}

static IGNORE: &[&str] = &[
    "Length",
    "BBox",
    "FormType",
    "Matrix",
    "Resources",
    "Type",
    "XObject",
    "Subtype",
    "Filter",
    "ColorSpace",
    "Width",
    "Height",
    "BitsPerComponent",
    "Length1",
    "Length2",
    "Length3",
    "PTEX.FileName",
    "PTEX.PageNumber",
    "PTEX.InfoDict",
    "FontDescriptor",
    "ExtGState",
    "Font",
    "MediaBox",
    "Annot",
];

fn filter_func(object_id: (u32, u16), object: &mut lopdf::Object) -> Option<((u32, u16), lopdf::Object)> {
    if IGNORE.contains(&object.type_name().unwrap_or_default()) {
        return None;
    }
    if let Ok(d) = object.as_dict_mut() {
        d.remove(b"Font");
        d.remove(b"Resources");
        d.remove(b"Producer");
        d.remove(b"ModDate");
        d.remove(b"Creator");
        d.remove(b"ProcSet");
        d.remove(b"XObject");
        d.remove(b"MediaBox");
        d.remove(b"Annots");
        if d.is_empty() {
            return None;
        }
    }
    Some((object_id, object.to_owned()))
}

You use rust iterators to iterate over most data structures, e.g:

let pdf_path = std::path::PathBuf::from("tmp/MyPdf.pdf");
let pdf_pages = pdf2text::PdfPages::open(&pdf_path)?;

for page in pdf_pages.pages.values() {
    println!("Page #{}", page.page_number)
}

Overall the API is very easy to understand but its very canonical Rust, you have to know the rust std and general conventions and whatnot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants