Is there any more docs or help available? #280

markusdd · 2024-06-01T09:19:50Z

Hi,

thanks for providing this lib, but I find myself unable to use it.

I need a small tool to crop a certain area out of a PDF page, rotate it, and save it into a new pdf.

This library seemed like a good match, but the modify example is so simple it doesn't help me at all and the documentation isn't very verbose as well.

I'm already struggling to understand why the page_ids are a BTreeMap and how to get pages out of that.

Is there any additional help available?

markusdd · 2024-06-01T14:46:15Z

It is probaby catastrophically wrong what I am trying here but how is this supposed to work.
I managed to guess pages is somehow at the arcane entry (7,0), no I am trying to somehow get to the MediaBox property to modify it.
When I print the arcane LinkedHashMap, all the keys are lists that look like they are the ASCII encoded characters of what the keys actually should be.
I mean...this can't be the recommended way to work with this?

So, simple question for an example: I got a pdf (assume single page only), which I need to rotate 90 degrees to the right and then essentially crop out a predefined area.

colbyn · 2024-06-18T20:07:57Z

If it helps here is my code that parses a PDF and extracts all page text:

use std::collections::BTreeMap;

#[derive(Debug, Default)]
pub struct PdfPages {
    pub pages: BTreeMap<usize, Page>,
    pub errors: Vec<lopdf::Error>,
}

impl PdfPages {
    pub fn open(file_path: impl AsRef<std::path::Path>) -> Result<Self, Box<dyn std::error::Error>> {
        let doc = lopdf::Document::load_filtered(file_path, filter_func)?;
        Self::extract_document_pages(doc)
    }
    fn extract_document_pages(doc: lopdf::Document) -> Result<Self, Box<dyn std::error::Error>> {
        let mut pages: Vec<Page> = Default::default();
        let mut errors: Vec<lopdf::Error> = Default::default();
        for (page_num, _) in doc.get_pages().into_iter() {
            match doc.extract_text(&[page_num]) {
                Ok(text) => {
                    pages.push(Page {
                        page_number: page_num as usize,
                        page_content: text,
                    });
                }
                Err(error) => {
                    errors.push(error);
                }
            }
        }
        let pages = pages
            .into_iter()
            .map(|page| (page.page_number, page))
            .collect::<BTreeMap<_, _>>();
        let payload = Self { pages, errors };
        Ok(payload)
    }
}

#[derive(Debug, Clone)]
pub struct Page {
    pub page_number: usize,
    pub page_content: String,
}

static IGNORE: &[&str] = &[
    "Length",
    "BBox",
    "FormType",
    "Matrix",
    "Resources",
    "Type",
    "XObject",
    "Subtype",
    "Filter",
    "ColorSpace",
    "Width",
    "Height",
    "BitsPerComponent",
    "Length1",
    "Length2",
    "Length3",
    "PTEX.FileName",
    "PTEX.PageNumber",
    "PTEX.InfoDict",
    "FontDescriptor",
    "ExtGState",
    "Font",
    "MediaBox",
    "Annot",
];

fn filter_func(object_id: (u32, u16), object: &mut lopdf::Object) -> Option<((u32, u16), lopdf::Object)> {
    if IGNORE.contains(&object.type_name().unwrap_or_default()) {
        return None;
    }
    if let Ok(d) = object.as_dict_mut() {
        d.remove(b"Font");
        d.remove(b"Resources");
        d.remove(b"Producer");
        d.remove(b"ModDate");
        d.remove(b"Creator");
        d.remove(b"ProcSet");
        d.remove(b"XObject");
        d.remove(b"MediaBox");
        d.remove(b"Annots");
        if d.is_empty() {
            return None;
        }
    }
    Some((object_id, object.to_owned()))
}

You use rust iterators to iterate over most data structures, e.g:

let pdf_path = std::path::PathBuf::from("tmp/MyPdf.pdf");
let pdf_pages = pdf2text::PdfPages::open(&pdf_path)?;

for page in pdf_pages.pages.values() {
    println!("Page #{}", page.page_number)
}

Overall the API is very easy to understand but its very canonical Rust, you have to know the rust std and general conventions and whatnot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any more docs or help available? #280

Is there any more docs or help available? #280

markusdd commented Jun 1, 2024

markusdd commented Jun 1, 2024

colbyn commented Jun 18, 2024

Is there any more docs or help available? #280

Is there any more docs or help available? #280

Comments

markusdd commented Jun 1, 2024

markusdd commented Jun 1, 2024

colbyn commented Jun 18, 2024