-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any more docs or help available? #280
Comments
If it helps here is my code that parses a PDF and extracts all page text: use std::collections::BTreeMap;
#[derive(Debug, Default)]
pub struct PdfPages {
pub pages: BTreeMap<usize, Page>,
pub errors: Vec<lopdf::Error>,
}
impl PdfPages {
pub fn open(file_path: impl AsRef<std::path::Path>) -> Result<Self, Box<dyn std::error::Error>> {
let doc = lopdf::Document::load_filtered(file_path, filter_func)?;
Self::extract_document_pages(doc)
}
fn extract_document_pages(doc: lopdf::Document) -> Result<Self, Box<dyn std::error::Error>> {
let mut pages: Vec<Page> = Default::default();
let mut errors: Vec<lopdf::Error> = Default::default();
for (page_num, _) in doc.get_pages().into_iter() {
match doc.extract_text(&[page_num]) {
Ok(text) => {
pages.push(Page {
page_number: page_num as usize,
page_content: text,
});
}
Err(error) => {
errors.push(error);
}
}
}
let pages = pages
.into_iter()
.map(|page| (page.page_number, page))
.collect::<BTreeMap<_, _>>();
let payload = Self { pages, errors };
Ok(payload)
}
}
#[derive(Debug, Clone)]
pub struct Page {
pub page_number: usize,
pub page_content: String,
}
static IGNORE: &[&str] = &[
"Length",
"BBox",
"FormType",
"Matrix",
"Resources",
"Type",
"XObject",
"Subtype",
"Filter",
"ColorSpace",
"Width",
"Height",
"BitsPerComponent",
"Length1",
"Length2",
"Length3",
"PTEX.FileName",
"PTEX.PageNumber",
"PTEX.InfoDict",
"FontDescriptor",
"ExtGState",
"Font",
"MediaBox",
"Annot",
];
fn filter_func(object_id: (u32, u16), object: &mut lopdf::Object) -> Option<((u32, u16), lopdf::Object)> {
if IGNORE.contains(&object.type_name().unwrap_or_default()) {
return None;
}
if let Ok(d) = object.as_dict_mut() {
d.remove(b"Font");
d.remove(b"Resources");
d.remove(b"Producer");
d.remove(b"ModDate");
d.remove(b"Creator");
d.remove(b"ProcSet");
d.remove(b"XObject");
d.remove(b"MediaBox");
d.remove(b"Annots");
if d.is_empty() {
return None;
}
}
Some((object_id, object.to_owned()))
} You use rust iterators to iterate over most data structures, e.g: let pdf_path = std::path::PathBuf::from("tmp/MyPdf.pdf");
let pdf_pages = pdf2text::PdfPages::open(&pdf_path)?;
for page in pdf_pages.pages.values() {
println!("Page #{}", page.page_number)
} Overall the API is very easy to understand but its very canonical Rust, you have to know the rust std and general conventions and whatnot. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
thanks for providing this lib, but I find myself unable to use it.
I need a small tool to crop a certain area out of a PDF page, rotate it, and save it into a new pdf.
This library seemed like a good match, but the modify example is so simple it doesn't help me at all and the documentation isn't very verbose as well.
I'm already struggling to understand why the page_ids are a BTreeMap and how to get pages out of that.
Is there any additional help available?
The text was updated successfully, but these errors were encountered: