Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text inserts newlines where it shouldn't #292

Open
Heinenen opened this issue Aug 7, 2024 · 1 comment
Open

extract_text inserts newlines where it shouldn't #292

Heinenen opened this issue Aug 7, 2024 · 1 comment

Comments

@Heinenen
Copy link
Collaborator

Heinenen commented Aug 7, 2024

Continuing from the discussion #125 (comment).

The responsible code is found at https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94.

The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in https://github.com/mozilla/pdf.js/blob/341a0b6d477d2909fcb14bcbfdf0d2fd37406cb0/src/core/evaluator.js#L2966.
The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.

@Heinenen
Copy link
Collaborator Author

Heinenen commented Aug 7, 2024

Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.

An example probably demonstrates this best:

#[test]
fn test_extract() {
    let doc = Document::load("extract_text_dkp.pdf").unwrap();
    let text = doc.extract_text(&[4]).unwrap();
    println!("{}", text);
}

prints

4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13

This is what the page, that the text is extracted from, looks like:
image

(Example PDF: extract_text_dkp.pdf, taken from #217 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant