`extract_text` inserts newlines where it shouldn't #292

Heinenen · 2024-08-07T10:09:30Z

Continuing from the discussion #125 (comment).

The responsible code is found at https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94.

The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in https://github.com/mozilla/pdf.js/blob/341a0b6d477d2909fcb14bcbfdf0d2fd37406cb0/src/core/evaluator.js#L2966.
The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.

Heinenen · 2024-08-07T22:22:50Z

Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.

An example probably demonstrates this best:

#[test]
fn test_extract() {
    let doc = Document::load("extract_text_dkp.pdf").unwrap();
    let text = doc.extract_text(&[4]).unwrap();
    println!("{}", text);
}

prints

4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13

This is what the page, that the text is extracted from, looks like:

(Example PDF: extract_text_dkp.pdf, taken from #217 (comment))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`extract_text` inserts newlines where it shouldn't #292

`extract_text` inserts newlines where it shouldn't #292

Heinenen commented Aug 7, 2024

Heinenen commented Aug 7, 2024

extract_text inserts newlines where it shouldn't #292

extract_text inserts newlines where it shouldn't #292

Comments

Heinenen commented Aug 7, 2024

Heinenen commented Aug 7, 2024

`extract_text` inserts newlines where it shouldn't #292

`extract_text` inserts newlines where it shouldn't #292