You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.
An example probably demonstrates this best:
#[test]fntest_extract(){let doc = Document::load("extract_text_dkp.pdf").unwrap();let text = doc.extract_text(&[4]).unwrap();println!("{}", text);}
prints
4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13
This is what the page, that the text is extracted from, looks like:
Continuing from the discussion #125 (comment).
The responsible code is found at https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94.
The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in https://github.com/mozilla/pdf.js/blob/341a0b6d477d2909fcb14bcbfdf0d2fd37406cb0/src/core/evaluator.js#L2966.
The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.
The text was updated successfully, but these errors were encountered: