-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663
Comments
@Seigneurhol I tested this code, this seems working for me with docling 2.14 and python 3.11
|
You don't have any problem with accent or special characters ? |
Yes you are right. On some document it works fine. But on other there are some encoding issue that don't happen in PyPdfiumDocumentBackend |
@Seigneurhol Can you provide the PDF that gives you problems? I am trying to fix all font related issues. |
@PeterStaar-IBM I get the same error as @zvictor with word with french accent. I can't provide the same document for legal issues. Do you want another with the same issues or does the @zvictor document is enough ? |
@Seigneurhol The more debug examples the better. Working now on fixing as many font issues as possible ;) |
I can't found another PDF that produce the same encoding errors but I will post when I found one :) |
Yes, please do! |
Bug
When I use the default parser (DoclingParseV2DocumentBackend) for parsing a PDF I have encoding issue : "ao\u00fbt, facturation \u00e0". But it works fine with PyPdfiumDocumentBackend.
Steps to reproduce
Use the default DocumentConverter without specifying a backend.
Then read a PDF and convert it to markdown.
Docling version
Docling version: 2.14.0
Python version
Python 3.12.3
The text was updated successfully, but these errors were encountered: