Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663

Seigneurhol · 2024-12-30T15:26:33Z

Bug

When I use the default parser (DoclingParseV2DocumentBackend) for parsing a PDF I have encoding issue : "ao\u00fbt, facturation \u00e0". But it works fine with PyPdfiumDocumentBackend.

Steps to reproduce

Use the default DocumentConverter without specifying a backend.

    pipeline_options.generate_picture_images = True
    pipeline_options.do_ocr = True
    pipeline_options.ocr_options = EasyOcrOptions()
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.ocr_options.lang = ["fr", "en"]
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=4, device=AcceleratorDevice.AUTO
    )
    
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

Then read a PDF and convert it to markdown.

    doc_stream = io.BytesIO(content)
    input_name = filename if filename else "document.pdf"
    
    # Convert document
    result = converter.convert(
        DocumentStream(name=input_name, stream=doc_stream)
    ) 
    result.document.export_to_markdown()

Docling version

Docling version: 2.14.0

Python version

Python 3.12.3

trinanjan12 · 2024-12-30T16:51:16Z

@Seigneurhol I tested this code, this seems working for me with docling 2.14 and python 3.11

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.pipeline_options import EasyOcrOptions
from docling.datamodel.pipeline_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat


pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions()
pipeline_options.do_table_structure = True
# pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang = ["fr", "en"]
pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    })
x = converter.convert(source="./tests/data/2305.03393v1-pg9.pdf")

x.document

Seigneurhol · 2024-12-30T16:54:31Z

You don't have any problem with accent or special characters ?

trinanjan12 · 2024-12-30T16:57:43Z

Seigneurhol · 2025-01-02T10:54:00Z

Yes you are right. On some document it works fine. But on other there are some encoding issue that don't happen in PyPdfiumDocumentBackend

PeterStaar-IBM · 2025-01-11T13:58:27Z

@Seigneurhol Can you provide the PDF that gives you problems? I am trying to fix all font related issues.

zvictor · 2025-01-11T20:47:17Z

I get encoding issues on a simple command:

docling --from pdf --to md --image-export-mode placeholder \
    https://venda-imoveis.caixa.gov.br/editais/EL00820224CPARE.PDF

That produces artifacts such as Leilªo Pœblico and Alienaçªo FiduciÆria instead of Leilão Público
and Alienação Fiduciária.

Seigneurhol · 2025-01-14T09:22:28Z

@PeterStaar-IBM I get the same error as @zvictor with word with french accent. I can't provide the same document for legal issues. Do you want another with the same issues or does the @zvictor document is enough ?

PeterStaar-IBM · 2025-01-14T12:29:07Z

@Seigneurhol The more debug examples the better. Working now on fixing as many font issues as possible ;)

Seigneurhol · 2025-01-15T15:31:20Z

I can't found another PDF that produce the same encoding errors but I will post when I found one :)

PeterStaar-IBM · 2025-01-16T07:57:15Z

Yes, please do!

Seigneurhol added the bug Something isn't working label Dec 30, 2024

cau-git added the PDF parsing label Jan 6, 2025

cau-git assigned PeterStaar-IBM Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663

Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663

Seigneurhol commented Dec 30, 2024

trinanjan12 commented Dec 30, 2024 •

edited

Loading

Seigneurhol commented Dec 30, 2024

trinanjan12 commented Dec 30, 2024

Seigneurhol commented Jan 2, 2025

PeterStaar-IBM commented Jan 11, 2025

zvictor commented Jan 11, 2025 •

edited

Loading

Seigneurhol commented Jan 14, 2025

PeterStaar-IBM commented Jan 14, 2025

Seigneurhol commented Jan 15, 2025

PeterStaar-IBM commented Jan 16, 2025

Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663

Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663

Comments

Seigneurhol commented Dec 30, 2024

Bug

Steps to reproduce

Docling version

Python version

trinanjan12 commented Dec 30, 2024 • edited Loading

Seigneurhol commented Dec 30, 2024

trinanjan12 commented Dec 30, 2024

Seigneurhol commented Jan 2, 2025

PeterStaar-IBM commented Jan 11, 2025

zvictor commented Jan 11, 2025 • edited Loading

Seigneurhol commented Jan 14, 2025

PeterStaar-IBM commented Jan 14, 2025

Seigneurhol commented Jan 15, 2025

PeterStaar-IBM commented Jan 16, 2025

trinanjan12 commented Dec 30, 2024 •

edited

Loading

zvictor commented Jan 11, 2025 •

edited

Loading