Support for Vertical Forms in Docling #676

jnm-ronquillo · 2025-01-05T00:43:54Z

Requested Feature

I'd like Docling to parse properly a vertical form in a pdf file. For example in the following code:

from docling.document_converter import DocumentConverter

source = "https://drive.usercontent.google.com/download?id=1kdT8UWLNnJ5XGiXoyyNCImWzoLucBNbq&export=download&authuser=0&confirm=t&uuid=b7768e5f-3e28-49d4-bd80-2f5856c398a1&at=APvzH3qJRJ11Y_63usKNHntbOMKO:1736036888144"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

The output is as follows:

Name

Last Name

Age

Mario

Bros

30

Occupation Plumber

Address

Marital Status Single

Minecraft 123

Which is difficult for a LLM to answer questions about structured data.
Another example:

from pathlib import Path

from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    OcrMacOptions,
    PdfPipelineOptions,
    RapidOcrOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

def main():
    input_doc = Path("./NameLastnameAge.pdf")

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True

    # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
    # ocr_options = EasyOcrOptions(force_full_page_ocr=True)
    # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
    ocr_options = OcrMacOptions(force_full_page_ocr=True)
    # ocr_options = RapidOcrOptions(force_full_page_ocr=True)
    # ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
    pipeline_options.ocr_options = ocr_options

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

    doc = converter.convert(input_doc).document
    md = doc.export_to_markdown()
    print(md)

if __name__ == "__main__":
    main()

The output:

| Name       | Last Name     | Age            |
|------------|---------------|----------------|
| Mario      | Bros          | 30             |
| Occupation | Address       | Marital Status |
| Plumber    | Minecraft 123 | Single         |

Which is not exactly the output desired because the pdf content is not a table. I attached the pdf example used.
NameLastnameAge.pdf

The text was updated successfully, but these errors were encountered:

jnm-ronquillo added the enhancement New feature or request label Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Vertical Forms in Docling #676

Support for Vertical Forms in Docling #676

jnm-ronquillo commented Jan 5, 2025

Support for Vertical Forms in Docling #676

Support for Vertical Forms in Docling #676

Comments

jnm-ronquillo commented Jan 5, 2025

Requested Feature