Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Vertical Forms in Docling #676

Open
jnm-ronquillo opened this issue Jan 5, 2025 · 0 comments
Open

Support for Vertical Forms in Docling #676

jnm-ronquillo opened this issue Jan 5, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@jnm-ronquillo
Copy link

Requested Feature

I'd like Docling to parse properly a vertical form in a pdf file. For example in the following code:

from docling.document_converter import DocumentConverter

source = "https://drive.usercontent.google.com/download?id=1kdT8UWLNnJ5XGiXoyyNCImWzoLucBNbq&export=download&authuser=0&confirm=t&uuid=b7768e5f-3e28-49d4-bd80-2f5856c398a1&at=APvzH3qJRJ11Y_63usKNHntbOMKO:1736036888144"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

The output is as follows:

Name

Last Name

Age

Mario

Bros

30

Occupation Plumber

Address

Marital Status Single

Minecraft 123

Which is difficult for a LLM to answer questions about structured data.
Another example:

from pathlib import Path

from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    OcrMacOptions,
    PdfPipelineOptions,
    RapidOcrOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

def main():
    input_doc = Path("./NameLastnameAge.pdf")

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True

    # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
    # ocr_options = EasyOcrOptions(force_full_page_ocr=True)
    # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
    ocr_options = OcrMacOptions(force_full_page_ocr=True)
    # ocr_options = RapidOcrOptions(force_full_page_ocr=True)
    # ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
    pipeline_options.ocr_options = ocr_options

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

    doc = converter.convert(input_doc).document
    md = doc.export_to_markdown()
    print(md)

if __name__ == "__main__":
    main()

The output:

| Name       | Last Name     | Age            |
|------------|---------------|----------------|
| Mario      | Bros          | 30             |
| Occupation | Address       | Marital Status |
| Plumber    | Minecraft 123 | Single         |

Which is not exactly the output desired because the pdf content is not a table. I attached the pdf example used.
NameLastnameAge.pdf

@jnm-ronquillo jnm-ronquillo added the enhancement New feature or request label Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant