You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like Docling to parse properly a vertical form in a pdf file. For example in the following code:
from docling.document_converter import DocumentConverter
source = "https://drive.usercontent.google.com/download?id=1kdT8UWLNnJ5XGiXoyyNCImWzoLucBNbq&export=download&authuser=0&confirm=t&uuid=b7768e5f-3e28-49d4-bd80-2f5856c398a1&at=APvzH3qJRJ11Y_63usKNHntbOMKO:1736036888144"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
The output is as follows:
Name
Last Name
Age
Mario
Bros
30
Occupation Plumber
Address
Marital Status Single
Minecraft 123
Which is difficult for a LLM to answer questions about structured data.
Another example:
from pathlib import Path
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
EasyOcrOptions,
OcrMacOptions,
PdfPipelineOptions,
RapidOcrOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
input_doc = Path("./NameLastnameAge.pdf")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
# ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
ocr_options = OcrMacOptions(force_full_page_ocr=True)
# ocr_options = RapidOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)
doc = converter.convert(input_doc).document
md = doc.export_to_markdown()
print(md)
if __name__ == "__main__":
main()
The output:
| Name | Last Name | Age |
|------------|---------------|----------------|
| Mario | Bros | 30 |
| Occupation | Address | Marital Status |
| Plumber | Minecraft 123 | Single |
Which is not exactly the output desired because the pdf content is not a table. I attached the pdf example used. NameLastnameAge.pdf
The text was updated successfully, but these errors were encountered:
Requested Feature
I'd like Docling to parse properly a vertical form in a pdf file. For example in the following code:
The output is as follows:
Which is difficult for a LLM to answer questions about structured data.
Another example:
The output:
Which is not exactly the output desired because the pdf content is not a table. I attached the pdf example used.
NameLastnameAge.pdf
The text was updated successfully, but these errors were encountered: