We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
... Wrongly parse PDF which contains only 1 table. here is the PDF (in Vietnamese): mountain_table.pdf
...
from docling.document_converter import DocumentConverter, PdfFormatOption from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions, TesseractOcrOptions from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend from docling.backend.docling_parse_backend import DoclingParseDocumentBackend import markdown2 source = "mountain_table.pdf" pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False pipeline_options.do_table_structure = True pipeline_options.table_structure_options.mode = "accurate" pipeline_options.table_structure_options.do_cell_matching = True pipeline_options.ocr_options = TesseractOcrOptions(lang=["vie"]) try: dl_doc = DocumentConverter(format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=DoclingParseV2DocumentBackend )}).convert(source).document print('1') except Exception as e: dl_doc = DocumentConverter(format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=DoclingParseDocumentBackend )}).convert(source).document print('2') text = dl_doc.export_to_markdown() html = markdown2.markdown(text, extras=["tables"]) with open(f"mountain_table.html", 'w', encoding="utf-8") as f: f.write(html)
Here is the result:
which is wrong at cell "Mount Radenor", I do not know how to fix this case !
docling==2.7.0 docling-parse==2.1.2
python 3.10
The text was updated successfully, but these errors were encountered:
we're also facing the same issue , in some tables in our pdf , some of the rows are getting merged like yours leading to incorrect parsing.
Sorry, something went wrong.
@gauravmindzk We will fix those in this PR: DS4SD/docling-parse#82
@PeterStaar-IBM all the best :)
No branches or pull requests
Bug
...
Wrongly parse PDF which contains only 1 table. here is the PDF (in Vietnamese):
mountain_table.pdf
Steps to reproduce
...
Here is the result:
which is wrong at cell "Mount Radenor", I do not know how to fix this case !
Docling version
Python version
python 3.10
The text was updated successfully, but these errors were encountered: