Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landscape pages are not read #683

Open
mohamed99akram opened this issue Jan 6, 2025 · 4 comments
Open

Landscape pages are not read #683

mohamed99akram opened this issue Jan 6, 2025 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@mohamed99akram
Copy link

Bug

I have a document of 39 pages, the orientation is portrait for 29 pages and landscape for 10 others. The text itself is normal (vertical, not rotated) only the orientation is different. Docling doesn't read the landscape pages. All pages have tables in them, tables are not read correctly either. However, for portrait pages, tables are read fine.

Steps to reproduce

A PDF file that has multiple orientations, one portrait and one landscape. then convert PDF to markdown.

Docling version

2.8.3

Python version

3.10.14

@mohamed99akram mohamed99akram added the bug Something isn't working label Jan 6, 2025
@nikos-livathinos nikos-livathinos self-assigned this Jan 6, 2025
@nikos-livathinos
Copy link
Collaborator

@mohamed99akram could you please provide a sample document to reproduce the issue

@JeandeBalzac
Copy link

Hi Nikos
have the same problem.
I provide you an example for a landscape pdf. Some pages are working fine, others are not working at all.
Marketing.pdf

@cau-git
Copy link
Contributor

cau-git commented Jan 13, 2025

After checking closer, @JeandeBalzac your issue does not appear to be connected to portrait layout. It is simply because there are many elements identified as figures, and these will export as bitmap resources in the markdown / HTML. The contained text elements of figures are in the JSON representation of the DoclingDocument but not exported to the other formats by default.

@JeandeBalzac
Copy link

JeandeBalzac commented Jan 18, 2025

Hi. Yes we are aware, that the pages are included as images. However, our goal is to extract text and not images. Therefore, this is still a bug for us.
I can also provide another landscape pdf, which is messed up quite a bit. We analyzed the problem. x and y are changed, when landscape and x works differently. X increases from right to left and not as usal in portrait form left to right. Moreover, top-left point is no longer top-left. The same is true for right-bottom point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants