Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML reader not working correct when figure is included #684

Open
JeandeBalzac opened this issue Jan 6, 2025 · 1 comment
Open

HTML reader not working correct when figure is included #684

JeandeBalzac opened this issue Jan 6, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@JeandeBalzac
Copy link

Bug

I have the following html (excerp of a longer html page):

Networks

The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-

Trulli

Fig.1 - Trulli, Puglia, Italy.

Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.

ticular, networks used commonly in NLP. Their unreasonable effectiveness (see (Sejnowski 2020)) in domains such as named entity recognition (Lample et al. 2016), machine translation (Klein et al. 2017) and chemistry (Schwaller et al. 2018) clearly demonstrates that they can capture the underlying structure and trends in (noisy) sequences of data. In the case of traditional NLP, the signals are often the embeddings of each character or word in a string. In chemistry applications, the signals are the embeddings of the characters in the SMILES representation of a chemical compound.

The result:

self_ref='#/texts/30' parent=RefItem(cref='#/groups/1') children=[RefItem(cref='#/texts/31'), RefItem(cref='#/pictures/0'), RefItem(cref='#/texts/33'), RefItem(cref='#/texts/34'), RefItem(cref='#/texts/35')] label=<DocItemLabel.SECTION_HEADER: 'section_header'> prov=[] orig='Networks' text='Networks' level=4

self_ref='#/texts/31' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-' text='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-'

self_ref='#/pictures/0' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PICTURE: 'picture'> prov=[] captions=[RefItem(cref='#/texts/32')] references=[] footnotes=[] image=None annotations=[]

self_ref='#/texts/33' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.' text='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.'

The picture exists in docling and even a caption in the picture element. However, the caption itself is not exisiting as you can see.

Steps to reproduce

use the html reader and you will see the same result.

Docling version

docling 2.14.0
docling-core 2.12.1
docling-ibm-models 3.1.0
docling-parse 3.0.0

Python version

ptyhon 3.11
...

@JeandeBalzac JeandeBalzac added the bug Something isn't working label Jan 6, 2025
@JeandeBalzac JeandeBalzac changed the title HTML reader not working correct when figure is provided HTML reader not working correct when figure is included Jan 6, 2025
@JeandeBalzac
Copy link
Author

Looked at this problem more carefully.
The figure caption is added but only at the end of the docling structure.
So, you can close the bug. Maybe you can adjust that the order is correct.
Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant