HTML reader not working correct when figure is included #684

JeandeBalzac · 2025-01-06T15:45:37Z

Bug

I have the following html (excerp of a longer html page):

Networks

The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-

Fig.1 - Trulli, Puglia, Italy.

Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.

ticular, networks used commonly in NLP. Their unreasonable effectiveness (see (Sejnowski 2020)) in domains such as named entity recognition (Lample et al. 2016), machine translation (Klein et al. 2017) and chemistry (Schwaller et al. 2018) clearly demonstrates that they can capture the underlying structure and trends in (noisy) sequences of data. In the case of traditional NLP, the signals are often the embeddings of each character or word in a string. In chemistry applications, the signals are the embeddings of the characters in the SMILES representation of a chemical compound.

The result:

self_ref='#/texts/30' parent=RefItem(cref='#/groups/1') children=[RefItem(cref='#/texts/31'), RefItem(cref='#/pictures/0'), RefItem(cref='#/texts/33'), RefItem(cref='#/texts/34'), RefItem(cref='#/texts/35')] label=<DocItemLabel.SECTION_HEADER: 'section_header'> prov=[] orig='Networks' text='Networks' level=4

self_ref='#/texts/31' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-' text='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-'

self_ref='#/pictures/0' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PICTURE: 'picture'> prov=[] captions=[RefItem(cref='#/texts/32')] references=[] footnotes=[] image=None annotations=[]

self_ref='#/texts/33' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.' text='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.'

The picture exists in docling and even a caption in the picture element. However, the caption itself is not exisiting as you can see.

Steps to reproduce

use the html reader and you will see the same result.

Docling version

docling 2.14.0
docling-core 2.12.1
docling-ibm-models 3.1.0
docling-parse 3.0.0

Python version

ptyhon 3.11
...

JeandeBalzac · 2025-01-06T18:56:48Z

Looked at this problem more carefully.
The figure caption is added but only at the end of the docling structure.
So, you can close the bug. Maybe you can adjust that the order is correct.
Best

JeandeBalzac added the bug Something isn't working label Jan 6, 2025

JeandeBalzac changed the title ~~HTML reader not working correct when figure is provided~~ HTML reader not working correct when figure is included Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML reader not working correct when figure is included #684

HTML reader not working correct when figure is included #684

JeandeBalzac commented Jan 6, 2025

JeandeBalzac commented Jan 6, 2025

HTML reader not working correct when figure is included #684

HTML reader not working correct when figure is included #684

Comments

JeandeBalzac commented Jan 6, 2025

Bug

Networks

Steps to reproduce

Docling version

Python version

JeandeBalzac commented Jan 6, 2025