You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have the following html (excerp of a longer html page):
Networks
The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-
Fig.1 - Trulli, Puglia, Italy.
Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.
ticular, networks used commonly in NLP. Their unreasonable effectiveness (see (Sejnowski 2020)) in domains such as named entity recognition (Lample et al. 2016), machinetranslation (Klein et al. 2017) and chemistry (Schwaller et al. 2018) clearly demonstrates that they can capture the underlying structure and trends in (noisy) sequences of data. In the case of traditional NLP, the signals are often the embeddings of each character or word in a string. In chemistry applications, the signals are the embeddings of the characters in the SMILES representation of a chemical compound.
self_ref='#/texts/31' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-' text='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-'
self_ref='#/texts/33' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.' text='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.'
The picture exists in docling and even a caption in the picture element. However, the caption itself is not exisiting as you can see.
Steps to reproduce
use the html reader and you will see the same result.
JeandeBalzac
changed the title
HTML reader not working correct when figure is provided
HTML reader not working correct when figure is included
Jan 6, 2025
Looked at this problem more carefully.
The figure caption is added but only at the end of the docling structure.
So, you can close the bug. Maybe you can adjust that the order is correct.
Best
Bug
I have the following html (excerp of a longer html page):
Networks
The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-
Fig.1 - Trulli, Puglia, Italy.
Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.
ticular, networks used commonly in NLP. Their unreasonable effectiveness (see (Sejnowski 2020)) in domains such as named entity recognition (Lample et al. 2016), machine translation (Klein et al. 2017) and chemistry (Schwaller et al. 2018) clearly demonstrates that they can capture the underlying structure and trends in (noisy) sequences of data. In the case of traditional NLP, the signals are often the embeddings of each character or word in a string. In chemistry applications, the signals are the embeddings of the characters in the SMILES representation of a chemical compound.
The result:
self_ref='#/texts/30' parent=RefItem(cref='#/groups/1') children=[RefItem(cref='#/texts/31'), RefItem(cref='#/pictures/0'), RefItem(cref='#/texts/33'), RefItem(cref='#/texts/34'), RefItem(cref='#/texts/35')] label=<DocItemLabel.SECTION_HEADER: 'section_header'> prov=[] orig='Networks' text='Networks' level=4
self_ref='#/texts/31' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-' text='The natural flow of the text in a document is very often reflected in a sequence of PDF printing commands. This inspired us to look at recurrent neural networks and, in par-'
self_ref='#/pictures/0' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PICTURE: 'picture'> prov=[] captions=[RefItem(cref='#/texts/32')] references=[] footnotes=[] image=None annotations=[]
self_ref='#/texts/33' parent=RefItem(cref='#/texts/30') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.' text='Sketch of the network architecture for a generic page model. In each model, we use the sequence of features of each text cell as input. The ordering of the cells is obtained after sorting them according to reading order using the toposort algorithm. The output of the network yields a label classification for each cell. Each model with which we have experimented contains at least the encoder part. The embedding and decoder parts are optional components.'
The picture exists in docling and even a caption in the picture element. However, the caption itself is not exisiting as you can see.
Steps to reproduce
use the html reader and you will see the same result.
Docling version
docling 2.14.0
docling-core 2.12.1
docling-ibm-models 3.1.0
docling-parse 3.0.0
Python version
ptyhon 3.11
...
The text was updated successfully, but these errors were encountered: