You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:
I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.
Alternatives
I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.
The text was updated successfully, but these errors were encountered:
We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.
Requested feature
Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:
ends up with the
.text
value:'Contents\ue157'
I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.
Alternatives
I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.
The text was updated successfully, but these errors were encountered: