Control HTML document Unicode decoding #682

sanmai-NL · 2025-01-06T10:27:42Z

Requested feature

Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:

<h2 id="contents">Contents<a class="headerlink" href="#contents" title="Permanent link"></a></h2>

ends up with the .text value:

'Contents\ue157'

I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.

Alternatives

I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.

The text was updated successfully, but these errors were encountered:

cau-git · 2025-01-07T13:30:42Z

@sanmai-NL I am not entirely sure what your request is. The escaped unicode in the string representation will actually print as a symbol, such as in:

> s = 'Contents\ue157'
> print(s)

Contents

How it prints depends on the interpreter.

sanmai-NL · 2025-01-07T15:40:24Z

It's a character we don't want. It's a data quality issue.

sanmai-NL · 2025-01-07T15:43:25Z

We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.

sanmai-NL added the enhancement New feature or request label Jan 6, 2025

sanmai-NL changed the title ~~Control how HTML document Unicode decoding~~ Control HTML document Unicode decoding Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control HTML document Unicode decoding #682

Control HTML document Unicode decoding #682

sanmai-NL commented Jan 6, 2025

cau-git commented Jan 7, 2025

sanmai-NL commented Jan 7, 2025

sanmai-NL commented Jan 7, 2025

Control HTML document Unicode decoding #682

Control HTML document Unicode decoding #682

Comments

sanmai-NL commented Jan 6, 2025

Requested feature

Alternatives

cau-git commented Jan 7, 2025

sanmai-NL commented Jan 7, 2025

sanmai-NL commented Jan 7, 2025