Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control HTML document Unicode decoding #682

Open
sanmai-NL opened this issue Jan 6, 2025 · 3 comments
Open

Control HTML document Unicode decoding #682

sanmai-NL opened this issue Jan 6, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@sanmai-NL
Copy link
Contributor

Requested feature

Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:

<h2 id="contents">Contents<a class="headerlink" href="#contents" title="Permanent link"></a></h2>

ends up with the .text value:

'Contents\ue157'

I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.

Alternatives

I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.

@sanmai-NL sanmai-NL added the enhancement New feature or request label Jan 6, 2025
@cau-git
Copy link
Contributor

cau-git commented Jan 7, 2025

@sanmai-NL I am not entirely sure what your request is. The escaped unicode in the string representation will actually print as a symbol, such as in:

> s = 'Contents\ue157'
> print(s)

Contents
image

How it prints depends on the interpreter.

@sanmai-NL sanmai-NL changed the title Control how HTML document Unicode decoding Control HTML document Unicode decoding Jan 7, 2025
@sanmai-NL
Copy link
Contributor Author

It's a character we don't want. It's a data quality issue.

@sanmai-NL
Copy link
Contributor Author

We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants