- Support deeper trees by using iteration instead of recursion.
- Fixed HTML comment and processing instruction handling.
- Use
lxml-html-clean
instead oflxml[html_clean]
in setup.py, to avoid jazzband/pip-tools#2004
- Moved the Git repository to https://github.com/zytedata/html-text.
- Added official support for Python 3.9-3.12.
- Removed support for Python 2.7 and 3.5-3.7.
- Switched the
lxml
dependency tolxml[html_clean]
to supportlxml >= 5.2.0
. - Switch from Travis CI to GitHub Actions.
- CI improvements.
- Handle lxml Cleaner exceptions (a workaround for https://bugs.launchpad.net/lxml/+bug/1838497 );
- Python 3.8 support;
- testing improvements.
Fixed whitespace handling when guess_punct_space
is False: html-text was
producing unnecessary spaces after newlines.
Parsel dependency is removed in this release, though parsel is still supported.
parsel
package is no longer required to install and use html-text;html_text.etree_to_text
function allows to extract text from lxml Elements;html_text.cleaner
is anlxml.html.clean.Cleaner
instance with options tuned for text extraction speed and quality;- test and documentation improvements;
- Python 3.7 support.
Fixed a regression in 0.4.0 release: text was empty when
html_text.extract_text
is called with a node with text, but
without children.
This is a backwards-incompatible release: by default html_text functions now add newlines after elements, if appropriate, to make the extracted text to look more like how it is rendered in a browser.
To turn it off, pass guess_layout=False
option to html_text functions.
guess_layout
option to to make extracted text look more like how it is rendered in browser.- Add tests of layout extraction for real webpages.
- Expose functions that operate on selectors,
use
.//text()
to extract text from selector.
- Packaging fix (include CHANGES.rst)
- Fix unwanted joins of words with inline tags: spaces are added for inline tags too, but a heuristic is used to preserve punctuation without extra spaces.
- Accept parsed html trees.
- Travis-CI and codecov.io integrations added
- First release on PyPI.