-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XPath3.1: mimic handling of multiple root element nodes #2351
base: master
Are you sure you want to change the base?
Conversation
requirements.txt
Outdated
@@ -55,7 +55,7 @@ beautifulsoup4 | |||
lxml >=4.8.0,<6 | |||
|
|||
# XPath 2.0-3.1 support - 4.2.0 broke something? | |||
elementpath==4.1.5 | |||
elementpath==4.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is time to upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is time to upgrade?
Sure, if the tests pass it's OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change was required to fix this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1
]) | ||
def test_broken_DOM_01(html_content, xpath, answer): | ||
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs. | ||
with pytest.raises(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.
@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element]) | ||
@pytest.mark.parametrize("xpath, answer", [ | ||
("/html/body/p[1]", "First paragraph."), | ||
("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.
<p>First paragraph.</p> | ||
</body> | ||
</html> | ||
<html> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second html root element.
This reverts commit 66a7dae.
So this is nearly always caused by a missing |
@dgtlmoon As I posted to https://gitlab.gnome.org/GNOME/libxml2/-/issues/716,
OR
OR
In this case, libxml2, and lxml returns two html root element nodes.
or
|
please, could you update this with latest |
https://gitlab.gnome.org/GNOME/libxml2/-/issues/211 yeah thats super interesting "The HTML parser in libxml2 was written 20+ years ago. It does not implement HTML5. Maybe it will some day, maybe it won't. Don't use libxml2 to parse HTML for anything serious. If you maintain a downstream project that uses libxml2's HTML parser, please forward this message to your users." |
So basically this PR is making HTML5 work with libxml2 in a round-about way |
That is not what I do. There is no HTML5 for libxml2 yet. The reason for multiple root elements is that the html parser doesn't implement DOM. This PR allows parsing a non-well-form DOM tree similar.
I believe the issue I submitted will be fixed with it. EDIT: add similar |
add precise description
Previous test failed with an unrelated issue. |
I'm not an expert. It's just my opinion. If the html parser of libxml2 implements the html5, the benefit is some sort of predictability of HTML DOM, and security in general. It's quite easy to expect xpath user will slightly need to change one's expression. It's difficult for me to say exhaustively. Previously html tag name was lowercase, but HTML5 may have XML elements inside, and the xml element may have be upper or lowercase (and also follows xml rule). (e.g: https://developer.mozilla.org/en-US/docs/Web/SVG/Element/textPath Also, some browsers don't support the namespace for xpath1.(I don't know how to express this sentence correctly) "//*:svg" or the Clark notation(and Clark notation similar) doesn't work in browsers. So, if html5 is parsed with xpath, for convenience, users will use "//*:svg" and the browser doesn't understand it. |
Im wondering if theres a way to only turn this on only when necessary? like do some check first? or does it already do that? |
The second PR(what you asked with this Q above) showed the side effects.
The current reverted PR (first PR) that I submitted was exactly only turned on when it needed to. The sole reason it spits an error is that the root element node has another root element node as a sibling. I revert to the first PR. |
the point where fails(unrelated with this PR) occurred is
So
EDIT: edit example |
Amazing, thank you so much -> #2623 |
please update with |
Obviously, some web server provides broken html.
The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)
But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because with some conditions, lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a trace? of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)
See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716
So I mimicked it.
The test I included describes the point.
fixes #2318