Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<?xml-stylesheet ?> element breaks relative xpaths #81

Open
uhlikfil opened this issue Nov 10, 2024 · 4 comments
Open

<?xml-stylesheet ?> element breaks relative xpaths #81

uhlikfil opened this issue Nov 10, 2024 · 4 comments

Comments

@uhlikfil
Copy link

uhlikfil commented Nov 10, 2024

The same XPath query with a relative path from the root child node does not return results if there is a <?xml-stylesheet ?> tag present in the XML document.

Minimal repro code:

import elementpath
from lxml import etree

# xml1 contains the xml-stylesheet tag
xml1 = b"""<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='test.xsl'?>
<root>
    <first>
        <second>
            value
        </second>
    </first>
</root>
"""

# the same as xml1, but without the xml-stylesheet tag
xml2 = b"""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <first>
        <second>
            value
        </second>
    </first>
</root>
"""

root1 = etree.XML(xml1)
root2 = etree.XML(xml2)
query = "first/second"

elementpath.select(root1, query)
# returns []
elementpath.select(root2, query)
# returns [<Element second at ... >]

Is that expected? Why is it happening?

@brunato
Copy link
Member

brunato commented Dec 15, 2024

Hi,
if you parse with etree.XML API the result tree in general is a fragment (an XML without a document node, an ElementTree instance in this case, that is wrapped in a DocumentNode instance). But xml1 has a root sibling other than the XML standard declaration, so it's interpreted as a document. You can force to skip the PI sibling providing fragment=True, so the result will be the same:

 elementpath.select(root1, query, fragment=True))
 # returns [<Element second at ... >]

Anyway the behavior may be not as intended by the argument description:

:param fragment: if `True` a root element is considered a fragment, if `False` \
a root element is considered the root of an XML document. If `None` is provided, \
the root node kind is preserved.

so something have to be fixed, at least when fragment is False or None.

thank you

@brunato
Copy link
Member

brunato commented Dec 21, 2024

A fix for fragment argument usage is available with v4.7.0. The default is changed to None, providing False a document node part is added to the tree.

For default the root node kind is not changed, except the cases like xml1 with lxml, where an effective document part is added, if you not provide fragment=True.

This default behavior with lxml could be changed, but with the drawback that root siblings can't be selected (in this case an explicit fragment=False will be needea).

Waiting for a feedback on this or close the issue.

Thank you

@uhlikfil
Copy link
Author

Just to be clear. Given an XML containing a PI.

If I want to use a relative query starting from the root (e.g. first/second), I need to set fragment=True. However, with fragment=True I am not able to select the root node (e.g. //root)? Is there a way to make both cases work? The lxml Element.xpath method works in both cases:

import elementpath
from lxml import etree

xml = b"""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type='text/xsl' href='test.xsl'?>
<root>
    <first>
        <second>
            value
        </second>
    </first>
</root>
"""

root = etree.XML(xml)
relative_query = "first/second"
root_query = "//root"

elementpath.select(root, relative_query, fragment=True)
root.xpath(relative_query)
# both return the same element now thanks to the fragment changes

elementpath.select(root, root_query, fragment=True)
# returns []
root.xpath(root_query)
# returns [<Element root at ...>]

@uhlikfil uhlikfil reopened this Dec 30, 2024
@brunato
Copy link
Member

brunato commented Jan 6, 2025

Hi,
a fragment doesn't have a root document so an absolute path (/ or //) forcedly goes on root's children. Lxml in this case consumes the document position, like elementpath does for non-fragments.

If you have to use both relative and absolute paths a solution is to provide item=root argument to selector, that keeps the XML tree as a document but set the initial item position to the root Element instead of document.

import elementpath
from lxml import etree

xml = b"""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type='text/xsl' href='test.xsl'?>
<root>
    <first>
        <second>
            value
        </second>
    </first>
</root>
"""

root = etree.XML(xml)
relative_query = "first/second"
root_query = "//root"

res1 = elementpath.select(root, relative_query, item=root)
res2 = root.xpath(relative_query)
assert res1 == res2 == [root[0][0]]
# both returns [<Element second at ...>]

res1 = elementpath.select(root, root_query, item=root)
res3 = elementpath.select(root, root_query)
res2 = root.xpath(root_query)
assert res1 == res2 == res3 == [root]
# all returns [<Element root at ...>]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants