Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MathML support in the HTML Sanitizer API #227

Open
fred-wang opened this issue Mar 25, 2024 · 19 comments
Open

MathML support in the HTML Sanitizer API #227

fred-wang opened this issue Mar 25, 2024 · 19 comments

Comments

@fred-wang
Copy link
Contributor

See https://wicg.github.io/sanitizer-api/

Some work has been done to hande mathml/svg namespaces but the spec should likely specify a default safelist, see WICG/sanitizer-api#103 (comment) (IIRC, the API allows web dev to accept more element/attributes that are not in the safelist, though)

So this issue is about discussing what we want to suggest as a default safelist for MathML.

In another issue, I had commented to try and follow MathML Core as much as possible as that's what browsers are expected to implement: WICG/sanitizer-api#167 (comment)

Some more comments:

Firefox has some safe list already but I guess it is not very strict, for example it still allows XLink href or content mathml markup. The bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1787594

For Chromium, I don't remember without checking more. But probably it does not include more than what is in MathML Core, since we never implemented more.

I'm not sure if the sanitzer api is actually being implemented in webkit.

@dginev
Copy link

dginev commented Mar 25, 2024

I see the current sanitization algorithms have a configurable on/off toggle for data- attributes:

  1. If "data-" is a code unit prefix of local name
    and if namespace is null and if config["dataAttributes"] exists and is false:

    1. Remove attr from child.

Similarly to config["dataAttributes"] it would be nice to have config["annotationElements"] (or related name) to direct whether the MathML Core elements annotation and annotation-xml should be kept or removed, whichever non-standard material they may happen to contain.

@fred-wang
Copy link
Contributor Author

Similarly to config["dataAttributes"] it would be nice to have config["annotationElements"] (or related name) to direct whether the MathML Core elements annotation and annotation-xml should be kept or removed, whichever non-standard material they may happen to contain.

Right, I believe we should have both 1) a strict default subset implemented natively in browsers and 2) a way to relax it for web developers using the API. I haven't read the spec for a while, but that's how I had understood the situation for HTML. It would be good if MathML folks could spend some time to ensure this is the case for MathML too.

@polx
Copy link

polx commented Mar 25, 2024

I think that MathML in general is safe. Any element except maybe those who contain script oriented ones should be in the accept-list.

@dginev
Copy link

dginev commented Apr 2, 2024

As requested in the Math WG meeting on March 28, here is a link to the MathML-related CVEs on record, currently 8:

NIST national vulnerabilities database, keyword MathML

I remembered noticing back in the day that cases such as CVE-2021-38193 and CVE-2020-26870 appeared to be examples where switching between parsing contexts hosted exploits. This was the reason I flagged annotation-xml as a potential exploit vector in my previous comment, since depending on how its non-standard contents are processed, an implementation may hit other DOM-related edge cases.

annotation should just be character data, so likely a step safer (and more directly comparable to data-* attributes).

@polx
Copy link

polx commented Apr 30, 2024

Hello @fred-wang and all,

we discussed the subject in the MathML-core meeting yesterday and I think that the following seems to have met everyone's agreement:

We converged on the fact that the skeptiscism about the security of data-* attributes is similar to that about annotation and annotation-xml elements: The problem is not a problem about MathML itself but it may be important to shut-the-tap proactively as it may become a problem of every users. Simplest examples include LaTeX source code fragments. Thus we propose, as @dginev hinted, that:

  1. the sanitzier API should have a switch to wipe out or not the annotation and annotation-xml elements' non-MathML-ingredients.

Other than that we see sanitziation needs to wipe-out:

  1. for any namespace-declaration or external entity references that is not entirely produced synthetically by the browser (in particular the MathML, SVG and HTML namespaces should be considered safe but others probably not)

  2. the actiontype attribute of maction if it is not the standard value statusline or toggle

  3. the src attribute of annotation and annotation-xml (as some parsers may consider this similar to an img element): sanitizers may inline that content instead.

We have also considered it important that this issue carries a few examples of potentials that the sanitizer's inclusion of the MathML elements may bring.

Finally, we have highlighted the potentials of TrustedTypes as an application that may be relevant for the sanitizers. But so far, I see this as a potential only. I would suggest that we request that the MathWG or Math CG be "called back" when TrustedTypes may intersect the sanitizer APIs beyond its current scope (which I understand to be a baseline converter to transform web-content in something that can be exchanged in a way considered safer further than the browser's current page).

Do you agree with the approach proposed in the numbers 1 to 4. Then I suggest we go to the sanitizer API issues and make that suggestion as a safe list.

thanks in advance.

Paul

@fred-wang
Copy link
Contributor Author

Hi,

Sorry for the late reply, I overlooked this was directed to me.

In general I don't have strong opinion on this, the sanitizer API is implemented in a relatively part of browser code that is relatively independent from MathML rendering. It should be fine to go ahead and talk to the people working on the sanitizer API spec, finding a consensus there. I didn't check what was the latest status regarding non-HTML namespace.

Probably the main thing to pay attention is that MathML Core is targetted for browsers while MathML Full is used in other applications. So we would need to decide whether we only accept MathML Core markup or allow MathML Full markup (with maybe more sanitization for security/privacy sensitive markup that will need to be figured out). 1-4 seems to be about things that are not in MathML Core.

Note that Firefox's sanitization currently accepts content markup but at the cost of adding many atomic strings for each content MathML tag: https://bugzilla.mozilla.org/show_bug.cgi?id=1787594#c8

Regarding security/safety in browsers, the one I'm aware of are described in https://w3c.github.io/mathml-core/#security-considerations and https://w3c.github.io/mathml-core/#privacy-considerations ; in particular href is the one that can cause problems (unfortunately the discussions regarding its inclusion in MathML Core is on hold). Note also the case of maction statusline (whose support was removed from browsers).

@dginev
Copy link

dginev commented Sep 3, 2024

@fred-wang

So we would need to decide whether we only accept MathML Core markup or allow MathML Full markup (with maybe more sanitization for security/privacy sensitive markup that will need to be figured out). 1-4 seems to be about things that are not in MathML Core.

Your comment made me wonder why/how the elements annotation, annotation-xml and their container semantics made it into MathML Core.

If they are to be useful, their contents should be able to survive sanitization, at least in some cases. If not, maybe they are better thought of as MathML Full elements?

As a cross-spec thought: SVG has a construct similar to <annotation-xml> called foreignObject, but unlike the annotation elements, foreign objects are also used for rendering. So the HTML parser has defined (and limited) behavior over them, accepting HTML+SVG+MathML content. I wonder if the sanitization considerations for <annotation-xml> should be similar to that case?

Content MathML is indeed the classic use of <annotation-xml> to consider. And a common use of <annotation> is to carry a source format representation (usually TeX). Both kinds of annotations may be useful data structures to expose for JavaScript apps. But that ship may have also sailed at this point, I myself don't have a strong opinion on the direction here. This is part of the conversation I was hoping the group will have during the next charter of the Math WG.

@polx
Copy link

polx commented Nov 25, 2024

Maybe the right thing to do here is to be minimal first so that something comes through. The annotation* family of elements allow arbitrary content whose security needs to be warned against (even TeX can have its security issues, I think).

While I agree we should strive for something useful, and href attributes and annotation* can be useful, they do not appear controllable to me (except maybe having well known media-types?).

@bkardell
Copy link
Collaborator

So we would need to decide whether we only accept MathML Core markup or allow MathML Full markup (with maybe more sanitization for security/privacy sensitive markup that will need to be figured out). 1-4 seems to be about things that are not in MathML Core.

Per the working group meeting today, we resolved that we feel it is ok to begin with MathML-core and we'd like to move that forward

@polx
Copy link

polx commented Nov 28, 2024

Something near this was discussed in, with, and about the fediverse tools:

@polx
Copy link

polx commented Nov 28, 2024

Here is my proposal:

MathML-core considers all but the following elements to be safely exchanged and do not need a sanitization, e.g. when they are converted from a display to an input enviroment.
We recommend the Sanitzer API (ref) to sanitize MathML by keeping all elements and attributes except:

  • the on* attributes (aimed at event handlers),
  • the maction elements if the actiontype attribute is of the value statusline (the element can be replaced by it first child),
  • any annotation or annotation-xml element whose encoding attribute is of a media-type that is is either absent or is not among the trusted types or if it contains an href attribute.

I was unsure about two aspects:

  • we could recommend to inline the content of the annotation's href attribute but I think this is going too far
  • I think that mphantom should be converted to an mrow as otherwise the content carries invisible content. But I am not sure that invisible content is a problem.

Thanks for your feedback.

@bkardell
Copy link
Collaborator

bkardell commented Dec 5, 2024

I think all of the global event handlers (we just use the standard mixins) would be handled by
WICG/sanitizer-api#226 - there is nothing "special" for MathML in terms of events, etc. I think..

If I'm honest though, I do think there's a kind of hard "race" between these two specs, bc in practice it probably has to force some conversations about things that Core Level 1 specifically punted on because we couldn't get the people talking - mainly, for example, links in MathML-Core (see whatwg/html#5248 (comment) and #29) Currently there are a few kinds of disagreements here in implementation see https://wpt.fyi/results/mathml/relations/html5-tree/link-color-001.tentative.html?label=master&label=experimental&aligned&q=mathml%20link for example) but also (unless this has changed) in what can be linked... It feels difficult to reason about standardized sanitization genuinely without addressing that.

Also, maction shouldn't really be dangerous according to MathML-Core, but idk...

@polx
Copy link

polx commented Dec 11, 2024

From the comments on [Sanitizer's 103](https://github.com/WICG/sanitizer-api/issues/103] it seems that the more minimal we provide a set, the more chances it has to succeed.

From your comment @bkardell , I seem to understand that it is unclear if links will be there or not. But can't we make a recommendation independently of this decision?

I agree maction does not seem to be dangerous but considering its use, it seems not too dangerous to declare it unsafe.

@polx
Copy link

polx commented Jan 14, 2025

We have discussed the possible relations of the sanitizer-safe-list and the evolution of the last bits of specs with @bkardell and have come to the following (probably minimal) version of elements and attributes that should be processed by the sanitizer.

We should discuss and decide on this list on the next MathML-core meeting at the end of January 2025.


MathML Safe List

Short Version

MathML-core considers all elements and attributes of MathML-core (as listed in section 2.1 of MathML-core) as safe and not needing a sanitziation except the following elements.

We recommend the Sanitzer API to sanitize MathML by keeping all elements and attributes except the follwing:

  • any common attribute with HTML attributes which need a sanitzation,
  • the maction and mphantom elements (the element can be replaced by their first child), and
  • any annotation or annotation-xml element whose encoding attribute is of a media-type that is is either absent or is not among the trusted types or if it contains an href attribute.

Detailed Version

MathML-core considers the following elements and attributes of MathML-core as safe and not needing sanitization:

Safe "as-is" Elements of MathML-core:
math, merror, mfrac, mi, mmultiscripts, mn, mo, mover, mpadded, mprescripts, mroot, mrow, ms, mspace, msqrt, mstyle, msub, msubsup, msup, mtable, mtd, mtext, mtr, munder, munderover, semantics

Attributes of MathML-core:
dir, displaystyle, mathbackground, mathcolor, mathsize, scriptlevel, encoding, display, linethickness, intent and arg; on mo elements: form, fence, separator, lspace, rspace, stretchy, symmetric, maxsize, minsize, largeop, movablelimits; on mpadded elements: width, height, depth, lspace, voffset, on mspace elements: width, height, depth, on munderover elements accent and accentunder; on mtd elements columnspan and rowspan.

Moreover, the following attributes have their syntax and semantics specified in the HTML specification. The sanitizer behaviour on these attributes should be as is done on HTML elements: on*, id, class, style, data-*, autofocus, nonce,tabindex (for example any javascript should be removed).

The elements of MathML-core which need treatment by the sanitizers are the following:

  • annotation and annotation-xml if their encoding attribute is not considered of a safe type (e.g. if the encoding is text/plain then it could be kept). If removed, the element should be replaced by its first child.
  • maction is replaced by their first child
  • mphantom is removed

@polx
Copy link

polx commented Jan 16, 2025

I have created mathml-safe-list to track the evolution of this document and to consider for furhter inclusion.

There was the error that the annotation and annotation-xml elements should not be replaced but removed. This was corrected.
A clarification on what it meant to replace by the first child was made.

Both are done in the commit 1eb208 of the mathml-doc repository.

@dginev
Copy link

dginev commented Jan 16, 2025

What's wrong with <mphantom>? That is just an <mrow style="display:hidden;">, isn't it?
The MathML Core spec has a note:

... the <mphantom> element is primarily intended as an aid in aligning expressions

Sanitizing that away would quietly break content, wouldn't it? Shouldn't it get at least marked as deprecated if so?

Edit: I see there is also a "compatibility-only" note here. So likely good enough, or only needing that note extended with a warning it may get sanitized away by default.

@bkardell
Copy link
Collaborator

@dginev see also WICG/sanitizer-api#103 (comment) - I'm not sure why it's not linked up already, but basically there are 3 not 2 variants here. One would be the 'default' which would remove <mphantom>, and another (with an empty config) would keep it - (and of course, the unsafe variants that exist today)

@polx
Copy link

polx commented Jan 17, 2025

At the meeting yesterday, the was shortly discussed and consensus emerged against removing mphantom as part of the sanitizer.

@ljmaher
Copy link

ljmaher commented Jan 17, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants