Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to match end before any patterns #139

Open
KapitanOczywisty opened this issue Sep 2, 2020 · 12 comments
Open

Option to match end before any patterns #139

KapitanOczywisty opened this issue Sep 2, 2020 · 12 comments

Comments

@KapitanOczywisty
Copy link

There is persistent problem with injected languages where unclosed parentheses or quotes eat whole document.

I'm proposing to add option e.g. applyEndPatternFirst. This would check such end pattern at every line before anything else. I've looked at code a bit and it seems possible, but I'm not familiar enough with engine to correctly implement this (at least without trashing performance).

Details:

  • On end match - everything before end would be extracted and checked again with sub-patterns and merged with end's tokens, while rest of line would be checked normally. Without match - business as usual.
  • When multiple end patterns with this option will be present, they should be stored and checked from first to last, at every line until any of them matches. There shouldn't be too many of them, since this should be used mainly with injected languages.
  • Successful match will terminate all descendant rules, after matching "before end" part of line.
  • end with option should also match in line where before was found.

Why not use while? While while (sic) is great addition, it can only exclude whole line. Proposed option would allow stopping in mid-line.

@alexdima
Copy link
Member

alexdima commented Sep 2, 2020

@KapitanOczywisty I think this makes sense, but vscode-textmate has as a goal to remain compliant with TM grammars. IMHO this proposal would make it that vscode-textmate diverges from TM and might lead to the creation of grammars that are not portable. I wonder what the exact problem you are running into is and how has it been addressed in grammars such as html, where the closing of </script> tag can occur in a JS comment, statement, etc.

@KapitanOczywisty
Copy link
Author

@alexdima If properly used, this option will only prevent some nasty injection bugs. No support for this option shouldn't cause grammar to break (well... not more than already is). As always there might be someone who uses this in highly improper way, but even now there are slight differences in implementations, e.g. php with sql injections breaks differently in atom and vscode. What is worth to mention, we're importing php grammar from atom, which don't have while support. I suspect that some of vscode's grammars are not "portable" already, at least to the atom.

Fwiw there is nothing stopping other implementations to add this feature.

About problem: There are many injection bugs atm in different languages, e.g. countless issues about embedded SQL in PHP.

Even mentioned </script> is not addressed:
image

What sometimes makes funny disagreements between LSP and grammar:
image

We can fix some of these issues making many duplicated rules, but there are issues which could be fixed only with such feature. Also atom's language-php in vscode uses different grammar for sql, what makes fixes even more fun.

@alexdima
Copy link
Member

alexdima commented Sep 4, 2020

@KapitanOczywisty Grammars (outside of injection) already do this by default, the end pattern is applied first by default and it is possible to configure in the grammar that the end pattern is applied last:

vscode-textmate/src/rule.ts

Lines 535 to 555 in ebcfe99

public compile(grammar: IRuleRegistry & IOnigLib, endRegexSource: string, allowA: boolean, allowG: boolean): CompiledRule {
if (!this._cachedCompiledPatterns) {
this._cachedCompiledPatterns = new RegExpSourceList();
this.collectPatternsRecursive(grammar, this._cachedCompiledPatterns, true);
if (this.applyEndPatternLast) {
this._cachedCompiledPatterns.push(this._end.hasBackReferences ? this._end.clone() : this._end);
} else {
this._cachedCompiledPatterns.unshift(this._end.hasBackReferences ? this._end.clone() : this._end);
}
}
if (this._end.hasBackReferences) {
if (this.applyEndPatternLast) {
this._cachedCompiledPatterns.setSource(this._cachedCompiledPatterns.length() - 1, endRegexSource);
} else {
this._cachedCompiledPatterns.setSource(0, endRegexSource);
}
}
return this._cachedCompiledPatterns.compile(grammar, allowA, allowG);
}

Otherwise, if this is about injection, then is it that both the injection and the rule matches? But it looks to me like we already handle that case with some scoring:

// Decide if `matchResult` or `injectionResult` should win
const matchResultScore = matchResult.captureIndices[0].start;
const injectionResultScore = injectionResult.captureIndices[0].start;
if (injectionResultScore < matchResultScore || (injectionResult.priorityMatch && injectionResultScore === matchResultScore)) {
// injection won!
return injectionResult;
}
return matchResult;

More likely, what I believe is the problem is that a certain rule misses an injection for the end embedded language pattern or that certain rules are written in a way where they cannot be injected (greedily eating up text and not giving a chance to the embedded language end regex to match). But IMHO this cannot be tackled with some new flag, this is a technical flaw of TM.

@KapitanOczywisty
Copy link
Author

KapitanOczywisty commented Sep 4, 2020

@alexdima this is true as long as every rule is properly closed, problems starts when we have parentheses and quotes.

Take this example (not the best practice, but easy to encounter in the wild):

<?php
$query = "SELECT * FROM `table`
WHERE `id` IN ('". implode("','", $ids) ."')";

implode is eaten as part of sql
image

In vscode parentheses in sql seems to not be tokenized so removing quotes fixes issue
image

But in atom sql tokenizes parentheses and we have this beauty
image

There were attempts to fix this, but this made even bigger mess.

Flag will not fix everything, but will allow stopping sql tokenization at IN ('". TM is flawed, but can still be a bit better 😉

Alternatively If there would be any way to alter tokens returned from TM, without separate tokenization in extension, I could write semanticTokenProvider to inject SQL tokens afterwards. Is there some callback like that already?

@alexdima
Copy link
Member

alexdima commented Sep 4, 2020

AFAIK, the "TM way" to fix this is to have a language called "sqlindoublequotes". Then a grammar is written from scratch or adapted such that this language never eats by accident ". The outer language, like php in this case would then inject the " end rule in all the rules of the "sqlindoublequotes".

I still don't understand your proposal though, since at a point the tokenizer reaches the position at '". If the sql language has a regex like /'[^']+'/, then that will match and consume the ". The end rule injected by php has no chance to intercept anything because the " is entirely skipped. The solution is to not have a regex like /'[^']+'/ in SQL, IMHO either by having a begin/end rule for strings where each character is looked at individually to which php can inject, either by having this language aware of its embedder that does not eagerly consume ".

@KapitanOczywisty
Copy link
Author

KapitanOczywisty commented Sep 4, 2020

@alexdima In my proposal end with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed. Descendant rules cannot eat " because it isn't passed to them anymore.

They tried to fix that way (sqlindoublequotes), but this failed as you can see. PHP has 4 variants of writing strings:

<?php
// double quoted
$query = "SELECT * FROM `{$table["name"]}`";
// single quoted (no interpolation)
$query = 'SELECT * FROM `'. $table["name"] .'`';
// heredoc
$query = <<<SQL
SELECT * FROM `{$table["name"]}`
SQL;
// nowdoc (no interpolation)
$query = <<<'SQL'
SELECT * FROM `table`
SQL;

Writing all variants is quite an overhead. It's possible, but also time consuming. I mean I'll do it if there is no other way, but I'd rather not. And there are similar issues in other languages.

Edit: In general any option for second pass would be a huge change, even if this would be callback in JS.

@alexdima
Copy link
Member

alexdima commented Sep 4, 2020

In my proposal end with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed.

That makes sense. Then I am sorry, I misunderstood your proposal. The description applyEndPatternFirst made me think this is about the order of the regexes or their priority. This is something completely new to TM, exactly like we do in Monarch.

@HugoGranstrom
Copy link

HugoGranstrom commented Jun 4, 2021

Any progress on this matter? I'm having similar problems embedding markdown in my language. It works fine if the end pattern is on a separate line from the "markdown content" but as soon as it's on the same line the markdown eats it and takes over the entire document.

var a = md"""
This is some md that works
"""

var b = md"""
This doesn't work"""

# This comment is now highlighted as if it was a markdown header

@AlfishSoftware
Copy link

This is extremely necessary. There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language breaking the container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.

Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.

@RedCMD
Copy link

RedCMD commented Oct 5, 2024

related: #207

The problem with adding this is that it will cause VSCode to be incompatible with TextMate2.0, Atom, Sublime etc
Apple should have thought about this in the first place, but it is too late now

Atom had their own potential implementation alwaysMatchEndPattern, but it never went through atom/first-mate#90

VSCode's while is a lot more strict that end
preventing embedded languages from leaking
however it is a VSCode bug #241

@AlfishSoftware
Copy link

AlfishSoftware commented Oct 8, 2024

I tried a pattern with the embedded pattern inside captures to see if I could get the entire content between begin..end to be matched atomically, then apply this include pattern on a second pass. But this wouldn't work because match/begin/end/while patterns themselves are single-line only. Even ignoring that, I still couldn't get it to work for single-line embedded content too.

EDIT: Ah, it must be because of #242

@AlfishSoftware
Copy link

AlfishSoftware commented Oct 8, 2024

EDIT: Moved to a new issue: #243

Original text

Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.

I think this would be the ideal implementation for the best embedded language support.
It would be to allow a subPatterns field (and an optional replacementPatterns field with it) that uses this second-pass logic. They would be mutually exclusive with patterns. This is how it could work when subPatterns is present:

  • The start..end|while rule is matched first, without considering any sub-patterns or replacement patterns. Let's say the text content between them is all stored into a "subContent" variable.
  • Then apply replacement patterns if they exist. They are basically the same as patterns, except they use match and a replaceWith field (which can have back-references from its match) to specify substitution within "subContent". So the sub-patterns will later operate considering the substitutions in this "subContent" variable. So, for example, if a &lt; to < substitution occurs, then sub-patterns operate on this new text. This allows you to replace escaping syntax from the parent language before the sub-pattern that includes the embedded language.
  • Then apply subPatterns into just the (replaced) "subContent" text atomically, on an inner/sub pass.
  • For any regions that had replacements, apply the replacement scope name on top of whatever scopes come from the sub-patterns. So this way you can inter-mix escaping syntax of both languages.

A theoretical example:

{
  "name": "string.quoted.embedded-code.$1.my-lang",
  "begin": "([\\w-]+)`", // group 1 is the language id
  "beginCaptures": {
    "1": { "name": "entity.other.language.my-lang" }
  },
  "end": "`",
  "contentName": "meta.embedded.block.$1 source.$1",
  "replacementPatterns": [
    // $1 would replace with the char in group 1 below literally
    { "match": "\\\\([`\\\\])", "replaceWith": "$1", "name": "constant.character.escape.my-lang" },
    // $h1 could replace with the unicode char from the hex number matched by group 1
    { "match": "\\\\u(\\h{4})", "replaceWith": "$h1", "name": "constant.character.escape.my-lang" },
    // $d1 same as above, but for decimal numbers
    { "match": "\\\\c\\[(\\d+)\\]", "replaceWith": "$d1", "name": "constant.character.escape.my-lang" },
  ],
  "subPatterns": [
    // "include" could allow back-references from the parent begin/match pattern
    // to support arbitrary languages
    { "include": "source.$1" }
  ]
}

This would let you include any arbitrary embedded language without having to know anything about its syntax, and you could even have escaping in the parent language be recognized and it would just work.

RedCMD referenced this issue in RedCMD/TmLanguage-Syntax-Highlighter Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants