-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to match end before any patterns #139
Comments
@KapitanOczywisty I think this makes sense, but |
@alexdima If properly used, this option will only prevent some nasty injection bugs. No support for this option shouldn't cause grammar to break (well... not more than already is). As always there might be someone who uses this in highly improper way, but even now there are slight differences in implementations, e.g. php with sql injections breaks differently in atom and vscode. What is worth to mention, we're importing php grammar from atom, which don't have Fwiw there is nothing stopping other implementations to add this feature. About problem: There are many injection bugs atm in different languages, e.g. countless issues about embedded SQL in PHP. Even mentioned What sometimes makes funny disagreements between LSP and grammar: We can fix some of these issues making many duplicated rules, but there are issues which could be fixed only with such feature. Also atom's |
@KapitanOczywisty Grammars (outside of injection) already do this by default, the Lines 535 to 555 in ebcfe99
Otherwise, if this is about injection, then is it that both the injection and the rule matches? But it looks to me like we already handle that case with some scoring: vscode-textmate/src/grammar.ts Lines 769 to 777 in ebcfe99
More likely, what I believe is the problem is that a certain rule misses an injection for the end embedded language pattern or that certain rules are written in a way where they cannot be injected (greedily eating up text and not giving a chance to the embedded language end regex to match). But IMHO this cannot be tackled with some new flag, this is a technical flaw of TM. |
@alexdima this is true as long as every rule is properly closed, problems starts when we have parentheses and quotes. Take this example (not the best practice, but easy to encounter in the wild): <?php
$query = "SELECT * FROM `table`
WHERE `id` IN ('". implode("','", $ids) ."')";
In vscode parentheses in sql seems to not be tokenized so removing quotes fixes issue But in atom sql tokenizes parentheses and we have this beauty There were attempts to fix this, but this made even bigger mess. Flag will not fix everything, but will allow stopping sql tokenization at Alternatively If there would be any way to alter tokens returned from TM, without separate tokenization in extension, I could write semanticTokenProvider to inject SQL tokens afterwards. Is there some callback like that already? |
AFAIK, the "TM way" to fix this is to have a language called "sqlindoublequotes". Then a grammar is written from scratch or adapted such that this language never eats by accident I still don't understand your proposal though, since at a point the tokenizer reaches the position at |
@alexdima In my proposal They tried to fix that way ( <?php
// double quoted
$query = "SELECT * FROM `{$table["name"]}`";
// single quoted (no interpolation)
$query = 'SELECT * FROM `'. $table["name"] .'`';
// heredoc
$query = <<<SQL
SELECT * FROM `{$table["name"]}`
SQL;
// nowdoc (no interpolation)
$query = <<<'SQL'
SELECT * FROM `table`
SQL; Writing all variants is quite an overhead. It's possible, but also time consuming. I mean I'll do it if there is no other way, but I'd rather not. And there are similar issues in other languages. Edit: In general any option for second pass would be a huge change, even if this would be callback in JS. |
That makes sense. Then I am sorry, I misunderstood your proposal. The description |
Any progress on this matter? I'm having similar problems embedding markdown in my language. It works fine if the end pattern is on a separate line from the "markdown content" but as soon as it's on the same line the markdown eats it and takes over the entire document. var a = md"""
This is some md that works
"""
var b = md"""
This doesn't work"""
# This comment is now highlighted as if it was a markdown header |
This is extremely necessary. There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language breaking the container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it. Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language. |
related: #207 The problem with adding this is that it will cause VSCode to be incompatible with TextMate2.0, Atom, Sublime etc Atom had their own potential implementation VSCode's |
I tried a pattern with the embedded pattern inside captures to see if I could get the entire content between begin..end to be matched atomically, then apply this include pattern on a second pass. But this wouldn't work because match/begin/end/while patterns themselves are single-line only. Even ignoring that, I still couldn't get it to work for single-line embedded content too. EDIT: Ah, it must be because of #242 |
EDIT: Moved to a new issue: #243 Original text
I think this would be the ideal implementation for the best embedded language support.
A theoretical example: {
"name": "string.quoted.embedded-code.$1.my-lang",
"begin": "([\\w-]+)`", // group 1 is the language id
"beginCaptures": {
"1": { "name": "entity.other.language.my-lang" }
},
"end": "`",
"contentName": "meta.embedded.block.$1 source.$1",
"replacementPatterns": [
// $1 would replace with the char in group 1 below literally
{ "match": "\\\\([`\\\\])", "replaceWith": "$1", "name": "constant.character.escape.my-lang" },
// $h1 could replace with the unicode char from the hex number matched by group 1
{ "match": "\\\\u(\\h{4})", "replaceWith": "$h1", "name": "constant.character.escape.my-lang" },
// $d1 same as above, but for decimal numbers
{ "match": "\\\\c\\[(\\d+)\\]", "replaceWith": "$d1", "name": "constant.character.escape.my-lang" },
],
"subPatterns": [
// "include" could allow back-references from the parent begin/match pattern
// to support arbitrary languages
{ "include": "source.$1" }
]
} This would let you include any arbitrary embedded language without having to know anything about its syntax, and you could even have escaping in the parent language be recognized and it would just work. |
There is persistent problem with injected languages where unclosed parentheses or quotes eat whole document.
I'm proposing to add option e.g.
applyEndPatternFirst
. This would check suchend
pattern at every line before anything else. I've looked at code a bit and it seems possible, but I'm not familiar enough with engine to correctly implement this (at least without trashing performance).Details:
end
match - everything beforeend
would be extracted and checked again with sub-patterns and merged withend
's tokens, while rest of line would be checked normally. Without match - business as usual.end
patterns with this option will be present, they should be stored and checked from first to last, at every line until any of them matches. There shouldn't be too many of them, since this should be used mainly with injected languages.end
with option should also match in line wherebefore
was found.Why not use
while
? Whilewhile
(sic) is great addition, it can only exclude whole line. Proposed option would allow stopping in mid-line.The text was updated successfully, but these errors were encountered: