Background
Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes and states. In general, Tree-sitter tends to be permissive in allowing words that are keywords in some places to be used freely as names in other places.
Sometimes this permissiveness causes unexpected error recoveries. Consider this C syntax error:
float // <-- error
int main() {}
Currently, when tree-sitter-c encounters this code, it doesn't detect an error until the word main
, because it interprets the word int
as a variable, declared with type float
. It doesn't see int
as a keyword, because the keyword int
wouldn't be allowed in that position.
Solution
In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are not allowed in certain places. For example in C, primitive types like int
and control-flow keywords like while
are not allowed as variable names in declarators.
This PR introduces a new EXCLUDE
rule to the underlying JSON schema. From JavaScript, you can use it like this:
declarator: choice(
$.pointer_declarator,
$.array_declarator,
$.identifier.exclude('if', 'int', ...etc)
)
Conceptually, you're saying "a declarator can match an identifier, but not these other tokens".
Implementation
Internally, all Tree-sitter needs to do is to insert the excluded tokens (if
, int
, etc) into the set of valid lookahead symbols that it uses when tokenizing, in the relevant states. Then, when the lexer sees the string "if", it will recognize it as an if
token, not just an identifier. Then, as always when there's an error, the parser will find that there are no valid parse actions for the token if
.
Alternatives
I could have instead introduced a new field on the entire grammar called keywords
. Then, if we added if
to the grammar's keywords
, the word if
would always be treated as its own token, in every parse state.
This is less general, and it wouldn't really work AFAICT. Even in C, there are states where the word int
should not be treated as a keyword. For example, inside of a string ("int"
), as the name of a macro definition #define int
. And in other languages, there are many more cases like this than there are in C. For example in JavaScript, it's fine to have an object property named if
.
Relevant Issues
https://github.com/atom/language-c/issues/308