Tree-sitter - An incremental parsing system for programming tools

Last update: Jan 9, 2023

Related tags

Command-line c rust tree-sitter parsing incremental

Overview

tree-sitter

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. Tree-sitter aims to be:

General enough to parse any programming language
Fast enough to parse on every keystroke in a text editor
Robust enough to provide useful results even in the presence of syntax errors
Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application

Links

Comments

[WIP] lib: Add meson build system
Introduction

This adds support for building using the meson meta-build system with ninja build backend. This makes building, installing, and using the tree-sitter library much simpler and faster. The tree-sitter library versions for the web and Rust bindings are also built using meson and ninja.

Additional build dependencies

This requires some additional build dependencies:

python3

meson

ninja

Build performance

There's a fairly significant (~70%) reduction in compile times (on my 4C/8T server) as a result of this change:

master

$ time (for i in $(seq 100); do ./script/build-lib rm libtree-sitter.a done) real 8m51.558s user 8m40.819s sys 0m4.115s

meson-build

$ time (for i in $(seq 100); do CC=gcc ./script/build-lib &> /dev/null rm -rf build done) real 2m56.688s user 9m21.832s sys 0m30.126s

Furthermore, incremental rebuilds should be much faster during development.

Usage

Meson can be configured to build a static library, shared library, or both libraries during the build using -Ddefault_library={shared,static,both}[1]. It is easy to modify the build type (e.g. release, debug, debugoptimized), CFLAGS (in addition to -Dc_args=), as well as install the libraries and public headers to the appropriate platform-specific location[2]. See script/build-lib for example usage as well as the quickstart guide[3] and the Meson reference manual[4] for details.

Future work

This is not yet completed and requires:

[ ] Discussion and approval of the idea

[ ] Updated documentation

[ ] Shared library: discussion of exported symbols and soname/soversion

References

Meson - built-in options

Meson - installing

Meson - quickstart guide

Meson - reference manual
opened by mkrupcale 25
Incorrect parsing behavior when NDEBUG macro is defined

Compiling tree-sitter with clang -Ofast I see a difference in behaviour: namely when parsing a big Go file it produces different number of nodes and consumes much MORE time than -O0 This is a scary thing meaning that there is UB somewhere. Could you please investigate it?

The problematic file could be found here: https://github.com/JetBrains/jsitter/blob/master/testData/router_go
bug

opened by zajac 22
Allow swapping malloc implementation at runtime

Not being able to swap malloc implementation at runtime is blocking us from merging tree-sitter support into Emacs. It is quite reasonable to expect the system to provide tree-sitter library and matching language libraries. On the other hand, embedding tree-sitter into Emacs is both a maintenance burden and a potential risk: Emacs' tree-sitter might not match the language libraries provided by the system. So we really want to link tree-sitter dynamically, and we need to be able to change malloc at runtime to do that. Could you consider adding this feature to tree-sitter? TIA.

opened by casouri 21
Introduce the 'Tree query' - an API for pattern-matching on syntax trees
Overview

This pull request adds a new data type to the Tree-sitter C library: TSQuery. A query represents one or more patterns of nodes in a syntax tree. You can instantiate a query from a series of S-expressions (similar to those used in Tree-sitter's unit testing system). You can then execute the query on a syntax tree, which lets you efficiently iterate over all of the occurrences of the patterns in the tree. This works in C, Rust, and JavaScript (via Wasm).

Background

Many code analysis tasks involve searching for patterns in syntax trees. Some of these analysis tasks are very common, and it'd be nice to avoid implementing them multiple times. Examples of some common tasks you might want to perform with a Tree-sitter syntax tree:

Computing syntax highlighting

Computing code-folding regions

Finding nested documents to parse separately (JavaScript within HTML, Ruby within ERB, etc)

The Tree-sitter C library is used from several languages, so in order for these analyses to be reusable, they have to be specified in a way that doesn't depend on any particular high-level language's runtime.

Prior Solutions

In https://github.com/tree-sitter/tree-sitter/pull/204, I added a system called property sheets, that let you use CSS to select syntax nodes and assigning them properties. It was based on a compile-time step that converted the .css files into finite state machines encoded as JSON.

In https://github.com/tree-sitter/tree-sitter/pull/283, I added a Rust implementation of syntax highlighting based on these property sheets.

But these solutions have some major drawbacks:

It's awkward having to compile the CSS files into JSON files ahead-of-time. It means that the JSON files need to be checked into the git repository (or somewhere) and applications can't easily use the feature for other purposes. Also, The generated JSON files are larger than the source CSS, which is bad for front-end use of Tree-sitter.

We could switch to processing the CSS at runtime, but the CLI currently relies on a Rust library for parsing the CSS, and a bunch of Rust code for transforming it into a DFA. It would be hard to consolidate this into the core C library, and it would bloat the library.

Most immediately-pressing - CSS has limited expressive power. It's awkward to select nodes based on details of their siblings. This turns out to be important for certain syntax highlighting tasks. For example, in this JavaScript code:

var foo = function(a) { /* ... */ }

The variable foo is normally highlighted as a function. The current tree-sitter-highlight crate can't do this, because properties of nodes can't depend on their siblings like that.

The Query Language

Instead of using CSS, compiled into a DFA ahead-of-time, the new TSQuery API uses S-expressions, compiled into an NFA at runtime. Here are some examples of what the query language currently looks like.

To select all of the methods, together with the name of the class:

(class_declaration name: (identifier) @the-class-name body: (class_body (method_definition name: (property_identifier) @the-method-name)))

To select variables to which functions or arrow functions are assigned:

(assignment_expression (identifier) @function-name (function)) (assignment_expression (identifier) @function-name (arrow_function))

To select all null checks:

(binary_expression left: (*) @null-checked-object operator: "==" right: (null))

The annotations that start with @, like @the-class-name are called captures. These identify which nodes should be returned from the query, and what name should be associated with them when they are returned.

With the exception of captures, the query language is identical to the format in which Tree-sitter's unit tests are written: S-expressions.

Static Verification of Queries

When a query is instantiated, all of the node names and field names are transformed into integer ids, and an error is raised if any of the names are not actually defined in the grammar.

Because the queries are so easy to parse, it would be easy to add an even more thorough check that uses the node-types.json to check that all of the parent-child relationships are valid, and could actually occur. I'll implement this in a follow-up PR. My plan is that reusable queries could be stored in a top-level queries folder in the grammar repositories, and the tree-sitter test command could be augmented to check the validity of the queries using the existing node type information.

Trying it Out

You can write and execute queries interactively in the web UI, both on the docs site and via the tree-sitter web-ui command in your own grammar repos.
opened by maxbrunsfeld 21
Sublime Syntax compatibility

Sublime's syntax is a way more powerful than the tmLanguage and allows you to control the contexts (an example of JavaScript expressions https://github.com/borela/naomi/blob/master/syntaxes/fjsx15/expression.sublime-syntax), is possible to somehow convert the syntax into the tree sitter grammar?

opened by borela 21
web-tree-sitter failing while other bindings are working

Hey,

running our test suite with the new tree sitter shows some errors. I'm still investigating, so this is just a headsup for now.

https://github.com/elm-tooling/elm-language-server/pull/590

opened by razzeee 19
Error loading `tree-sitter.{js,wasm}` in browser

Instead of relying on Parser.Language.load() to properly load in .wasm parsers, having an option like https://github.com/ballercat/wasm-loader for webpack users would give a second, battle-hardened approach to begin using tree-sitter on the web.

Needed because I am experiencing an error where Parser.Language.load is looking in the wrong path for my .wasm file. Parser.Language.load("/tree-sitter-javascript.wasm") sends a web request for /tree-sitter.wasm. No fix is applicable.

opened by c4lliope 19
Reimplement syntax highlighting with tree queries instead of property sheets
Summary

This PR reimplements the tree-sitter-highlight crate (including its C API) and the tree-sitter highlight subcommand. Syntax highlighting is no longer based on property sheets, and instead use the new 'tree query' API introduced in #444.

The Query Structure

The new syntax highlighting implementation is based on three query files:

queries/highlights.scm

This file contains patterns with capture names that correspond to syntax highlighting styles. Examples are @type, @function.builtin, @punctuation.bracket, etc. The names are dot-separated, and the idea is that users may want to style in a course grained way (e.g. @function) or in a more fine-grained way (e.g. @function.method.builtin), and both will work.

Tie-breaking convention - In the event that two captures in this query both capture a given node with two different capture names, the first pattern is preferred.

queries/injections.scm

This file contains patterns that, when matched, cause portions of the syntax tree to have their text re-parsed with a new "injected" grammar. These patterns can specify three fixed captures:

@injection.site - This is the parent node that contains all of the information regarding the injection.

@injection.content - This node will have its text re-parsed using some other grammar. If there are multiple matches with the same site but different content nodes, then the all of the content nodes will be parsed together as one nested document. This allows you to parse multiple disjoint ranges of text as one injection, which is important for templating languages like ERB, in which code can be interspersed with other text.

@injection.language (optional) - This node's text will be used to determine which language to use for the injection. For example, in a Ruby HEREDOC, where the heredoc delimiter often indicates the language, you would capture the delimiter as the language.

The following predicates are recognized for patterns in this file:

(set! injection.language "the-language") - This allows you to hard-code a specific language that should be injected, instead of inferring one from the text of a captured node.

(set! injection.include-children) - By default, when an injection.content node is captured, only its own text is included in the injection, not the text of its children. This directive allows you to override that, including all of the text contained by a given node instead.

queries.locals.scm

This file allows you to keep track of local variables, ensuring that they are styled the same way in every place where they occur. Patterns in this file can have these captures:

@local.scope - This indicates that the captured node introduces a new scope.

@local.definition - This indicates that the captured node represents a newly-introduced local variable.

@local.reference - This indicates that the captured node may be a reference to a local variable introduced earlier in the current scope, or some surrounding scope.

Tasks

[x] Get the tests passing

[x] Update the tree-sitter-highlight readme

[x] Add queries for all of the languages which previously had property sheets

[x] Add the ability for queries match ERROR nodes (🎩 @bfredl for raising this)

[ ] Add some docs about this on the documentation site

Example Highlighting Queries

[x] HTML https://github.com/tree-sitter/tree-sitter-html/pull/12

[x] JS https://github.com/tree-sitter/tree-sitter-javascript/pull/123

[x] Rust https://github.com/tree-sitter/tree-sitter-rust/pull/55

[x] ERB/EJS https://github.com/tree-sitter/tree-sitter-embedded-template/pull/2

[x] Go https://github.com/tree-sitter/tree-sitter-go/pull/33/files

[x] Python https://github.com/tree-sitter/tree-sitter-python/pull/54

[x] Ruby https://github.com/tree-sitter/tree-sitter-ruby/pull/110
opened by maxbrunsfeld 19
Question about files that contain multiple languages

Hey there!

I am late to the party, however I like what you’re doing here @maxbrunsfeld 👏 .

A question that crossed my mind is how tree-sitter could be useful for files that contain multiple languages. A typical example could be a foo.html file that contain HTML, Javascript and CSS; or a bar.js file that contain Javascript mixed with JSX.

I am really liking the idea of incremental parsing, and I also realize maybe “language detection” isn’t a concern here; but I wonder how would, for instance, GitHub handle syntax highlighting using tree-sitter and detect to “switch“ to a different grammer for various parts of a file.

Any comments?

Thank you. 🙏

opened by exalted 19
Introduce an EXCLUDE rule
Background

Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes and states. In general, Tree-sitter tends to be permissive in allowing words that are keywords in some places to be used freely as names in other places.

Sometimes this permissiveness causes unexpected error recoveries. Consider this C syntax error:

float // <-- error int main() {}

Currently, when tree-sitter-c encounters this code, it doesn't detect an error until the word main, because it interprets the word int as a variable, declared with type float. It doesn't see int as a keyword, because the keyword int wouldn't be allowed in that position.

Solution

In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are not allowed in certain places. For example in C, primitive types like int and control-flow keywords like while are not allowed as variable names in declarators.

This PR introduces a new EXCLUDE rule to the underlying JSON schema. From JavaScript, you can use it like this:

declarator: choice( $.pointer_declarator, $.array_declarator, $.identifier.exclude('if', 'int', ...etc) )

Conceptually, you're saying "a declarator can match an identifier, but not these other tokens".

Implementation

Internally, all Tree-sitter needs to do is to insert the excluded tokens (if, int, etc) into the set of valid lookahead symbols that it uses when tokenizing, in the relevant states. Then, when the lexer sees the string "if", it will recognize it as an if token, not just an identifier. Then, as always when there's an error, the parser will find that there are no valid parse actions for the token if.

Alternatives

I could have instead introduced a new field on the entire grammar called keywords. Then, if we added if to the grammar's keywords, the word if would always be treated as its own token, in every parse state.

This is less general, and it wouldn't really work AFAICT. Even in C, there are states where the word int should not be treated as a keyword. For example, inside of a string ("int"), as the name of a macro definition #define int. And in other languages, there are many more cases like this than there are in C. For example in JavaScript, it's fine to have an object property named if.

Relevant Issues

https://github.com/atom/language-c/issues/308
opened by maxbrunsfeld 18
tree-sitter/emscripten mismatch?
I'm seeing an issue that stems from loadWebAssemblyModule (apparently provided by emscripten's support.js) reading from my Wasm parser and calling fetchBinary with paths/URLs that don't make sense.

This currently looks something like the following:

fetch url: fetch url: fetch url: nv tableBase fetch url: ree_sitter_mylang fetch url: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ fetch url: @

where:

global.fetch = (url, options) => { console.log("fetch url:", url); ... };

I was able to learn some things by taking my node_module/web-tree-sitter/tree-sitter.js file and running it through prettier.

Specifically, neededDynLibs is the array of strings shown in the log output above.

I'll try to provide as much info as possible on my setup.

I'm essentially trying to use tree-sitter in a Node.js environment, so while the following helps with that:

https://github.com/tree-sitter/tree-sitter/blob/85877a1def714ed2c7c1300cef72ca6612248998/lib/binding_web/binding.js#L573-L581

Emscripten still uses fetch, which isn't provided by default on Node.js so I'm essentially doing global.fetch = ..., checking for the file: URL scheme, and employing a similar workaround.

I don't think this is affecting anything, but wanted to point it out.

I'm using the release assets of tree-sitter 0.15.2 but my Wasm parser was built using tree-sitter build-wasm. The build-wasm command apparently also uses emscripten, and I'm concerned that the version of emscripten's support.js included in tree-sitter.js is incompatible with the version of emscripten I used when invoking tree-sitter build-wasm

I'm using Nix so can't rely on script/fetch-emscripten (or rather haven't been able to figure out how to make that work as a fixed-output derivation yet) but based on the Travis CI build for 0.15.2 I'm using emscripten 1.38.31

I've tried different versions of emscripten and have seen different log output, which makes me think this is the issue.

For example, if I use 1.38.32 I see;

fetch url: fetch url: fetch url: nv tableBase fetch url: _mylang fetch url: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ fetch url: @
bug
opened by paulyoung 17
Automatic compilation via subcommands and documentation
I've been puzzled from time to time about "automatic compilation" and when it might be triggered.

There is a section in the docs about Automatic Compilation which currently reads:

You might notice that the first time you run tree-sitter test after regenerating your parser, it takes some extra time. This is because Tree-sitter automatically compiles your C code into a dynamically-loadable library. It recompiles your parser as-needed whenever you update it by re-running tree-sitter generate.

Indeed, I have observed running tree-sitter test after tree-sitter generate leading to compilation (easily observable evidence is the appearance of a replacement .so file), but automatic compilation seems to occur under other circumstances as well.

From some empirical tests and reading of source, the following are subcommands I think can trigger compilation:

generate -- in 2023-01 --build was added; if it's present, yes

highlight

query

tags

test

Does ti seem worth mentioning in the docs that other subcommands can trigger automatic compilation?

It's a feature which can be convenient but I suspect it has contributed to confusion at times.
opened by sogaiu 5
Incorrect result of ts_node_first_child_for_byte
There are 2 reports of ts_node_first_child_for_byte returning what appears to be an incorrect result:

https://github.com/tree-sitter/tree-sitter-bash/issues/139

https://github.com/sogaiu/tree-sitter-clojure/issues/32

I investigated both issues and posted a writeup to the fomer issue explaining my findings. ATM, I think ts_node_first_child_for_byte doesn't handle certain specific cases.

I went searching for use of ts_node_first_child_for_byte, but apart from Emacs 29+, I did not find any use. Does anyone know of other uses of this function?

cc @casouri @dannyfreeman
bug c-library
opened by sogaiu 0
Allow TextProvider's iterators to generate owned text

This is just #1294 rebased on today's master branch. With the help of https://github.com/emacs-tree-sitter/elisp-tree-sitter/pull/249 thrown on top, I was able to get the test suite of tree-sitter-langs to pass with home-built language DLLs.

opened by domq 1

node version of CLI fails on my M1 MacBook Pro

unfortunately was not able to reproduce on other similar machines maybe unique to me? On my machine running npm install --save-dev tree-sitter-cli followed by ./node_modules/.bin/tree-sitter resulting in the following error

node:internal/child_process:413
    throw errnoException(err, 'spawn');
    ^

Error: spawn Unknown system error -86
    at ChildProcess.spawn (node:internal/child_process:413:11)
    at spawn (node:child_process:743:9)
    at Object.<anonymous> (~/tree-sitter-hello/node_modules/tree-sitter-cli/cli.js:8:1)
    at Module._compile (node:internal/modules/cjs/loader:1119:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1173:10)
    at Module.load (node:internal/modules/cjs/loader:997:32)
    at Module._load (node:internal/modules/cjs/loader:838:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)
    at node:internal/main/run_main_module:18:47 {
  errno: -86,
  code: 'Unknown system error -86',
  syscall: 'spawn'
}

expected output is a help menu from the cli. Other commands produce the same output.

opened by ilevyor 1

Most recent cargo tree-sitter-cli version 0.20.7 breaks node usage.

To reproduce cargo install tree-sitter-cli and follow the instructions instructions for creating a language here: https://tree-sitter.github.io/tree-sitter/creating-parsers

then in a sister directory init a node project with the following in package.json

{
  "name": "simple",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "tree-sitter": "^0.20.1",
    "tree-sitter-hello": "file:../tree-sitter-hello/",
  }
}

and the following in index.js

const Parser = require('tree-sitter');
const Grit = require('tree-sitter-hello');

const parser = new Parser();
parser.setLanguage(Grit);

const sourceCode = 'hello';
const tree = parser.parse(sourceCode);
console.log(tree.rootNode.toString());

running node index.js will result in the following error:

~/simple/node_modules/tree-sitter/index.js:259
  setLanguage.call(this, language);
              ^

RangeError: Incompatible language version. Compatible range: 13 - 13. Got: 14

language version can be found on line 8 of parser.c #define LANGUAGE_VERSION 14 using 0.20.6 of the rust cli resolves the issue.

opened by ilevyor 1