Tree-sitter - An incremental parsing system for programming tools



Build Status Build status DOI

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. Tree-sitter aims to be:

  • General enough to parse any programming language
  • Fast enough to parse on every keystroke in a text editor
  • Robust enough to provide useful results even in the presence of syntax errors
  • Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application


  • [WIP] lib: Add meson build system

    [WIP] lib: Add meson build system


    This adds support for building using the meson meta-build system with ninja build backend. This makes building, installing, and using the tree-sitter library much simpler and faster. The tree-sitter library versions for the web and Rust bindings are also built using meson and ninja.

    Additional build dependencies

    This requires some additional build dependencies:

    • python3
    • meson
    • ninja

    Build performance

    There's a fairly significant (~70%) reduction in compile times (on my 4C/8T server) as a result of this change:


    $ time (for i in $(seq 100); do
    rm libtree-sitter.a
    real    8m51.558s
    user    8m40.819s
    sys     0m4.115s


    $ time (for i in $(seq 100); do
    CC=gcc ./script/build-lib &> /dev/null
    rm -rf build
    real    2m56.688s
    user    9m21.832s
    sys     0m30.126s

    Furthermore, incremental rebuilds should be much faster during development.


    Meson can be configured to build a static library, shared library, or both libraries during the build using -Ddefault_library={shared,static,both}[1]. It is easy to modify the build type (e.g. release, debug, debugoptimized), CFLAGS (in addition to -Dc_args=), as well as install the libraries and public headers to the appropriate platform-specific location[2]. See script/build-lib for example usage as well as the quickstart guide[3] and the Meson reference manual[4] for details.

    Future work

    This is not yet completed and requires:

    • [ ] Discussion and approval of the idea
    • [ ] Updated documentation
    • [ ] Shared library: discussion of exported symbols and soname/soversion


    1. Meson - built-in options
    2. Meson - installing
    3. Meson - quickstart guide
    4. Meson - reference manual
    opened by mkrupcale 25
  • Incorrect parsing behavior when NDEBUG macro is defined

    Incorrect parsing behavior when NDEBUG macro is defined

    Compiling tree-sitter with clang -Ofast I see a difference in behaviour: namely when parsing a big Go file it produces different number of nodes and consumes much MORE time than -O0 This is a scary thing meaning that there is UB somewhere. Could you please investigate it?

    The problematic file could be found here:

    opened by zajac 22
  • Allow swapping malloc implementation at runtime

    Allow swapping malloc implementation at runtime

    Not being able to swap malloc implementation at runtime is blocking us from merging tree-sitter support into Emacs. It is quite reasonable to expect the system to provide tree-sitter library and matching language libraries. On the other hand, embedding tree-sitter into Emacs is both a maintenance burden and a potential risk: Emacs' tree-sitter might not match the language libraries provided by the system. So we really want to link tree-sitter dynamically, and we need to be able to change malloc at runtime to do that. Could you consider adding this feature to tree-sitter? TIA.

    opened by casouri 21
  • Introduce the 'Tree query' - an API for pattern-matching on syntax trees

    Introduce the 'Tree query' - an API for pattern-matching on syntax trees


    This pull request adds a new data type to the Tree-sitter C library: TSQuery. A query represents one or more patterns of nodes in a syntax tree. You can instantiate a query from a series of S-expressions (similar to those used in Tree-sitter's unit testing system). You can then execute the query on a syntax tree, which lets you efficiently iterate over all of the occurrences of the patterns in the tree. This works in C, Rust, and JavaScript (via Wasm).


    Many code analysis tasks involve searching for patterns in syntax trees. Some of these analysis tasks are very common, and it'd be nice to avoid implementing them multiple times. Examples of some common tasks you might want to perform with a Tree-sitter syntax tree:

    • Computing syntax highlighting
    • Computing code-folding regions
    • Finding nested documents to parse separately (JavaScript within HTML, Ruby within ERB, etc)

    The Tree-sitter C library is used from several languages, so in order for these analyses to be reusable, they have to be specified in a way that doesn't depend on any particular high-level language's runtime.

    Prior Solutions

    • In, I added a system called property sheets, that let you use CSS to select syntax nodes and assigning them properties. It was based on a compile-time step that converted the .css files into finite state machines encoded as JSON.
    • In, I added a Rust implementation of syntax highlighting based on these property sheets.

    But these solutions have some major drawbacks:

    1. It's awkward having to compile the CSS files into JSON files ahead-of-time. It means that the JSON files need to be checked into the git repository (or somewhere) and applications can't easily use the feature for other purposes. Also, The generated JSON files are larger than the source CSS, which is bad for front-end use of Tree-sitter.

    2. We could switch to processing the CSS at runtime, but the CLI currently relies on a Rust library for parsing the CSS, and a bunch of Rust code for transforming it into a DFA. It would be hard to consolidate this into the core C library, and it would bloat the library.

    3. Most immediately-pressing - CSS has limited expressive power. It's awkward to select nodes based on details of their siblings. This turns out to be important for certain syntax highlighting tasks. For example, in this JavaScript code:

      var foo = function(a) { /* ... */ }

      The variable foo is normally highlighted as a function. The current tree-sitter-highlight crate can't do this, because properties of nodes can't depend on their siblings like that.

    The Query Language

    Instead of using CSS, compiled into a DFA ahead-of-time, the new TSQuery API uses S-expressions, compiled into an NFA at runtime. Here are some examples of what the query language currently looks like.

    To select all of the methods, together with the name of the class:

      name: (identifier) @the-class-name
      body: (class_body
          name: (property_identifier) @the-method-name)))

    To select variables to which functions or arrow functions are assigned:

      (identifier) @function-name
      (identifier) @function-name

    To select all null checks:

      left: (*) @null-checked-object
      operator: "=="
      right: (null))

    The annotations that start with @, like @the-class-name are called captures. These identify which nodes should be returned from the query, and what name should be associated with them when they are returned.

    With the exception of captures, the query language is identical to the format in which Tree-sitter's unit tests are written: S-expressions.

    Static Verification of Queries

    When a query is instantiated, all of the node names and field names are transformed into integer ids, and an error is raised if any of the names are not actually defined in the grammar.

    Because the queries are so easy to parse, it would be easy to add an even more thorough check that uses the node-types.json to check that all of the parent-child relationships are valid, and could actually occur. I'll implement this in a follow-up PR. My plan is that reusable queries could be stored in a top-level queries folder in the grammar repositories, and the tree-sitter test command could be augmented to check the validity of the queries using the existing node type information.

    Trying it Out

    You can write and execute queries interactively in the web UI, both on the docs site and via the tree-sitter web-ui command in your own grammar repos.


    opened by maxbrunsfeld 21
  • Sublime Syntax compatibility

    Sublime Syntax compatibility

    Sublime's syntax is a way more powerful than the tmLanguage and allows you to control the contexts (an example of JavaScript expressions, is possible to somehow convert the syntax into the tree sitter grammar?

    opened by borela 21
  • web-tree-sitter failing while other bindings are working

    web-tree-sitter failing while other bindings are working


    running our test suite with the new tree sitter shows some errors. I'm still investigating, so this is just a headsup for now.

    opened by razzeee 19
  • Error loading `tree-sitter.{js,wasm}` in browser

    Error loading `tree-sitter.{js,wasm}` in browser

    Instead of relying on Parser.Language.load() to properly load in .wasm parsers, having an option like for webpack users would give a second, battle-hardened approach to begin using tree-sitter on the web.

    Needed because I am experiencing an error where Parser.Language.load is looking in the wrong path for my .wasm file. Parser.Language.load("/tree-sitter-javascript.wasm") sends a web request for /tree-sitter.wasm. No fix is applicable.

    opened by c4lliope 19
  • Reimplement syntax highlighting with tree queries instead of property sheets

    Reimplement syntax highlighting with tree queries instead of property sheets


    This PR reimplements the tree-sitter-highlight crate (including its C API) and the tree-sitter highlight subcommand. Syntax highlighting is no longer based on property sheets, and instead use the new 'tree query' API introduced in #444.

    The Query Structure

    The new syntax highlighting implementation is based on three query files:

    1. queries/highlights.scm

      This file contains patterns with capture names that correspond to syntax highlighting styles. Examples are @type, @function.builtin, @punctuation.bracket, etc. The names are dot-separated, and the idea is that users may want to style in a course grained way (e.g. @function) or in a more fine-grained way (e.g. @function.method.builtin), and both will work.

      Tie-breaking convention - In the event that two captures in this query both capture a given node with two different capture names, the first pattern is preferred.

    2. queries/injections.scm

      This file contains patterns that, when matched, cause portions of the syntax tree to have their text re-parsed with a new "injected" grammar. These patterns can specify three fixed captures:

      • - This is the parent node that contains all of the information regarding the injection.
      • @injection.content - This node will have its text re-parsed using some other grammar. If there are multiple matches with the same site but different content nodes, then the all of the content nodes will be parsed together as one nested document. This allows you to parse multiple disjoint ranges of text as one injection, which is important for templating languages like ERB, in which code can be interspersed with other text.
      • @injection.language (optional) - This node's text will be used to determine which language to use for the injection. For example, in a Ruby HEREDOC, where the heredoc delimiter often indicates the language, you would capture the delimiter as the language.

      The following predicates are recognized for patterns in this file:

      • (set! injection.language "the-language") - This allows you to hard-code a specific language that should be injected, instead of inferring one from the text of a captured node.
      • (set! injection.include-children) - By default, when an injection.content node is captured, only its own text is included in the injection, not the text of its children. This directive allows you to override that, including all of the text contained by a given node instead.
    3. queries.locals.scm

      This file allows you to keep track of local variables, ensuring that they are styled the same way in every place where they occur. Patterns in this file can have these captures:

      • @local.scope - This indicates that the captured node introduces a new scope.
      • @local.definition - This indicates that the captured node represents a newly-introduced local variable.
      • @local.reference - This indicates that the captured node may be a reference to a local variable introduced earlier in the current scope, or some surrounding scope.


    • [x] Get the tests passing
    • [x] Update the tree-sitter-highlight readme
    • [x] Add queries for all of the languages which previously had property sheets
    • [x] Add the ability for queries match ERROR nodes (🎩 @bfredl for raising this)
    • [ ] Add some docs about this on the documentation site

    Example Highlighting Queries

    • [x] HTML
    • [x] JS
    • [x] Rust
    • [x] ERB/EJS
    • [x] Go
    • [x] Python
    • [x] Ruby
    opened by maxbrunsfeld 19
  • Question about files that contain multiple languages

    Question about files that contain multiple languages

    Hey there!

    I am late to the party, however I like what you’re doing here @maxbrunsfeld 👏 .

    A question that crossed my mind is how tree-sitter could be useful for files that contain multiple languages. A typical example could be a foo.html file that contain HTML, Javascript and CSS; or a bar.js file that contain Javascript mixed with JSX.

    I am really liking the idea of incremental parsing, and I also realize maybe “language detection” isn’t a concern here; but I wonder how would, for instance, GitHub handle syntax highlighting using tree-sitter and detect to “switch“ to a different grammer for various parts of a file.

    Any comments?

    Thank you. 🙏

    opened by exalted 19
  • Introduce an EXCLUDE rule

    Introduce an EXCLUDE rule


    Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes and states. In general, Tree-sitter tends to be permissive in allowing words that are keywords in some places to be used freely as names in other places.

    Sometimes this permissiveness causes unexpected error recoveries. Consider this C syntax error:

    float // <-- error
    int main() {}

    Currently, when tree-sitter-c encounters this code, it doesn't detect an error until the word main, because it interprets the word int as a variable, declared with type float. It doesn't see int as a keyword, because the keyword int wouldn't be allowed in that position.


    In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are not allowed in certain places. For example in C, primitive types like int and control-flow keywords like while are not allowed as variable names in declarators.

    This PR introduces a new EXCLUDE rule to the underlying JSON schema. From JavaScript, you can use it like this:

    declarator: choice(
      $.identifier.exclude('if', 'int', ...etc)

    Conceptually, you're saying "a declarator can match an identifier, but not these other tokens".


    Internally, all Tree-sitter needs to do is to insert the excluded tokens (if, int, etc) into the set of valid lookahead symbols that it uses when tokenizing, in the relevant states. Then, when the lexer sees the string "if", it will recognize it as an if token, not just an identifier. Then, as always when there's an error, the parser will find that there are no valid parse actions for the token if.


    I could have instead introduced a new field on the entire grammar called keywords. Then, if we added if to the grammar's keywords, the word if would always be treated as its own token, in every parse state.

    This is less general, and it wouldn't really work AFAICT. Even in C, there are states where the word int should not be treated as a keyword. For example, inside of a string ("int"), as the name of a macro definition #define int. And in other languages, there are many more cases like this than there are in C. For example in JavaScript, it's fine to have an object property named if.

    Relevant Issues

    opened by maxbrunsfeld 18
  • tree-sitter/emscripten mismatch?

    tree-sitter/emscripten mismatch?

    I'm seeing an issue that stems from loadWebAssemblyModule (apparently provided by emscripten's support.js) reading from my Wasm parser and calling fetchBinary with paths/URLs that don't make sense.

    This currently looks something like the following:

    fetch url:
    fetch url:
    fetch url: nv	tableBase
    fetch url: ree_sitter_mylang
    fetch url: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    fetch url: @


    global.fetch = (url, options) => {
      console.log("fetch url:", url);

    I was able to learn some things by taking my node_module/web-tree-sitter/tree-sitter.js file and running it through prettier.

    Specifically, neededDynLibs is the array of strings shown in the log output above.

    I'll try to provide as much info as possible on my setup.

    1. I'm essentially trying to use tree-sitter in a Node.js environment, so while the following helps with that:

    Emscripten still uses fetch, which isn't provided by default on Node.js so I'm essentially doing global.fetch = ..., checking for the file: URL scheme, and employing a similar workaround.

    I don't think this is affecting anything, but wanted to point it out.

    1. I'm using the release assets of tree-sitter 0.15.2 but my Wasm parser was built using tree-sitter build-wasm. The build-wasm command apparently also uses emscripten, and I'm concerned that the version of emscripten's support.js included in tree-sitter.js is incompatible with the version of emscripten I used when invoking tree-sitter build-wasm

    I'm using Nix so can't rely on script/fetch-emscripten (or rather haven't been able to figure out how to make that work as a fixed-output derivation yet) but based on the Travis CI build for 0.15.2 I'm using emscripten 1.38.31

    I've tried different versions of emscripten and have seen different log output, which makes me think this is the issue.

    For example, if I use 1.38.32 I see;

    fetch url: 
    fetch url: 
    fetch url: nv	tableBase
    fetch url: _mylang
    fetch url: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    fetch url: @
    opened by paulyoung 17
  • Automatic compilation via subcommands and documentation

    Automatic compilation via subcommands and documentation

    I've been puzzled from time to time about "automatic compilation" and when it might be triggered.

    There is a section in the docs about Automatic Compilation which currently reads:

    You might notice that the first time you run tree-sitter test after regenerating your parser, it takes some extra time. This is because Tree-sitter automatically compiles your C code into a dynamically-loadable library. It recompiles your parser as-needed whenever you update it by re-running tree-sitter generate.

    Indeed, I have observed running tree-sitter test after tree-sitter generate leading to compilation (easily observable evidence is the appearance of a replacement .so file), but automatic compilation seems to occur under other circumstances as well.

    From some empirical tests and reading of source, the following are subcommands I think can trigger compilation:

    • generate -- in 2023-01 --build was added; if it's present, yes
    • highlight
    • query
    • tags
    • test

    Does ti seem worth mentioning in the docs that other subcommands can trigger automatic compilation?

    It's a feature which can be convenient but I suspect it has contributed to confusion at times.

    opened by sogaiu 5
  • Incorrect result of ts_node_first_child_for_byte

    Incorrect result of ts_node_first_child_for_byte

    There are 2 reports of ts_node_first_child_for_byte returning what appears to be an incorrect result:


    I investigated both issues and posted a writeup to the fomer issue explaining my findings. ATM, I think ts_node_first_child_for_byte doesn't handle certain specific cases.

    I went searching for use of ts_node_first_child_for_byte, but apart from Emacs 29+, I did not find any use. Does anyone know of other uses of this function?

    cc @casouri @dannyfreeman

    bug c-library 
    opened by sogaiu 0
  • Allow TextProvider's iterators to generate owned text

    Allow TextProvider's iterators to generate owned text

    This is just #1294 rebased on today's master branch. With the help of thrown on top, I was able to get the test suite of tree-sitter-langs to pass with home-built language DLLs.

    opened by domq 1
  • node version of CLI fails on my M1 MacBook Pro

    node version of CLI fails on my M1 MacBook Pro

    unfortunately was not able to reproduce on other similar machines maybe unique to me? On my machine running npm install --save-dev tree-sitter-cli followed by ./node_modules/.bin/tree-sitter resulting in the following error

        throw errnoException(err, 'spawn');
    Error: spawn Unknown system error -86
        at ChildProcess.spawn (node:internal/child_process:413:11)
        at spawn (node:child_process:743:9)
        at Object.<anonymous> (~/tree-sitter-hello/node_modules/tree-sitter-cli/cli.js:8:1)
        at Module._compile (node:internal/modules/cjs/loader:1119:14)
        at Module._extensions..js (node:internal/modules/cjs/loader:1173:10)
        at Module.load (node:internal/modules/cjs/loader:997:32)
        at Module._load (node:internal/modules/cjs/loader:838:12)
        at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)
        at node:internal/main/run_main_module:18:47 {
      errno: -86,
      code: 'Unknown system error -86',
      syscall: 'spawn'

    expected output is a help menu from the cli. Other commands produce the same output.

    opened by ilevyor 1
  • Most recent cargo tree-sitter-cli version 0.20.7 breaks node usage.

    Most recent cargo tree-sitter-cli version 0.20.7 breaks node usage.

    To reproduce cargo install tree-sitter-cli and follow the instructions instructions for creating a language here:

    then in a sister directory init a node project with the following in package.json

      "name": "simple",
      "version": "1.0.0",
      "description": "",
      "main": "index.js",
      "scripts": {
        "test": "echo \"Error: no test specified\" && exit 1"
      "author": "",
      "license": "ISC",
      "dependencies": {
        "tree-sitter": "^0.20.1",
        "tree-sitter-hello": "file:../tree-sitter-hello/",

    and the following in index.js

    const Parser = require('tree-sitter');
    const Grit = require('tree-sitter-hello');
    const parser = new Parser();
    const sourceCode = 'hello';
    const tree = parser.parse(sourceCode);

    running node index.js will result in the following error:

    ~/simple/node_modules/tree-sitter/index.js:259, language);
    RangeError: Incompatible language version. Compatible range: 13 - 13. Got: 14

    language version can be found on line 8 of parser.c #define LANGUAGE_VERSION 14 using 0.20.6 of the rust cli resolves the issue.

    opened by ilevyor 1
Like grep, but uses tree-sitter grammars to search

tree-grepper Works like grep, but uses tree-sitter to search for structure instead of strings. Installing This isn't available packaged anywhere. That

Brian Hicks 219 Dec 25, 2022
A syntax highlighter for Node powered by Tree Sitter. Written in Rust.

tree-sitter-highlight A syntax highlighter for Node.js powered by Tree Sitter. Written in Rust. Usage The following will output HTML: const treeSitter

Devon Govett 211 Dec 20, 2022
A tree-sitter based AST difftool to get meaningful semantic diffs

diffsitter Disclaimer diffsitter is very much a work in progress and nowhere close to production ready (yet). Contributions are always welcome! Summar

Afnan Enayet 1.3k Jan 8, 2023
Semantic find-and-replace using tree-sitter-based macro expansion

Semantic find-and-replace using tree-sitter-based macro expansion

Isaac Clayton 15 Nov 10, 2022
Mypyc DSL grammar for tree-sitter

tree-sitter-mypyc Mypyc DSL grammar for tree-sitter. Installing (Neovim) This is based on the Neovim Tree-sitter docs for adding new parsers. Basicall

dosisod 3 Dec 30, 2022
tree-sitter meets Kakoune

kak-tree-sitter This is a binary server that interfaces tree-sitter with kakoune. Features Install Usage Design Credits Features Semantic highlighting

Dimitri Sabadie 5 May 3, 2023
rehype plugin to use tree-sitter to highlight code in pre code blocks

rehype-tree-sitter rehype plugin to use tree-sitter to highlight code in <pre><code> blocks Contents What is this? When should I use this? Install Use

null 5 Jul 25, 2023
⚙️ A curated list of static analysis (SAST) tools for all programming languages, config files, build tools, and more.

This repository lists static analysis tools for all programming languages, build tools, config files and more. The official website,

Analysis Tools 10.7k Jan 2, 2023
Incremental computation through constrained memoization.

comemo Incremental computation through constrained memoization. [dependencies] comemo = "0.1" A memoized function caches its return values so that it

Typst 37 Dec 15, 2022
An easy-to-use, incremental, multi-threaded garbage collector for Rust

Refuse An easy-to-use, incremental, multi-threaded garbage collector for Rust. //! A basic usage example demonstrating the garbage collector. use refu

Khonsu Labs 6 May 3, 2024
Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. A tokei/scc/cloc alternative.

tcount (pronounced "tee-count") Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. Quick Start Simply run tcount

Adam P. Regasz-Rethy 48 Dec 7, 2022
As-tree - Print a list of paths as a tree of paths 🌳

as-tree Print a list of paths as a tree of paths. For example, given: dir1/foo.txt dir1/bar.txt dir2/qux.txt it will print: . ├── dir1 │ ├── foo.tx

Jake Zimmerman 396 Dec 10, 2022
A system clipboard command line tools which inspired by pbcopy & pbpaste but better to use.

rclip A command line tool which supports copy a file contents to the system clipboard or copy the contents of the system clipboard to a file. Install

yahaa 3 May 30, 2022
An Interpreter for Brainfuck programming language implemented in the Rust programming language with zero dependencies.

Brainfuck Hello, Visitor! Hey there, welcome to my project showcase website! It's great to have you here. I hope you're ready to check out some awesom

Syed Vilayat Ali Rizvi 7 Mar 31, 2023
Programming language made by me to learn other people how to make programming languages :3

Spectra programming language Programming language made for my tutorial videos (my youtube channel): Syntax Declaring a variable: var a = 3; Function

Adi Salimgereyev 3 Jul 25, 2023
A programming and system administration assistant, powered by chatGPT

TermGPT Interact with ChatGPT from your terminal! ?? ?? Install Cargo cargo install termgpt termgpt --help From source git clone [email protected]:bahdot

Gokul 5 May 11, 2023
First project in rust which will be to make an accounts system & Leaderboard/Score system

rust-backend this is my first project in rust which will be to make a backend for compsci project it will include: Accounts, Player Achievements (if I

NaughtyDog6000 2 Jul 13, 2023
Argument parsing for the future 🚀

argi Argument parsing for the future ?? Features Macro-based approach, providing an intuitive way to layout a cli Rich auto-help generation, styling b

Owez 132 Oct 23, 2022
Application microframework with command-line option parsing, configuration, error handling, logging, and shell interactions

Abscissa is a microframework for building Rust applications (either CLI tools or network/web services), aiming to provide a large number of features w

iqlusion 524 Dec 26, 2022