tcount

(pronounced "tee-count")

Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree.

Quick Start

Simply run tcount in your project root to count tokens and files and print the results grouped by Language. E.g.,

tcount

────────────────────────────
 Group        Files  Tokens
────────────────────────────
 Rust            18   10309
 Go               8    4539
 Ruby             6    1301
────────────────────────────

Installation

cargo install --git https://github.com/RRethy/tcount.git

Requirements

Lastest stable Rust compiler.
Mac or Linux (untested on Windows but most functionality should work, --query option likely will not work on Windows)

tcount Cookbook

Note: None of these use --query, see Queries for information on that option

Compare size of each language in pwd

tcount

────────────────────────────
 Group        Files  Tokens
────────────────────────────
 Rust            18   10309
 Go               8    4539
 Ruby             6    1301
────────────────────────────

Top 5 files by token count

tcount --groupby=file --top=5

──────────────────────────────────
 Group              Files  Tokens
──────────────────────────────────
 ./src/count.rs         1    2451
 ./src/language.rs      1    1685
 ./src/main.rs          1    1214
 ./src/output.rs        1    1157
 ./src/cli.rs           1     757
──────────────────────────────────

Compare size of two directories

tcount --groupby=arg go/scc/ rust/tokei/

─────────────────────────────────────────────
 Group                         Files  Tokens 
─────────────────────────────────────────────
 go/scc                          170  479544 
 rust/tokei                      152   39797 
─────────────────────────────────────────────

Compare size of a Go file and a Rust file

tcount --groupby=file foo.go foo.rs

────────────────────────────
 Group        Files  Tokens
────────────────────────────
 foo.rs           1    1214
 foo.go           1     757
────────────────────────────

Count comments for each language

tcount --kind-pattern=".*comment"

──────────────────────────────────────────────────
 Group        Files  Tokens  Pattern(.*comment)
──────────────────────────────────────────────────
 Rust            18   10309                    78
 Go               7    1302                    35
 Ruby             4     802                    12
──────────────────────────────────────────────────

Note: Comment nodes can have different names depending on the parser. For a language, you can look in the node-types.json file in the parser repo to see what names are given to different nodes (e.g. Go Parser Repo's node-types.json)

Track change in project size over time

tcount --format=csv > tcount-$(date +%m-%d-%Y).csv

These CSV files can then be read and graphed using your tool of choice.

Compare size of all Go files vs all Rust files in foo/

tcount --whitelist Go Rust -- foo/

──────────────────────
 Group  Files  Tokens
──────────────────────
 Rust       9    9034
 Go         6    2011
──────────────────────

Supported languages

tcount --list-languages

──────────────────────────────────────────────────────────────────────
 Language           Extensions                        Query Dir Name 
──────────────────────────────────────────────────────────────────────
 Bash               .bash                             bash 
 BibTeX             .bib                              bibtex 
 C                  .h,.c                             c 
 C#                 .csx,.cs                          c_sharp 
 Clojure            .clj                              clojure 
 C++                .cxx,.c++,.h++,.hh,.cc,.cpp,.hpp  cpp 
 CSS                .css                              css 
 Elm                .elm                              elm 
 Erlang             .erl,.hrl                         erlang 
 Go                 .go                               go 
 HTML               .html                             html 
 Java               .java                             java 
 Javascript         .js,.mjs                          javascript 
 JSON               .json                             json 
 Julia              .jl                               julia 
 LaTeX              .tex                              latex 
 Markdown           .md                               markdown 
 OCaml              .ml                               ocaml 
 OCaml Interface    .mli                              ocaml_interface 
 Python             .pyw,.py                          python 
 Tree-sitter Query  .scm                              query 
 Ruby               .rb                               ruby 
 Rust               .rs                               rust 
 Scala              .scala,.sc                        scala 
 Svelte             .svelte                           svelte 
 Typescript         .ts                               typescript 
──────────────────────────────────────────────────────────────────────

Why count tokens instead of lines

Counting lines rewards dense programs. For example,

int nums[4] = { 1, 2, 3, 4 };
int mult[4] = {0};
for (int i = 0; i < 4; i++) {
    mult[i] = nums[i] * 2;
}
printf("[%d] [%d] [%d] [%d]", mult[0], mult[1], mult[2], mult[3]);

nums := []int{1, 2, 3, 4}
mult := make([]int, 4)
for i, n := range nums {
    mult[i] = n * 2
}
fmt.Println(mult)

Are these programs the same size? They are each 6 lines long, but clearly the Go version is considerably smaller than the C version. While this is a contrived example, line counting still rewards dense programs and dense programming languages.

Counting lines rewards short variable names. Is ns shorter than namespace? By bytes it is, when used throughout a project it likely will result in fewer line breaks, but I don't think a program should be considered smaller just because it uses cryptic variable names whenever possible.
Counting lines penalizes line comments mixed with code. Consider the following contrived example,

v.iter() // iterate over the vector
    .map(|n| n * 2) // multiply each number by two
    .collect::Vec<u32>(); // collect the iterator into a vector of u32

Without the comments, it could be written as v.iter().map(|n| n * 2).collect::Vec<32>();.

Short syntactical elements in languages are rewarded. For example:

[1, 2, 3, 4].map { |n| n * 2 }

Compared with the equivalent

[1, 2, 3, 4].map do |n|
    n * 2
end

Counting lines rewards horizontal programming and penalizes vertical programming

Usage

tcount -h

tcount 0.1.0
Count your code by tokens, node kinds, and patterns in the syntax tree.

USAGE:
    tcount [FLAGS] [OPTIONS] [--] [paths]...

FLAGS:
        --count-hidden        Count hidden files
    -h, --help                Prints help information
        --list-languages      Show a list of supported languages for parsing
        --no-dot-ignore       Don't respect .ignore files
        --no-git              Don't respect gitignore and .git/info/exclude files
        --no-parent-ignore    Don't respect ignore files from parent directories
        --show-totals         Show column totals. This is not affected by --top
    -V, --version             Prints version information

OPTIONS:
        --blacklist <blacklist>...          Blacklist of languages not to parse. This is overriden by --whitelist and
                                            must be an exact match
        --format <format>                   One of table|csv [default: table]
        --groupby <groupby>                 One of language|file|arg. "arg" will group by the `paths` arguments provided
                                            [default: language]
    -k, --kind <kind>...                    kinds of nodes in the syntax tree to count. See node-types.json in the
                                            parser's repo to see the names of nodes or use https://tree-
                                            sitter.github.io/tree-sitter/playground.
    -p, --kind-pattern <kind-pattern>...    Patterns of node kinds to count in the syntax tree (e.g. ".*comment" to
                                            match nodes of type "line_comment", "block_comment", and "comment").
                                            Supports Rust regular expressions
        --query <query>...                  Tree-sitter queries to match and count. Captures can also be counted with
                                            --query=query_name@capture_name,capture_name2. See
                                            https://github.com/RRethy/tcount/blob/master/QUERIES.md for more information
        --sort-by <sort-by>                 One of group|numfiles|tokens. "group" will sort based on --groupby value
                                            [default: tokens]
        --top <top>                         How many of the top results to show
        --verbose <verbose>                 Logging level. 0 to not print errors. 1 to print IO and filesystem errors. 2
                                            to print parsing errors. 3 to print everything else. [default: 0]
        --whitelist <whitelist>...          Whitelist of languages to parse. This overrides --blacklist and must be an
                                            exact match

ARGS:
    <paths>...    Files and directories to parse and count. [default: .]

Counting Tree-sitter Queries

See QUERIES.md

Performance

tcount parses each file using a Tree-sitter parser to create a full syntax tree. This takes more time than only counting lines of code/comments so programs like tokei, scc, and cloc will typically be faster than tcount.

Here are some benchmarks using hyperfine to give an overview of how much slower it is than line counting programs:

tcount

Program	Runtime
tcount	19.5 ms ± 1.7 ms
scc	13.0 ms ± 1.4 ms
tokei	7.2 ms ± 1.2 ms
cloc	1.218 s ± 0.011 s

Redis

Program	Runtime
tcount	1.339 s ± 0.125 s
scc	49.9 ms ± 1.6 ms
tokei	79.9 ms ± 5.3 ms
cloc	1.331 s ± 0.016 s

CPython

Program	Runtime
tcount	11.580 s ± 0.199 s
scc	256.7 ms ± 3.0 ms
tokei	512.2 ms ± 96.4 ms
cloc	12.467 s ± 0.139 s

Limitations

tcount does not support nested languages like ERB. This may change in the future.
It's not always clear what is a token, tcount treats any node in the syntax tree without children as a token. This usually works, but in some cases, like strings in the Rust Tree-sitter parser which can have children (escape codes), it may produce slightly expected results.

Why Tree-sitter

Tree-sitter has relatively efficient parsing and has support for many languages without the need to create and maintain individual parsers or lexers. Support for new languages is easy and only requires and up-to-date Tree-sitter parser.

Contributing

To add support for a new language, add it's information to https://github.com/RRethy/tcount/blob/master/src/language.rs and add the language's Tree-sitter parser crate to Cargo.toml.

Acknowledgements

All parsing is done using Tree-sitter parsers

I tried installing, and it fails at this spot. Any other system deps I need? I'm running macos

# cargo install --git https://github.com/RRethy/tcount.git
    Updating git repository `https://github.com/RRethy/tcount.git`
  Installing tcount v0.1.0 (https://github.com/RRethy/tcount.git#71638e65)
    Updating crates.io index
    Updating git repository `https://github.com/latex-lsp/tree-sitter-bibtex`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-c`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-c-sharp`
    Updating git repository `https://github.com/sogaiu/tree-sitter-clojure`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-css`
    Updating git repository `https://github.com/AbstractMachinesLab/tree-sitter-erlang`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-go`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-html`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-json`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-julia`
    Updating git repository `https://github.com/latex-lsp/tree-sitter-latex`
    Updating git repository `https://github.com/nvim-treesitter/tree-sitter-lua`
    Updating git repository `https://github.com/nvim-treesitter/tree-sitter-query`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-ruby`
    Updating git repository `https://github.com/tree-sitter/tree-sitter-scala`
error: failed to compile `tcount v0.1.0 (https://github.com/RRethy/tcount.git#71638e65)`, intermediate artifacts can be found at `/var/folders/rr/bzwz4lzx7mbbymtknwk1y7140000gn/T/cargo-installFHZdFy`

Caused by:
  failed to select a version for the requirement `tree-sitter-c = "^0.16.0"`
  candidate versions found which didn't match: 0.20.1
  location searched: Git repository https://github.com/tree-sitter/tree-sitter-c
  required by package `tcount v0.1.0 (/Users/pthrasher/.cargo/git/checkouts/tcount-92c5348212cf527d/71638e6)

List key patterns of a JSON file for jq.

jqk jqk lists key patterns of a JSON file for jq. Why? jq is a useful command line tool to filter values from a JSON file quickly on a terminal; howev

8 Jun 25, 2023

Extract patterns from unstructured log messages

logu logu is for extracting patterns from (streaming) unstructured log messages. For parsing unstructured logs, it uses the parser from Drain. In simp

78 Oct 21, 2024

A cli tool to automate the building and deployment of Bitcoin nodes

ℹ️ Automate Bitcoin builds, speed up deployment Shran is an open-source cli tool being developed to address the needs of DMG Blockchain Solutions. It

1 Oct 20, 2022

Work in progress NCBI's Common Tree alternative in the terminal

Lomanai Usage lomanai --species 'Mus musculus' --species 'Homo sapiens' # Mammalia # ++Rodentia # | \-Mus musculus # \+Primates # \-Homo sapien

3 Dec 20, 2022

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.