Read and modify constituency trees in Rust.

Overview

Crate Build Status

lumberjack

Read and process constituency trees in various formats.

Install:

  • From crates.io:
cargo install lumberjack-utils
  • From GitHub:
cargo install --git https://github.com/sebpuetz/lumberjack

Usage as standalone:

  • Convert treebank in NEGRA export 4 format to bracketed TueBa V2 format
lumberjack-conversion --input_file treebank.negra --input_format negra \
    --output_format tueba --output_file treebank.tueba --projectivize
  • Retain only root node, NPs and PPs and print to simple bracketed format:
echo "NP PP" > filter_set.txt
lumberjack-conversion --input_file treebank.simple --input_format simple \
    --output_format tueba --output_file treebank.filtered \
    --filter filter_set.txt
  • Convert from treebank in simple bracketed to CONLLX format and annotate parent tags of terminals as features.
lumberjack-conversion --input_file treebank.simple --input_format  simple\
    --output_format conllx --output_file treebank.conll --parent 
  • Modifications in the following order:
  1. Reattach all terminals with part-of-speech starting with $ to the root node
  2. Remove all nonterminals except the root, Ss, NPs, PPs and VPs
  3. Assign unique identifiers based on the closest S to terminals
  4. Insert nodes with label label above terminals that aren't dominated by NP or PP
  5. Annotate label of parent node on terminals.
  6. Print to CONLLX format with annotations.
echo "S VP NP PP" > filter_set.txt
echo "NP PP" > insert_set.txt
echo "S" > id_set.txt
lumberjack-conversion --input_file treebank.simple --input_format simple\
    --output_format conllx --insertion_set insert_set.txt \
    --insertion_label label --id_set id_set.txt --reattach $\
    --parent parent --output_file treebank.conllx

Usage as rust library:

  • read and projectivize trees from NEGRA format and print to simple bracketed format
use std::io::{BufReader, File};

use lumberjack::io::{NegraReader, PTBFormat};
use lumberjack::Projectivize;

fn print_negra(path: &str) {
    let file = File::open(path).unwrap();
    let reader = NegraReader::new(BufReader::new(file));
    for tree in reader {
        let mut tree = tree.unwrap();
        tree.projectivize();
        println!("{}", PTBFormat::Simple.tree_to_string(&tree).unwrap());
    }
}
  • filter non-terminal nodes from trees in a treebank and print to simple bracketed format:
use lumberjack::{io::PTBFormat, Tree, TreeOps, util::LabelSet};

fn filter_nodes(iter: impl Iterator<Item=Tree>, set: LabelSet) {
    for mut tree in iter {
        tree.filter_nonterminals(|tree, nt| set.matches(tree[nt].label())).unwrap();
        println!("{}", PTBFormat::Simple.tree_to_string(&tree).unwrap());
    }
}
  • convert treebank in simple bracketed format to CONLLX with constituency structure encoded in the features field
use conllx::graph::Sentence;
use lumberjack::io::Encode;
use lumberjack::{Tree, TreeOps, UnaryChains};

fn to_conllx(iter: impl Iterator<Item=Tree>) {
    for mut tree in iter {
        tree.collaps_unary_chains().unwrap();
        tree.annotate_absolute().unwrap();
        println!("{}", Sentence::from(&tree));    
    }
}
You might also like...
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

Gomez - A pure Rust framework and implementation of (derivative-free) methods for solving nonlinear (bound-constrained) systems of equations

Gomez A pure Rust framework and implementation of (derivative-free) methods for solving nonlinear (bound-constrained) systems of equations. Warning: T

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

A command-line tool and library for generating regular expressions from user-provided test cases
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

Find and replace text in source files
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Comments
Releases(v0.3)
  • v0.3(May 27, 2019)

    Changes

    • Construct minimal Tree from form and part-of-speech tag of a terminal node.
    • Insertion of NonTerminal nodes as unary-chain above or below another node.
    • Removal of arbitrary nodes.
    • Reattachment of arbitrary nodes.
    • Reattachment of all terminals to a specific node matching a given criterion.
    • Moving Terminal nodes.
    • Insertion of Terminal nodes at arbitrary indices.
    • Cheap insertion of Terminal nodes to the end of the sentence by attaching to the root.
    • Switch to closures in projection and modification methods:
      • take context into account when annotating features.
      • go beyond the node label to determine matches.
    Source code(tar.gz)
    Source code(zip)
Owner
Sebastian Pütz
Rust & Python @enlyze
Sebastian Pütz
Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

Isaac Whitfield 53 Sep 24, 2022
(Read-only) Generate n-grams

N-grams Documentation This crate takes a sequence of tokens and generates an n-gram for it. For more information about n-grams, check wikipedia: https

Paul Woolcock 26 Dec 30, 2022
A naive (read: slow) implementation of Word2Vec. Uses BLAS behind the scenes for speed.

SloWord2Vec This is a naive implementation of Word2Vec implemented in Rust. The goal is to learn the basic principles and formulas behind Word2Vec. BT

Lloyd 2 Jul 5, 2018
A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

Ben Wishovich 48 Nov 11, 2022
Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

Bernhard Schuster 274 Nov 5, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023