Neural network transition-based dependency parser (in Rust)

Overview

dpar

Introduction

dpar is a neural network transition-based dependency parser. The original Go version can be found in the oldgo branch.

Dependencies

Build-time

Run-time

  • Tensorflow

Building dpar

To compile and install dpar, run the following in the main project directory:

cargo install --path dpar-utils

To do a debug build and run unit tests, run cargo build in the main project directory. To generate API documentation, run cargo doc.

Comments
  • Add updated parser config

    Add updated parser config

    basic-parse.conf doesn't seem to be up to date with the parser anymore. A second config has been added to account for this. Alternatively, basic-parse.conf could also be replaced by extended-parse.conf.

    opened by DiveFish 5
  • Feature name restricted to ASCII_ALPHANUMERIC

    Feature name restricted to ASCII_ALPHANUMERIC

    The feature definition file is restricted to ASCII_ALPHANUMERIC feature names. I don't know if that's wanted, but there are probably cases where people (me) have features with underscores or some other non-ascii-alphanumeric chars.

    Quick fix for this would be replace ASCII_ALPHANUMERIC in the grammar definition file by a newly defined char rule:

    feature_name = ${ char+ }
    char = _{ !(WHITESPACE | "|" | ":" ) ~ ANY }
    

    or a more defensive change to feature_name by whitelisting non-ascii-alphanumeric letters explicitly

    feature_name = ${ (ASCI_ALPHANUMERIC | "_" | "-" )+ }
    
    opened by sebpuetz 3
  • Convert feature specification parser to pest.

    Convert feature specification parser to pest.

    This removes an external dependency (ragel) for building the feature parser.

    @sebpuetz: I thought I'd add you as the reviewer for this one, since we talked about Pest last week, so I though you'd have an above average interest ;).

    maintenance 
    opened by danieldk 2
  • Replace Numberer by Transitions in the transition systems.

    Replace Numberer by Transitions in the transition systems.

    Transitions is a wrapper of Numberer that has several additional properties that are useful for transition tables.

    • It insures that the identifier 0 for unknown transitions.
    • It returns the correct length of the transition table, that includes the special identifier 0.
    • The table can be both fresh and frozen. A fresh table automatically adds transitions that are not known. A frozen table returns the special identifier 0 when a transition is now known.

    For future consideration: provide a thaw method as well?

    opened by danieldk 2
  • Support pseudo-projective parsing

    Support pseudo-projective parsing

    Currently, dpar assumes that all dependencies are projective, even though they are read from the non-projective column.

    Support for pseudo-projective parsing should be added to deal with non-projective structures.

    enhancement 
    opened by danieldk 2
  • Reduce memory use during training

    Reduce memory use during training

    We vectorize all the data before optimizing the graph. This worked fine when we were just storing indices, but now that we store embeddings for embedding layers, memory use is getting out of hand (~30GB on TüBa-D/Z).

    I guess we we should generate the batches on the fly instead.

    opened by danieldk 1
  • Add input vector for association metrics

    Add input vector for association metrics

    So far, input layers only stored the i32 indices that were lookups into an embedding matrix or a lookup table. A new kind of input vector is now added which makes it possible to use f32 values directly as an input vector. This is convenient for e.g. including association measures like PMI into the training process of the parser. PMIs are retrieved for the parser states that are involved in an attachment.

    Note that the lookup of the PMIs is only a dummy HashMap right now. In the next step, the PMIs will be looked up from a file.

    opened by DiveFish 1
  • Add retrieval of addressed values to transition system(s)

    Add retrieval of addressed values to transition system(s)

    This PR adds functionality to retrieve the addresses of the two tokens between which a dependency relation is established. Depending on the transition system, these are tokens from the stack or buffer.

    opened by DiveFish 1
  • Add FreezableLookup which replaces several lookup tables.

    Add FreezableLookup which replaces several lookup tables.

    This change introduces a single lookup table that replaces:

    • TransitionLookup
    • LookupTable
    • MutableLookupTable

    StoredLookupTable is simplified, but still exists as a wrapper around FreezableLookup to simplify serialization.

    FreezableLookup implements three traits:

    • LookupValue: defines the methods to look up an identifier for a value and vice versa. In contrast to the lookups that are replaced, values are borrowed during lookup and only cloned upon insertion.
    • LookupLen: defines the len method.
    • Lookup: inherits LookupValue and LookupLen and is implemented for any type that implements both traits.

    In my first iteration, I had a single lookup trait. However, this sometimes led to type inference problems when the len method was used. Since len does not use the borrowed type in its signature, the type parameter for Lookup<B> could not be inferred. So, instead, we make the len method part of a trait without type parameters.

    Fixes #19.

    maintenance 
    opened by danieldk 1
  • Replace error-chain by failure.

    Replace error-chain by failure.

    Big boring change --- replaces the error-chain crate by failure, which is now the preferred error handling crate in the Rust community. This also allows us to upgrade to conllx 0.10 without too much wrapping.

    opened by danieldk 1
  • Rewrite addressed value parser in nom

    Rewrite addressed value parser in nom

    Ragel dropped support for all languages outside C/C++/ASM. To change/regenerate the addressed value parser, one has to manually compile an old Ragel version.

    It would be nicer to just switch to nom, which is Rust-native and well-maintained.

    maintenance 
    opened by danieldk 1
  • Replace various lookups by one data structure

    Replace various lookups by one data structure

    There is a lot of overlap between the transition-specific lookup and feature table lookup. This can be factored out to one class that replaces Numberer.

    Maybe this should be a separate crate, because it is generally useful.

    maintenance 
    opened by danieldk 2
Owner
Daniël de Kok
I love to discover patterns in data using Natural Language Processing and Machine Learning.
Daniël de Kok
A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

Theia Vogel 32 Dec 26, 2022
Zero-grammer definition command-line parser

zgclp Zgclp (Zero-grammar definition command-line parser) is one of Rust's command-line parsers. A normal command-line parser generates a parser from

Toshihiro Kamiya 1 Mar 31, 2022
Makdown-like text parser.

Makdown-like text parser.

Ryo Nakamura 1 Dec 7, 2021
A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "<title >hello world</tit

Markus Unterwaditzer 129 Dec 30, 2022
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

Hollow Man 52 Jan 7, 2023
A small rust library for creating regex-based lexers

A small rust library for creating regex-based lexers

nph 1 Feb 5, 2022
Simple STM32F103 based glitcher FW

Airtag glitcher (Bluepill firmware) Simple glitcher firmware running on an STM32F103 on a bluepill board. See https://github.com/pd0wm/airtag-dump for

Willem Melching 27 Dec 22, 2022
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Wilfred Hughes 13.9k Jan 2, 2023
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Untanglr Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistic

Andrei Butnaru 15 Nov 23, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

Bernhard Schuster 274 Nov 5, 2022
A rule based sentence segmentation library.

cutters A rule based sentence segmentation library. ?? This library is experimental. ?? Features Full UTF-8 support. Robust parsing. Language specific

null 11 Jul 29, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022