Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot

Overview

Warning: sticker is succeeded by SyntaxDot, which supports many new features:

  • Multi-task learning.
  • Pretrained transformer models, suchs as BERT and XLM-R.
  • Biaffine parsing in addition to parsing as sequence labeling.
  • Lemmatization.

sticker

sticker is a sequence labeler using neural networks.

Introduction

sticker is a sequence labeler that uses either recurrent neural networks, transformers, or dilated convolution networks. In principle, it can be used to perform any sequence labeling task, but so far the focus has been on:

  • Part-of-speech tagging
  • Topological field tagging
  • Dependency parsing
  • Named entity recognition

Features

  • Input representations:
    • finalfusion embeddings with subword units
    • Bidirectional byte LSTMs
  • Hidden representations:
    • Bidirectional recurrent neural networks (LSTM or GRU)
    • Transformers
    • Dillated convolutions
  • Classification layers:
    • Softmax (best-N)
    • CRF
  • Deployment:
    • Standalone binary that links against libtensorflow
    • Very liberal license
    • Docker containers with models

Status

sticker is almost production-ready and we are preparing for release 1.0.0. Graphs and models crated with the current version must work with sticker 1.x.y. There may still be breaking API or configuration file changes until 1.0.0 is released.

Where to go from here

References

sticker uses techniques from or was inspired by the following papers:

Issues

You can report bugs and feature requests in the sticker issue tracker.

License

sticker is licensed under the Blue Oak Model License version 1.0.0. The Tensorflow protocol buffer definitions in tf-proto are licensed under the Apache License version 2.0. The list of contributors is also available.

Credits

  • sticker is developed by Daniël de Kok & Tobias Pütz.
  • The Python precursor to sticker was developer by Erik Schill.
  • Sebastian Pütz and Patricia Fischer reviewed a lot of code across the sticker projects.
Comments
  • Switch to the Keras LSTM/GRU implementation

    Switch to the Keras LSTM/GRU implementation

    Recent versions of Tensorflow Keras will automatically switch between cuDNN and Tensorflow implementations. The trained parameters work regardless of the selected implementation.

    The conditions for using the cuDNN implementation are documented at:

    https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

    They boil down to: 1. a NVIDIA GPU is available, 2. certain hyper parameters (e.g. activations) are set to specific values. If the cuDNN implementation is selected, this results in a nice speedup.

    The Tensorflow requirements are bumped to 1.15.0. This setup fails with 1.14.0 with a constant folding error in Grappler:

    https://github.com/tensorflow/tensorflow/issues/29525

    opened by danieldk 15
  • Linear learning-rate warmup.

    Linear learning-rate warmup.

    Adds linear learning-rate warmup to sticker train.

    The scheduler now needs to be passed into run_epoch since the warmup is per-batch and not per-epoch.

    Not the scope of this PR: it may be useful to change ExponentialDecay to decay per-batch instead of per-epoch.

    opened by twuebi 7
  • Add support for automatic mixed precision graph rewrites

    Add support for automatic mixed precision graph rewrites

    This PR consists of three commits:

    • Update protocol buffer definitions to Tensorflow 1.14.0.
    • Regenerate Rust Tensorflow protocol buffer files.
    • The actual implementation in the graph write scripts and Rust tagger/trainer.

    I did not have the opportunity to test this PR yet. Both GPUs on hopper are used. I am currently compiling Tensorflow on tesniere with the right compute capabilities.

    opened by danieldk 6
  • Periodical eval and summaries on tensorboard

    Periodical eval and summaries on tensorboard

    For pretraining it's nice to do periodical evaluation and save on improvements. It's even nicer to plot this periodical evaluation in tensorboard since it nicely illustrates how the accuracy fluctuates between pretrain-batches.

    For my (non-sticker) experiments, dev-accuracy was still moving up and down by ~0.1% after ~300k train steps, although the model had reached the final level (0.02% below the highscore) after less than 200k train steps.

    feature 
    opened by sebpuetz 6
  • Slow start with quantized embeddings

    Slow start with quantized embeddings

    sticker uses the average of all embeddings for the unknown embedding. However, this has a bad side-effect for quantized embeddings. Since all the embeddings in the embedding matrix are reconstructed, loading models takes quite a while.

    Possible workarounds:

    • Precompute the unknown embedding and add it to the embedding matrix. I dislike this somewhat, since sticker would need specially prepared matrices again.
    • Precompute the unknown embedding and store it somewhere outside the embedding matrix. Breaks existing models.
    • Precompute the unknown embedding over a sample of the embeddings.
    • Maybe it's enough to average all subquantizers, since they are the centroids of the embeddings? This would be very cheap. Problem: the original embeddings may not be distributed evenly over the clusters.

    @twuebi @sebpuetz any opinions?

    opened by danieldk 6
  • Consider moving the sticker repositories to an organization

    Consider moving the sticker repositories to an organization

    We currently have sticker and sticker-python. I have also been pondering whether I should move the models from blob.danieldk.eu to a sticker-models repo (where the repo would contain metadata and the associated releases would store the models).

    In favor:

    • All sticker-related projects are grouped together.
    • Hosting models on GitHub may prove to be more reliable.

    Against:

    • Even stronger coupling to Microsoft GitHub ;). (sticker started as a project that was purely on https://sourcehut.org/ , which was kinda nice, because sr.ht is open source + has far nice/better CI.)
    • GitHub releases has size limitations (2GB per file). We do not hit them anymore since I have started quantizing embeddings, but it may prove to be annoying in the future.
    • GitHub is been finicky with downloads before. Before Releases they had a download feature, which they suddenly cancelled and there was no download option between this cancellation and when they introduced Releases.
    • sticker is already used. So we'd need an organization with a different name (I already reserver glumarko, which is Esperanto for 'sticker').
    question 
    opened by danieldk 6
  • Adds the transformer model.

    Adds the transformer model.

    A good chunk of these lines are docs ;)


    This PR adds the transformer model.

    In detail:

    • the feed-forward residual layers
    • the encoder assembly
    • the write script including commandline args

    I did some preliminary experiments for topo-field tagging and dependency parsing, both perform reasonably well.

    opened by twuebi 6
  • Show averages in batch loss/accuracy?

    Show averages in batch loss/accuracy?

    While training, the batch loss/accuracy of the last trained batch is shown. However, these numbers tend to jump around quite a bit. Maybe we should show the average so far in the current epoch? This would help observing the trends a bit better.

    opened by danieldk 5
  • Argparsing

    Argparsing

    With the addition of the transformer architecture, we will have three model types, each with their own set of hyperparameters.

    Currently, arguments are disambiguated by their names, (--rnn_layers) or by the help string (number of dilated convolution levels). With the new model, this is becoming increasingly confusing, it also raises the question for which model type the defaults apply.

    I see a few options to make things clearer:

    1. subcommands to determine the model type:

    + all model specific arguments that don't belong to the selected one are inactive + model specific --help + easy to access via a shell script to run some experiments

    - only write-graph --help + write-graph <MODEL_TYPE> --help will give the full information - arguments are not persistent

    2. read the hyperparameters from a toml file:

    + structure unambiguous + parsing handled by a library + adds persistent metadata to pre-trained models + manipulation for experiments through e.g. python script

    - slightly more difficult to automate experiments - need to handle many config files

    3. read all hyperparameters from config.py

    + no parsing + persistent storage

    - config.py is loaded via import, no simple switching between configs - more difficult to automate experiments - need to handle many config files

    4. specific write-graph scripts for each model type

    + clear separation + viable for both reading configs from a file or using argparse


    If we stick with a single write-graph, I'd prefer a config.toml (2.). If we switch to multiple write-scripts, sticking to command line arguments should be fine, although, saving the training parameters would be a nice feature. What do you think?

    opened by twuebi 5
  • Restore compatibility with sticker 0.4 models

    Restore compatibility with sticker 0.4 models

    This change restores compatibility with models that are trained using sticker 0.4. The only graph incompatibility is the addition of training and validation summaries for TensorBoard. Make these ops optional, so that prediction (and training) using old graphs continues to work.

    The ops could be made mandatory in some future version when the sticker 0.4 models have all been replaced.

    maintenance 
    opened by danieldk 4
  • Tensorboard broken with tf 1.14

    Tensorboard broken with tf 1.14

    https://github.com/tensorflow/tensorflow/commit/826027dbd4277a2636fc2935ed245700fd01e7cd#diff-335a6620279c6fa5d7f70c37c91265a3

    this broke our _create_file_writer_generic_type in vendored.py, it was introduced with Tensorflow 1.14. I'm looking into it.

    opened by twuebi 4
  • Support of custom features

    Support of custom features

    Hello, Thank you for this awesome library. I´m very impressed by the quality.

    My task is to extract/segment information from a semi structured text. But not only the text is important some "external features" are also important.

    For example, imagine that I want to segment this text about GitHub projects in Category, Project name, URl and description.

    I utilize an BIO scheme to tag each html token as a category.

    Where token=NLP start_of_p=True bold=True center=True B-Category token=Projects start_of_p=False bold=True center=True I-Category token=Project start_of_p=True bold=True center=False B-Project-name token=Name start_of_p=False bold=True center=False I-Project-name token=: start_of_p=False bold=False center=False I-Project-name

    The final result is something like:

    Pay attention that some features are important such as: Text formatting (italic, bold, centered), position in text and more...

    There is anyway of using those custom external features for training Sticker?

    Thank you!

    opened by bratao 1
  • Make the exponential decay lr schedule available

    Make the exponential decay lr schedule available

    Right now, the exponential decay lr schedule is not available for sticker trainand sticker pretrain. Once #145 is merged, it would make sense to have it available for both subcommands.

    This may clutter the command line arguments a bit since we then have:

    • Plateau decay

      • lr_scale
      • lr_patience
    • Exponential decay

      • decay_rate
      • decay_exponent

    Maybe it would make sense to move the learning-rate schedule related things to the config file.

    opened by twuebi 6
  • Clear the batch accuracy/loss when an epoch is finished

    Clear the batch accuracy/loss when an epoch is finished

    Once an epoch is done, we start a new line. The batch accuracy/loss that is displayed is then that of the last batch of the epoch. These numbers are confusing when inspecting the outputs for accuracies/losses.

    opened by danieldk 0
  • Epoch summaries on tensorboard

    Epoch summaries on tensorboard

    Add epoch-averaged (train and dev) values of the summarized metrics to tensorboard. After a few epochs it's hard to tell anything from the per-batch graphs.

    feature 
    opened by sebpuetz 3
Releases(0.11.1)
Owner
stickeritis
sticker is succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot
stickeritis
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Cassandra DB native client written in Rust language. Find 1.x versions on https://github.com/AlexPikalov/cdrs/tree/v.1.x Looking for an async version? - Check WIP https://github.com/AlexPikalov/cdrs-async

CDRS CDRS is looking for maintainers CDRS is Apache Cassandra driver written in pure Rust. ?? Looking for an async version? async-std https://github.c

Alex Pikalov 338 Jan 1, 2023
Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Rust SBert Rust port of sentence-transformers using rust-bert and tch-rs. Supports both rust-tokenizers and Hugging Face's tokenizers. Supported model

null 41 Nov 13, 2022
Rust port of https://github.com/hunar4321/life_code with some fun features.

Smarticles A Rust port of Brainxyz's Artificial Life simulator with some fun features. A simple program to simulate primitive Artificial Life using si

Chevy Ray Johnston 15 Dec 24, 2022
Provision your authorized_keys via HTTPS/GitHub/GitLab

Keyps Key Provisioning Service Provision authorized_keys from HTTPS/GitHub/GitLab and automatically keep them up to date. Motivation Problem Provision

Samuel Rounce 6 Apr 27, 2023
ARCHIVED -- moved into the main Embassy repo at https://github.com/embassy-rs/embassy

ARCHIVED - moved into the main Embassy repo https://github.com/embassy-rs/embassy cyw43 WIP driver for the CYW43439 wifi chip, used in the Raspberry P

null 245 Jun 27, 2023
Rust-based interpreter for the Dreamberd (https://github.com/TodePond/DreamBerd) language

Dreamberd.rs Rust-based interpreter for the Dreamberd language. The full specification for Dreamberd is available at https://github.com/TodePond/Dream

null 16 Jul 2, 2023
example codes for CIS198 https://cis198-2016s.github.io/

CIS198: RUST 编程语言 学习背景 rust 和 c/c++/Java/Python/golang 不太一样 rust 学习曲线比较陡峭 rust 有很多颠覆认知的特性: 所有权,生命周期,借用检测 cargo 工具 函数式+命令式支持 视频讲解见 B站 课程大纲 Timeline Lec

Jinghui Hu 3 Apr 9, 2024
Discover GitHub token scope permission and return you an easy interface for checking token permission before querying GitHub.

github-scopes-rs Discover GitHub token scope permission and return you an easy interface for checking token permission before querying GitHub. In many

null 8 Sep 15, 2022
Mirror of https://gitlab.redox-os.org/redox-os/redox

Redox is an operating system written in Rust, a language with focus on safety and high performance. Redox, following the microkernel design, aims to b

Redox OS 14.3k Jan 3, 2023
This project now lives on in a rewrite at https://gitlab.redox-os.org/redox-os/parallel

MIT/Rust Parallel: A Command-line CPU Load Balancer Written in Rust This is an attempt at recreating the functionality of GNU Parallel, a work-stealer

Michael Murphy 1.2k Nov 20, 2022
Mirror of https://gitlab.com/mmstick/tv-renamer

Build Status: Features Written safely in the Rust programming language Features both a command-line and GTK3 interface Support for Templates to define

Michael Murphy 143 Sep 24, 2022
Mirror of https://gitlab.redox-os.org/redox-os/ion

Introduction Ion is a modern system shell that features a simple, yet powerful, syntax. It is written entirely in Rust, which greatly increases the ov

Redox OS 1.3k Jan 4, 2023
Federated blogging application, thanks to ActivityPub (now on https://git.joinplu.me/ — this is just a mirror)

Plume Website — Documentation — Contribute — Instances list Plume is a federated blogging engine, based on ActivityPub. It is written in Rust, with th

Plume 1.9k Jan 8, 2023
Mirror of https://gitlab.redox-os.org/redox-os/termion

Documentation Examples Changelog Tutorial Termion is a pure Rust, bindless library for low-level handling, manipulating and reading information about

Redox OS 1.9k Dec 31, 2022
https://crates.io/crates/transistor

Transistor A Rust Crux Client crate/lib. For now, this crate intends to support 2 ways to interact with Crux: Via Docker with a crux-standalone versio

Julia Naomi 28 May 28, 2022
Mirror of https://gitlab.redox-os.org/redox-os/rusttype

RustType RustType is a pure Rust alternative to libraries like FreeType. The current capabilities of RustType: Reading OpenType formatted fonts and fo

Redox OS 567 Dec 5, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
A fast static site generator in a single binary with everything built-in. https://www.getzola.org

zola (né Gutenberg) A fast static site generator in a single binary with everything built-in. Documentation is available on its site or in the docs/co

Zola 10.1k Jan 5, 2023
https://getrust.tech

learn-rust 这是一个分享Rust学习资料的在线学习网站 https://getrust.tech ?? 。 微信群二维码过期你可以添加我微信: AA996DD 目的是什么 ❓ 通过连载文章的形式帮助有一定其他语言编程基础的人快速学习和入门 Rust 内容包括 Rust 基础、内置库、web

Jarvib Ding 313 Dec 18, 2022