DeepFrog - NLP Suite
Introduction
DeepFrog aims to be a (partial) successor of the Dutch-NLP suite Frog. Whereas the various NLP modules in Frog wre built on k-NN classifiers, DeepFrog builds on deep learning techniques and can use a variety of neural transformers.
Our deliverables are multi-faceted:
- Fine-tuned neural network models for Dutch NLP that can be compared with Frog and are directly usable with Huggingface's Transformers library for Python (or rust-bert for Rust).
- Training pipelines for the above models (see
training
). - A software tool that integrates multiple models (not just limited to dutch!) and provides a single pipeline solution for end-users.
- with full support for FoLiA XML input/output.
- usage is not limited to the models we provide
Installation
DeepFrog and all its dependencies are included as an extra in LaMachine, which is the easiest way to install it. Within lamachine, do:
lamachine-add deepfrog && lamachine-update
Otherwise, simply install DeepFrog using Rust's package manager:
cargo install deepfrog
No cargo/rust on your system yet? Do sudo apt install cargo
on Debian/ubuntu based systems, brew install rust
on mac, or use rustup.
In order to run the DeepFrog command-line-tool, you need to have the C++ library libtorch installed. Download it from https://pytorch.org/ ; make sure you select package: libtorch there! You don't need the rest of pytorch.
To use our models directly with the Huggingface's Transformers library for Python, you merely need that library, models will be automatically downloaded and installed as you invoke them. The DeepFrog command-line-tool is not used in this workflow.
Models
We aim to make available various models for Dutch NLP.
RobBERT v1 Part-of-Speech (CGN tagset) for Dutch
Model page with instructions: https://huggingface.co/proycon/robbert-pos-cased-deepfrog-nld
Uses pre-trained model RobBERT (a Roberta model), fine-tuned on part-of-speech tags with the full corpus as also used by Frog. Uses the tag set of Corpus Gesproken Nederlands (CGN), this corpus constitutes a subset of the training data.
Test Evaluation:
f1 = 0.9708171206225681
loss = 0.07882563415198372
precision = 0.9708171206225681
recall = 0.9708171206225681
RobBERT v2 Part-of-Speech (CGN tagset) for Dutch
Model page with instructions: https://huggingface.co/proycon/robbert2-pos-cased-deepfrog-nld
Uses pre-trained model RobBERT v2 (a Roberta model), fine-tuned on part-of-speech tags with the full corpus as also used by Frog. Uses the tag set of Corpus Gesproken Nederlands (CGN), this corpus constitutes a subset of the training data.
f1 = 0.9664560038891591
loss = 0.09085878504153627
precision = 0.9659863945578231
recall = 0.9669260700389105
BERT Part-of-Speech (CGN tagset) for Dutch
Model page with instructions: https://huggingface.co/proycon/bert-pos-cased-deepfrog-nld
Uses pre-trained model BERTje (a BERT model), fine-tuned on part-of-speech tags with the full corpus as also used by Frog. Uses the tag set of Corpus Gesproken Nederlands (CGN), this corpus constitutes a subset of the training data.
Test Evaluation:
f1 = 0.9737354085603113
loss = 0.0647074995296342
precision = 0.9737354085603113
recall = 0.9737354085603113
RobBERT SoNaR1 Named Entities for Dutch
Model page with instructions: https://huggingface.co/proycon/robbert-ner-cased-sonar1-nld
Uses pre-trained model RobBERT (a Roberta model), fine-tuned on Named Entities from the SoNaR1 corpus (as also used by Frog). Provides basic PER,LOC,ORG,PRO,EVE,MISC tags.
Test Evaluation (note: this is a simple token-based evaluation rather than entity based!)
f1 = 0.9170731707317074
loss = 0.023864904676364467
precision = 0.9306930693069307
recall = 0.9038461538461539
Note: the tokenisation in this model is English rather than Dutch
RobBERT v2 SoNaR1 Named Entities for Dutch
Model page with instructions: https://huggingface.co/proycon/robbert2-ner-cased-sonar1-nld
Uses pre-trained model RobBERT (v2) (a Roberta model), fine-tuned on Named Entities from the SoNaR1 corpus (as also used by Frog). Provides basic PER,LOC,ORG,PRO,EVE,MISC tags.
f1 = 0.8878048780487806
loss = 0.03555946223787032
precision = 0.900990099009901
recall = 0.875
BERT SoNaR1 Named Entities for Dutch
Model page with instructions: https://huggingface.co/proycon/bert-ner-cased-sonar1-nld
Uses pre-trained model BERTje (a BERT model), fine-tuned on Named Entities from the SoNaR1 corpus (as also used by Frog). Provides basic PER,LOC,ORG,PRO,EVE,MISC tags.
Test Evaluation (note: this is a simple token-based evaluation rather than entity based!)
f1 = 0.9519230769230769
loss = 0.02323892477299803
precision = 0.9519230769230769
recall = 0.9519230769230769