Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Overview

Quickner

A simple, fast, and easy to use NER annotator for Python

PyPI version License PyPI - Downloads Build Status

Showcase

Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner is blazing fast, simple to use, and easy to configure using a TOML file.

Installation

# Create a virtual environment
python3 -m venv env
source env/bin/activate

# Install quickner
pip install quickner # or pip3 install quickner

Usage

Using the config file

from quickner import Quickner, Config

config = Config(path="config.toml") # or Config() if the config file is in the current directory

# Initialize the annotator
quick = Quickner(config=config)

# Annotate the texts using the config file
quick.process() # or annotator.process(True) to save the annotated data to a file

Using Documents

from quickner import Quickner, Document

# Create documents
rust = Document("rust is made by Mozilla")
python = Document("Python was created by Guido van Rossum")
java = Document("Java was created by James Gosling")

# Documents can be added to a list
documents = [rust, python, java]

# Initialize the annotator

quick = Quickner(documents=documents)
quick
>>> Entities: 0 | Documents: 3 | Annotations:
>>> quick.documents
[Document(id="87e03d58b1ba4d72", text=rust is made by Mozilla, label=[]), Document(id="f1da5d23ef88f3dc", text=Python was created by Guido van Rossum, label=[]), Document(id="e4324f9818e7e598", text=Java was created by James Gosling, label=[])]
>>> quick.entities
[]

Using Documents and Entities

from quickner import Quickner, Document, Entity

# Create documents from texts
texts = (
  "rust is made by Mozilla",
  "Python was created by Guido van Rossum",
  "Java was created by James Gosling at Sun Microsystems",
  "Swift was created by Chris Lattner and Apple",
)
documents = [Document(text) for text in texts]

# Create entities
entities = (
  ("Rust", "PL"),
  ("Python", "PL"),
  ("Java", "PL"),
  ("Swift", "PL"),
  ("Mozilla", "ORG"),
  ("Apple", "ORG"),
  ("Sun Microsystems", "ORG"),
  ("Guido van Rossum", "PERSON"),
  ("James Gosling", "PERSON"),
  ("Chris Lattner", "PERSON"),
)
entities = [Entity(*(entity)) for entity in entities]

# Initialize the annotator
quick = Quickner(documents=documents, entities=entities)
quick.process()

>>> quick
Entities: 6 | Documents: 3 | Annotations: PERSON: 2, PL: 3, ORG: 1
>>> quick.documents 
[Document(id=87e03d58b1ba4d72, text=rust is made by Mozilla, label=[(0, 4, PL), (16, 23, ORG)]), Document(id=f1da5d23ef88f3dc, text=Python was created by Guido van Rossum, label=[(0, 6, PL), (22, 38, PERSON)]), Document(id=e4324f9818e7e598, text=Java was created by James Gosling, label=[(0, 4, PL), (20, 33, PERSON)])]

Find documents by label or entity

When you have annotated your documents, you can use the find_documents_by_label and find_documents_by_entity methods to find documents by label or entity.

Both methods return a list of documents, and are not case sensitive.

Example:

# Find documents by label
>>> quick.find_documents_by_label("PERSON")
[Document(id=f1da5d23ef88f3dc, text=Python was created by Guido van Rossum, label=[(0, 6, PL), (22, 38, PERSON)]), Document(id=e4324f9818e7e598, text=Java was created by James Gosling, label=[(0, 4, PL), (20, 33, PERSON)])]

# Find documents by entity
>>> quick.find_documents_by_entity("Guido van Rossum")
[Document(id=f1da5d23ef88f3dc, text=Python was created by Guido van Rossum, label=[(0, 6, PL), (22, 38, PERSON)])]
>>> quick.find_documents_by_entity("rust")
[Document(id=87e03d58b1ba4d72, text=rust is made by Mozilla, label=[(0, 4, PL), (16, 23, ORG)])]
>>> quick.find_documents_by_entity("Chris Lattner")
[Document(id=3b0b3b5b0b5b0b5b, text=Swift was created by Chris Lattner and Apple, label=[(0, 5, PL), (21, 35, PERSON), (40, 45, ORG)])]

Get a Spacy Compatible Generator Object

You can use the spacy method to get a spacy compatible generator object.

The generator object can be used to feed a spacy model with the annotated data, you still need to convert the data into DocBin format.

Example:

# Get a spacy compatible generator object
>>> quick.spacy()
<builtins.SpacyGenerator object at 0x102311440>
# Divide the documents into chunks
>>> chunks = quick.spacy(chunks=2)
>>> for chunk in chunks:
...     print(chunk)
...
[('rust is made by Mozilla', {'entitiy': [(0, 4, 'PL'), (16, 23, 'ORG')]}), ('Python was created by Guido van Rossum', {'entitiy': [(0, 6, 'PL'), (22, 38, 'PERSON')]})]
[('Java was created by James Gosling at Sun Microsystems', {'entitiy': [(0, 4, 'PL'), (20, 33, 'PERSON'), (37, 53, 'ORG')]}), ('Swift was created by Chris Lattner and Apple', {'entitiy': [(0, 5, 'PL'), (21, 34, 'PERSON'), (39, 44, 'ORG')]})]

Single document annotation

You can also annotate a single document with a list of entities.

This is useful when you want to annotate a document with a list of entities is not in the list of entities of the Quickner object.

Example:

from quickner import Document, Entity

# Create a document from a string
# Method 1
rust = Document.from_string("rust is made by Mozilla")
# Method 2
rust = Document("rust is made by Mozilla")

# Create a list of entities
entities = [Entity("Rust", "PL"), Entity("Mozilla", "ORG")]
# Annotate the document with the entities, case_sensitive is set to False by default
>>> rust.annotate(entities, case_sensitive=True)
>>> rust
Document(id="87e03d58b1ba4d72", text=rust is made by Mozilla, label=[(16, 23, ORG)])
>>> rust.annotate(entities, case_sensitive=False)
>>> rust
Document(id="87e03d58b1ba4d72", text=rust is made by Mozilla, label=[(16, 23, ORG), (0, 4, PL)])

Load from file

Initialize the Quickner object from a file containing existing annotations.

Quickner.from_jsonl and Quickner.from_spacy are class methods that return a Quickner object and are able to parse the annotations and entities from a jsonl or spaCy file.

from quickner import Quickner

quick = Quickner.from_jsonl("annotations.jsonl") # load the annotations from a jsonl file
quick = Quickner.from_spacy("annotations.json") # load the annotations from a spaCy file

Configuration

The configuration file is a TOML file with the following structure:

# Configuration file for the NER tool

[general]
# Mode to run the tool, modes are:
# Annotation from the start
# Annotation from already annotated texts
# Load annotations and add new entities

[logging]
level = "debug" # level of logging (debug, info, warning, error, fatal)

[texts]

[texts.input]
filter = false     # if true, only texts in the filter list will be used
path = "texts.csv" # path to the texts file

[texts.filters]
accept_special_characters = ".,-" # list of special characters to accept in the text (if special_characters is true)
alphanumeric = false              # if true, only strictly alphanumeric texts will be used
case_sensitive = false            # if true, case sensitive search will be used
max_length = 1024                 # maximum length of the text
min_length = 0                    # minimum length of the text
numbers = false                   # if true, texts with numbers will not be used
punctuation = false               # if true, texts with punctuation will not be used
special_characters = false        # if true, texts with special characters will not be used

[annotations]
format = "spacy" # format of the output file (jsonl, spaCy, brat, conll)

[annotations.output]
path = "annotations.jsonl" # path to the output file

[entities]

[entities.input]
filter = true         # if true, only entities in the filter list will be used
path = "entities.csv" # path to the entities file
save = true           # if true, the entities found will be saved in the output file

[entities.filters]
accept_special_characters = ".-" # list of special characters to accept in the entity (if special_characters is true)
alphanumeric = false             # if true, only strictly alphanumeric entities will be used
case_sensitive = false           # if true, case sensitive search will be used
max_length = 20                  # maximum length of the entity
min_length = 0                   # minimum length of the entity
numbers = false                  # if true, entities with numbers will not be used
punctuation = false              # if true, entities with punctuation will not be used
special_characters = true        # if true, entities with special characters will not be used

[entities.excludes]
# path = "excludes.csv" # path to entities to exclude from the search

Features Roadmap and TODO

  • Add support for spaCy format
  • Add support for brat format
  • Add support for conll format
  • Add support for jsonl format
  • Add support for loading annotations from a json spaCy file
  • Add support for loading annotations from a jsonl file
  • Find documents with a specific entity/entities and return the documents
  • Add support for loading annotations from a brat file
  • Substring search for entities in the text (case sensitive and insensitive)
  • Partial match for entities, e.g. "Rust" will match "Rustlang"
  • Pattern/regex based entites, e.g. "Rustlang" will match "Rustlang 1.0"
  • Fuzzy match for entities with levenstein distance, e.g. "Rustlang" will match "Rust"
  • Add support for jupyter notebook

License

MOZILLA PUBLIC LICENSE Version 2.0

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Authors

  • [Omar MHAIMDAT]
You might also like...
Papercraft is a tool to unwrap 3D models.
Papercraft is a tool to unwrap 3D models.

Papercraft Introduction Papercraft is a tool to unwrap paper 3D models, so that you can cut and glue them together and get a real world paper model. T

A build tool for illumos.

Eos Eos is a build tool for illumos. It works by locating build.toml files in the illumos source tree and generating a top-level ninja build specifica

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

A Japanese Morphological Analyzer written in pure Rust

Yoin - A Japanese Morphological Analyzer yoin is a Japanese morphological analyze engine written in pure Rust. mecab-ipadic is embedded in yoin. :) $

A simple OpenAI (GPT-3) client written in Rust.

A simple OpenAI (GPT-3) client written in Rust. It works by making HTTP requests to OpenAI's API and consuming the results.

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

Owner
Omar MHAIMDAT
Data Scientist | Software Engineer | Better programming and Heartbeat contributor
Omar MHAIMDAT
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

Roman Yurchak 133 Jan 3, 2023
A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

Ambuda 14 Dec 30, 2022
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
A Rust wrapper for the Text synthesization service TextSynth API

A Rust wrapper for the Text synthesization service TextSynth API

ALinuxPerson 2 Mar 24, 2022
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

Peter M. Stahl 5.8k Dec 30, 2022
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Wilfred Hughes 13.9k Jan 2, 2023