Common stop words in a variety of languages

Overview

Github CI Crates.io docs.rs

About

Stop words are words that don't carry much meaning, and are typically removed as a preprocessing step before text analysis or natural language processing. This crate contains common stop words for a variety of languages. This crate uses stop word lists from Stopwords ISO and also from NLTK.

Usage

Using this crate is fairly straight-forward:

// Get the stop words
let words = stop_words::get(stop_words::LANGUAGE::English);

// Print them
for word in words {
    println!("{}", word)
}

The function get will take either a member of the LANGUAGE enum or a two-letter ISO language code as either a str or a String type.

You can find a complete example of how to read in a text file and remove stop words here.

Language Availability

This crate supports all languages from Stopwords ISO and also from NLTK. Expand the table below to see a comprehensive description.

Language Coverage Table
ISO 639-1 Code Language Stopwords ISO NLTK
aa Afar
ab Abkhazian
af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
an Aragonese
hy Armenian
as Assamese
av Avaric
ae Avestan
ay Aymara
az Azerbaijani
ba Bashkir
bm Bambara
eu Basque
be Belarusian
bn Bengali
bh Bihari languages
bi Bislama
bo Tibetan
bs Bosnian
br Breton
bg Bulgarian
my Burmese
ca Catalan; Valencian
cs Czech
ch Chamorro
ce Chechen
zh Chinese
cu Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Church Slavonic
cv Chuvash
kw Cornish
co Corsican
cr Cree
cy Welsh
da Danish
de German
dv Divehi; Dhivehi; Maldivian
nl Dutch; Flemish
dz Dzongkha
el Greek, Modern (1453-)
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
fa Persian
fj Fijian
fi Finnish
fr French
fy Western Frisian
ff Fulah
ka Georgian
gd Gaelic; Scottish Gaelic
ga Irish
gl Galician
gv Manx
gn Guarani
gu Gujarati
ht Haitian; Haitian Creole
ha Hausa
he Hebrew
hz Herero
hi Hindi
ho Hiri Motu
hr Croatian
hu Hungarian
ig Igbo
is Icelandic
io Ido
ii Sichuan Yi; Nuosu
iu Inuktitut
ie Interlingue; Occidental
ia Interlingua (International Auxiliary Language Association)
id Indonesian
ik Inupiaq
it Italian
jv Javanese
ja Japanese
kl Kalaallisut; Greenlandic
kn Kannada
ks Kashmiri
kr Kanuri
kk Kazakh
km Central Khmer
ki Kikuyu; Gikuyu
rw Kinyarwanda
ky Kirghiz; Kyrgyz
kv Komi
kg Kongo
ko Korean
kj Kuanyama; Kwanyama
ku Kurdish
lo Lao
la Latin
lv Latvian
li Limburgan; Limburger; Limburgish
ln Lingala
lt Lithuanian
lb Luxembourgish; Letzeburgesch
lu Luba-Katanga
lg Ganda
mk Macedonian
mh Marshallese
ml Malayalam
mi Maori
mr Marathi
ms Malay
mg Malagasy
mt Maltese
mn Mongolian
na Nauru
nv Navajo; Navaho
nr Ndebele, South; South Ndebele
nd Ndebele, North; North Ndebele
ng Ndonga
ne Nepali
nn Norwegian Nynorsk; Nynorsk, Norwegian
nb Bokmål, Norwegian; Norwegian Bokmål
no Norwegian
ny Chichewa; Chewa; Nyanja
oc Occitan (post 1500)
oj Ojibwa
or Oriya
om Oromo
os Ossetian; Ossetic
pa Panjabi; Punjabi
pi Pali
pl Polish
pt Portuguese
ps Pushto; Pashto
qu Quechua
rm Romansh
ro Romanian; Moldavian; Moldovan
rn Rundi
ru Russian
sg Sango
sa Sanskrit
si Sinhala; Sinhalese
sk Slovak
sl Slovenian
se Northern Sami
sm Samoan
sn Shona
sd Sindhi
so Somali
st Sotho, Southern
es Spanish; Castilian
sc Sardinian
sr Serbian
ss Swati
su Sundanese
sw Swahili
sv Swedish
ty Tahitian
ta Tamil
tt Tatar
te Telugu
tg Tajik
tl Tagalog
th Thai
ti Tigrinya
to Tonga (Tonga Islands)
tn Tswana
ts Tsonga
tk Turkmen
tr Turkish
tw Twi
ug Uighur; Uyghur
uk Ukrainian
ur Urdu
uz Uzbek
ve Venda
vi Vietnamese
vo Volapük
wa Walloon
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
za Zhuang; Chuang
zu Zulu
You might also like...
Rust crate providing a variety of automotive related libraries, such as communicating with CAN interfaces and diagnostic APIs

The Automotive Crate Welcome to the automotive crate documentation. The purpose of this crate is to help you with all things automotive related. Most

GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba
GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba

A One-Stop Large-Scale Graph Computing System from Alibaba GraphScope is a unified distributed graph computing platform that provides a one-stop envir

One-Stop Solution for all boilerplate needs!
One-Stop Solution for all boilerplate needs!

One Stop Solution for all boilerplate needs! Consider leaving a ⭐ if you found the project helpful. Templa-rs Templa-rs is a one-of-a-kind TUI tool wr

Tiny crate that allows to wait for a stop signal across multiple threads

Tiny crate that allows to wait for a stop signal across multiple threads. Helpful mostly in server applications that run indefinitely and need a signal for graceful shutdowns.

A webserver and websocket pair to stop your viewers from spamming !np and
A webserver and websocket pair to stop your viewers from spamming !np and "what's the song?" all the time.

spotify-np 🦀 spotify-np is a Rust-based local webserver inspired by l3lackShark's gosumemory application, but the catch is that it's for Spotify! 🎶

Your one stop CLI for ONNX model analysis.
Your one stop CLI for ONNX model analysis.

Your one stop CLI for ONNX model analysis. Featuring graph visualization, FLOP counts, memory metrics and more! ⚡️ Quick start First, download and ins

Start and stop system for applications to save your budget on hourly billing VPS.

Start and stop system (STT) Start and stop system for applications to save your budget on hourly billing VPS. Service A service consists of start/stop

Stop deployments in friday. DEFINITELY

NO MORE DEPLOYMENTS IN FRIDAY (now written in Rust!) It does what it says. Blocks new deployments in friday on admission controller level. How often d

Pre-commit hook to help me stop naming files *.yml half of the time

Disallow file endings Pre-commit hook that lets you specify banned file endings. I keep naming half my yaml files *.yaml and the other *.yml and it's

Rye is Armin's personal one-stop-shop for all his Python needs.
Rye is Armin's personal one-stop-shop for all his Python needs.

Rye Rye is Armin's personal one-stop-shop for all his Python needs. It installs and manages Python installations, manages pyproject.toml files, instal

Which words can you spell using only element abbreviations from the periodic table?
Which words can you spell using only element abbreviations from the periodic table?

Periodic Words Have you ever wondered which words you can spell using only element abbreviations from the periodic table? Well thanks to this extremel

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Untanglr Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistic

Vanitygen-bip39 - Generate vanity / gas efficient Ethereum addresses for your hdwallet (bip39 12 or 24 words)
Vanitygen-bip39 - Generate vanity / gas efficient Ethereum addresses for your hdwallet (bip39 12 or 24 words)

vanitygen-bip39 Generate Ethereum gas efficient addresses with leading zeros https://medium.com/coinmonks/on-efficient-ethereum-addresses-3fef0596e263

Predict next words (`・ω・´)
Predict next words (`・ω・´)

Mocword Predict next words (`・ω・´) Installation Important: You must prepare Mocword dataset in advance. See below (Dataset and Environment Variable).

Scans for indications of an XSS vuln, Oracle SQLi and filters out words containing MySQL

RustScan Scans for indications of an XSS vuln, Oracle SQLi and filters out words containing MySQL. Best used along side ParamSpider found at https://g

Scans for indications of an XSS, Oracle SQLi and filters out words containing MySQL
Scans for indications of an XSS, Oracle SQLi and filters out words containing MySQL

PizzaHunt A tool to out pizza the hunt. Scans for indications of an XSS vuln (Double quote escapes) , Oracle SQLi (ORA- in response), filters out url

An opinionated, better system for spelling words in English.

ingLix / ˈɪŋ glɪʃ / English done right. An opinionated, better system for spelling words in English. Preamble Click to expand. The English language is

SHA256 sentence: discover a SHA256 checksum that matches a sentence's description of hex digit words.

SHA256 sentence "The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine." Inspired by @lauriewired post Inspired by @humbleha

A small CLI utility for helping you learn japanese words made in rust 🦀
A small CLI utility for helping you learn japanese words made in rust 🦀

Memofante (Clique aqui ver em português) Memofante is here, a biiiig help: Do you often forget japanese words you really didn't want to forget? Do you

Releases(v0.7.0)
Owner
Chris McComb
Studying humans who design, teaching computers to do it too. Associate professor at CMU.
Chris McComb
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Untanglr Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistic

Andrei Butnaru 15 Nov 23, 2022
Predict next words (`・ω・´)

Mocword Predict next words (`・ω・´) Installation Important: You must prepare Mocword dataset in advance. See below (Dataset and Environment Variable).

high-moctane 36 Dec 3, 2022
An opinionated, better system for spelling words in English.

ingLix / ˈɪŋ glɪʃ / English done right. An opinionated, better system for spelling words in English. Preamble Click to expand. The English language is

Nicholas Omer Chiasson 6 Aug 8, 2022
SHA256 sentence: discover a SHA256 checksum that matches a sentence's description of hex digit words.

SHA256 sentence "The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine." Inspired by @lauriewired post Inspired by @humbleha

Joel Parker Henderson 16 Oct 9, 2023
A small CLI utility for helping you learn japanese words made in rust 🦀

Memofante (Clique aqui ver em português) Memofante is here, a biiiig help: Do you often forget japanese words you really didn't want to forget? Do you

Tiaguinho 3 Nov 4, 2023
Subtitles-rs - Use SRT subtitle files to study foreign languages

Rust subtitle utilities Are you looking for substudy? Try here. (substudy has been merged into the subtitles-rs project.) This repository contains a n

Eric Kidd 268 Dec 29, 2022
Tests a wide variety of N64 features, from common to hardware quirks. Written in Rust. Executes quickly.

n64-systemtest Tests a wide variety of N64 features, from common to hardware quirks. Written in Rust. Executes quickly. n64-systemtest is a test rom t

null 37 Jan 7, 2023
Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.

treeedb treeedb makes it easier to start writing a source-level program analysis in Soufflé Datalog. First, treeedb generates Soufflé types and relati

Langston Barrett 16 Nov 30, 2022
An anthology of a variety of tools for the Persian language in Rust

persian-tools A set of helpers to sanitize, convert or transform information related to Persian language and/or Iran. Features Feature Status National

null 17 Dec 25, 2021
NixEl is a Rust library that turns Nix code into a variety of correct, typed, memory-safe data-structures

?? NixEL Lexer, Parser, Abstract Syntax Tree and Concrete Syntax Tree for the Nix Expressions Language. NixEl is a Rust library that turns Nix code in

Kevin Amado 56 Dec 29, 2022