Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents.

Overview

Rust TF-IDF

Build Status

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents. The library provides strategies to act on objects that implement certain document traits (NaiveDocument, ProcessedDocument, ExpandableDocument).

For more information on the strategies that were implemented, check out Wikipedia.

Document Types

A document is defined as a collection of terms. The documents don't make assumptions about the term types (the terms are not normalized in any way).

These document types are of my design. The terminology isn't standard, but they are fairly straight forward to understand.

  • NaiveDocument - A document is 'naive' if it only knows if a term is contained within it or not, but does not know HOW MANY of the instances of the term it contains.

  • ProcessedDocument - A document is 'processed' if it knows how many instances of each term is contained within it.

  • ExpandableDocument - A document is 'expandable' if provides a way to access each term contained within it.

Example

The most simple way to calculate the TfIdf of a document is with the default implementation. Note, the library provides implementation of ProcessedDocument, for a Vec<(T, usize)>.

0.5);">
use tfidf::{TfIdf, TfIdfDefault};

let mut docs = Vec::new();
let doc1 = vec![("a", 3), ("b", 2), ("c", 4)];
let doc2 = vec![("a", 2), ("d", 5)];

docs.push(doc1);
docs.push(doc2);

assert_eq!(0f64, TfIdfDefault::tfidf("a", &docs[0], docs.iter()));
assert!(TfIdfDefault::tfidf("c", &docs[0], docs.iter()) > 0.5);

You can also roll your own strategies to calculate tf-idf using some strategies included in the library.

0f64); assert!(MyTfIdfStrategy::tfidf("c", &docs[0], docs.iter()) > 0f64);">
use tfidf::{TfIdf, ProcessedDocument};
use tfidf::tf::{RawFrequencyTf};
use tfidf::idf::{InverseFrequencySmoothIdf};

#[derive(Copy, Clone)] struct MyTfIdfStrategy;

impl 
   TfIdf
    
    for 
    MyTfIdfStrategy 
    where 
    T : 
    ProcessedDocument {
  
    type 
    Tf = RawFrequencyTf;
  
    type 
    Idf = InverseFrequencySmoothIdf; 
}


    let 
    mut docs 
    = 
    Vec
    ::
    new();

    let doc1 
    = 
    vec![(
    "a", 
    3), (
    "b", 
    2), (
    "c", 
    4)];

    let doc2 
    = 
    vec![(
    "a", 
    2), (
    "d", 
    5)];

docs.
    push(doc1);
docs.
    push(doc2);


    assert!(MyTfIdfStrategy
    ::
    tfidf(
    "a", 
    &docs[
    0], docs.
    iter()) 
    > 
    0f64);

    assert!(MyTfIdfStrategy
    ::
    tfidf(
    "c", 
    &docs[
    0], docs.
    iter()) 
    > 
    0f64);
   
  

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments
  • Document implementation for HashMap<String, usize>

    Document implementation for HashMap

    It would probably be useful to have a default implementation of Document and ProcessedDocument for HashMap<String, usize>.

    I did a first attempt, but somehow can't get it to compile:

    impl<T> Document for HashMap<T, usize> {
        type Term = T;
    }
    
    impl<T> ProcessedDocument for HashMap<String, usize> where T : PartialEq {
        fn term_frequency<K>(&self, term: K) -> usize where K : Borrow<T> {
            &self.get(&term).unwrap_or(0)
        }
    
        fn max(&self) -> Option<&T> {
            match self.iter().max_by_key(|&(_, v)| v) {
                Some(&(ref k, _)) => Some(k),
                None => None
            }
        }
    }
    

    The error message I'm getting:

    src/lib.rs:75:6: 75:7 error: the type parameter `T` is not constrained by the impl trait, self type, or predicates [E0207]
    src/lib.rs:75 impl<T> ProcessedDocument for HashMap<String, usize> where T : PartialEq {
                       ^
    src/lib.rs:75:6: 75:7 help: run `rustc --explain E0207` to see a detailed explanation
    error: aborting due to previous error
    

    Any ideas?

    opened by dbrgn 2
  • Support more collections from the std lib, remove unused dependency

    Support more collections from the std lib, remove unused dependency

    Hello @ferristseng . As suggested in the title, this PR adds support for HashMap and BTreeMap based documents.

    I also noticed that num wasn't being used for anything, so I took that dependency out.

    Cheers!

    opened by ZackPierce 0
  • Relicense under dual MIT/Apache-2.0

    Relicense under dual MIT/Apache-2.0

    This issue was automatically generated. Feel free to close without ceremony if you do not agree with re-licensing or if it is not possible for other reasons. Respond to @cmr with any questions or concerns, or pop over to #rust-offtopic on IRC to discuss.

    You're receiving this because someone (perhaps the project maintainer) published a crates.io package with the license as "MIT" xor "Apache-2.0" and the repository field pointing here.

    TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that license is good for interoperation. The MIT license as an add-on can be nice for GPLv2 projects to use your code.

    Why?

    The MIT license requires reproducing countless copies of the same copyright header with different names in the copyright field, for every MIT library in use. The Apache license does not have this drawback. However, this is not the primary motivation for me creating these issues. The Apache license also has protections from patent trolls and an explicit contribution licensing clause. However, the Apache license is incompatible with GPLv2. This is why Rust is dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for GPLv2 compat), and doing so would be wise for this project. This also makes this crate suitable for inclusion and unrestricted sharing in the Rust standard distribution and other projects using dual MIT/Apache, such as my personal ulterior motive, the Robigalia project.

    Some ask, "Does this really apply to binary redistributions? Does MIT really require reproducing the whole thing?" I'm not a lawyer, and I can't give legal advice, but some Google Android apps include open source attributions using this interpretation. Others also agree with it. But, again, the copyright notice redistribution is not the primary motivation for the dual-licensing. It's stronger protections to licensees and better interoperation with the wider Rust ecosystem.

    How?

    To do this, get explicit approval from each contributor of copyrightable work (as not all contributions qualify for copyright, due to not being a "creative work", e.g. a typo fix) and then add the following to your README:

    ## License
    
    Licensed under either of
    
     * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
     * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
    
    at your option.
    
    ### Contribution
    
    Unless you explicitly state otherwise, any contribution intentionally submitted
    for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
    additional terms or conditions.
    

    and in your license headers, if you have them, use the following boilerplate (based on that used in Rust):

    // Copyright 2016 rust-tfidf Developers
    //
    // Licensed under the Apache License, Version 2.0, <LICENSE-APACHE or
    // http://apache.org/licenses/LICENSE-2.0> or the MIT license <LICENSE-MIT or
    // http://opensource.org/licenses/MIT>, at your option. This file may not be
    // copied, modified, or distributed except according to those terms.
    

    It's commonly asked whether license headers are required. I'm not comfortable making an official recommendation either way, but the Apache license recommends it in their appendix on how to use the license.

    Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these from the Rust repo for a plain-text version.

    And don't forget to update the license metadata in your Cargo.toml to:

    license = "MIT OR Apache-2.0"
    

    I'll be going through projects which agree to be relicensed and have approval by the necessary contributors and doing this changes, so feel free to leave the heavy lifting to me!

    Contributor checkoff

    To agree to relicensing, comment with :

    I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.
    

    Or, if you're a contributor, you can check the box in this repo next to your name. My scripts will pick this exact phrase up and check your checkbox, but I'll come through and manually review this issue later as well.

    • [x] @ferristseng
    opened by emberian 0
  • a performance question and proposal to expose the `Idf` trait

    a performance question and proposal to expose the `Idf` trait

    Hi! Thanks for publishing this software. It's quite helpful to potentially be able to use your library instead of re-implementing TFIDF myself. I am grateful for the time and attention you've given to this.

    For my use-case, I am working with a large corpus of documents and trying to understand if I can use this library in a way which will have suitable performance.

    Examples in this repo show examples of the form:

    for term in terms:
      for doc in docs:
        score = compute_tfidf(term, doc) 
    

    where compute_tfidf is either TfIdfDefault::tfidf or MyTfIdfStrategy::tfidf


    1. Is it true that the idf implementations exposed by this crate all require a O(n) linear iteration over the documents/corpus?
    2. Is it possible to use the idf functions on their own, without going through tfidf?

    Presented in pseudocode here, I would like to do the following:

    for term in terms:
      idf = compute_idf(term, docs)
      for doc in docs:
        score = compute_tf(term, doc) * idf
    

    Concretely, I have tried to use the library in the following way, but ran into an error that I don't quite understand yet:

    use tfidf::idf::{InverseFrequencySmoothIdf};
    use tfidf::tf::DoubleHalfNormalizationTf;
    use tfidf::{Tf, Idf};
    
    for term in terms {
      let idf = idf::InverseFrequencyIdf::idf(term, docs)
      for doc in docs {
        let tf = tf::DoubleHalfNormalizationTf::tf(term, doc)
        let tfidf = tf * idf
      }
    }
    
    

    Error 1:

    Unresolved import `tfidf::Idf`
    

    Error 2:

    No function or associated item named `idf` found for struct `InverseFrequencySmoothIdf` in the current scope
    

    Further exploration has led me to discover that this may be occurring simply because Idf isn't exposed. Would it be okay for me to submit a patch which modifies lib.rs to expose Idf?

    by changing:

    pub use prelude::{
      Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
      SmoothingFactor, Tf, TfIdf,
    };
    

    proposed:

    pub use prelude::{
      Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
      SmoothingFactor, Tf, TfIdf, Idf,
    };
    
    opened by btc 2
  • Get TfIdf Vector

    Get TfIdf Vector

    I'm still trying to wrap my head around TF-IDF, therefore this might be a stupid question :)

    I want to compare the similarity between two documents. I already have code in place to extract the words from the documents and to count the words. The result is a HashMap<String, usize>.

    What I want to get now is a vector that contains TF-IDF values for every word that occurs in the documents, so that I can determine the cosine similarity between them.

    Is this possible with the current API? If I understand it correctly, the tfidf function simply calculates the TF-IDF value for a single word, right? Does IDF even make much sense if there are only 2 documents?

    opened by dbrgn 2
Owner
Ferris Tseng
Ferris Tseng
WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Joel Parker Henderson 2 Dec 27, 2022
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

Peter M. Stahl 5.8k Dec 30, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
A morphological analysis library.

Lindera A Japanese morphological analysis library in Rust. This project fork from fulmicoton's kuromoji-rs. Lindera aims to build a library which is e

Lindera Morphology 240 Dec 27, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Rust wrapper for the BlingFire tokenization library

BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin

Re:infer 14 Sep 5, 2022
A Rust library containing an offline version of webster's dictionary.

webster-rs A Rust library containing an offline version of webster's dictionary. Add to Cargo.toml webster = 0.3.0 Simple example: fn main() { le

Grant Handy 12 Sep 27, 2022
Wrapper around Microsoft CNTK library

Bindings for CNTK library Simple low level bindings for CNTK library from Microsoft. API Documentation Status Currently exploring ways how to interact

Vlado Boza 21 Nov 30, 2021
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

LibreTuner 6 Oct 3, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
A small random number generator hacked on top of Rust's standard library. An exercise in pointlessness.

attorand from 'atto', meaning smaller than small, and 'rand', short for random. A small random number generator hacked on top of Rust's standard libra

Isaac Clayton 1 Nov 24, 2021
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
A small rust library for creating regex-based lexers

A small rust library for creating regex-based lexers

nph 1 Feb 5, 2022
A rule based sentence segmentation library.

cutters A rule based sentence segmentation library. ?? This library is experimental. ?? Features Full UTF-8 support. Robust parsing. Language specific

null 11 Jul 29, 2022