An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Overview

Omikuji

Build Status Crate version PyPI version

An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, written in Rust 🦀 with love 💖 .

Features & Performance

Omikuji has has been tested on datasets from the Extreme Classification Repository. All tests below are run on a quad-core Intel® Core™ i7-6700 CPU, and we allowed as many cores to be utilized as possible. We measured training time, and calculated precisions at 1, 3, and 5. (Note that, due to randomness, results might vary from run to run, especially for smaller datasets.)

Parabel, better parallelized

Omikuji provides a more parallelized implementation of Parabel (Prabhu et al., 2018) that trains faster when more CPU cores are available. Compared to the original implementation written in C++, which can only utilize the same number of CPU cores as the number of trees (3 by default), Omikuji maintains the same level of precision but trains 1.3x to 1.7x faster on our quad-core machine. Further speed-up is possible if more CPU cores are available.

Dataset Metric Parabel Omikuji
(balanced,
cluster.k=2)
EURLex-4K P@1 82.2 82.1
P@3 68.8 68.8
P@5 57.6 57.7
Train Time 18s 14s
Amazon-670K P@1 44.9 44.8
P@3 39.8 39.8
P@5 36.0 36.0
Train Time 404s 234s
WikiLSHTC-325K P@1 65.0 64.8
P@3 43.2 43.1
P@5 32.0 32.1
Train Time 959s 659s

Regular k-means for shallow trees

Following Bonsai (Khandagale et al., 2019), Omikuji supports using regular k-means instead of balanced 2-means clustering for tree construction, which results in wider, shallower and unbalanced trees that train slower but have better precision. Comparing to the original Bonsai implementation, Omikuji also achieves the same precisions while training 2.6x to 4.6x faster on our quad-core machine. (Similarly, further speed-up is possible if more CPU cores are available.)

Dataset Metric Bonsai Omikuji
(unbalanced,
cluster.k=100,
max_depth=3)
EURLex-4K P@1 82.8 83.0
P@3 69.4 69.5
P@5 58.1 58.3
Train Time 87s 19s
Amazon-670K P@1 45.5* 45.6
P@3 40.3* 40.4
P@5 36.5* 36.6
Train Time 5,759s 1,753s
WikiLSHTC-325K P@1 66.6* 66.6
P@3 44.5* 44.4
P@5 33.0* 33.0
Train Time 11,156s 4,259s

*Precision numbers as reported in the paper; our machine doesn't have enough memory to run the full prediction with their implementation.

Balanced k-means for balanced shallow trees

Sometimes it's desirable to have shallow and wide trees that are also balanced, in which case Omikuji supports the balanced k-means algorithm used by HOMER (Tsoumakas et al., 2008) for clustering as well.

Dataset Metric Omikuji
(balanced,
cluster.k=100)
EURLex-4K P@1 82.1
P@3 69.4
P@5 58.1
Train Time 19s
Amazon-670K P@1 45.4
P@3 40.3
P@5 36.5
Train Time 1,153s
WikiLSHTC-325K P@1 65.6
P@3 43.6
P@5 32.5
Train Time 3,028s

Layer collapsing for balanced shallow trees

An alternative way for building balanced, shallow and wide trees is to collapse adjacent layers, similar to the tree compression step used in AttentionXML (You et al., 2019): intermediate layers are removed, and their children replace them as the children of their parents. For example, with balanced 2-means clustering, if we collapse 5 layers after each layer, we can increase the tree arity from 2 to 2⁵⁺¹ = 64.

Dataset Metric Omikuji
(balanced,
cluster.k=2,
collapse 5 layers)
EURLex-4K P@1 82.4
P@3 69.3
P@5 58.0
Train Time 16s
Amazon-670K P@1 45.3
P@3 40.2
P@5 36.4
Train Time 460s
WikiLSHTC-325K P@1 64.9
P@3 43.3
P@5 32.3
Train Time 1,649s

Build & Install

Omikuji can be easily built & installed with Cargo as a CLI app:

cargo install omikuji --features cli

Or install from the latest source:

cargo install --git https://github.com/tomtung/omikuji.git --features cli

The CLI app will be available as omikuji. For example, to reproduce the results on the EURLex-4K dataset:

omikuji train eurlex_train.txt --model_path ./model
omikuji test ./model eurlex_test.txt --out_path predictions.txt

Python Binding

A simple Python binding is also available for training and prediction. It can be install via pip:

pip install omikuji

Note that you might still need to install Cargo should compilation become necessary.

You can also install from the latest source:

pip install git+https://github.com/tomtung/omikuji.git -v

The following script demonstrates how to use the Python binding to train a model and make predictions:

import omikuji

# Train
hyper_param = omikuji.Model.default_hyper_param()
# Adjust hyper-parameters as needed
hyper_param.n_trees = 5
model = omikuji.Model.train_on_data("./eurlex_train.txt", hyper_param)

# Serialize & de-serialize
model.save("./model")
model = omikuji.Model.load("./model")
# Optionally densify model weights to trade off between prediction speed and memory usage
model.densify_weights(0.05)

# Predict
feature_value_pairs = [
    (0, 0.101468),
    (1, 0.554374),
    (2, 0.235760),
    (3, 0.065255),
    (8, 0.152305),
    (10, 0.155051),
    # ...
]
label_score_pairs =  model.predict(feature_value_pairs)

Usage

$ omikuji train --help
omikuji-train
Train a new model

USAGE:
    omikuji train [FLAGS] [OPTIONS] <TRAINING_DATA_PATH>

FLAGS:
        --cluster.unbalanced     Perform regular k-means clustering instead of balanced k-means clustering
    -h, --help                   Prints help information
        --train_trees_1_by_1     Finish training each tree before start training the next; limits initial
                                 parallelization but saves memory
        --tree_structure_only    Build the trees without training classifiers; useful when a downstream user needs the
                                 tree structures only
    -V, --version                Prints version information

OPTIONS:
        --centroid_threshold <THRESHOLD>         Threshold for pruning label centroid vectors [default: 0]
        --cluster.eps <EPS>                      Epsilon value for determining clustering convergence [default: 0.0001]
        --cluster.k <K>                          Number of clusters [default: 2]
        --cluster.min_size <SIZE>
            Labels in clusters with sizes smaller than this threshold are reassigned to other clusters instead [default:
            2]
        --collapse_every_n_layers <N>
            Number of adjacent layers to collapse, which increases tree arity and decreases tree depth [default: 0]

        --linear.c <C>                           Cost co-efficient for regularizing linear classifiers [default: 1]
        --linear.eps <EPS>
            Epsilon value for determining linear classifier convergence [default: 0.1]

        --linear.loss <LOSS>
            Loss function used by linear classifiers [default: hinge]  [possible values: hinge, log]

        --linear.max_iter <M>
            Max number of iterations for training each linear classifier [default: 20]

        --linear.weight_threshold <THRESHOLD>
            Threshold for pruning weight vectors of linear classifiers [default: 0.1]

        --max_depth <DEPTH>                      Maximum tree depth [default: 20]
        --min_branch_size <SIZE>
            Number of labels below which no further clustering & branching is done [default: 100]

        --model_path <PATH>
            Optional path of the directory where the trained model will be saved if provided; if an model with
            compatible settings is already saved in the given directory, the newly trained trees will be added to the
            existing model
        --n_threads <T>
            Number of worker threads. If 0, the number is selected automatically [default: 0]

        --n_trees <N>                            Number of trees [default: 3]

ARGS:
    <TRAINING_DATA_PATH>    Path to training dataset file (in the format of the Extreme Classification Repository)
$ omikuji test --help
omikuji-test
Test an existing model

USAGE:
    omikuji test [OPTIONS] <MODEL_PATH> <TEST_DATA_PATH>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
        --beam_size <beam_size>           Beam size for beam search [default: 10]
        --k_top <K>                       Number of top predictions to write out for each test example [default: 5]
        --max_sparse_density <DENSITY>    Density threshold above which sparse weight vectors are converted to dense
                                          format. Lower values speed up prediction at the cost of more memory usage
                                          [default: 0.1]
        --n_threads <T>                   Number of worker threads. If 0, the number is selected automatically [default:
                                          0]
        --out_path <PATH>                 Path to the which predictions will be written, if provided

ARGS:
    <MODEL_PATH>        Path of the directory where the trained model is saved
    <TEST_DATA_PATH>    Path to test dataset file (in the format of the Extreme Classification Repository)

Data format

Our implementation takes dataset files formatted as those provided in the Extreme Classification Repository. A data file starts with a header line with three space-separated integers: total number of examples, number of features, and number of labels. Following the header line, there is one line per each example, starting with comma-separated labels, followed by space-separated feature:value pairs:

label1,label2,...labelk ft1:ft1_val ft2:ft2_val ft3:ft3_val .. ftd:ftd_val

Trivia

The project name comes from o-mikuji (御神籤), which are predictions about one's future written on strips of paper (labels?) at jinjas and temples in Japan, often tied to branches of pine trees after they are read.

References

  • Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 993–1002.
  • S. Khandagale, H. Xiao, and R. Babbar, “Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification,” Apr. 2019.
  • G. Tsoumakas, I. Katakis, and I. Vlahavas, “Effective and efficient multilabel classification in domains with large number of labels,” ECML, 2008.
  • R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu, “AttentionXML: Extreme Multi-Label Text Classification with Multi-Label Attention Based Recurrent Neural Networks,” Jun. 2019.

License

Omikuji is licensed under the MIT License.

Comments
  • Issues when training on a large dataset

    Issues when training on a large dataset

    Hi Tom! At first, I wanted to thank you for your great contribution. This is the best implementation for XMC I've found (that is also feasible to use in production).

    I ran a number of experiments and I have an observation that it works great when the training set is about 1-2M samples, but the task I'm trying to solve has 60M samples in the training set with 1M labels and 3M features from Tf-Idf. I always use default Parabel-like parameters.

    Once I managed to train a model on 60M samples with 260k labels, but the only machine that managed to fit it was 160CPU 3.4T RAM GCP instance which is very expensive.

    I tried 96CPU 1.4T machine to decrease costs, but it hangs for 3-4 hours on Initializing tree trainer step and then disconnects (I guess it gets out of memory).

    Do you have any tips and tricks how to run training on a dataset of this size at a reasonable cost? E.g. would it be possible to train in batches on smaller/cheaper machines? Or are there any "magic" hyperparameter settings to achieve this?

    opened by klimentij 6
  • Python import error with fresh pip install

    Python import error with fresh pip install

    Hey, really excited to test your code out on a problem I'm working on, but I encountered an error immediately after installing and attempting to import.

    The error:

    Python 3.6.5 (default, Apr 12 2018, 10:53:09)
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.
    
    In [1]: import parabel
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-1-e20e1e492a86> in <module>
    ----> 1 import parabel
    
    ~/.pyenv/versions/3.6.5/envs/venv365/lib/python3.6/site-packages/parabel/__init__.py in <module>
          2 __all__ = ["Model", "LossType", "Trainer", "init_rayon_threads"]
          3
    ----> 4 from ._libparabel import lib, ffi
          5
          6 try:
    
    ~/.pyenv/versions/3.6.5/envs/venv365/lib/python3.6/site-packages/parabel/_libparabel.py in <module>
          3
          4 import os
    ----> 5 from parabel._libparabel__ffi import ffi
          6
          7 lib = ffi.dlopen(os.path.join(os.path.dirname(__file__), '_libparabel__lib.so'), 130)
    
    ~/.pyenv/versions/3.6.5/envs/venv365/lib/python3.6/site-packages/parabel/_libparabel__ffi.py in <module>
          1 # auto-generated file
    ----> 2 import _cffi_backend
          3
          4 ffi = _cffi_backend.FFI('parabel._libparabel__ffi',
          5     _version = 0x2601,
    
    ModuleNotFoundError: No module named '_cffi_backend'
    

    Environment

    $ uname -a
    Darwin C02QRF6FG8WP 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64 x86_64
    
    opened by travisbrady 4
  • Wheel for Python 3.10

    Wheel for Python 3.10

    It would be nice to have a wheel for Python 3.10 in PyPI, especially now when Ubuntu 22.04 has been released with 3.10 as the default Python version. It would make installing Omikuji more straightforward (the version of cargo in apt repositories is too old for Omikuji, and I struggled some time to install a new enough cargo version in Dockerfile).

    opened by juhoinkinen 3
  • Cannot load model if model directory contains symlinks

    Cannot load model if model directory contains symlinks

    I'm trying to adapt Omikuji (via the integration with Annif) into a Data Version Control workflow. In DVC, large model files are typically stored in a cache outside the working tree (which is a git repository). There are several ways to keep the working directory synchronized with the cache, but one common solution is the use of symbolic links. This means that model files will be moved to the cache directory and replaced with symlinks that point to the original files.

    I noticed that Omikuji has problems loading the model if the files in the model directory (settings.json, tree0.cbor, tree1.cbor ...) aren't regular files but symlinks. Loading the model apparently succeeds, but all the predictions are empty. I was able demonstrate this without involving DVC, by editing the Python example into this:

    import os
    import sys
    import shutil
    import time
    
    import omikuji
    
    if __name__ == "__main__":
        # Adjust hyper-parameters as needed
        hyper_param = omikuji.Model.default_hyper_param()
        hyper_param.n_trees = 2
    
        # Train
        model = omikuji.Model.train_on_data("./eurlex_train.txt", hyper_param)
    
        # Serialize & de-serialize
        model.save("./model")
        
        # create a directory containing symlinks to the saved model files
        shutil.rmtree("./model2", ignore_errors=True)
        os.mkdir("./model2")
        for fn in os.listdir("./model"):
            os.symlink(f"../model/{fn}", f"./model2/{fn}")
    
        # load the model from the directory containing symlinks
        model = omikuji.Model.load("./model2")
    
        # Predict
        feature_value_pairs = [
            (0, 0.101468),
            (1, 0.554374),
            (2, 0.235760),
            (3, 0.065255),
            (8, 0.152305),
            (10, 0.155051),
            # ...
        ]
        label_score_pairs = model.predict(feature_value_pairs, top_k=3)
        print("Dummy prediction results: {}".format(label_score_pairs))
    

    The result of running this:

    INFO [omikuji::data] Loading data from ./eurlex_train.txt
    INFO [omikuji::data] Parsing data
    INFO [omikuji::data] Loaded 15539 examples; it took 0.16s
    INFO [omikuji::model::train] Training model with hyper-parameters HyperParam { n_trees: 2, min_branch_size: 100, max_depth: 20, centroid_threshold: 0.0, collapse_every_n_layers: 0, linear: HyperParam { loss_type: Hinge, eps: 0.1, c: 1.0, weight_threshold: 0.1, max_iter: 20 }, cluster: HyperParam { k: 2, balanced: true, eps: 0.0001, min_size: 2 }, tree_structure_only: false, train_trees_1_by_1: false }
    INFO [omikuji::model::train] Initializing tree trainer
    INFO [omikuji::model::train] Computing label centroids
    Labels 3786 / 3786 [==============================================================] 100.00 % 23820.55/s INFO [omikuji::model::train] Start training forest
    7824 / 7824 [======================================================================] 100.00 % 3112.38/s INFO [omikuji::model::train] Model training complete; it took 3.39s
    INFO [omikuji::model] Saving model...
    INFO [omikuji::model] Saving tree to ./model/tree0.cbor
    INFO [omikuji::model] Saving tree to ./model/tree1.cbor
    INFO [omikuji::model] Model saved; it took 0.08s
    INFO [omikuji::model] Loading model...
    INFO [omikuji::model] Loading model settings from ./model2/settings.json...
    INFO [omikuji::model] Loaded model settings Settings { n_features: 5000, classifier_loss_type: Hinge }...
    INFO [omikuji::model] Model loaded; it took 0.00s
    Dummy prediction results: []
    

    The suspicious part is the model loading (it should take more than 0.00s) and then the empty list of predictions.

    I wonder if there's a good reason why the model files have to be actual files. Normally it doesn't matter whether a file used for a read operation is a regular file or a symlink; as long as the symlink points to an actual file (with the correct permissions etc.) that should work fine.

    Some information about the system: Ubuntu Linux 20.04 amd64, ext4 filesystem Python 3.8.10 Omikuji 0.4.1 installed in a virtual environment with pip

    opened by osma 3
  • Slow prediction problem

    Slow prediction problem

    First of all, I want to say this repository is really helpful to my work. It trains really fast and efficient from small dataset to large dataset. However, I found it really slow when I predicted data using saved model. When I started predicting, the number that shows how mach examples per second is nearly 4k, but the actual number of that is 200. I've checked the process status, it showed that most of the time the CPU usage of every core is nearly 0%! Sometimes it will increase to 20% or 50%, but it just lasted a few seconds. I've tried prediction on different scaled data, with different parameters, this problem happened all the time. I'm using google cloud machine with intel CPU and 64GB Memory, I'm not sure if this problem is due to my machine. Thanks again!

    opened by WestbrookGE 3
  • golang binding via c-api possible?

    golang binding via c-api possible?

    Thanks a lot for this great lib, it's wonderful!

    This is not an issue per se, more like a question. I would like to use omikuji in my application, which is currently written in golang. I think my options are to use the python api in a small dedicated prediction server, or to somehow create a golang binding myself. I am not very familiar with rust, do you think the current c-api can be used to generate the necessary stub C code for such a binding (with cgo)? If you happen to have some pointers or advice on how to do that, I would love to hear it. (Maybe adding a REST server option to omikuji itself could be a fun project too, if I ever find some time to learn rust...)

    Thanks again!

    opened by trpstra 3
  • thread 'main' panicked at 'Could not determine the UTC offset on this system

    thread 'main' panicked at 'Could not determine the UTC offset on this system

    I've been running this on linux via AWS SageMaker for some experimentation with Twitter hashtags and I think the new releases changed something relating to the simple-logger update. I reverted back to 0.3.3 and everything works.

    I really appreciate this implementation tho! it is very quick and efficient for testing out some extreme classification pipelines, thanks for writing it up!

    Backtrace

    thread 'main' panicked at 'Could not determine the UTC offset on this system. Possible causes are that the time crate does not implement "local_offset_at" on your system, or that you are running in a multi-threaded environment and the time crate is returning "None" from "local_offset_at" to avoid unsafe behaviour. See the time crate's documentation for more information. (https://time-rs.github.io/internal-api/time/index.html#feature-flags): IndeterminateOffset', /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/simple_logger-1.15.1/src/lib.rs:360:64 stack backtrace: 0: 0x556ab1b7413d - std::backtrace_rs::backtrace::libunwind::trace::hf6a6dfd7da937cb0 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5 1: 0x556ab1b7413d - std::backtrace_rs::backtrace::trace_unsynchronized::hc596a19e4891f7f3 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: 0x556ab1b7413d - std::sys_common::backtrace::_print_fmt::hb16700db31584325 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/sys_common/backtrace.rs:67:5 3: 0x556ab1b7413d - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h231c4190cfa75162 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/sys_common/backtrace.rs:46:22 4: 0x556ab1b147cc - core::fmt::write::h2a1462b5f8eea807 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/fmt/mod.rs:1163:17 5: 0x556ab1b72cb4 - std::io::Write::write_fmt::h71ddfebc68685972 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/io/mod.rs:1696:15 6: 0x556ab1b73360 - std::sys_common::backtrace::_print::hcc197d4bebf2b369 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/sys_common/backtrace.rs:49:5 7: 0x556ab1b73360 - std::sys_common::backtrace::print::h335a66af06738c7c at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/sys_common/backtrace.rs:36:9 8: 0x556ab1b73360 - std::panicking::default_hook::{{closure}}::h6fac9ac9c8b79e52 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:210:50 9: 0x556ab1b7278a - std::panicking::default_hook::h341c1030c6a1161b at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:227:9 10: 0x556ab1b7278a - std::panicking::rust_panic_with_hook::h50680ff4b44510c6 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:624:17 11: 0x556ab1b92608 - std::panicking::begin_panic_handler::{{closure}}::h9371c0fbb1e8465a at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:521:13 12: 0x556ab1b92586 - std::sys_common::backtrace::__rust_end_short_backtrace::h9b3efa22a5768c0f at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/sys_common/backtrace.rs:139:18 13: 0x556ab1b92542 - rust_begin_unwind at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5 14: 0x556ab1a850e0 - core::panicking::panic_fmt::h23b9203e89cc61cf at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14 15: 0x556ab1a853c2 - core::result::unwrap_failed::h32ef6b3156e8fc57 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/result.rs:1616:5 16: 0x556ab1b6ebdd - <simple_logger::SimpleLogger as log::Log>::log::h1c34c8b7ef19bacc 17: 0x556ab1b230a2 - omikuji::data::DataSet::load_xc_repo_data_file::hb7a7cb9826b9150f 18: 0x556ab1ac19ea - omikuji::train::h3d9bcb33a544581f 19: 0x556ab1acda31 - omikuji::main::h382f86dabe280b2b 20: 0x556ab1ab4c63 - std::sys_common::backtrace::__rust_begin_short_backtrace::h9a5620b049f38e48 21: 0x556ab1ad23f9 - main 22: 0x7f6556dff585 - __libc_start_main 23: 0x556ab1a902a5 - 24: 0x0 -

    Notes

    Seems like the same problem as noted here: https://github.com/ravenclaw900/DietPi-Dashboard/issues/82

    opened by follperson 2
  • Wheel for Python 3.9?

    Wheel for Python 3.9?

    Hi and thanks for the great library! We are using Omikuji in one backend of Annif, and it is performing very well in the subject indexing task.

    As the title says, we are wondering if it would be possible to have wheel for Python 3.9 (for Linux) in PyPI? It would simplify installation into Docker image. Now for installing Omikuji on Python 3.9 it seems it is necessary to have also cargo installed (at least).

    opened by juhoinkinen 2
  • Installation issues

    Installation issues

    HI,

    I'm having installation issues. I was trying to install the package using pip and with 'setup.py' as well. I'm getting the same error each time:

    2019-12-27T10:47:45,077 Created temporary directory: C:\Users\myuser\AppData\Local\Temp\pip-wheel-044u2wao 2019-12-27T10:47:45,077 Destination directory: C:\Users\myuser\AppData\Local\Temp\pip-wheel-044u2wao 2019-12-27T10:47:45,078 Running command 'c:\users\myuser\appdata\local\programs\python\python36\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\myuser\AppData\Local\Temp\pip-install-3k2qvyqz\omikuji\setup.py'"'"'; file='"'"'C:\Users\myuser\AppData\Local\Temp\pip-install-3k2qvyqz\omikuji\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\myuser\AppData\Local\Temp\pip-wheel-044u2wao' --python-tag cp36 2019-12-27T10:47:45,879 running bdist_wheel 2019-12-27T10:47:45,881 running build 2019-12-27T10:47:45,881 running build_py 2019-12-27T10:47:45,882 creating build\lib 2019-12-27T10:47:45,882 creating build\lib\omikuji 2019-12-27T10:47:45,883 copying python-wrapper\omikuji_init_.py -> build\lib\omikuji 2019-12-27T10:47:45,887 error: [WinError 2] Cannot find the file specified 2019-12-27T10:47:45,911 ERROR: Failed building wheel for omikuji

    Any ideas on how to deal with that?

    Regards, Jakub

    opened by rabitwhte 2
  • Cannot set collapse_every_n_layers via Python bindings

    Cannot set collapse_every_n_layers via Python bindings

    Hi,

    I'm trying to figure out how to use the AttentionXML-like hyperparameters mentioned in the top level README using the Python bindings. This would require setting the collapse_every_n_layers hyperparameter to 5.

    It appears to me that this hyperparameter is not exposed to the Python bindings (nor to the C API). There is no field called collapse_every_n_layers (or anything similar) in the HyperParam object returned by omikuji.Model.default_hyper_param(). In fact, the only source files that mention this hyperparameter are src/bin/omikuji.rs and src/model/train.rs so it is only used within the Rust implementation, not in any of the bindings.

    opened by osma 2
  • Regarding code execution

    Regarding code execution

    Dear Sir/Madam, We have tried to install parabel-rs package as per the instructions given in repository. But during build process it is unable to find some directory. we are using python3.6+.

    Below screenshot for the error.

    image

    opened by purviprajapati196 2
  • Predefined tree architecture

    Predefined tree architecture

    Hey @tomtung!

    Do you think it would be possible to predefine tree architecture for parabel? It could be useful when labels are hierarchical by definition. Do you have any hints for implementation? Like where to start?

    Regards, rabitwhte

    enhancement 
    opened by rabitwhte 6
  • Explainability

    Explainability

    Hey @tomtung,

    great repository! Your implementation of parabel has significantly outperformed several architectures of DNN that I tried (for dataset of 600k samples, 20k labels) while being much faster at the same time (both training and prediction). Thank you for the python wrapper as well, since it was easier and faster to try for me.

    Can you think of any way to approach a challenge of explainability? Is there a way to e.g. for each prediction, get the most important words in the file that decided the file was classified in that way or another?

    Thanks again!

    enhancement 
    opened by rabitwhte 6
Releases(v0.5.0)
  • v0.5.0(Feb 2, 2022)

    What's Changed

    • Drop support for Python 3.6 by @tomtung in https://github.com/tomtung/omikuji/pull/41
    • Update simple_logger to avoid throwing local-time-related error during logger initialization by @tomtung in https://github.com/tomtung/omikuji/pull/40
    • Support loading symlinked model files by @tomtung in https://github.com/tomtung/omikuji/pull/39
    • Upgrade clap to version 3.0 by @tomtung in https://github.com/tomtung/omikuji/pull/42

    Full Changelog: https://github.com/tomtung/omikuji/compare/v0.4.1...v0.5.0

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Dec 6, 2021)

    What's Changed

    • Limit number of label candidates per leaf for prediction by @tomtung in https://github.com/tomtung/omikuji/pull/36

    Full Changelog: https://github.com/tomtung/omikuji/compare/v0.4.0...v0.4.1

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 4, 2021)

    What's Changed

    • Change weight matrix storage to significantly speed up prediction while using less memory by @tomtung in https://github.com/tomtung/omikuji/pull/33
      • Note: this breaks compatibility of saved model files

    Full Changelog: https://github.com/tomtung/omikuji/compare/v0.3.4...v0.4.0

    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Dec 4, 2021)

    What's Changed

    • Downgrade sprs to 0.9 to fix training speed regression by @tomtung in https://github.com/tomtung/omikuji/pull/32

    Full Changelog: https://github.com/tomtung/omikuji/compare/v0.3.3...v0.3.4

    Source code(tar.gz)
    Source code(zip)
Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). LGC and CAMLP have been implemented.

label-propagation-rs Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). A simple LGC and a mor

vaaaaanquish 4 Sep 15, 2021
Simple neural network library for classification written in Rust.

Cogent A note I continue working on GPU stuff, I've made some interesting things there, but ultimately it made me realise this is far too monumental a

Jonathan Woollett-Light 41 Dec 25, 2022
Repository for CinPatent: Datasets for Patent Classification

CinPatent: Datasets for Patent Classification We release two datasets for patent classification in English and Japanese at Google Drive. The data fold

Cinnamon 1 Jan 2, 2023
Generic decision trees for rust

Stamm Stamm is a rust library for creating decision trees and random forests in a very general way. Decision trees are used in machine learning for cl

null 10 Jul 28, 2022
A plugin for Jupyter Notebooks that shows you its energy use.

Jupyter Energy Jupyter Notebooks are a data science tool mostly used for statistics and machine learning, some of the most energy-intensive computing

Marcel Garus 3 Jun 29, 2022
Rust implementation of multi-index hashing for neighbor searches on binary codes in the Hamming space

mih-rs Rust implementation of multi-index hashing (MIH) for neighbor searches on binary codes in the Hamming space, described in the paper Norouzi, Pu

Shunsuke Kanda 8 Sep 23, 2022
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

Synerise 405 Dec 20, 2022
Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

Facebook Research 134 Dec 8, 2022
HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

Rust Computer Vision 93 Dec 30, 2022
🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

Hora-Search 2.3k Jan 3, 2023
Efficient ML solutions for long-tailed demands.

MegFlow MegFlow 是一个面向视觉应用的流式计算框架, 目标是简单、高性能、帮助机器学习应用快速落地。 Features 基于 async-std[features=tokio1] 的高效异步运行时调度器 简洁的基于 toml 的建图描述格式 支持静态、动态、共享子图 支持 Rust/P

旷视天元 MegEngine 371 Dec 21, 2022
Efficient argmin & argmax (in 1 function)

ArgMinMax Efficient argmin & argmax (in 1 function) with SIMD (SSE, AVX(2), AVX512, NEON) for f16, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64. ??

Jeroen Van Der Donckt 33 Feb 12, 2023
Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a code that simulates quantum circuits. In order to learn more about damavand, refer to the documentation. Development status Core feature

prevision.io 6 Mar 29, 2022
Multi-agent (path finding) planning framework

multi-agent (path finding) planning framework Mapf is a (currently experimental) Rust library for multi-agent planning, with a focus on cooperative pa

null 17 Dec 5, 2022
A simple neural net implementation.

PROPHET - Neural Network Library Linux Windows Codecov Coveralls Docs Crates.io A simple neural net implementation written in Rust with a focus on cac

Robin Freyler 41 Sep 16, 2022
Rust implementation of real-coded GA for solving optimization problems and training of neural networks

revonet Rust implementation of real-coded genetic algorithm for solving optimization problems and training of neural networks. The latter is also know

Yury Tsoy 19 Aug 11, 2022
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 135 Dec 24, 2022
A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

null 89 Dec 23, 2022
A neural network, and tensor dynamic automatic differentiation implementation for Rust.

Corgi A neural network, and tensor dynamic automatic differentiation implementation for Rust. BLAS The BLAS feature can be enabled, and requires CBLAS

Patrick Song 20 Nov 7, 2022