Deduplicating Training Data Makes Language Models Better

Overview

Deduplicating Training Data Makes Language Models Better

This repository contains code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better" by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini. We release the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform ExactSubstr deduplication and inspect the results (written in Python). We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.

This is not an officially supported Google product.

Why deduplicate?

When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times (e.g., we find a single 50 word sequence that is repeated in the C4 dataset 60,000 times). Training models on deduplicated datasets is faster (because they see fewer total examples) and experimentally results in models with similar or better perplexity to models trained on data that hasn't been deduplicated. Moreover, language models are less likely to exhibit memorization when their training data has been well-deduplicated.

Citing this work

If you use this repository or our deduplicated datasets you can cite

@article{lee2021deduplicating,
      title={Deduplicating Training Data Makes Language Models Better}, 
      author={Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison-Burch and Nicholas Carlini},
      journal={arXiv preprint arXiv:2107.06499},
      year={2021},
}

Exact Deduplication Code

We provide an implementation of the exact deduplication technique used in the paper. This is very much research code. It is (a very slightly cleaned up) version of exactly what we do in the paper. It assumes that you want to deduplicate something the size of C4 (~300GB) running on a machine with 96 cores and >600GB of RAM. If you only want to use this for reasonably-sized datasets, you should change the number of parallel threads from 96 to something smaller. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).

We build a suffix array (based on Andrew Gallant's suffix array implementation) in src/table.rs. It has some minor changes from the original version that make it so we can't just import this library as a crate. First, we need 64-bit integers. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. So we need u64. Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8. The main complication in the rest of src/main.rs is the fact that we want things to run in parallel, and we probably can't fit the entire suffix array into memory. And so all of our algorithms are designed around these constraints.

If you just want to run the rust deduplicator, then you will only need to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

If you additionally want to generate datasets to run the rust script on (and you probably do) then you will need python dependencies:

pip3 install numpy scipy tensorflow tensorflow_datasets transformers sentencepiece

Basic Usage

If you just want to reproduce the result of this paper, or deduplicate any language model that's already in the Tensorflow Datasets (TFDS) format, then you can just run the following commands:

cargo build

to compile the rust code, and then run

python3 scripts/load_dataset.py --data_dir $LOAD_DIR --save_dir $SAVE_DIR --name $DATASET --split $SPLIT [--tokenize]

For example, to get the LM1B training set you could run python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test. This should will take just a few seconds to run on the test set or about an hour if running with the train set instead.

If the dataset is really big, you might want to add the --tokenize flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.

And then to construct the suffix array run

python3 scripts/make_suffix_array.py [path/to/dataset]

For example, if you run python3 scripts/make_suffix_array.py data/lm1b.test, this will create a file data/lm1b.test.table.bin containing the suffix array. Again, this should be fast, about two hours on the LM1B train set when run single-thread and a few minutes on 96 cores.

(If you get an error that you have too many open files, that's because this script opens lots of files. You should run ulimit -Sn 1000000 to "fix" the error.)

Querying a suffix array to find duplicated examples

Start by loading and building a suffix array for a dataset as described above

Once you have the suffix array, you now query the dataset to find all occurances of a particular string. To do this, run

python3 scripts/count_occurances.py --suffix [path/to/suffix_array] [--query query_string] [--query_file /path/to/query]

On the LM1B test set, running python3 scripts/count_occurances.py --suffix data/lm1b.test --query " on Tuesday" should return 1288. If you tokenized the dataset, then you should pass --tokenizetocount_occurences.py` as well, to get the same result (plus or minus tokenization differences).

If you want to confirm this the outputted number is correct (assuming you haven't tokenized), you can run cat /tmp/lm1b.test | grep -ao " on Tuesday" and get the same result.

Advanced Usage

The above scripts work by calling into the core Rust suffix array deduplicator. If you want to do each step yourself, the following options are available:

Single threaded suffix array construction

To build a suffix array for any particular file, you can run

cargo run save [file_path]

This will create a file called [file_path].table.bin which contains the suffix array for the file provided. This algorithm is linear time, but (a) only runs on a single core, and (b) has memory requirement O(big * len(file)) which is prohibitive for large files.

Parallel suffix array construction

To build a suffix array for an extremely large file (e.g., ~about as much RAM as available) it is better to run the script

python scripts/make_suffix_array.py [file_path]

This script will build the suffix array in parallel by splitting the single file into chunks, generating suffix arrays for each chunk, and then merging the suffix arrays together to form the full suffix array. Note that in general this algorithm is quadratic, but when the maximum substring length is short relative to the total file length (as it is, when generating suffix arrays for N independent training examples) it will never reach this worst case behavior.

The two steps are described below.

Building a piece of a suffix array from a piece of a file

The first generats a suffix array from a piece of a file. This is implemented by running

cargo run save_part [file_path] [byte_start] [byte_end]

And builds a suffix array for the byte sequence between [byte_start] and [byte_end] for the given file. Multiple of these can be run in parallel to build a suffix array for a file quickly.

Merging suffix array pieces to create a single suffix array

Given the several independent suffix arrays, merging them is now just a matter of calling

cargo run merge_parallel [path_to_partial_suffix_trees,...] [tmp_output_directory]

to generate a collection of ordered suffix arrays pieces in the output directory. The final step just requires merging these together

cat [tmp_output_directory]/* > [file_path].table.bin

Finding Duplicates

Given a suffix array file, as generated in the prevous section, it can now be queried for interesting statistics. The simplest operation, counting occurrences of particular substrings, takes O(log(N)) time and O(query_length) memory requirements, (as shown above with scripts/count_occurances.py). To do this you can run:

cargo run count_occurances /path/to/dataset /path/to/query_file

(Indeed, the python script is just a wrapper that makes calling this nicer, with the option for tokenization.) This is useful mainly as a commandline interface to interact with the dataset to find interesting properties. To run more sophisticated analysis, use the tools described below:

Finding duplicates between two documents

Given a document A and another document B, we can find all duplicates betwen the two by (1) constructing suffix arrays for both, and then (2) linearly walking the suffix arrays in order to find all duplicates of a given length.

Once the suffix array for the dataset has been constructed, this algorithm therefore requires time O(len(dataset) + len(query)) and space O(len(dataset)). It is better to run this algorithm when the number of queries into the dataset is greater than O(len(dataset)/log(len(query))). However note that the prior code requires disk seeks and and this implementation is a linear scan through the suffix array table, so in practice there is at least a factor-of-10 speedup here. As a rough order of magnitude, for a dataset with ~100GB, it is faster to run similar_parallel when querying with more than a few megabytes of text. Otherwise it is probably faster to run count_occurances.

Notice that this command also requires that the entire dataset fits in memory. For many datasets this is not a problem, but the C4 dataset is 350 GB and the Pile dataset is 750 GB (both even after tokenization). The machine must therefore have a lot of RAM for this to work.

cargo run similar_parallel [dataset1] [dataset2]

This creates lots of containing the position of all examples in dataset2 that are also in dataset1. (The code could also do the inverse at the same time, if you want to modify it slightly.) However it spits this out in some not-very-useful form: a list of tokens x_i so that dataset2[x_i:x_i+100] is also in dataset1. But this probably has overlaps.

The second step is then to run

cargo run collect_similar [dataset2]. This converts the result to instead compute ranges so that instead we have dataset2[xi:yi] match.

Finding duplicates within one document

To find duplicates that are contained within one document (for example, to actually deduplicate a dataset as we do in the paper) run the command

cargo run selfsimilar_parallel [dataset]

This will find all repeated substrings contained in the dataset above a given length threshold. Again run collect_similar to find the indexs of repeated examples.

Approx Deduplication Results

The following CSVs contain three columns: the document ID, a boolean indicating whether or not this document was deleted during deduplication, and a cluster ID. Documents with the same cluster ID were identified as near-duplicates. For C4 and RealNews, the document ID is the url associated with the document. For Wiki-40B, it is the wikidata_id. LM1B coming soon.

Name Link Size
C4 link 13GB
RealNews link 1.4GB
Wiki-40B link 26MB
Comments
  • Can the tool run on plain text files?

    Can the tool run on plain text files?

    Hello, I'm trying to deduplicate several plain text files. If i run python scripts/make_suffix_array.py myfile.en it correctly generates the myfile.en.table.bin file

    However, if i run cargo selfsimilar_parallel myfile.en it shows no duplicates.

    myfile.en contains 10 times the same string, so I am wondering whether I have to use TFDS format or not.

    opened by m-resta 20
  • Accessing the duplicates and their counts

    Accessing the duplicates and their counts

    Hey, thanks for releasing the code!

    I'm a bit confused regarding how to use the dups_ and sizes_ files. I would like to get a mapping between all duplicate strings and their corresponding number of appearances in the data. From my understanding, this is what you get with the points in those files, but I don't understand how to read these. Any explanation, would be helpful! (and code snippet / reference even better!)

    Thanks

    opened by yanaiela 12
  • Error on self deduplication

    Error on self deduplication

    I am planning to reproduce the self deduplication result for lm1b. I have already produced the result mentioned in the readme here.

    However, when running selfsimilar_parallel, it shows Final answer 0 and when running collect_similar it throws an error of thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26. Am I missing something here?

    Log:

    $ python3 scripts/count_occurances.py --suffix dataset_save/lm1b.test --query " on Tuesday"
    b' on Tuesday'
    Number of times present: 1288
    
    
    $ cargo run selfsimilar_parallel dataset_save/lm1b.test
    warning: function is never used: `get_example_index`
       --> src/main.rs:447:4
        |
    447 | fn get_example_index(table:&[u64], position:u64) -> usize{
        |    ^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(dead_code)]` on by default
    
    warning: unused `Result` that must be used
       --> src/main.rs:367:2
        |
    367 |     tablestream.file.read_exact(&mut tablestream.cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(unused_must_use)]` on by default
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: unused `Result` that must be used
       --> src/main.rs:379:2
        |
    379 |     file.read_exact(&mut cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: 3 warnings emitted
    
        Finished dev [optimized + debuginfo] target(s) in 0.02s
         Running `target/debug/dedup_dataset selfsimilar_parallel dataset_save/lm1b.test`
    Start load!
    Loading ratio is 8
    0 / 453700
    Final answer 0
    
    
    $ cargo run collect_similar dataset_save/lm1b.test
    warning: function is never used: `get_example_index`
       --> src/main.rs:447:4
        |
    447 | fn get_example_index(table:&[u64], position:u64) -> usize{
        |    ^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(dead_code)]` on by default
    
    warning: unused `Result` that must be used
       --> src/main.rs:367:2
        |
    367 |     tablestream.file.read_exact(&mut tablestream.cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(unused_must_use)]` on by default
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: unused `Result` that must be used
       --> src/main.rs:379:2
        |
    379 |     file.read_exact(&mut cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: 3 warnings emitted
    
        Finished dev [optimized + debuginfo] target(s) in 0.02s
         Running `target/debug/dedup_dataset collect_similar dataset_save/lm1b.test`
    Sorting.
    Sorted.
    thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by zijwang 10
  • Error with table size not being divisible by text size

    Error with table size not being divisible by text size

    Hi, I'm getting an error because size of the suffix table is not divisible by length of the text. https://github.com/google-research/deduplicate-text-datasets/blob/8a172b0d815862b1131da203be24d430e121a725/src/main.rs#L479

    I'm running bash scripts/deduplicate_single_file.sh test.txt test_dedup.txt 20 1 where test.txt just contains a few paragraphs from a random Wikipedia article and some duplicate text that I manually added. I'm doing this mainly for debugging purpose (I would like to later make some edits to keep the first occurrence of duplicate samples and throw away the rest). If I run the command on my actual dataset that is roughly ~70GB big, I'm not encountering such issue. So I'm wondering what the issue is? Does the code not work with datasets that are too small?

    Thanks!

    Update: I just found out that running the command on the actual 70GB dataset also raised the same error.

    opened by jinyongyoo 7
  • How to dedup between two datasets?

    How to dedup between two datasets?

    A practical situation is that given two datasets A and B, we want to remove the data in A that has huge overlap with B. Is there a command that I could use to achieve this functionality? There is only command of single-document or single-document pairs in the readme on finding duplicates.

    opened by mralexis1 7
  • Why not use Simhash?

    Why not use Simhash?

    Since Google has shown that Simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository (Detecting Near-Duplicates for Web Crawling). In your paper, you choose minhash for approximate matching. Why not use Simhash in this scenario?

    opened by Ethan-yt 3
  • question about deduplication cluster size

    question about deduplication cluster size

    As shown in following picture, the cluster starting at 0x02954cb9 has the size of 3. image but when I count it using bytes.count(), it shows 2. image

    I tried different datasets and observed the same phenomenon. Did I make a mistake about the size meaning?

    opened by everks 2
  • Unexpected behavior with ending symbols

    Unexpected behavior with ending symbols

    Hi again,

    I found that count-occurrences have an unexpected behavior if you want to count last symbols in sequence. Here are the examples:

    • sequence "aaabbb", query "b": expect 3, but output is Number of times present: 2
    • another one is when sequence "aaabbb", query "bb": expected 2, but actual output is Number of times present: 0

    Can you fix this? Thanks!

    opened by mitya52 2
  • "failed to fill whole buffer" errors

    Hi,

    I have tried to run the code on simple string and count-occurrences fails with "failed to fill whole buffer" error.

    Here are steps to reproduce:

    1. run ./target/debug/dedup_dataset make --data-file dup.txt, data file dup.txt contains simple string "aaabbb"
    2. then run ./target/debug/dedup_dataset count-occurrences --data-file dup.txt --query-file query.txt, where query.txt contains
    • "bb" expectation: Number of times present: 2 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:275:31;
    • "ab" expectation: Number of times present: 1 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:297:31;
    • "b" expectation: Number of times present: 2 reality: Number of times present: 1;

    May be I'm doing something wrong? Thanks.

    opened by mitya52 2
  • Should newline char be removed

    Should newline char be removed

    Hi, So I notice that this read here adds a \n char to the end of the query. This then causes an issue with the count if its not actually an end-of-line. Should there be a .strip() added here?

    arr = open(args.query_file,"rb").read().strip()

    Thanks.

    opened by cperiz 1
  • Fix multiprocessing bug in Windows/Mac OS X

    Fix multiprocessing bug in Windows/Mac OS X

    The multiprocessing pool started uses the default method for launching child processes, which is OS specific. The default on Unix is "fork", and the resulting process inherits all resources from the parent process. Conversely, the default on Mac OS X/Windows is "spawn", which results in a minimal number of resources inherited by the child process.

    I changed the code to explicitly use "fork", and I was able to run through the README on a Mac M1. I presume this fix would also help Windows users, though I haven't tested myself.

    While I was at it, I fixed a small typo in the README.

    Thanks for sharing the code and having a really nice README!

    opened by alistairewj 1
  • Off-by-1 error in `collect`?

    Off-by-1 error in `collect`?

    Hi, thanks for the great repo!

    I'm using the tool to deduplicate a dataset, and I'm trying to investigate what happens in subsequent steps. I noticed that after running collect, some of the duplicate strings seem to start with control characters:, e.g. after running code similar to this:

    >>> data=open("data/my_data.test","rb").read()
    >>> data[left:right]
    

    where left and right are one of the pairs returned by collect, I get sth like this:

    b'\x00\x00Enter the username or e-mail you used in your profile. A password reset link will be sent to you by email.'
    

    I'm cleaning the control characters up in my main text so it looks like parts of the separator codes are being leaked. Interestingly, this doesn't happen consistently, but it does happen more on the more frequent strings. Also, matched documents from my original dataset don't contain control characters.

    Any chance there's some sort of an off-by-1 error in collect? Not a huge deal but I'd like to understand what's happening here

    opened by ola13 0
  • how to deduplicate huggingface datasets

    how to deduplicate huggingface datasets

    Hey there, excellent work on this repo and the paper.

    I wanted to know on how could i use this to deduplicate my huggingface custom dataset. that i have custom developed and cleaned.

    that has been saved as custom_dataset.save_to_disk("dataset_path")

    and can be loaded as custom_dataset = datasets.load_from_disk("dataset_path")

    opened by StephennFernandes 6
  • Fix to issue #17 limits cmd_merge to be single-threaded

    Fix to issue #17 limits cmd_merge to be single-threaded

    Hi,

    it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:

    https://github.com/google-research/deduplicate-text-datasets/blob/ad86c7f65ac626581fe3a4277106309bc6b50c23/src/main.rs#L1020-L1023

    texts.len() is equal to nn (the number of input parts), I think you want something like

        let num_threads = std::cmp::min(num_threads, std::cmp::max((texts_len.iter().sum::<usize>() as i64 - 1024)/10, 1));
    

    instead.

    opened by kleinj 2
  • RAM crash when use collect method

    RAM crash when use collect method

    first of all thanks for releasing the code

    i have dataset(mc4) size about 110 GB

    my machine specs is 96 cores cpu and 350 GB RAM

    i've successfully created 524GB suffix array from that dataset

    i also managed to run deduplicator (self similar method with 100 threshold) with no memory issue , create about ~140 GB cache files ( 20B examples)

    but when i run collect method my RAM blowup after few minutes

    i stacktrace the code and found my RAM crash when this code/step running https://github.com/google-research/deduplicate-text-datasets/blob/ad86c7f65ac626581fe3a4277106309bc6b50c23/src/main.rs#L1188

    is this expected? do you have workaround to solve the issue?

    AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW

    i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish (this is just assumption, i dont know its possible...not expert on rust)

    thank you

    opened by acul3 1
  • Error when running the code

    Error when running the code

    Hi,

    I try to deduplicate my plain text file, but it shows some errors. I first run

    python scripts/make_suffix_array.py c4-train.00000-of-01024.txt
    

    The output is

    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 0 --end-byte 114700294
    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 114600294 --end-byte 229300588
    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 229200588 --end-byte 343900882
    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 343800882 --end-byte 458401177
    Waiting for jobs to finish
    Checking all wrote correctly
    FACT 4.0
    FACT 4.0
    FACT 4.0
    FACT 4.0
    Rerunning 0 jobs because they failed.
    Merging suffix trees
    ./target/debug/dedup_dataset merge --output-file tmp/out.table.bin --suffix-path c4-train.00000-of-01024.txt.part.0-114700294 --suffix-path c4-train.00000-of-01024.txt.part.114600294-229300588 --suffix-path c4-train.00000-of-01024.txt.part.229200588-343900882 --suffix-path c4-train.00000-of-01024.txt.part.343800882-458401177 --num-threads 256
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 875src/main.rs::125222
    :note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    77
    thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222875875:::77125125
    
    
    thread 'thread '<unnamed><unnamed>thread 'thread '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ' panicked at '' panicked at '', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', ', src/main.rssrc/main.rs:222::875222:222::77:1257777
    
    
    
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ', src/main.rs' panicked at 'src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:875', 875:src/main.rs:125:thread '125
    222<unnamed>
    :' panicked at '77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
    ', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread '', src/main.rs<unnamed>src/main.rs:' panicked at ':222thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875:<unnamed>', :77' panicked at 'src/main.rs125thread '
    called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:
    <unnamed>', 222thread '' panicked at 'src/main.rs:<unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:77' panicked at '', 222
    called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', :src/main.rs77222:
    :22277:
    77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875', :src/main.rs125:
    875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222222222:::777777
    
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rs', ', src/main.rssrc/main.rs:src/main.rssrc/main.rs::222::222222:222222::77::7777
    7777
    
    
    
    thread 'thread 'thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed><unnamed><unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', ', ', ', ', src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs:::::::222222222222222222222:::::::77777777777777
    
    
    
    
    
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'thread 'thread '', <unnamed><unnamed>', <unnamed><unnamed>src/main.rsthread '' panicked at '<unnamed>' panicked at 'src/main.rsthread 'thread 'thread 'thread 'thread '' panicked at '' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed><unnamed><unnamed><unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 222' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '', ', :src/main.rs', src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rssrc/main.rs77:src/main.rs:77', ', ', ', ', ::
    222:thread '222thread '
    src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs875222:22277:<unnamed>:thread '<unnamed>::::thread '::thread ':
    77' panicked at '77<unnamed>' panicked at '875222222222<unnamed>222125<unnamed>77
    called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
    ' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }::::' panicked at ':
    ' panicked at '
    ', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 125777777called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs', src/main.rs
    
    
    
    ',
    ', :src/main.rs:src/main.rssrc/main.rs222:222::222:222:222:77::7777
    7777
    
    
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at 'thread ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>875', :' panicked at 'src/main.rsthread '125called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed>
    ', 222' panicked at ':src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77:',
    222src/main.rs::77222
    :77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '875called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 125src/main.rs
    :875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rs::222875::77125
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
    :875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', :src/main.rs77:
    222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:thread '125<unnamed>
    ' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
    :875:125
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rsthread '::<unnamed>thread '222875' panicked at '<unnamed>::called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '77125', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
    
    src/main.rs', :src/main.rs875::222125:
    77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ' panicked at 'src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 222src/main.rs::77875
    :125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /home/yiming/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43
    Now merging individual tables
    Cleaning up
    

    Yet, it successfully create the suffix array file

    c4-train.00000-of-01024.txt.part.0-114700294
    c4-train.00000-of-01024.txt.part.0-114700294.table.bin
    c4-train.00000-of-01024.txt.part.114600294-229300588
    c4-train.00000-of-01024.txt.part.114600294-229300588.table.bin
    c4-train.00000-of-01024.txt.part.229200588-343900882
    c4-train.00000-of-01024.txt.part.229200588-343900882.table.bin  
    c4-train.00000-of-01024.txt.part.343800882-458401177           
    c4-train.00000-of-01024.txt.part.343800882-458401177.table.bin  
    c4-train.00000-of-01024.txt.table.bin
    

    Then, I run

    cargo run self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128
    

    It gives me below error:

        Finished dev [optimized + debuginfo] target(s) in 5.69s
         Running `target/debug/dedup_dataset self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128`
    Start load!
    thread 'main' panicked at 'assertion failed: metadata.len() % (text.len() as u64) == 0', src/main.rs:479:5
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    

    May I ask how to fix this? Thank you!

    Yiming

    opened by MatthewCYM 14
Owner
Google Research
Google Research
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It

null 340 Dec 30, 2022
A machine learning library for supervised training of parametrized models

Vikos Vikos is a library for supervised training of parameterized, regression, and classification models Design Goals Model representations, cost func

Blue Yonder GmbH 10 May 10, 2022
Tangram - makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram is the all-in-one machine learning toolkit for programmers. Train a model from a CSV file on the command line. Make predictions from Elixir, G

Tangram 1.3k May 3, 2022
Rust implementation of real-coded GA for solving optimization problems and training of neural networks

revonet Rust implementation of real-coded genetic algorithm for solving optimization problems and training of neural networks. The latter is also know

Yury Tsoy 19 Aug 11, 2022
NEATeRS is a library for training a genetic neural net through reinforcement learning.

NEATeRS NEATeRS is a library for training a genetic neural net through reinforcement learning. It uses the NEAT algorithm developed by Ken Stanley whi

TecTrixer 3 Nov 28, 2022
Embedded Rust on Espressif training material.

Embedded Rust Trainings for Espressif This repository contains Training Material for learning to use Embedded Rust with the Espressif ESP32-C3. We sug

Ferrous Systems 269 Dec 28, 2022
🦀 Example of serving deep learning models in Rust with batched prediction

rust-dl-webserver This project provides an example of serving a deep learning model with batched prediction using Rust. In particular it runs a GPT2 m

Evan Pete Walsh 28 Dec 15, 2022
Proof of concept for a web API that can export 3MF files from parametric OpenSCAD models

Model API About A proof of concept for a web API that can export 3MF files from a parametric OpenSCAD model. A typical use would be to have a form on

Hanno Braun 4 Jul 23, 2022
An example of using TensorFlow rust bindings to serve trained machine learning models via Actix Web

Serving TensorFlow with Actix-Web This repository gives an example of training a machine learning model using TensorFlow2.0 Keras in python, exporting

Kyle Kosic 39 Dec 12, 2022
A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio

Neil Kaushikkar 0 Sep 2, 2021
m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

Bayes' Witnesses 2.3k Dec 31, 2022
Rust API to run predictions with YoloV5 models.

YoloV5-API [WIP] API to run inferences with YoloV5 models. Written in Rust, based on OpenCV 4.5.5 If you need a C++ version, check my C++ Yolov5-API R

Mauro Sciancalepore 14 Dec 26, 2022
Build neural network models in Cairo 1.0

Explore ML in Cairo 1.0 Build neural network models in Cairo 1.0 to perform inference. The calculations are performed using i33 values, and the outcom

null 17 Mar 27, 2023
Tutorial for Porting PyTorch Transformer Models to Candle (Rust)

Candle Tutorial - Convert Pytorch Models to Candle Candle is an ML framework written in rust that takes advantage of the speed and memory safety Rust

Ogundepo Odunayo 28 Oct 23, 2023
A Rust library integrated with ONNXRuntime, providing a collection of ML models.

usls A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models including YOLOv8, YOLOv9, RTDETR,

Jamjamjon 3 Apr 9, 2024
High-performance runtime for data analytics applications

Weld Documentation Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and func

Weld 2.9k Jan 7, 2023
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

Synerise 405 Dec 20, 2022
Dynamically get the suggested clusters in the data for unsupervised learning.

Python implementation of the Gap Statistic Purpose Dynamically identify the suggested number of clusters in a data-set using the gap statistic. Full e

Miles Granger 163 Dec 9, 2022
zenoh-flow aims at providing a zenoh-based data-flow programming framework for computations that span from the cloud to the device.

Eclipse Zenoh-Flow Zenoh-Flow provides a zenoh-based dataflow programming framework for computations that span from the cloud to the device. ⚠️ This s

null 35 Dec 12, 2022