Deduplicating Training Data Makes Language Models Better

Google Research

Last update: Dec 27, 2022

Related tags

Machine learning deduplicate-text-datasets

Overview

Deduplicating Training Data Makes Language Models Better

This repository contains code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better" by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini. We release the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform ExactSubstr deduplication and inspect the results (written in Python). We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.

This is not an officially supported Google product.

Why deduplicate?

When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times (e.g., we find a single 50 word sequence that is repeated in the C4 dataset 60,000 times). Training models on deduplicated datasets is faster (because they see fewer total examples) and experimentally results in models with similar or better perplexity to models trained on data that hasn't been deduplicated. Moreover, language models are less likely to exhibit memorization when their training data has been well-deduplicated.

Citing this work

If you use this repository or our deduplicated datasets you can cite

@article{lee2021deduplicating,
      title={Deduplicating Training Data Makes Language Models Better}, 
      author={Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison-Burch and Nicholas Carlini},
      journal={arXiv preprint arXiv:2107.06499},
      year={2021},
}

Exact Deduplication Code

We provide an implementation of the exact deduplication technique used in the paper. This is very much research code. It is (a very slightly cleaned up) version of exactly what we do in the paper. It assumes that you want to deduplicate something the size of C4 (~300GB) running on a machine with 96 cores and >600GB of RAM. If you only want to use this for reasonably-sized datasets, you should change the number of parallel threads from 96 to something smaller. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).

We build a suffix array (based on Andrew Gallant's suffix array implementation) in src/table.rs. It has some minor changes from the original version that make it so we can't just import this library as a crate. First, we need 64-bit integers. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. So we need u64. Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8. The main complication in the rest of src/main.rs is the fact that we want things to run in parallel, and we probably can't fit the entire suffix array into memory. And so all of our algorithms are designed around these constraints.

If you just want to run the rust deduplicator, then you will only need to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

If you additionally want to generate datasets to run the rust script on (and you probably do) then you will need python dependencies:

pip3 install numpy scipy tensorflow tensorflow_datasets transformers sentencepiece

Basic Usage

If you just want to reproduce the result of this paper, or deduplicate any language model that's already in the Tensorflow Datasets (TFDS) format, then you can just run the following commands:

cargo build

to compile the rust code, and then run

python3 scripts/load_dataset.py --data_dir $LOAD_DIR --save_dir $SAVE_DIR --name $DATASET --split $SPLIT [--tokenize]

For example, to get the LM1B training set you could run python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test. This should will take just a few seconds to run on the test set or about an hour if running with the train set instead.

If the dataset is really big, you might want to add the --tokenize flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.

And then to construct the suffix array run

python3 scripts/make_suffix_array.py [path/to/dataset]

For example, if you run python3 scripts/make_suffix_array.py data/lm1b.test, this will create a file data/lm1b.test.table.bin containing the suffix array. Again, this should be fast, about two hours on the LM1B train set when run single-thread and a few minutes on 96 cores.

(If you get an error that you have too many open files, that's because this script opens lots of files. You should run ulimit -Sn 1000000 to "fix" the error.)

Querying a suffix array to find duplicated examples

Start by loading and building a suffix array for a dataset as described above

Once you have the suffix array, you now query the dataset to find all occurances of a particular string. To do this, run

python3 scripts/count_occurances.py --suffix [path/to/suffix_array] [--query query_string] [--query_file /path/to/query]

On the LM1B test set, running python3 scripts/count_occurances.py --suffix data/lm1b.test --query " on Tuesday" should return 1288. If you tokenized the dataset, then you should pass --tokenizetocount_occurences.py` as well, to get the same result (plus or minus tokenization differences).

If you want to confirm this the outputted number is correct (assuming you haven't tokenized), you can run cat /tmp/lm1b.test | grep -ao " on Tuesday" and get the same result.

Advanced Usage

The above scripts work by calling into the core Rust suffix array deduplicator. If you want to do each step yourself, the following options are available:

Single threaded suffix array construction

To build a suffix array for any particular file, you can run

cargo run save [file_path]

This will create a file called [file_path].table.bin which contains the suffix array for the file provided. This algorithm is linear time, but (a) only runs on a single core, and (b) has memory requirement O(big * len(file)) which is prohibitive for large files.

Parallel suffix array construction

To build a suffix array for an extremely large file (e.g., ~about as much RAM as available) it is better to run the script

python scripts/make_suffix_array.py [file_path]

This script will build the suffix array in parallel by splitting the single file into chunks, generating suffix arrays for each chunk, and then merging the suffix arrays together to form the full suffix array. Note that in general this algorithm is quadratic, but when the maximum substring length is short relative to the total file length (as it is, when generating suffix arrays for N independent training examples) it will never reach this worst case behavior.

The two steps are described below.

Building a piece of a suffix array from a piece of a file

The first generats a suffix array from a piece of a file. This is implemented by running

cargo run save_part [file_path] [byte_start] [byte_end]

And builds a suffix array for the byte sequence between [byte_start] and [byte_end] for the given file. Multiple of these can be run in parallel to build a suffix array for a file quickly.

Merging suffix array pieces to create a single suffix array

Given the several independent suffix arrays, merging them is now just a matter of calling

cargo run merge_parallel [path_to_partial_suffix_trees,...] [tmp_output_directory]

to generate a collection of ordered suffix arrays pieces in the output directory. The final step just requires merging these together

cat [tmp_output_directory]/* > [file_path].table.bin

Finding Duplicates

Given a suffix array file, as generated in the prevous section, it can now be queried for interesting statistics. The simplest operation, counting occurrences of particular substrings, takes O(log(N)) time and O(query_length) memory requirements, (as shown above with scripts/count_occurances.py). To do this you can run:

cargo run count_occurances /path/to/dataset /path/to/query_file

(Indeed, the python script is just a wrapper that makes calling this nicer, with the option for tokenization.) This is useful mainly as a commandline interface to interact with the dataset to find interesting properties. To run more sophisticated analysis, use the tools described below:

Finding duplicates between two documents

Given a document A and another document B, we can find all duplicates betwen the two by (1) constructing suffix arrays for both, and then (2) linearly walking the suffix arrays in order to find all duplicates of a given length.

Once the suffix array for the dataset has been constructed, this algorithm therefore requires time O(len(dataset) + len(query)) and space O(len(dataset)). It is better to run this algorithm when the number of queries into the dataset is greater than O(len(dataset)/log(len(query))). However note that the prior code requires disk seeks and and this implementation is a linear scan through the suffix array table, so in practice there is at least a factor-of-10 speedup here. As a rough order of magnitude, for a dataset with ~100GB, it is faster to run similar_parallel when querying with more than a few megabytes of text. Otherwise it is probably faster to run count_occurances.

Notice that this command also requires that the entire dataset fits in memory. For many datasets this is not a problem, but the C4 dataset is 350 GB and the Pile dataset is 750 GB (both even after tokenization). The machine must therefore have a lot of RAM for this to work.

cargo run similar_parallel [dataset1] [dataset2]

This creates lots of containing the position of all examples in dataset2 that are also in dataset1. (The code could also do the inverse at the same time, if you want to modify it slightly.) However it spits this out in some not-very-useful form: a list of tokens x_i so that dataset2[x_i:x_i+100] is also in dataset1. But this probably has overlaps.

The second step is then to run

cargo run collect_similar [dataset2]. This converts the result to instead compute ranges so that instead we have dataset2[xi:yi] match.

Finding duplicates within one document

To find duplicates that are contained within one document (for example, to actually deduplicate a dataset as we do in the paper) run the command

cargo run selfsimilar_parallel [dataset]

This will find all repeated substrings contained in the dataset above a given length threshold. Again run collect_similar to find the indexs of repeated examples.

Approx Deduplication Results

The following CSVs contain three columns: the document ID, a boolean indicating whether or not this document was deleted during deduplication, and a cluster ID. Documents with the same cluster ID were identified as near-duplicates. For C4 and RealNews, the document ID is the url associated with the document. For Wiki-40B, it is the wikidata_id. LM1B coming soon.

Name	Link	Size
C4	link	13GB
RealNews	link	1.4GB
Wiki-40B	link	26MB

Comments

Can the tool run on plain text files?

Hello, I'm trying to deduplicate several plain text files. If i run python scripts/make_suffix_array.py myfile.en it correctly generates the myfile.en.table.bin file

However, if i run cargo selfsimilar_parallel myfile.en it shows no duplicates.

myfile.en contains 10 times the same string, so I am wondering whether I have to use TFDS format or not.

opened by m-resta 20
Accessing the duplicates and their counts

Hey, thanks for releasing the code!

I'm a bit confused regarding how to use the dups_ and sizes_ files. I would like to get a mapping between all duplicate strings and their corresponding number of appearances in the data. From my understanding, this is what you get with the points in those files, but I don't understand how to read these. Any explanation, would be helpful! (and code snippet / reference even better!)

Thanks

opened by yanaiela 12

Error on self deduplication

I am planning to reproduce the self deduplication result for lm1b. I have already produced the result mentioned in the readme here.

However, when running selfsimilar_parallel, it shows Final answer 0 and when running collect_similar it throws an error of thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26. Am I missing something here?

Log:

$ python3 scripts/count_occurances.py --suffix dataset_save/lm1b.test --query " on Tuesday"
b' on Tuesday'
Number of times present: 1288


$ cargo run selfsimilar_parallel dataset_save/lm1b.test
warning: function is never used: `get_example_index`
   --> src/main.rs:447:4
    |
447 | fn get_example_index(table:&[u64], position:u64) -> usize{
    |    ^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: unused `Result` that must be used
   --> src/main.rs:367:2
    |
367 |     tablestream.file.read_exact(&mut tablestream.cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(unused_must_use)]` on by default
    = note: this `Result` may be an `Err` variant, which should be handled

warning: unused `Result` that must be used
   --> src/main.rs:379:2
    |
379 |     file.read_exact(&mut cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this `Result` may be an `Err` variant, which should be handled

warning: 3 warnings emitted

    Finished dev [optimized + debuginfo] target(s) in 0.02s
     Running `target/debug/dedup_dataset selfsimilar_parallel dataset_save/lm1b.test`
Start load!
Loading ratio is 8
0 / 453700
Final answer 0


$ cargo run collect_similar dataset_save/lm1b.test
warning: function is never used: `get_example_index`
   --> src/main.rs:447:4
    |
447 | fn get_example_index(table:&[u64], position:u64) -> usize{
    |    ^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: unused `Result` that must be used
   --> src/main.rs:367:2
    |
367 |     tablestream.file.read_exact(&mut tablestream.cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(unused_must_use)]` on by default
    = note: this `Result` may be an `Err` variant, which should be handled

warning: unused `Result` that must be used
   --> src/main.rs:379:2
    |
379 |     file.read_exact(&mut cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this `Result` may be an `Err` variant, which should be handled

warning: 3 warnings emitted

    Finished dev [optimized + debuginfo] target(s) in 0.02s
     Running `target/debug/dedup_dataset collect_similar dataset_save/lm1b.test`
Sorting.
Sorted.
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

opened by zijwang 10

Error with table size not being divisible by text size

Hi, I'm getting an error because size of the suffix table is not divisible by length of the text. https://github.com/google-research/deduplicate-text-datasets/blob/8a172b0d815862b1131da203be24d430e121a725/src/main.rs#L479

I'm running bash scripts/deduplicate_single_file.sh test.txt test_dedup.txt 20 1 where test.txt just contains a few paragraphs from a random Wikipedia article and some duplicate text that I manually added. I'm doing this mainly for debugging purpose (I would like to later make some edits to keep the first occurrence of duplicate samples and throw away the rest). If I run the command on my actual dataset that is roughly ~70GB big, I'm not encountering such issue. So I'm wondering what the issue is? Does the code not work with datasets that are too small?

Thanks!

Update: I just found out that running the command on the actual 70GB dataset also raised the same error.

opened by jinyongyoo 7
How to dedup between two datasets?

A practical situation is that given two datasets A and B, we want to remove the data in A that has huge overlap with B. Is there a command that I could use to achieve this functionality? There is only command of single-document or single-document pairs in the readme on finding duplicates.

opened by mralexis1 7
Why not use Simhash?

Since Google has shown that Simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository (Detecting Near-Duplicates for Web Crawling). In your paper, you choose minhash for approximate matching. Why not use Simhash in this scenario?

opened by Ethan-yt 3
question about deduplication cluster size

As shown in following picture, the cluster starting at 0x02954cb9 has the size of 3. but when I count it using bytes.count(), it shows 2.

I tried different datasets and observed the same phenomenon. Did I make a mistake about the size meaning?

opened by everks 2
Unexpected behavior with ending symbols
Hi again,

I found that count-occurrences have an unexpected behavior if you want to count last symbols in sequence. Here are the examples:

sequence "aaabbb", query "b": expect 3, but output is Number of times present: 2

another one is when sequence "aaabbb", query "bb": expected 2, but actual output is Number of times present: 0

Can you fix this? Thanks!
opened by mitya52 2
"failed to fill whole buffer" errors
Hi,

I have tried to run the code on simple string and count-occurrences fails with "failed to fill whole buffer" error.

Here are steps to reproduce:

run ./target/debug/dedup_dataset make --data-file dup.txt, data file dup.txt contains simple string "aaabbb"

then run ./target/debug/dedup_dataset count-occurrences --data-file dup.txt --query-file query.txt, where query.txt contains

"bb" expectation: Number of times present: 2 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:275:31;

"ab" expectation: Number of times present: 1 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:297:31;

"b" expectation: Number of times present: 2 reality: Number of times present: 1;

May be I'm doing something wrong? Thanks.
opened by mitya52 2
Should newline char be removed

Hi, So I notice that this read here adds a \n char to the end of the query. This then causes an issue with the count if its not actually an end-of-line. Should there be a .strip() added here?

arr = open(args.query_file,"rb").read().strip()

Thanks.

opened by cperiz 1
Fix multiprocessing bug in Windows/Mac OS X

The multiprocessing pool started uses the default method for launching child processes, which is OS specific. The default on Unix is "fork", and the resulting process inherits all resources from the parent process. Conversely, the default on Mac OS X/Windows is "spawn", which results in a minimal number of resources inherited by the child process.

I changed the code to explicitly use "fork", and I was able to run through the README on a Mac M1. I presume this fix would also help Windows users, though I haven't tested myself.

While I was at it, I fixed a small typo in the README.

Thanks for sharing the code and having a really nice README!

opened by alistairewj 1
Off-by-1 error in `collect`?
Hi, thanks for the great repo!

I'm using the tool to deduplicate a dataset, and I'm trying to investigate what happens in subsequent steps. I noticed that after running collect, some of the duplicate strings seem to start with control characters:, e.g. after running code similar to this:

>>> data=open("data/my_data.test","rb").read() >>> data[left:right]

where left and right are one of the pairs returned by collect, I get sth like this:

b'\x00\x00Enter the username or e-mail you used in your profile. A password reset link will be sent to you by email.'

I'm cleaning the control characters up in my main text so it looks like parts of the separator codes are being leaked. Interestingly, this doesn't happen consistently, but it does happen more on the more frequent strings. Also, matched documents from my original dataset don't contain control characters.

Any chance there's some sort of an off-by-1 error in collect? Not a huge deal but I'd like to understand what's happening here
opened by ola13 0
how to deduplicate huggingface datasets

Hey there, excellent work on this repo and the paper.

I wanted to know on how could i use this to deduplicate my huggingface custom dataset. that i have custom developed and cleaned.

that has been saved as custom_dataset.save_to_disk("dataset_path")

and can be loaded as custom_dataset = datasets.load_from_disk("dataset_path")

opened by StephennFernandes 6
Fix to issue #17 limits cmd_merge to be single-threaded
Hi,

it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:

https://github.com/google-research/deduplicate-text-datasets/blob/ad86c7f65ac626581fe3a4277106309bc6b50c23/src/main.rs#L1020-L1023

texts.len() is equal to nn (the number of input parts), I think you want something like

let num_threads = std::cmp::min(num_threads, std::cmp::max((texts_len.iter().sum::<usize>() as i64 - 1024)/10, 1));

instead.
opened by kleinj 2
RAM crash when use collect method

first of all thanks for releasing the code

i have dataset(mc4) size about 110 GB

my machine specs is 96 cores cpu and 350 GB RAM

i've successfully created 524GB suffix array from that dataset

i also managed to run deduplicator (self similar method with 100 threshold) with no memory issue , create about ~140 GB cache files ( 20B examples)

but when i run collect method my RAM blowup after few minutes

i stacktrace the code and found my RAM crash when this code/step running https://github.com/google-research/deduplicate-text-datasets/blob/ad86c7f65ac626581fe3a4277106309bc6b50c23/src/main.rs#L1188

is this expected? do you have workaround to solve the issue?

AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW

i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish (this is just assumption, i dont know its possible...not expert on rust)

thank you

opened by acul3 1

Error when running the code

Hi,

I try to deduplicate my plain text file, but it shows some errors. I first run

python scripts/make_suffix_array.py c4-train.00000-of-01024.txt

The output is

./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 0 --end-byte 114700294
./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 114600294 --end-byte 229300588
./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 229200588 --end-byte 343900882
./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 343800882 --end-byte 458401177
Waiting for jobs to finish
Checking all wrote correctly
FACT 4.0
FACT 4.0
FACT 4.0
FACT 4.0
Rerunning 0 jobs because they failed.
Merging suffix trees
./target/debug/dedup_dataset merge --output-file tmp/out.table.bin --suffix-path c4-train.00000-of-01024.txt.part.0-114700294 --suffix-path c4-train.00000-of-01024.txt.part.114600294-229300588 --suffix-path c4-train.00000-of-01024.txt.part.229200588-343900882 --suffix-path c4-train.00000-of-01024.txt.part.343800882-458401177 --num-threads 256
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 875src/main.rs::125222
:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
77
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222875875:::77125125


thread 'thread '<unnamed><unnamed>thread 'thread '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ' panicked at '' panicked at '', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', ', src/main.rssrc/main.rs:222::875222:222::77:1257777



thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ', src/main.rs' panicked at 'src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:875', 875:src/main.rs:125:thread '125
222<unnamed>
:' panicked at '77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread '', src/main.rs<unnamed>src/main.rs:' panicked at ':222thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875:<unnamed>', :77' panicked at 'src/main.rs125thread '
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:
<unnamed>', 222thread '' panicked at 'src/main.rs:<unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:77' panicked at '', 222
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', :src/main.rs77222:
:22277:
77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875', :src/main.rs125:
875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222222222:::777777


thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rs', ', src/main.rssrc/main.rs:src/main.rssrc/main.rs::222::222222:222222::77::7777
7777



thread 'thread 'thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed><unnamed><unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', ', ', ', ', src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs:::::::222222222222222222222:::::::77777777777777






thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'thread 'thread '', <unnamed><unnamed>', <unnamed><unnamed>src/main.rsthread '' panicked at '<unnamed>' panicked at 'src/main.rsthread 'thread 'thread 'thread 'thread '' panicked at '' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed><unnamed><unnamed><unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 222' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '', ', :src/main.rs', src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rssrc/main.rs77:src/main.rs:77', ', ', ', ', ::
222:thread '222thread '
src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs875222:22277:<unnamed>:thread '<unnamed>::::thread '::thread ':
77' panicked at '77<unnamed>' panicked at '875222222222<unnamed>222125<unnamed>77
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }::::' panicked at ':
' panicked at '
', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 125777777called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs', src/main.rs



',
', :src/main.rs:src/main.rssrc/main.rs222:222::222:222:222:77::7777
7777



thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at 'thread ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>875', :' panicked at 'src/main.rsthread '125called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed>
', 222' panicked at ':src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77:',
222src/main.rs::77222
:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '875called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 125src/main.rs
:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rs::222875::77125

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', :src/main.rs77:
222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:thread '125<unnamed>
' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
:875:125
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rsthread '::<unnamed>thread '222875' panicked at '<unnamed>::called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '77125', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

src/main.rs', :src/main.rs875::222125:
77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ' panicked at 'src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 222src/main.rs::77875
:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /home/yiming/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43
Now merging individual tables
Cleaning up

Yet, it successfully create the suffix array file

c4-train.00000-of-01024.txt.part.0-114700294
c4-train.00000-of-01024.txt.part.0-114700294.table.bin
c4-train.00000-of-01024.txt.part.114600294-229300588
c4-train.00000-of-01024.txt.part.114600294-229300588.table.bin
c4-train.00000-of-01024.txt.part.229200588-343900882
c4-train.00000-of-01024.txt.part.229200588-343900882.table.bin  
c4-train.00000-of-01024.txt.part.343800882-458401177           
c4-train.00000-of-01024.txt.part.343800882-458401177.table.bin  
c4-train.00000-of-01024.txt.table.bin

Then, I run

cargo run self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128

It gives me below error:

    Finished dev [optimized + debuginfo] target(s) in 5.69s
     Running `target/debug/dedup_dataset self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128`
Start load!
thread 'main' panicked at 'assertion failed: metadata.len() % (text.len() as u64) == 0', src/main.rs:479:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

May I ask how to fix this? Thank you!

Yiming

opened by MatthewCYM 14

Deduplicating Training Data Makes Language Models Better

Related tags

Overview

Deduplicating Training Data Makes Language Models Better

Why deduplicate?

Citing this work

Exact Deduplication Code

Basic Usage

Querying a suffix array to find duplicated examples

Advanced Usage

Single threaded suffix array construction

Parallel suffix array construction

Building a piece of a suffix array from a piece of a file

Merging suffix array pieces to create a single suffix array

Finding Duplicates

Finding duplicates between two documents

Finding duplicates within one document

Approx Deduplication Results

Comments

Owner

Google Research

High performance distributed framework for training deep learning recommendation models based on PyTorch.

A machine learning library for supervised training of parametrized models

Tangram - makes it easy for programmers to train, deploy, and monitor machine learning models.

Rust implementation of real-coded GA for solving optimization problems and training of neural networks

NEATeRS is a library for training a genetic neural net through reinforcement learning.

Embedded Rust on Espressif training material.

🦀 Example of serving deep learning models in Rust with batched prediction

Proof of concept for a web API that can export 3MF files from parametric OpenSCAD models

An example of using TensorFlow rust bindings to serve trained machine learning models via Actix Web

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Rust API to run predictions with YoloV5 models.

Build neural network models in Cairo 1.0

Tutorial for Porting PyTorch Transformer Models to Candle (Rust)

A Rust library integrated with ONNXRuntime, providing a collection of ML models.

High-performance runtime for data analytics applications

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Dynamically get the suggested clusters in the data for unsupervised learning.

zenoh-flow aims at providing a zenoh-based data-flow programming framework for computations that span from the cloud to the device.