Some bunch of test scripts to generate a SQLite DB with 1B rows in fastest possible way

Overview

fast-sqlite3-inserts

To find out the fastest way to create an SQLite DB with 1B random rows.

Current Benchmark

$ ./bench.sh

Sat May  8 19:08:47 IST 2021 [PYTHON] running naive.py (10_000_000) inserts
      855.29 real       158.63 user       258.69 sys

Sat May  8 19:23:02 IST 2021 [PYTHON] running naive_batched.py (10_000_000) inserts
      569.71 real       114.91 user       252.70 sys

Sat May  8 19:32:32 IST 2021 [PYTHON] running sqlite3_opt.py (100_000_000) inserts
      609.06 real       603.59 user         3.55 sys

Sat May  8 19:42:44 IST 2021 [PYTHON] running sqlite3_opt_batched.py (100_000_000) inserts
      517.53 real       508.24 user         7.35 sys

Sat May  8 19:51:24 IST 2021 [PYTHON] running threaded_batched.py (100_000_000) inserts
      697.70 real       515.22 user       170.90 sys

Sat May  8 20:03:04 IST 2021 [PYPY] running sqlite3_opt_batched.py (100_000_000) inserts
      159.70 real       153.46 user         5.81 sys

Sat May  8 20:05:45 IST 2021 [PYPY] running threaded_batched.py (100_000_000) inserts
      324.12 real       224.14 user        84.69 sys

Busy loop time

$ ./busy.sh

Sun May  9 13:16:01 IST 2021 [PYTHON] busy_loop.py (100_000_000) iterations
      351.14 real       347.53 user         3.39 sys

Sun May  9 13:21:52 IST 2021 [PYPY] busy_loop.py (100_000_000) iterations
       81.58 real        77.73 user         3.80 sys

Sun May  9 13:23:14 IST 2021 [RUST] busy.rs (100_000_000) iterations
       17.97 real        16.29 user         1.67 sys

Sun May  9 13:23:32 IST 2021 [RUST] threaded_busy.rs (100_000_000) iterations
        7.18 real        42.52 user         7.20 sys

Rust

Mon May 10 17:25:45 IST 2021 [RUST] basic_async.rs (100_000_000) inserts
      833.36 real      1202.85 user       254.21 sys
Mon May 10 17:39:39 IST 2021 [RUST] basic_prep.rs (100_000_000) inserts
       59.64 real        57.01 user         2.27 sys
Mon May 10 17:40:39 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
       51.08 real        48.16 user         2.43 sys
Comments
  • Rework the random functions to reduce allocations

    Rework the random functions to reduce allocations

    This pull request globally improves the performance of the program by removing most of the allocations at run time. Basically, the changes are in the lib.rs file.

    • I changed the common.rs module to be lib.rs file, this way cargo will not try to compile it like a binary file.
    • I removed the Vec creation in the get_random_age function.
    • I replaced the format! calls with an AreaCode new type that wraps a TinyStr8. By using this instead of strings. It helps kipping memory locality by directly storing the formatted string on a pointer-sized integer (64 bytes) without any indirection to read the str itself and also removes any possible allocation.

    On my side, the performance gain is around 17.6%.

    opened by Kerollmops 14
  • Switch from (crypto-secure) rand to fastrand

    Switch from (crypto-secure) rand to fastrand

    fastrand should be a lot better suited, less hassle with threads trying to optimize random selections. did not test the threaded build tbh but should be ok.

    opened by red15 7
  • Speed up the commons functions

    Speed up the commons functions

    My 10M (sqlite3_opt_batched) run goes from 42s to 26s.

    Switching to https://docs.python.org/3/library/os.html#os.urandom would make this even faster, but I don't think that's the point of the exercise so leaving it at this. I think it's still nice to have because it reduces the time by a lot and makes it faster for me to benchmark

    result for threaded_batched.py against pypy3 on my i7 for all 100M records:

    real 111.251 user 104.187 sys 4.483 pcpu 97.68

    I seem to be running out of memory though (process keeps getting OOM-killed, wondering if that can be fixed somehow).

    opened by captn3m0 4
  • Fix /usr/bin/time invocation

    Fix /usr/bin/time invocation

    /usr/bin/time doesn't exist on many systems because it is implemented as a shell-builtin.

    Putting a semi-colon at the end of the command forces make to run a shell for the same, and invokes the correct time.

    Added a shebang for bench.sh so hopefully that should also work everywhere?

    Ref: https://stackoverflow.com/a/17550243/368328

    opened by captn3m0 1
  • Add pure SQLite approach, with generate_series

    Add pure SQLite approach, with generate_series

    Based from #11. This PR adds a new method of quickly inserting 100M rows into a SQLite table, using a pure SQLite approach.

    Using the generate_series table valued function, we can insert 100M rows into a table very, very quickly with a single INSERT INTO ... SELECT statement.

    The question of "how fast is this" is difficult to answer. I'm not sure if it's something with my computer/background apps I'm running, but my benchmarks of the previous python/rust solutions have been all over the place. Using this approach, I get 57s. For the pypy threaded_batched approach, I get 5m50s, and with the rust threaded_batched approach, I get 1m9s.

    $ time sqlite3x sqlite3.db  '.read load.sql'
    real 0m57.414s
    user 0m45.714s
    sys 0m6.574s
    
    $ time ~/Downloads/pypy3.8-v7.3.7-osx64/bin/pypy3 threaded_batched.py
    real 5m49.887s
    user 3m4.141s
    sys 1m52.690s
    
    $ time ./target/release/threaded_batched
    real 1m9.807s
    user 0m49.430s
    sys 0m29.562s
    

    Given this, I'm tempted to say that this actually might be the fastest solution, but given my wildly different benchmark times (pypy's 5m50s vs 126s, rust's 1m9s vs 30s), I'd love to see how this runs on your computer, using the full benchmark!

    Why (I think) it's so fast

    One solid reason why I think this approach is so fast is because it's only a single SQL statement that is being ran. All the other approaches are performing millions of insert into ... values (...) SQL statements at a time, which is a fast operation, but doing anything millions of times starts to add up. We see an obvious benefit when we start to batch statements together, but even batching 50 at a time is still 2 million distinct SQL statements that need to be ran.

    Also, by using generate_series, all operations are kept in C, which is very fast. Rust and pypy are also fast, but my guess is that context switching between Rust -> C or pypy -> C takes a non-trivial amount of time that adds up fast.

    Also, on generate_series vs recursive CTEs, the generate_series page mentions that recursive CTEs are slower than table-valued functions.

    Note: generate_series may not be in your sqlite CLI by default, so make sure the sqlite3 CLI that you have contains that

    Would love to hear what you think, and to see the "official" benchmark numbers!

    opened by asg017 4
  • `basic_async` not actually asynchronous ?

    `basic_async` not actually asynchronous ?

    I might be missing something here as I'm still new to async in rust, but as I understand it calling await on a future forces the runtime to serialize the execution of the program, this means that this code:

    conn.execute("PRAGMA cache_size = 1000000;").await?;
    conn.execute("PRAGMA locking_mode = EXCLUSIVE;").await?;
    conn.execute("PRAGMA temp_store = MEMORY;").await?;
    

    is completely synchronized and each call to the database needs to complete before the next call starts. Now, this seems fine as this only happens once outside of the benchmark loop.

    But all the code in the faker function also uses await on each async call. For example this code :

    let stmt_with_area = tx
        .prepare("INSERT INTO user VALUES (NULL, ?, ?, ?)")
        .await?;
    let stmt = tx
        .prepare("INSERT INTO user VALUES (NULL, NULL, ?, ?)")
        .await?;
    

    It seems to me that there is no reason the needs to be serialized and we should join these calls, in something like:

    let (stmt_with_area, stmt ) = tokio::join!(
        tx.prepare("INSERT INTO user VALUES (NULL, ?, ?, ?)"),
        tx.prepare("INSERT INTO user VALUES (NULL, NULL, ?, ?)")
    )
    

    Finally, it also seems that we don't actually need to await the execution of the statements themselves inside the for but that would require saving all the futures in a big vector and then joining them, which seems a bit odd and would consume a lot of memory when running the for loop for 10e6 iterations.

    opened by Zshoham 1
  • Avoid the slow Display::fmt trait for area code generation

    Avoid the slow Display::fmt trait for area code generation

    As described in the other PR #12 I discovered the "slow" Display::fmt being called for formatting the area code using perf.

    This PR does limit the generated area code to numbers above 100_000 where before it could be less, thus losing some randomness, but it's about the same "trick" as used in #2 for the python side.

    opened by red15 2
  • Statically linked with libsqlite3.a with LTO enabled

    Statically linked with libsqlite3.a with LTO enabled

    This PR enables binaries who uses rusqlite to statically linked with libsqlite3.a compiled with LTO using linker-plugin-lto.

    To compile these binaries (excluding basic_async.rs), just run make -j $(nproc). It will compile sqlite3.c using CFLAGS='-O2 -flto'. The generated binaries will be smaller, ~~though I haven't tested the performance yet, I will add the benchmark below as a comment~~ ~~but only provides minor performance improvements (see comments below)~~ It seems that I didn't enable LTO in rust (see comments below).

    To compile basic_async.rs, run cargo build --release --bin basic_async --features async-sql.

    This PR might be related to #14

    Signed-off-by: Jiahao XU [email protected]

    opened by NobodyXu 13
Owner
I git stuff done
null
Async Lightweight HTTP client using system native library if possible. (Currently under heavy development)

Async Lightweight HTTP Client (aka ALHC) What if we need async but also lightweight http client without using such a large library like reqwest, isahc

SteveXMH 7 Dec 15, 2022
🧰 The Rust SQL Toolkit. An async, pure Rust SQL crate featuring compile-time checked queries without a DSL. Supports PostgreSQL, MySQL, SQLite, and MSSQL.

SQLx ?? The Rust SQL Toolkit Install | Usage | Docs Built with ❤️ by The LaunchBadge team SQLx is an async, pure Rust† SQL crate featuring compile-tim

launchbadge 7.6k Dec 31, 2022
Ergonomic bindings to SQLite for Rust

Rusqlite Rusqlite is an ergonomic wrapper for using SQLite from Rust. It attempts to expose an interface similar to rust-postgres. use rusqlite::{para

Rusqlite 1.9k Jan 5, 2023
SQLite clone from scratch in Rust

Rust-SQLite (SQLRite) Rust-SQLite, aka SQLRite , is a simple embedded database modeled off SQLite, but developed with Rust. The goal is get a better u

João Henrique Machado Silva 952 Jan 5, 2023
Simple and handy btrfs snapshoting tool. Supports unattended snapshots, tracking, restoring, automatic cleanup and more. Backed with SQLite.

Description Simple and handy btrfs snapshoting tool. Supports unattended snapshots, tracking, restoring, automatic cleanup and more. Backed with SQLit

Eduard Tolosa 27 Nov 22, 2022
🐸Slippi DB ingests Slippi replays and puts the data into a SQLite database for easier parsing.

The primary goal of this project is to make it easier to analyze large amounts of Slippi data. Its end goal is to create something similar to Ballchasing.com but for Melee.

Max Timkovich 20 Jan 2, 2023
A tool for automated migrations for PostgreSQL, SQLite and MySQL.

Models Models is an implementation for a SQL migration management tool. It supports PostgreSQL, MySQL, and SQLite. Quick Start install the CLI by runn

null 45 Nov 16, 2022
Provides a Rust-based SQLite extension for using Hypercore as the VFS for your databases.

SQLite and Hypercore A Rust library providing SQLite with an virtual file system to enable Hypercore as a means of storage. Contributing The primary r

Jacky Alciné 14 Dec 5, 2022
ChiselStore is an embeddable, distributed SQLite for Rust, powered by Little Raft.

ChiselStore ChiselStore is an embeddable, distributed SQLite for Rust, powered by Little Raft. SQLite is a fast and compact relational database manage

null 516 Jan 2, 2023
cogo rust coroutine database driver (Mysql,Postgres,Sqlite)

cdbc Coroutine Database driver Connectivity.based on cogo High concurrency,based on coroutine No Future<'q,Output=*>,No async fn, No .await , no Poll*

co-rs 10 Nov 13, 2022
Interface to SQLite

SQLite The package provides an interface to SQLite. Example Open a connection, create a table, and insert some rows: let connection = sqlite::open(":m

Stainless Steel 139 Dec 28, 2022
Rusqlite is an ergonomic wrapper for using SQLite from Rust

Rusqlite Rusqlite is an ergonomic wrapper for using SQLite from Rust. It attempts to expose an interface similar to rust-postgres. use rusqlite::{para

Rusqlite 1.9k Jan 7, 2023
SQLite compiled to WASM with pluggable data storage

wasm-sqlite SQLite compiled to WASM with pluggable data storage. Useful to save SQLite in e.g. Cloudflare Durable Objects (example: https://github.com

Markus Ast 36 Dec 7, 2022
Build SQLite virtual file systems (VFS) by implementing a simple Rust trait.

sqlite-vfs Build SQLite virtual file systems (VFS) by implementing a simple Rust trait. Documentation | Example This library is build for my own use-c

Markus Ast 56 Dec 19, 2022
Persist EWW histories into SQLite

EWW History Extension Persist EWW histories into SQLite. Besides EWW, packages below are also supported: elfeed Welcome to open an issue if you have a

null 5 Sep 23, 2022
webmention-receiver - a simple program that receives webmentions, records them to a SQLite database

webmention-receiver is a simple program that receives webmentions, records them to a SQLite database, and allows viewing them via a webpage or RSS feed. It has no conception of an "account", and by default will accept webmentions for any domain. It is possible to configure it to only accept webmentions for a specific set of domains, if you'd prefer.

Wesley Aptekar-Cassels 11 Sep 3, 2022
Grsql is a great tool to allow you set up your remote sqlite database as service and CRUD(create/read/update/delete) it using gRPC.

Grsql is a great tool to allow you set up your remote sqlite database as service and CRUD (create/ read/ update/ delete) it using gRPC. Why Create Thi

Bruce Yuan 33 Dec 16, 2022
XLite - query Excel (.xlsx, .xls) and Open Document spreadsheets (.ods) as SQLite virtual tables

XLite - query Excel (.xlsx, .xls) and Open Document spreadsheets (.ods) as SQLite virtual tables XLite is a SQLite extension written in Rust. The main

Sergey Khabibullin 1.1k Dec 28, 2022
A Rust-based comment server using SQLite and an intuitive REST API.

soudan A Rust-based comment server using SQLite and an intuitive REST API. Soudan is built with simplicity and static sites in mind. CLI usage See sou

Elnu 0 Dec 19, 2022