Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Overview

flaco

Code Style CI PyPI PyPI - Wheel Downloads

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Have a gander at the initial benchmarks 🏋

flaco tends to use nearly ~3-5x less memory than standard pandas.read_sql and about ~3x faster. However, it's probably 50x less stable at the moment. 😜

To whet your appetite, here's a memory profile between flaco and pandas.read_sql on a table with 2M rows with columns of various types. (see test_benchmarks.py) *If the test data has null values, you can expect a ~3x saving, instead of the ~5x you see here; therefore (hot tip 🔥 ), supply fill values in your queries where possible via coalesce.

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    99    140.5 MiB    140.5 MiB           1   @profile
   100                                         def memory_profile():
   101    140.5 MiB      0.0 MiB           1       stmt = "select * from test_table"
   102                                         
   103                                             
   104    140.9 MiB      0.4 MiB           1       with Database(DB_URI) as con:
   105    441.8 MiB    300.9 MiB           1           data = read_sql(stmt, con)
   106    441.8 MiB      0.0 MiB           1           _flaco_df = pd.DataFrame(data, copy=False)
   107                                         
   108                                             
   109    441.8 MiB      0.0 MiB           1       engine = create_engine(DB_URI)
   110   2091.5 MiB   1649.7 MiB           1       _pandas_df = pd.read_sql(stmt, engine)

Example

from flaco.io import read_sql, Database


uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"

with Database(uri) as con:
    data = read_sql(stmt, con)  # dict of column name to numpy array

# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)

# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array 
# allocation without needing to resize during query reading.
with Database(uri) as con:
    data = read_sql(stmt, con, 1_000)

Notes

Is this a drop in replacement for pandas.read_sql?

No. It varies in a few ways:

  • It will return a dict of str numpy.ndarray objects. But this can be passed with zero copies to pandas.DataFrame
  • When querying integer columns, if a null is encountered, the array will be converted to dtype=object and nulls from PostgreSQL will be None. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basically numpy.nan types.
  • It lacks basically all of the options pandas.read_sql has.

Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:

  • 📝 Poor/non-existant error messages
  • 💩 Core dumps
  • 🚰 Memory leaks (although I think most are handled now)
  • 🦖 Almost complete lack of exception handling from underlying Rust/C interface
  • 📍 PostgreSQL numeric type should ideally be converted to decimal.Decimal but uses f64 for now; potentially loosing precision. Note, this is exactly what pandas.read_sql does.
  • Might not handle all or custom arbitrary PostgreSQL types. If you encounter such types, either convert them to a supported type like text/json/jsonb (ie select my_field::text ...), or open an issue if a standard type is not supported.

License

Why did you choose such lax licensing? Could you change to a copy left license, please?

...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.

Comments
  • Improve temporal types

    Improve temporal types

    Remove all use of Utf8Array when handling temporal types, in favor or proper DataType::Timestamp and the like.

    This also removes read_sql_to_numpy as I don't think that's a worthy compatibility to keep.

    opened by milesgranger 0
  • Switch to using Arrow

    Switch to using Arrow

    • Removes ability to read directly into numpy buffers and thus DataFrame (will re-introduce once figure out C Data Interface)
    • Reads sql query into Arrow/Feather/IPC or Parquet file very efficiently, allowing use with other tools which can make good use of these.
    opened by milesgranger 0
  • Bump tokio from 1.12.0 to 1.16.1

    Bump tokio from 1.12.0 to 1.16.1

    Bumps tokio from 1.12.0 to 1.16.1.

    Release notes

    Sourced from tokio's releases.

    Tokio v1.16.1

    1.16.1 (January 28, 2022)

    This release fixes a bug in #4428 with the change #4437.

    #4428: tokio-rs/tokio#4428 #4437: tokio-rs/tokio#4437

    Tokio v1.16.0

    Fixes a soundness bug in io::Take (#4428). The unsoundness is exposed when leaking memory in the given AsyncRead implementation and then overwriting the supplied buffer:

    impl AsyncRead for Buggy {
        fn poll_read(
            self: Pin<&mut Self>,
            cx: &mut Context<'_>,
            buf: &mut ReadBuf<'_>
        ) -> Poll<Result<()>> {
          let new_buf = vec![0; 5].leak();
          *buf = ReadBuf::new(new_buf);
          buf.put_slice(b"hello");
          Poll::Ready(Ok(()))
        }
    }
    

    Also, this release includes improvements to the multi-threaded scheduler that can increase throughput by up to 20% in some cases (#4383).

    Fixed

    • io: soundness don't expose uninitialized memory when using io::Take in edge case (#4428)
    • fs: ensure File::write results in a write syscall when the runtime shuts down (#4316)
    • process: drop pipe after child exits in wait_with_output (#4315)
    • rt: improve error message when spawning a thread fails (#4398)
    • rt: reduce false-positive thread wakups in the multi-threaded scheduler (#4383)
    • sync: don't inherit Send from parking_lot::*Guard (#4359)

    Added

    • net: TcpSocket::linger() and set_linger() (#4324)
    • net: impl UnwindSafe for socket types (#4384)
    • rt: impl UnwindSafe for JoinHandle (#4418)
    • sync: watch::Receiver::has_changed() (#4342)
    • sync: oneshot::Receiver::blocking_recv() (#4334)
    • sync: RwLock blocking operations (#4425)

    Unstable

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Use numpy.datetime64 dtypes

    Use numpy.datetime64 dtypes

    Use direct i64 values from postgres_protocol and calculate into numpy.datetime64 types. Greatly improves memory, but costs more time. (still ~3x faster than pandas, but slower than connectorx now)

    opened by milesgranger 0
  • Datetime, Date, and Time support

    Datetime, Date, and Time support

    Plus some other optimizations here and there. Also benchmark against connectors; future PR to mention some differences in approach with that project.

    Cannot (yet) support timetz as postgres-rust does not support it.

    opened by milesgranger 0
  • Memory and speed improvements

    Memory and speed improvements

    • Make everything non clone-able.
    • Return pointer to Data instead of cloning
    • Reduce complexity of external API; free data when error encountered, or iteration end, inside of next_row

    Still some issues on the rust lib side.. it shouldn't continue growing in memory throughout use; should rather be streaming the rows to the caller.

    Ref: it swells up to approx the size of the table data being read, it should instead level off at a lower memory use I think.

    image

    opened by milesgranger 0
  • Optimize handling of null integers and reuse row container

    Optimize handling of null integers and reuse row container

    • Much more efficient to set expected dtype, and if a null is encountered, cast the array to object.
    • Do a single allocation for row data, and reuse that container throughout iterations.
    opened by milesgranger 0
  • Add CI/CD and support(ish) numeric type

    Add CI/CD and support(ish) numeric type

    Note about numeric type from PostgreSQL: would be best, IMO, to support proper decimal.Decimal conversion, but haven't figured that out just yet, and pandas converts this to f64, so for the meantime, it seems like a suitable alternative.

    opened by milesgranger 0
Releases(v0.6.0)
  • v0.6.0(Oct 28, 2022)

    What's Changed

    • Bump tokio from 1.12.0 to 1.16.1 by @dependabot in https://github.com/milesgranger/flaco/pull/20
    • Switch to using Arrow by @milesgranger in https://github.com/milesgranger/flaco/pull/22
    • Build wheels and PyPI distribution by @milesgranger in https://github.com/milesgranger/flaco/pull/23
    • Remove offset seconds from OffsetDatetime result by @milesgranger in https://github.com/milesgranger/flaco/pull/27
    • Improve temporal types by @milesgranger in https://github.com/milesgranger/flaco/pull/28
    • Support reading into pyarrow.Table by @milesgranger in https://github.com/milesgranger/flaco/pull/31

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.5.1...v0.6.0

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-rc4(Oct 9, 2022)

    What's Changed

    • Support reading into pyarrow.Table by @milesgranger in https://github.com/milesgranger/flaco/pull/31

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.6.0-rc3...v0.6.0-rc4

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-rc3(Sep 27, 2022)

    What's Changed

    • Improve temporal types by @milesgranger in https://github.com/milesgranger/flaco/pull/28

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.6.0-rc2...v0.6.0-rc3

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-rc2(Sep 16, 2022)

    What's Changed

    • Remove offset seconds from OffsetDatetime result by @milesgranger in https://github.com/milesgranger/flaco/pull/27

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.6.0-rc1...v0.6.0-rc2

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0-rc1(Sep 11, 2022)

    What's Changed

    • Bump tokio from 1.12.0 to 1.16.1 by @dependabot in https://github.com/milesgranger/flaco/pull/20
    • Switch to using Arrow by @milesgranger in https://github.com/milesgranger/flaco/pull/22
    • Build wheels and PyPI distribution by @milesgranger in https://github.com/milesgranger/flaco/pull/23
    • Support for read_sql_to_numpy by @milesgranger in https://github.com/milesgranger/flaco/pull/26

    New Contributors

    • @dependabot made their first contribution in https://github.com/milesgranger/flaco/pull/20

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.5.1...v0.6.0-rc1

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Dec 23, 2021)

    What's Changed

    • Fix offset calculated as datetime64 result, not timedelta64 by @milesgranger in https://github.com/milesgranger/flaco/pull/18

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.5.0...v0.5.1

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Nov 11, 2021)

    What's Changed

    • Small improvements by @milesgranger in https://github.com/milesgranger/flaco/pull/15
    • Use numpy.datetime64 dtypes by @milesgranger in https://github.com/milesgranger/flaco/pull/16
    • Cleanup and formatting by @milesgranger in https://github.com/milesgranger/flaco/pull/17

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.4.1...v0.5.0

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Oct 26, 2021)

  • v0.4.0(Oct 26, 2021)

    What's Changed

    • Parsed Datetime, Date, and Time support by @milesgranger in https://github.com/milesgranger/flaco/pull/14

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.3.0...v0.4.0

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0.post3(Oct 21, 2021)

  • v0.3.0.post2(Oct 20, 2021)

    What's Changed

    • Reduce allocations by reusing existing enums by @milesgranger in https://github.com/milesgranger/flaco/pull/13

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.2.0...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Oct 18, 2021)

    What's Changed

    • Add CI/CD and support(ish) numeric type by @milesgranger in https://github.com/milesgranger/flaco/pull/6
    • Run CI on master branch push by @milesgranger in https://github.com/milesgranger/flaco/pull/7
    • Improve error handling, context manager and CI by @milesgranger in https://github.com/milesgranger/flaco/pull/8
    • Optimize handling of null integers and reuse row container by @milesgranger in https://github.com/milesgranger/flaco/pull/9
    • Better error handling by @milesgranger in https://github.com/milesgranger/flaco/pull/10
    • Memory and speed improvements by @milesgranger in https://github.com/milesgranger/flaco/pull/11
    • Handle memory leaks by @milesgranger in https://github.com/milesgranger/flaco/pull/12

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.1.0...v0.2.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0.post3(Oct 13, 2021)

    What's Changed

    • Add CI/CD and support(ish) numeric type by @milesgranger in https://github.com/milesgranger/flaco/pull/6
      • Support x64 Linux and Windows platforms
    • Run CI on master branch push by @milesgranger in https://github.com/milesgranger/flaco/pull/7

    Full Changelog: https://github.com/milesgranger/flaco/compare/v0.1.0...v0.1.0.post3

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Oct 6, 2021)

    Initial support for most data types, no TLS support and full of terrible error messages and handling. but in initial benchmarks, it's ~2x as fast and memory-efficient as basic pandas.read_sql.

    Source code(tar.gz)
    Source code(zip)
Owner
Miles Granger
Just a happy engineer.
Miles Granger
PyO3-based Rust binding of NumPy C-API

rust-numpy Rust bindings for the NumPy C-API API documentation Latest release (possibly broken) Current Master Requirements Rust >= 1.41.1 Basically,

PyO3 759 Jan 3, 2023
A highly efficient daemon for streaming data from Kafka into Delta Lake

A highly efficient daemon for streaming data from Kafka into Delta Lake

Delta Lake 172 Dec 23, 2022
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

SFU Database Group 939 Jan 5, 2023
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. ?? ?? I needed a succinct way to describe 2d pixel map operat

Ben Greenier 0 Oct 29, 2021
PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

Per Arneng 4 Apr 11, 2023
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.9k Jan 6, 2023
🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

Firelink Data 3 Dec 22, 2023
New generation decentralized data warehouse and streaming data pipeline

World's first decentralized real-time data warehouse, on your laptop Docs | Demo | Tutorials | Examples | FAQ | Chat Get Started Watch this introducto

kamu 184 Dec 22, 2022
This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

null 2 Nov 2, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 5k Jan 9, 2023
ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

null 2.6k Jan 7, 2023
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Polars Python Documentation | Rust Documentation | User Guide | Discord | StackOverflow Blazingly fast DataFrames in Rust, Python & Node.js Polars is

null 11.8k Jan 8, 2023
An AWS Lambda for automatically loading JSON files as they're created into Delta tables

Delta S3 Loader This AWS Lambda serves a singular purpose: bring JSON files from an S3 bucket into Delta Lake. This can be highly useful for legacy or

R. Tyler Croy 4 Jan 12, 2022
Dig into ClickHouse with TUI interface. PRE ALPHA version, everything will be changed.

chdig Dig into ClickHouse with TUI interface. Motivation The idea is came from everyday digging into various ClickHouse issues. ClickHouse has a appro

Azat Khuzhin 5 Feb 10, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

null 5 Sep 6, 2023
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

null 30.7k Jan 7, 2023
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

Memgraph 40 Dec 20, 2022