Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Miles Granger

Last update: Oct 31, 2022

Related tags

Data processing python rust numpy cython postgresql pandas dataframes

Overview

flaco

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Have a gander at the initial benchmarks 🏋

flaco tends to use nearly ~3-5x less memory than standard pandas.read_sql and about ~3x faster. However, it's probably 50x less stable at the moment. 😜

To whet your appetite, here's a memory profile between flaco and pandas.read_sql on a table with 2M rows with columns of various types. (see test_benchmarks.py) *If the test data has null values, you can expect a ~3x saving, instead of the ~5x you see here; therefore (hot tip 🔥 ), supply fill values in your queries where possible via coalesce.

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    99    140.5 MiB    140.5 MiB           1   @profile
   100                                         def memory_profile():
   101    140.5 MiB      0.0 MiB           1       stmt = "select * from test_table"
   102                                         
   103                                             
   104    140.9 MiB      0.4 MiB           1       with Database(DB_URI) as con:
   105    441.8 MiB    300.9 MiB           1           data = read_sql(stmt, con)
   106    441.8 MiB      0.0 MiB           1           _flaco_df = pd.DataFrame(data, copy=False)
   107                                         
   108                                             
   109    441.8 MiB      0.0 MiB           1       engine = create_engine(DB_URI)
   110   2091.5 MiB   1649.7 MiB           1       _pandas_df = pd.read_sql(stmt, engine)

Example

from flaco.io import read_sql, Database


uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"

with Database(uri) as con:
    data = read_sql(stmt, con)  # dict of column name to numpy array

# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)

# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array 
# allocation without needing to resize during query reading.
with Database(uri) as con:
    data = read_sql(stmt, con, 1_000)

Notes

Is this a drop in replacement for pandas.read_sql?

No. It varies in a few ways:

It will return a dict of str ➡ numpy.ndarray objects. But this can be passed with zero copies to pandas.DataFrame
When querying integer columns, if a null is encountered, the array will be converted to dtype=object and nulls from PostgreSQL will be None. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basically numpy.nan types.
It lacks basically all of the options pandas.read_sql has.

Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:

📝 Poor/non-existant error messages
💩 Core dumps
🚰 Memory leaks (although I think most are handled now)
🦖 Almost complete lack of exception handling from underlying Rust/C interface
📍 PostgreSQL numeric type should ideally be converted to decimal.Decimal but uses f64 for now; potentially loosing precision. Note, this is exactly what pandas.read_sql does.
❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter such types, either convert them to a supported type like text/json/jsonb (ie select my_field::text ...), or open an issue if a standard type is not supported.

License

Why did you choose such lax licensing? Could you change to a copy left license, please?

...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.

Comments

Improve temporal types

Remove all use of Utf8Array when handling temporal types, in favor or proper DataType::Timestamp and the like.

This also removes read_sql_to_numpy as I don't think that's a worthy compatibility to keep.

opened by milesgranger 0
Switch to using Arrow
Removes ability to read directly into numpy buffers and thus DataFrame (will re-introduce once figure out C Data Interface)

Reads sql query into Arrow/Feather/IPC or Parquet file very efficiently, allowing use with other tools which can make good use of these.
opened by milesgranger 0
Bump tokio from 1.12.0 to 1.16.1
Bumps tokio from 1.12.0 to 1.16.1.

Release notes

Sourced from tokio's releases.

Tokio v1.16.1

1.16.1 (January 28, 2022)

This release fixes a bug in #4428 with the change #4437.

#4428: tokio-rs/tokio#4428 #4437: tokio-rs/tokio#4437

Tokio v1.16.0

Fixes a soundness bug in io::Take (#4428). The unsoundness is exposed when leaking memory in the given AsyncRead implementation and then overwriting the supplied buffer:

impl AsyncRead for Buggy { fn poll_read( self: Pin<&mut Self>, cx: &mut Context<'_>, buf: &mut ReadBuf<'_> ) -> Poll<Result<()>> { let new_buf = vec![0; 5].leak(); *buf = ReadBuf::new(new_buf); buf.put_slice(b"hello"); Poll::Ready(Ok(())) } }

Also, this release includes improvements to the multi-threaded scheduler that can increase throughput by up to 20% in some cases (#4383).

Fixed

io: soundness don't expose uninitialized memory when using io::Take in edge case (#4428)

fs: ensure File::write results in a write syscall when the runtime shuts down (#4316)

process: drop pipe after child exits in wait_with_output (#4315)

rt: improve error message when spawning a thread fails (#4398)

rt: reduce false-positive thread wakups in the multi-threaded scheduler (#4383)

sync: don't inherit Send from parking_lot::*Guard (#4359)

Added

net: TcpSocket::linger() and set_linger() (#4324)

net: impl UnwindSafe for socket types (#4384)

rt: impl UnwindSafe for JoinHandle (#4418)

sync: watch::Receiver::has_changed() (#4342)

sync: oneshot::Receiver::blocking_recv() (#4334)

sync: RwLock blocking operations (#4425)

Unstable

... (truncated)

Commits

91b9850 chore: prepare Tokio v1.16.1 release (#4438)

3c46705 io: fix take pointer check (#4437)

afd2189 chore: prepare Tokio v1.16 release (#4431)

986b88b chore: update year in LICENSE files (#4429)

257053e util: add spawn_pinned (#3370)

5af9e0d sync: add blocking lock methods to RwLock (#4425)

8f77ee8 net: add generic trait to combine UnixListener and TcpListener (#4385)

2747043 tests: enable running wasm32-unknown-unknown tests (#4421)

2a5071f feat: implement Framed::map_codec (#4427)

621790e io: fix take when using evil reader (#4428)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Use numpy.datetime64 dtypes

Use direct i64 values from postgres_protocol and calculate into numpy.datetime64 types. Greatly improves memory, but costs more time. (still ~3x faster than pandas, but slower than connectorx now)

opened by milesgranger 0
Datetime, Date, and Time support

Plus some other optimizations here and there. Also benchmark against connectors; future PR to mention some differences in approach with that project.

Cannot (yet) support timetz as postgres-rust does not support it.

opened by milesgranger 0
Memory and speed improvements
Make everything non clone-able.

Return pointer to Data instead of cloning

Reduce complexity of external API; free data when error encountered, or iteration end, inside of next_row

Still some issues on the rust lib side.. it shouldn't continue growing in memory throughout use; should rather be streaming the rows to the caller.

Ref: it swells up to approx the size of the table data being read, it should instead level off at a lower memory use I think.
opened by milesgranger 0
Optimize handling of null integers and reuse row container
Much more efficient to set expected dtype, and if a null is encountered, cast the array to object.

Do a single allocation for row data, and reuse that container throughout iterations.
opened by milesgranger 0
Add CI/CD and support(ish) numeric type

Note about numeric type from PostgreSQL: would be best, IMO, to support proper decimal.Decimal conversion, but haven't figured that out just yet, and pandas converts this to f64, so for the meantime, it seems like a suitable alternative.

opened by milesgranger 0

Releases(v0.6.0)

v0.6.0(Oct 28, 2022)
What's Changed

Bump tokio from 1.12.0 to 1.16.1 by @dependabot in https://github.com/milesgranger/flaco/pull/20

Switch to using Arrow by @milesgranger in https://github.com/milesgranger/flaco/pull/22

Build wheels and PyPI distribution by @milesgranger in https://github.com/milesgranger/flaco/pull/23

Remove offset seconds from OffsetDatetime result by @milesgranger in https://github.com/milesgranger/flaco/pull/27

Improve temporal types by @milesgranger in https://github.com/milesgranger/flaco/pull/28

Support reading into pyarrow.Table by @milesgranger in https://github.com/milesgranger/flaco/pull/31

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.5.1...v0.6.0
Source code(tar.gz)
Source code(zip)
v0.6.0-rc4(Oct 9, 2022)
What's Changed

Support reading into pyarrow.Table by @milesgranger in https://github.com/milesgranger/flaco/pull/31

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.6.0-rc3...v0.6.0-rc4
Source code(tar.gz)
Source code(zip)
v0.6.0-rc3(Sep 27, 2022)
What's Changed

Improve temporal types by @milesgranger in https://github.com/milesgranger/flaco/pull/28

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.6.0-rc2...v0.6.0-rc3
Source code(tar.gz)
Source code(zip)
v0.6.0-rc2(Sep 16, 2022)
What's Changed

Remove offset seconds from OffsetDatetime result by @milesgranger in https://github.com/milesgranger/flaco/pull/27

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.6.0-rc1...v0.6.0-rc2
Source code(tar.gz)
Source code(zip)
v0.6.0-rc1(Sep 11, 2022)
What's Changed

Bump tokio from 1.12.0 to 1.16.1 by @dependabot in https://github.com/milesgranger/flaco/pull/20

Switch to using Arrow by @milesgranger in https://github.com/milesgranger/flaco/pull/22

Build wheels and PyPI distribution by @milesgranger in https://github.com/milesgranger/flaco/pull/23

Support for read_sql_to_numpy by @milesgranger in https://github.com/milesgranger/flaco/pull/26

New Contributors

@dependabot made their first contribution in https://github.com/milesgranger/flaco/pull/20

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.5.1...v0.6.0-rc1
Source code(tar.gz)
Source code(zip)
v0.5.1(Dec 23, 2021)
What's Changed

Fix offset calculated as datetime64 result, not timedelta64 by @milesgranger in https://github.com/milesgranger/flaco/pull/18

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.5.0...v0.5.1
Source code(tar.gz)
Source code(zip)
v0.5.0(Nov 11, 2021)
What's Changed

Small improvements by @milesgranger in https://github.com/milesgranger/flaco/pull/15

Use numpy.datetime64 dtypes by @milesgranger in https://github.com/milesgranger/flaco/pull/16

Cleanup and formatting by @milesgranger in https://github.com/milesgranger/flaco/pull/17

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.4.1...v0.5.0
Source code(tar.gz)
Source code(zip)
v0.4.1(Oct 26, 2021)

Fix: use bytearray for PostgreSQL bytea type.

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.4.0...v0.4.1
Source code(tar.gz)
Source code(zip)
v0.4.0(Oct 26, 2021)
What's Changed

Parsed Datetime, Date, and Time support by @milesgranger in https://github.com/milesgranger/flaco/pull/14

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.3.0...v0.4.0
Source code(tar.gz)
Source code(zip)
v0.3.0.post3(Oct 21, 2021)

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.3.0...v0.3.0.post3
Source code(tar.gz)
Source code(zip)
v0.3.0.post2(Oct 20, 2021)
What's Changed

Reduce allocations by reusing existing enums by @milesgranger in https://github.com/milesgranger/flaco/pull/13

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.2.0...v0.3.0
Source code(tar.gz)
Source code(zip)
v0.2.0(Oct 18, 2021)
What's Changed

Add CI/CD and support(ish) numeric type by @milesgranger in https://github.com/milesgranger/flaco/pull/6

Run CI on master branch push by @milesgranger in https://github.com/milesgranger/flaco/pull/7

Improve error handling, context manager and CI by @milesgranger in https://github.com/milesgranger/flaco/pull/8

Optimize handling of null integers and reuse row container by @milesgranger in https://github.com/milesgranger/flaco/pull/9

Better error handling by @milesgranger in https://github.com/milesgranger/flaco/pull/10

Memory and speed improvements by @milesgranger in https://github.com/milesgranger/flaco/pull/11

Handle memory leaks by @milesgranger in https://github.com/milesgranger/flaco/pull/12

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.1.0...v0.2.0
Source code(tar.gz)
Source code(zip)
v0.1.0.post3(Oct 13, 2021)
What's Changed

Add CI/CD and support(ish) numeric type by @milesgranger in https://github.com/milesgranger/flaco/pull/6

Support x64 Linux and Windows platforms

Run CI on master branch push by @milesgranger in https://github.com/milesgranger/flaco/pull/7

Full Changelog: https://github.com/milesgranger/flaco/compare/v0.1.0...v0.1.0.post3
Source code(tar.gz)
Source code(zip)
v0.1.0(Oct 6, 2021)

Initial support for most data types, no TLS support and full of terrible error messages and handling. but in initial benchmarks, it's ~2x as fast and memory-efficient as basic pandas.read_sql.
Source code(tar.gz)
Source code(zip)

Owner

Miles Granger

Just a happy engineer.

GitHub

PyO3-based Rust binding of NumPy C-API

rust-numpy Rust bindings for the NumPy C-API API documentation Latest release (possibly broken) Current Master Requirements Rust >= 1.41.1 Basically,

759 Jan 3, 2023

A highly efficient daemon for streaming data from Kafka into Delta Lake

172 Dec 23, 2022

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

939 Jan 5, 2023

Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. ?? ?? I needed a succinct way to describe 2d pixel map operat

0 Oct 29, 2021

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

237 Jan 1, 2023

PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

4 Apr 11, 2023

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

10.9k Jan 6, 2023

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

3 Dec 22, 2023

New generation decentralized data warehouse and streaming data pipeline

184 Dec 22, 2022

This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

2 Nov 2, 2022

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

5k Jan 9, 2023

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

2.6k Jan 7, 2023

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Polars Python Documentation | Rust Documentation | User Guide | Discord | StackOverflow Blazingly fast DataFrames in Rust, Python & Node.js Polars is

11.8k Jan 8, 2023

An AWS Lambda for automatically loading JSON files as they're created into Delta tables

Delta S3 Loader This AWS Lambda serves a singular purpose: bring JSON files from an S3 bucket into Delta Lake. This can be highly useful for legacy or

4 Jan 12, 2022

Dig into ClickHouse with TUI interface. PRE ALPHA version, everything will be changed.

chdig Dig into ClickHouse with TUI interface. Motivation The idea is came from everyday digging into various ClickHouse issues. ClickHouse has a appro

5 Feb 10, 2023

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

5 Aug 31, 2023

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

5 Sep 6, 2023

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

30.7k Jan 7, 2023

An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

40 Dec 20, 2022

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Related tags

Overview

flaco

Example

Notes

License

Comments

Tokio v1.16.1

1.16.1 (January 28, 2022)

Tokio v1.16.0

Fixed

Added

Unstable

Releases(v0.6.0)

v0.6.0(Oct 28, 2022)

What's Changed

v0.6.0-rc4(Oct 9, 2022)

What's Changed

v0.6.0-rc3(Sep 27, 2022)

What's Changed

v0.6.0-rc2(Sep 16, 2022)

What's Changed

v0.6.0-rc1(Sep 11, 2022)

What's Changed

New Contributors

v0.5.1(Dec 23, 2021)

What's Changed

v0.5.0(Nov 11, 2021)

What's Changed

v0.4.1(Oct 26, 2021)

v0.4.0(Oct 26, 2021)

What's Changed

v0.3.0.post3(Oct 21, 2021)

v0.3.0.post2(Oct 20, 2021)

What's Changed

v0.2.0(Oct 18, 2021)

What's Changed

v0.1.0.post3(Oct 13, 2021)

What's Changed

v0.1.0(Oct 6, 2021)

Owner

Miles Granger

PyO3-based Rust binding of NumPy C-API

A highly efficient daemon for streaming data from Kafka into Delta Lake

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

New generation decentralized data warehouse and streaming data pipeline

This library provides a data view for reading and writing data in a byte array.

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

An AWS Lambda for automatically loading JSON files as they're created into Delta tables

Dig into ClickHouse with TUI interface. PRE ALPHA version, everything will be changed.

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪