Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy.
Have a gander at the initial benchmarks
flaco tends to use nearly ~3-5x less memory than standard
pandas.read_sql and about ~3x faster. However, it's probably 50x less stable at the moment.
To whet your appetite, here's a memory profile between flaco and
pandas.read_sql on a table with 2M rows with columns of various types. (see test_benchmarks.py) *If the test data has null values, you can expect a ~3x saving, instead of the ~5x you see here; therefore (hot tip
Line # Mem usage Increment Occurences Line Contents ============================================================ 99 140.5 MiB 140.5 MiB 1 @profile 100 def memory_profile(): 101 140.5 MiB 0.0 MiB 1 stmt = "select * from test_table" 102 103 104 140.9 MiB 0.4 MiB 1 with Database(DB_URI) as con: 105 441.8 MiB 300.9 MiB 1 data = read_sql(stmt, con) 106 441.8 MiB 0.0 MiB 1 _flaco_df = pd.DataFrame(data, copy=False) 107 108 109 441.8 MiB 0.0 MiB 1 engine = create_engine(DB_URI) 110 2091.5 MiB 1649.7 MiB 1 _pandas_df = pd.read_sql(stmt, engine)
from flaco.io import read_sql, Database uri = "postgresql://postgres:[email protected]:5432/postgres" stmt = "select * from my_big_table" with Database(uri) as con: data = read_sql(stmt, con) # dict of column name to numpy array # If you have pandas installed, you can create a DataFrame # with zero copying like this: import pandas as pd df = pd.DataFrame(data, copy=False) # If you know the _exact_ rows which will be returned # you can supply n_rows to perform a single array # allocation without needing to resize during query reading. with Database(uri) as con: data = read_sql(stmt, con, 1_000)
Is this a drop in replacement for
No. It varies in a few ways:
- It will return a
numpy.ndarrayobjects. But this can be passed with zero copies to
- When querying integer columns, if a null is encountered, the array will be converted to
dtype=objectand nulls from PostgreSQL will be
None. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basically
- It lacks basically all of the options
Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:
📝Poor/non-existant error messages
🚰Memory leaks (although I think most are handled now)
🦖Almost complete lack of exception handling from underlying Rust/C interface
numerictype should ideally be converted to
f64for now; potentially loosing precision. Note, this is exactly what
❗Might not handle all or custom arbitrary PostgreSQL types. If you encounter such types, either convert them to a supported type like text/json/jsonb (ie
select my_field::text ...), or open an issue if a standard type is not supported.
Why did you choose such lax licensing? Could you change to a copy left license, please?
...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.