ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

Overview

ConnectorX status

Load data from to , the fastest way.

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

What you need is one line of code:

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Optionally, you can accelerate the data loading using parallelism by specifying a partition column.

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

The function will partition the query by evenly splitting the specified column to the amount of partitions. ConnectorX will assign one thread for each partition to load and write data in parallel. Currently, we support partitioning on integer columns for SPJA queries.

Check out more detailed usage and examples here. A general introduction of the project can be found in this blog post.

Installation

pip install connectorx

Performance

We compared different solutions in Python that provides the read_sql function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.

Time chart, lower is better.

time chart

Memory consumption chart, lower is better.

memory chart

In conclusion, ConnectorX uses up to 3x less memory and 21x less time. More on here.

How does ConnectorX achieve a lightening speed while keeping the memory footprint low?

We observe that existing solutions more or less do data copy multiple times when downloading the data. Additionally, implementing a data intensive application in Python brings additional cost.

ConnectorX is written in Rust and follows "zero-copy" principle. This allows it to make full use of the CPU by becoming cache and branch predictor friendly. Moreover, the architecture of ConnectorX ensures the data will be copied exactly once, directly from the source to the destination.

How does ConnectorX download the data?

Upon receiving the query, e.g. SELECT * FROM lineitem, ConnectorX will first issue a LIMIT 1 query SELECT * FROM lineitem LIMIT 1 to get the schema of the result set.

Then, if partition_on is specified, ConnectorX will issue SELECT MIN($partition_on), MAX($partition_on) FROM (SELECT * FROM lineitem) to know the range of the partition column. After that, the original query is split into partitions based on the min/max information, e.g. SELECT * FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000. ConnectorX will then run a count query to get the partition size (e.g. SELECT COUNT(*) FROM (SELECT * FROM lineitem) WHERE $partition_on > 0 AND $partition_on < 10000). If the partition is not specified, the count query will be SELECT COUNT(*) FROM (SELECT * FROM lineitem).

Finally, ConnectorX will use the schema info as well as the count info to allocate memory and download data by executing the queries normally.

Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.

This mechanism implies that having an index on the partition column is recommended to make full use of the parallel downloading power provided by ConnectorX.

Supported Sources & Destinations

Supported protocols, data types and type mappings can be found here. For more planned data sources, please check out our discussion.

Sources

  • Postgres
  • Mysql
  • Sqlite
  • Redshift (through postgres protocol)
  • Clickhouse (through mysql protocol)
  • SQL Server (no encryption support yet)
  • Oracle
  • ...

Destinations

  • Pandas
  • PyArrow
  • Modin (through Pandas)
  • Dask (through Pandas)
  • Polars (through PyArrow)

Detailed Usage and Examples

Rust docs: stable nightly

API

connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string URI. Supported URI scheme: (postgres|postgressql|mysql|mssql|sqlite)://username:password@addr:port/dbname.
  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
  • partition_on: Optional[str]: The column to partition the result.
  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
  • partition_num: Optioinal[int]: The number of partitions to generate.

Examples

  • Read a DataFrame from a SQL using a single thread

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query)
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    30000000"] cx.read_sql(postgres_url, queries) ">
    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    cx.read_sql(postgres_url, queries)
  • Read a DataFrame parallelly using 4 threads from a more complex query

    DATE '1995-03-15' GROUP BY l_orderkey, o_orderdate, o_shippriority """ cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4) ">
    import connectorx as cx
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority 
    """
    
    cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)

Next Plan

Checkout our discussions to participate in deciding our next plan!

Historical Benchmark Results

https://sfu-db.github.io/connector-x/dev/bench/

Developer's Guide

Please see Developer's Guide for information about developing ConnectorX.

Comments
  • MySQL source parsing NULL value error

    MySQL source parsing NULL value error

    import polars as pl
    import pandas as pd
    from sqlalchemy import create_engine
    import pyarrow
    # 
    print(pl.__version__)
    # 0.8.20
    print(pd.__version__)
    # 1.3.0
    pip list | grep connector*
    # connectorx                    0.2.0
    pyarrow.__version__
    # '4.0.1'
    
    # pandas first
    sql = '''select ORDER_ID from tables     '''
    engine = create_engine('mysql+pymysql://root:***@*.*.*.*:*')
    df = pd.read_sql_query(sql, engine)
    df.dtypes
    # ORDER_ID                      int64
    
    # polars second
    conn = "mysql://root:***@*.*.*.*:*"
    pdf = pl.read_sql(sql, conn)
    `
    ---------------------------------------------------------------------------
    PanicException                            Traceback (most recent call last)
    <timed exec> in <module>
    
    ~/miniconda3/envs/test/lib/python3.8/site-packages/polars/io.py in read_sql(sql, connection_uri, partition_on, partition_range, partition_num)
        556     """
        557     if _WITH_CX:
    --> 558         tbl = cx.read_sql(
        559             conn=connection_uri,
        560             query=sql,
    
    ~/miniconda3/envs/test/lib/python3.8/site-packages/connectorx/__init__.py in read_sql(conn, query, return_type, protocol, partition_on, partition_range, partition_num)
        126             raise ValueError("You need to install pyarrow first")
        127 
    --> 128         result = _read_sql(
        129             conn,
        130             "arrow",
    
    PanicException: Could not retrieve i64 from Value
    
    `
    
    opened by ztsweet 13
  • Update Connection Pooling Crate

    Update Connection Pooling Crate

    It would be nice if we could get logs from rust over the wire for debug purposes, preferably configurable from the Python client.

    I have a supposedly posgtres compatible source that fails due to

    RuntimeError: Cannot get metadata for the queries, last error: Some(Error { kind: UnexpectedMessage, cause: None })
    

    In the meantime, any tips for debugging this?

    opened by wseaton 13
  • ModuleNotFoundError: No module named 'connectorx.connectorx_python'

    ModuleNotFoundError: No module named 'connectorx.connectorx_python'

    Hello,

    I tried importing connectorx in a python interpreter after installing it through pipenv, but I am getting this error.

    Python 3.10.0 (default, Dec 16 2021, 16:14:51) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import connectorx as cx
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/xxxxx/.local/share/virtualenvs/xxxxx-py-_L81HEc9/lib/python3.10/site-packages/connectorx/__init__.py", line 3, in <module>
        from .connectorx_python import read_sql as _read_sql
    ModuleNotFoundError: No module named 'connectorx.connectorx_python'
    
    opened by zandeck 12
  • Add SSL/TLS support for Postgres connections

    Add SSL/TLS support for Postgres connections

    As discussed in #103, currently there's no way to connect to a postgres database that mandates a TLS/SSL connection.

    What this PR does is make it so sslmode=required is functional, and if a root cert and no client certs are passed it falls back to the behavior of sslmode=verify-ca.

    Example conn string:

    conn = f"postgresql://{user}:{pw}@my.database.host:5439/db?sslmode=require&sslrootcert=/tmp/myroot.crt"
    df = cx.read_sql(conn, "select 1;", return_type='pandas', protocol='cursor')
    

    There is a lot of upstream discussion on the topic and I've tried to link to it in doc strings where appropriate. Tested it on Amazon Redshift, but looking for pointers on how to add functional tests for this. All of the code probably isn't idiomatic rust either, so looking for some feedback there too. Thanks!

    opened by wseaton 10
  • Fatal Python error: none_dealloc: deallocating None

    Fatal Python error: none_dealloc: deallocating None

    The problem

    Running a query will result in a Fatal Python error: none_dealloc: deallocating None. I have no idea what causes this exception or how to even start looking what causes it. All help is appreciated.

    More context:

    I am working on a Dash application using sqlalchemy to construct the query which is them compiled for a mysql database. The query runs perfectly as expected (much faster the using the pandas build-in read_sql_query. pd.read_sql_query(query, db.engine)) but several seconds after the query, the entire python application crashes. When the query is not executed, the python application keeps running. When replacing the 2 lines below with the original pandas build-in read_sql_query, there is no problem and Python is not crashing at all.

    My Code

    The query parameters is a complex dynamically created sqlalchemy select object of type <class'sqlalchemy.sql.selectable.Select'>.

    query_cx = str(query.compile(dialect=mysql.dialect(),compile_kwargs={"literal_binds": True}))
    df = cx.read_sql(DB_URL, query_cx)
    
    opened by TomEversdijk 9
  • read_sql from Oracle to polars truncates timestamp

    read_sql from Oracle to polars truncates timestamp

    What language are you using?

    Python

    What version are you using?

    0.2.5

    What database are you using?

    Oracle

    What dataframe are you using?

    polars, pandas, arrow2

    Can you describe your bug?

    When quering data from Oralce DB, timestamp columns are trunctated to date format (DD-MM-YYYY 00:00:00).

    Example query / code
    query = 'select timestamp_col from sample_db'
    cx.read_sql(conn, query, return_type='polars')
    

    Result: 2022-04-27 00:00:00

    Converting timestamp to varchar (within query) and then to datetime (polars) it works:

    query = 'select to_char(timestamp_col, 'YYYY-MM-DD HH24:MI:SS') as str_col from sample_db'
    cx.read_sql(conn, query, return_type='polars').with_column(
              pl.col('str_col').str.strptime(pl.Datetime, '%F %T').alias('datetime_col') 
    )
    

    Result: 2022-04-27 18:21:22

    What is the error?

    No error message, but wrong dateime format.

    bug 
    opened by wKollendorf 8
  • Missing conversion rule for bytea in postgres

    Missing conversion rule for bytea in postgres

    Error message:

    Can't run dispatcher: ConnectorX(NoConversionRule("ByteA(true)", "connectorx::destinations::arrow::typesystem::ArrowTypeSystem"))

    Expected: BYTEA in postgresql should be converted to Vec<u8> in Rust

    Is this a time-consuming fix? Happy to assist if this is the case.

    opened by quambene 8
  • How do I connect to Oracle?

    How do I connect to Oracle?

    I keep getting this error message. Any idea what might be wrong? My login credentials and DSN are all correct. tsnnames.ora file are also in the environment variable. However, I just cannot connect.

    This is what I currently have: import connectorx as cx

    conn = f"oracle://username:mypassword!001@dsn" query = 'select * from table"

    data = cx.read_sql(conn , query=query)

    This is the error message: "timed out waiting for connection: OCI Error: ORA-12154: TNS:could not resolve the connect identifier specified"

    opened by Fatroundfox 7
  • MSSQL issues

    MSSQL issues

    Hi,

    It's possible I'm missing something, I'd really like to give this a try, but every time I try to connect to MSSQL I get this error:


    PanicException Traceback (most recent call last) in ----> 1 df = cx.read_sql(conn=conn, query="Select * from test", partition_on="Unique_ID", partition_num=10) 2

    C:\ProgramData\Miniconda3\lib\site-packages\connectorx_init_.py in read_sql(conn, query, return_type, protocol, partition_on, partition_range, partition_num, index_col) 100 raise ValueError("You need to install pandas first") 101 --> 102 result = _read_sql( 103 conn, 104 "pandas",

    PanicException: byte index 1 is out of bounds of ``

    My connection string that I'm passing to conn looks like: mssql://user:passw@ip:port

    I've updated pandas to the most recent version, and I upgraded connector-x from stable to alpha-5 to see if it made any difference. Thanks!

    opened by ishbooisland 7
  • [Postgres] int4 type conversion failure on partition compare

    [Postgres] int4 type conversion failure on partition compare

    Related to #107, I can get queries to run now via the postgres bridge of my database (via the cursor protocol) but I'm running into a subtle type issue. Was wondering if you guys had any idea the cause, happy to help contribute a fix.

    thread '<unnamed>' panicked at 'error retrieving column 0: error deserializing column 0: cannot convert between the Rust type `i64` and the Postgres type `int4`', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-0.7.3/src/row.rs:151:25
    

    From what I can tell in the transport mapping, this conversion shouldn't be happening (https://github.com/sfu-db/connector-x/blob/main/connectorx/src/sources/postgres/typesystem.rs#L41).

    The error only happens on the COUNT(*) queries used to estimate the partition size.

    Example:

    [2021-11-02T13:44:35Z DEBUG connectorx::sql] Transformed count query: SELECT count(*) FROM (SELECT cn, account, created_fiscal_yr FROM mydb.mytable WHERE createddate >= '2017-01-01') AS CXTMPTAB_PART WHERE 2019 <= CXTMPTAB_PART.created_fiscal_yr AND CXTMPTAB_PART.created_fiscal_yr < 2022
    

    Maybe there is a code path there were the arrow transport is not used? Thanks!

    opened by wseaton 7
  • read_sql from SQL Server to Arrow truncates datetime

    read_sql from SQL Server to Arrow truncates datetime

    I'm working with a MS SQL Server 2019. When reading a datetime field, read_sql correctly captures it in a pandas dataframe. When I convert that dataframe to an arrow table, datetimes are also retained correctly. However, loading directly to an arrow table truncates the datetime fields to midnight. I'd like to remove the pandas dependency and load directly to an arrow table. Is there a way to do this without truncating the datetime?

    Example Code:

    import connectorx as cx
    import pyarrow as pa
    con_string = 'mssql://user:[email protected]%5CG:1439/database'
    print('------- Pandas Table -------')
    query = 'SELECT top 5 Datum FROM Termine WHERE Datum>getdate()'
    pandas_table = cx.read_sql(con_string, query, return_type='pandas')
    print(pandas_table)
    print('------- Arrow table from pandas -------')
    arrow_table_from_pandas = pa.Table.from_pandas(pandas_table)
    print(arrow_table_from_pandas)
    print('------- Arrow Table -------')
    arrow_table = cx.read_sql(con_string, query, return_type='arrow')
    print(arrow_table)
    

    Example Output:

    ------- Pandas Table -------
                    Datum
    0 2022-02-06 07:30:00
    1 2022-02-07 00:00:00
    2 2022-02-07 00:00:00
    3 2022-02-07 07:00:00
    4 2022-02-07 07:30:00
    ------- Arrow table from pandas -------
    pyarrow.Table
    Datum: timestamp[ns]
    ----
    Datum: [[2022-02-06 07:30:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 07:00:00.000000000,2022-02-07 07:30:00.000000000]]
    ------- Arrow Table -------
    pyarrow.Table
    Datum: date64[ms]
    ----
    Datum: [[2022-02-06,2022-02-07,2022-02-07,2022-02-07,2022-02-07]]
    
    opened by t-alex-fritz 6
  • # hashtag character causing error on authentication

    # hashtag character causing error on authentication

    What language are you using?

    Python

    What version are you using?

    0.3.1

    What database are you using?

    MySQL

    What dataframe are you using?

    Pandas,

    Can you describe your bug?

    I have a database with "#" character in their credentials. I can't connect to it because of the hashtag, tried with another user and it works.

    What are the steps to reproduce the behavior?

    Connect to a MySQL database with a user having a "#" credential in it

    cx.read_sql("mysql://user:aa###aaa@address:3306/db", str(query)).iloc[:, 0].tolist()
    

    What is the error?

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/ubuntu/.cache/pypoetry/virtualenvs/api.eramet-ObXpZerD-py3.9/lib/python3.9/site-packages/connectorx/__init__.py", line 224, in read_sql
        result = _read_sql(
    RuntimeError: parse error: invalid port number
    
    bug 
    opened by Syndorik 1
  • reading mssql datatype decimal(38,20) returns an error

    reading mssql datatype decimal(38,20) returns an error

    Hi,

    I'm trying to read data from a MS SQL Server table with a column of data type decimal(38,20).

    I get the following error:

    thread '<unnamed>' panicked at 'Number exceeds maximum value that can be represented.', C:\Users\runneradmin\.cargo\registry\src\github.com-1ecc6299db9ec823\rust_decimal-1.26.1\src\decimal.rs:470:23

    The value in the column that causes the error is this one: 2800003753376.20000000000000000000

    I tried casting the column. It works when the data type is casted to decimal(38,16). Basically, as soon as the scale is higher (>16), it fails.

    Is this a bug?

    opened by sashk8 0
  • can return multi df?

    can return multi df?

    I use cx as etl tools, and is very quickly. but my pc's memory is not large, If the query is so big, my memory can't hold it. cx.read_sql can add partition params(partition_on="l_orderkey", partition_num=10) maybe can add another param to return 10 df sequentially? thanks!

    opened by wonb168 0
  • Error

    Error "Unknown authentication protocol: sha256_password" for queries to ClickHouse

    What language are you using?

    Python

    What version are you using?

    0.3.1

    What database are you using?

    ClickHouse (over MySQL protocol)

    What dataframe are you using?

    Polars

    What are the steps to reproduce the behavior?

    1. Create a user identified with sha256_password in ClickHouse: https://clickhouse.com/docs/en/sql-reference/statements/create/user/
    2. Try to run any query using connectorx
    Example query / code
    import connectorx
    import polars
    
    conn = f'mysql://{user}:{password}@{host}:9004/default'
    
    -- both queries fail with the same error
    polars.read_sql('select 1 as c', conn, protocol='text')
    connectorx.read_sql(conn, 'select 1 as c', protocol='text')
    

    What is the error?

    RuntimeError: timed out waiting for connection: DriverError { Unknown authentication protocol: sha256_password }

    bug 
    opened by ivkhokhlachev 0
  • PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 35, kind: WouldBlock, message:

    PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" }

    What language are you using?

    Python

    What version are you using?

    connectorx 0.3.1

    What database are you using?

    sqlite3

    What dataframe are you using?

    Pandas

    Can you describe your bug?

    The exception happens while selecting in a loop. After approx 1500 iterations. If time.sleep(0.01) is added right after the query then the problem goes away.

    What are the steps to reproduce the behavior?

    Database setup if the error only happens on specific data or data type

    Table schema and example data

    Example query / code

    result_df = cx.read_sql("sqlite://" + xdb_path, "SELECT * FROM '" + table_name + "'")

    What is the error?

    thread '' panicked at 'called Result::unwrap() on an Err value: Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" }', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/scheduled-thread-pool-0.2.6/src/lib.rs:320:44 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Exception in thread Thread-1: Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniforge/base/envs/tf/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/homebrew/Caskroom/miniforge/base/envs/tf/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/Users/mac/tensorflow_macos_venv/monitor_cutoff.py", line 663, in get_dataframes ResultDFArray[mode_index], FactorArray[mode_index], MinDaysArray[mode_index] = get_combined_data(mode_in_list) File "/Users/mac/tensorflow_macos_venv/monitor_cutoff.py", line 382, in get_combined_data result_df = cx.read_sql("sqlite://" + xdb_path, "SELECT * FROM '" + table_name + "'") File "/opt/homebrew/Caskroom/miniforge/base/envs/tf/lib/python3.9/site-packages/connectorx/init.py", line 224, in read_sql result = _read_sql( pyo3_runtime.PanicException: called Result::unwrap() on an Err value: Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" }

    bug 
    opened by bitnlp 0
  • Issue Connecting to Oracle

    Issue Connecting to Oracle

    Hello,

    I am having an issue with the connection string required by connectorx.readsql(). I am using the template suggested by the documentation: 'oracle://username:password@server:port/database' However, I get the error: ORA-12154: TNS:could not resolve the connect identifier specified

    I can connect to this same Oracle DB through SQLAlchemy by using the following connection string: oracle+cx_oracle://{user}:{password}@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST={host})(PORT={port}))(CONNECT_DATA=(SERVICE_NAME={service_name}))

    The portion of the string after the '@' is generated by the cx_Oracle.make_dsn function. My username & password contain only alphanumeric characters.

    Any help would be greatly appreciated. This package seems to have a ton of potential and I'm excited to try it out.

    opened by ciollid2 8
Releases(v0.3.1)
  • v0.3.1(Oct 31, 2022)

    What's Changed

    • Add CLOB and BLOB to Oracle typesystem by @wKollendorf in #287
    • Correcting Spelling in Docs by @venkashank in https://github.com/sfu-db/connector-x/pull/297
    • Update doc by @lBilali in #324 #327 #326 #325 #323
    • Adding bigquery job location in parameters while getting job details by @phanindra-ramesh in https://github.com/sfu-db/connector-x/pull/346
    • Refactor for federated query in https://github.com/sfu-db/connector-x/pull/351
    • Rust version bump to nightly-2022-09-15 https://github.com/sfu-db/connector-x/pull/357
    • Add array support in arrow2 #339
    • Upgrade gcp-bigquery-client to 0.13.0 by @hieudepchai in https://github.com/sfu-db/connector-x/pull/355
    • Bump arrow-rs to 22 by @houqp in https://github.com/sfu-db/connector-x/pull/364
    • Add ltree string type by @auyer in https://github.com/sfu-db/connector-x/pull/382
    • Fix docs typo by @alexander-beedie in https://github.com/sfu-db/connector-x/pull/383
    • Add citext in postgres #372
    • Add long in oracle #375
    • Support MySQL in federated query #296
    • Add appname in mssql connection string #336
    • Fix bug for NULL values in round func on oracle #333

    New Contributors

    @wKollendorf, @houqp, @venkashank, @lBilali, @phanindra-ramesh, @hieudepchai, @auyer made their first contribution

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(May 11, 2022)

    ConnectorX

    Features

    • Add experimental support on federated query (support PostgreSQL only) #280
    • BigQuery support experimental -> stable #152
    • Upgrade rust version to nightly-2022-04-17

    Bug Fix

    • Arrow2 add blob #261
    • Mssql datetime fix #263
    • Add Mssql encryption option #265
    • Fix Mssql count query construction for queries with OFFSET
    • Add binaryfloat and binarydouble to Oracle #273
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Mar 29, 2022)

    ConnectorX

    Features

    • Build wheel for m1 machine through cross compile #238
    • Support client authentication on postgres #222
    • Expose partition_sql and get_meta for Python API
    • Parse connection url's scheme as the substring before '+', to compatible with possible sqlalchemy user's input

    Bug Fix

    • Sqlite windows path error #245
    • Oracle convert Number(0,0) to float for round(x, y) #227
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Mar 3, 2022)

    ConnectorX 0.2.4

    Bug Fix

    • Fixing deallocating None bug #201
    • Clickhouse support on new version #165
    • Shrink array before return for arrow2 to reduce memory usage when result is small #196
    • Fix Oracle port error #209

    Others

    • Improve Oracle performance by enlarging array size using prepared statement #127
    • Upgrade arrow2 to 0.9 #221
    • Add Postgres Json, TimestampTz support to arrow2 #235 #229
    • Clean up python dependencies (remove ones for benchmarking)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Dec 21, 2021)

    ConnectorX 0.2.3

    Features

    • Add Google BigQuery as source https://github.com/sfu-db/connector-x/issues/152 - need benchmark and more tests
    • Add windows trust_connection for mssql https://github.com/sfu-db/connector-x/discussions/145
    • Support Postgres HStore (only for cursor protocol) https://github.com/sfu-db/connector-x/issues/174

    Others

    • Investigate SQLite in-memory database https://github.com/sfu-db/connector-x/issues/172
    • Add redshift to connection uri automatically use cursor protocol https://github.com/sfu-db/connector-x/pull/187
    • Test on mariadb https://github.com/sfu-db/connector-x/issues/193
    • Test on azure sql database
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Nov 23, 2021)

    ConnectorX 0.2.2

    Features

    • Stream write to destination https://github.com/sfu-db/connector-x/pull/147
    • Support CTE https://github.com/sfu-db/connector-x/pull/161
    • Add money type support in mssql https://github.com/sfu-db/connector-x/issues/141
    • Add support of converting bytea in postgres arrow https://github.com/sfu-db/connector-x/issues/148

    Bug Fixes

    • Test each feature gate https://github.com/sfu-db/connector-x/issues/139
    • Support Python 3.10 https://github.com/sfu-db/connector-x/issues/157
    • Adding instance name to connect mssql https://github.com/sfu-db/connector-x/issues/140
    • Support multiple integer types when for count query in postgres https://github.com/sfu-db/connector-x/pull/153
    • Support unsigned for mysql: https://github.com/sfu-db/connector-x/issues/163
    • Fix issue when limit > count: https://github.com/sfu-db/connector-x/issues/166

    Other Changes

    • Add turbodbc in benchmark https://github.com/sfu-db/connector-x/issues/124
    • Update polars dependency https://github.com/sfu-db/connector-x/pull/154
    Source code(tar.gz)
    Source code(zip)
Owner
SFU Database Group
SFU Database Group
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
Read specialized NGS formats as data frames in R, Python, and more.

oxbow Read specialized bioinformatic file formats as data frames in R, Python, and more. File formats create a lot of friction for computational biolo

null 12 Jun 7, 2023
This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

null 2 Nov 2, 2022
New generation decentralized data warehouse and streaming data pipeline

World's first decentralized real-time data warehouse, on your laptop Docs | Demo | Tutorials | Examples | FAQ | Chat Get Started Watch this introducto

kamu 184 Dec 22, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 5k Jan 9, 2023
Bytewax is an open source Python framework for building highly scalable dataflows.

Bytewax Bytewax is an open source Python framework for building highly scalable dataflows. Bytewax uses PyO3 to provide Python bindings to the Timely

Bytewax 289 Jan 6, 2023
Rayon: A data parallelism library for Rust

Rayon Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel

null 7.8k Jan 8, 2023
A cross-platform library to retrieve performance statistics data.

A toolkit designed to be a foundation for applications to monitor their performance.

Lark Technologies Pte. Ltd. 155 Nov 12, 2022
Dataflow is a data processing library, primarily for machine learning

Dataflow Dataflow is a data processing library, primarily for machine learning. It provides efficient pipeline primitives to build a directed acyclic

Sidekick AI 9 Dec 19, 2022
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

null 5 Sep 6, 2023
Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. ?? ?? I needed a succinct way to describe 2d pixel map operat

Ben Greenier 0 Oct 29, 2021
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

null 30.7k Jan 7, 2023
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

Memgraph 40 Dec 20, 2022
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.9k Jan 6, 2023
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Dec 10, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Datafuse Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture Datafuse is a Real-Time Data Processing & Analytics DBMS wit

Datafuse Labs 5k Jan 4, 2023
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 27, 2022