ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

SFU Database Group

Last update: Jan 5, 2023

Comments

MySQL source parsing NULL value error

import polars as pl
import pandas as pd
from sqlalchemy import create_engine
import pyarrow
# 
print(pl.__version__)
# 0.8.20
print(pd.__version__)
# 1.3.0
pip list | grep connector*
# connectorx                    0.2.0
pyarrow.__version__
# '4.0.1'

# pandas first
sql = '''select ORDER_ID from tables     '''
engine = create_engine('mysql+pymysql://root:***@*.*.*.*:*')
df = pd.read_sql_query(sql, engine)
df.dtypes
# ORDER_ID                      int64

# polars second
conn = "mysql://root:***@*.*.*.*:*"
pdf = pl.read_sql(sql, conn)
`
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<timed exec> in <module>

~/miniconda3/envs/test/lib/python3.8/site-packages/polars/io.py in read_sql(sql, connection_uri, partition_on, partition_range, partition_num)
    556     """
    557     if _WITH_CX:
--> 558         tbl = cx.read_sql(
    559             conn=connection_uri,
    560             query=sql,

~/miniconda3/envs/test/lib/python3.8/site-packages/connectorx/__init__.py in read_sql(conn, query, return_type, protocol, partition_on, partition_range, partition_num)
    126             raise ValueError("You need to install pyarrow first")
    127 
--> 128         result = _read_sql(
    129             conn,
    130             "arrow",

PanicException: Could not retrieve i64 from Value

`

opened by ztsweet 13

Update Connection Pooling Crate
It would be nice if we could get logs from rust over the wire for debug purposes, preferably configurable from the Python client.

I have a supposedly posgtres compatible source that fails due to

RuntimeError: Cannot get metadata for the queries, last error: Some(Error { kind: UnexpectedMessage, cause: None })

In the meantime, any tips for debugging this?
opened by wseaton 13

ModuleNotFoundError: No module named 'connectorx.connectorx_python'

Hello,

I tried importing connectorx in a python interpreter after installing it through pipenv, but I am getting this error.

Python 3.10.0 (default, Dec 16 2021, 16:14:51) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import connectorx as cx
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxxxx/.local/share/virtualenvs/xxxxx-py-_L81HEc9/lib/python3.10/site-packages/connectorx/__init__.py", line 3, in <module>
    from .connectorx_python import read_sql as _read_sql
ModuleNotFoundError: No module named 'connectorx.connectorx_python'

opened by zandeck 12

Add SSL/TLS support for Postgres connections
As discussed in #103, currently there's no way to connect to a postgres database that mandates a TLS/SSL connection.

What this PR does is make it so sslmode=required is functional, and if a root cert and no client certs are passed it falls back to the behavior of sslmode=verify-ca.

Example conn string:

conn = f"postgresql://{user}:{pw}@my.database.host:5439/db?sslmode=require&sslrootcert=/tmp/myroot.crt" df = cx.read_sql(conn, "select 1;", return_type='pandas', protocol='cursor')

There is a lot of upstream discussion on the topic and I've tried to link to it in doc strings where appropriate. Tested it on Amazon Redshift, but looking for pointers on how to add functional tests for this. All of the code probably isn't idiomatic rust either, so looking for some feedback there too. Thanks!
opened by wseaton 10
Fatal Python error: none_dealloc: deallocating None
The problem

Running a query will result in a Fatal Python error: none_dealloc: deallocating None. I have no idea what causes this exception or how to even start looking what causes it. All help is appreciated.

More context:

I am working on a Dash application using sqlalchemy to construct the query which is them compiled for a mysql database. The query runs perfectly as expected (much faster the using the pandas build-in read_sql_query. pd.read_sql_query(query, db.engine)) but several seconds after the query, the entire python application crashes. When the query is not executed, the python application keeps running. When replacing the 2 lines below with the original pandas build-in read_sql_query, there is no problem and Python is not crashing at all.

My Code

The query parameters is a complex dynamically created sqlalchemy select object of type <class'sqlalchemy.sql.selectable.Select'>.

query_cx = str(query.compile(dialect=mysql.dialect(),compile_kwargs={"literal_binds": True})) df = cx.read_sql(DB_URL, query_cx)
opened by TomEversdijk 9
read_sql from Oracle to polars truncates timestamp
What language are you using?

Python

What version are you using?

0.2.5

What database are you using?

Oracle

What dataframe are you using?

polars, pandas, arrow2

Can you describe your bug?

When quering data from Oralce DB, timestamp columns are trunctated to date format (DD-MM-YYYY 00:00:00).

Example query / code

query = 'select timestamp_col from sample_db' cx.read_sql(conn, query, return_type='polars')

Result: 2022-04-27 00:00:00

Converting timestamp to varchar (within query) and then to datetime (polars) it works:

query = 'select to_char(timestamp_col, 'YYYY-MM-DD HH24:MI:SS') as str_col from sample_db' cx.read_sql(conn, query, return_type='polars').with_column( pl.col('str_col').str.strptime(pl.Datetime, '%F %T').alias('datetime_col') )

Result: 2022-04-27 18:21:22

What is the error?

No error message, but wrong dateime format.
bug
opened by wKollendorf 8
Missing conversion rule for bytea in postgres

Error message:

Can't run dispatcher: ConnectorX(NoConversionRule("ByteA(true)", "connectorx::destinations::arrow::typesystem::ArrowTypeSystem"))

Expected: BYTEA in postgresql should be converted to Vec<u8> in Rust

Is this a time-consuming fix? Happy to assist if this is the case.

opened by quambene 8
How do I connect to Oracle?

I keep getting this error message. Any idea what might be wrong? My login credentials and DSN are all correct. tsnnames.ora file are also in the environment variable. However, I just cannot connect.

This is what I currently have: import connectorx as cx

conn = f"oracle://username:mypassword!001@dsn" query = 'select * from table"

data = cx.read_sql(conn , query=query)

This is the error message: "timed out waiting for connection: OCI Error: ORA-12154: TNS:could not resolve the connect identifier specified"

opened by Fatroundfox 7
MSSQL issues

Hi,

It's possible I'm missing something, I'd really like to give this a try, but every time I try to connect to MSSQL I get this error:

PanicException Traceback (most recent call last) in ----> 1 df = cx.read_sql(conn=conn, query="Select * from test", partition_on="Unique_ID", partition_num=10) 2

C:\ProgramData\Miniconda3\lib\site-packages\connectorx_init_.py in read_sql(conn, query, return_type, protocol, partition_on, partition_range, partition_num, index_col) 100 raise ValueError("You need to install pandas first") 101 --> 102 result = _read_sql( 103 conn, 104 "pandas",

PanicException: byte index 1 is out of bounds of ``

My connection string that I'm passing to conn looks like: mssql://user:passw@ip:port

I've updated pandas to the most recent version, and I upgraded connector-x from stable to alpha-5 to see if it made any difference. Thanks!

opened by ishbooisland 7
[Postgres] int4 type conversion failure on partition compare
Related to #107, I can get queries to run now via the postgres bridge of my database (via the cursor protocol) but I'm running into a subtle type issue. Was wondering if you guys had any idea the cause, happy to help contribute a fix.

thread '<unnamed>' panicked at 'error retrieving column 0: error deserializing column 0: cannot convert between the Rust type `i64` and the Postgres type `int4`', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-0.7.3/src/row.rs:151:25

From what I can tell in the transport mapping, this conversion shouldn't be happening (https://github.com/sfu-db/connector-x/blob/main/connectorx/src/sources/postgres/typesystem.rs#L41).

The error only happens on the COUNT(*) queries used to estimate the partition size.

Example:

[2021-11-02T13:44:35Z DEBUG connectorx::sql] Transformed count query: SELECT count(*) FROM (SELECT cn, account, created_fiscal_yr FROM mydb.mytable WHERE createddate >= '2017-01-01') AS CXTMPTAB_PART WHERE 2019 <= CXTMPTAB_PART.created_fiscal_yr AND CXTMPTAB_PART.created_fiscal_yr < 2022

Maybe there is a code path there were the arrow transport is not used? Thanks!
opened by wseaton 7

read_sql from SQL Server to Arrow truncates datetime

I'm working with a MS SQL Server 2019. When reading a datetime field, read_sql correctly captures it in a pandas dataframe. When I convert that dataframe to an arrow table, datetimes are also retained correctly. However, loading directly to an arrow table truncates the datetime fields to midnight. I'd like to remove the pandas dependency and load directly to an arrow table. Is there a way to do this without truncating the datetime?

Example Code:

import connectorx as cx
import pyarrow as pa
con_string = 'mssql://user:[email protected]%5CG:1439/database'
print('------- Pandas Table -------')
query = 'SELECT top 5 Datum FROM Termine WHERE Datum>getdate()'
pandas_table = cx.read_sql(con_string, query, return_type='pandas')
print(pandas_table)
print('------- Arrow table from pandas -------')
arrow_table_from_pandas = pa.Table.from_pandas(pandas_table)
print(arrow_table_from_pandas)
print('------- Arrow Table -------')
arrow_table = cx.read_sql(con_string, query, return_type='arrow')
print(arrow_table)

Example Output:

------- Pandas Table -------
                Datum
0 2022-02-06 07:30:00
1 2022-02-07 00:00:00
2 2022-02-07 00:00:00
3 2022-02-07 07:00:00
4 2022-02-07 07:30:00
------- Arrow table from pandas -------
pyarrow.Table
Datum: timestamp[ns]
----
Datum: [[2022-02-06 07:30:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 07:00:00.000000000,2022-02-07 07:30:00.000000000]]
------- Arrow Table -------
pyarrow.Table
Datum: date64[ms]
----
Datum: [[2022-02-06,2022-02-07,2022-02-07,2022-02-07,2022-02-07]]

opened by t-alex-fritz 6

# hashtag character causing error on authentication
What language are you using?

Python

What version are you using?

0.3.1

What database are you using?

MySQL

What dataframe are you using?

Pandas,

Can you describe your bug?

I have a database with "#" character in their credentials. I can't connect to it because of the hashtag, tried with another user and it works.

What are the steps to reproduce the behavior?

Connect to a MySQL database with a user having a "#" credential in it

cx.read_sql("mysql://user:aa###aaa@address:3306/db", str(query)).iloc[:, 0].tolist()

What is the error?

Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/ubuntu/.cache/pypoetry/virtualenvs/api.eramet-ObXpZerD-py3.9/lib/python3.9/site-packages/connectorx/__init__.py", line 224, in read_sql result = _read_sql( RuntimeError: parse error: invalid port number
bug
opened by Syndorik 1
reading mssql datatype decimal(38,20) returns an error

Hi,

I'm trying to read data from a MS SQL Server table with a column of data type decimal(38,20).

I get the following error:

thread '<unnamed>' panicked at 'Number exceeds maximum value that can be represented.', C:\Users\runneradmin\.cargo\registry\src\github.com-1ecc6299db9ec823\rust_decimal-1.26.1\src\decimal.rs:470:23

The value in the column that causes the error is this one: 2800003753376.20000000000000000000

I tried casting the column. It works when the data type is casted to decimal(38,16). Basically, as soon as the scale is higher (>16), it fails.

Is this a bug?

opened by sashk8 0
can return multi df?

I use cx as etl tools, and is very quickly. but my pc's memory is not large, If the query is so big, my memory can't hold it. cx.read_sql can add partition params(partition_on="l_orderkey", partition_num=10) maybe can add another param to return 10 df sequentially？ thanks！

opened by wonb168 0
Error "Unknown authentication protocol: sha256_password" for queries to ClickHouse
What language are you using?

Python

What version are you using?

0.3.1

What database are you using?

ClickHouse (over MySQL protocol)

What dataframe are you using?

Polars

What are the steps to reproduce the behavior?

Create a user identified with sha256_password in ClickHouse: https://clickhouse.com/docs/en/sql-reference/statements/create/user/

Try to run any query using connectorx

Example query / code

import connectorx import polars conn = f'mysql://{user}:{password}@{host}:9004/default' -- both queries fail with the same error polars.read_sql('select 1 as c', conn, protocol='text') connectorx.read_sql(conn, 'select 1 as c', protocol='text')

What is the error?

RuntimeError: timed out waiting for connection: DriverError { Unknown authentication protocol: sha256_password }

bug
opened by ivkhokhlachev 0
PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" }

What language are you using?

Python

What version are you using?

connectorx 0.3.1

What database are you using?

sqlite3

What dataframe are you using?

Pandas

Can you describe your bug?

The exception happens while selecting in a loop. After approx 1500 iterations. If time.sleep(0.01) is added right after the query then the problem goes away.

What are the steps to reproduce the behavior?

Database setup if the error only happens on specific data or data type

Table schema and example data

Example query / code

result_df = cx.read_sql("sqlite://" + xdb_path, "SELECT * FROM '" + table_name + "'")

What is the error?

thread '' panicked at 'called Result::unwrap() on an Err value: Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" }', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/scheduled-thread-pool-0.2.6/src/lib.rs:320:44 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Exception in thread Thread-1: Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniforge/base/envs/tf/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/opt/homebrew/Caskroom/miniforge/base/envs/tf/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/Users/mac/tensorflow_macos_venv/monitor_cutoff.py", line 663, in get_dataframes ResultDFArray[mode_index], FactorArray[mode_index], MinDaysArray[mode_index] = get_combined_data(mode_in_list) File "/Users/mac/tensorflow_macos_venv/monitor_cutoff.py", line 382, in get_combined_data result_df = cx.read_sql("sqlite://" + xdb_path, "SELECT * FROM '" + table_name + "'") File "/opt/homebrew/Caskroom/miniforge/base/envs/tf/lib/python3.9/site-packages/connectorx/init.py", line 224, in read_sql result = _read_sql( pyo3_runtime.PanicException: called Result::unwrap() on an Err value: Os { code: 35, kind: WouldBlock, message: "Resource temporarily unavailable" }
bug

opened by bitnlp 0
Issue Connecting to Oracle

Hello,

I am having an issue with the connection string required by connectorx.readsql(). I am using the template suggested by the documentation: 'oracle://username:password@server:port/database' However, I get the error: ORA-12154: TNS:could not resolve the connect identifier specified

I can connect to this same Oracle DB through SQLAlchemy by using the following connection string: oracle+cx_oracle://{user}:{password}@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST={host})(PORT={port}))(CONNECT_DATA=(SERVICE_NAME={service_name}))

The portion of the string after the '@' is generated by the cx_Oracle.make_dsn function. My username & password contain only alphanumeric characters.

Any help would be greatly appreciated. This package seems to have a ton of potential and I'm excited to try it out.

opened by ciollid2 8

Releases(v0.3.1)

v0.3.1(Oct 31, 2022)
What's Changed

Add CLOB and BLOB to Oracle typesystem by @wKollendorf in #287

Correcting Spelling in Docs by @venkashank in https://github.com/sfu-db/connector-x/pull/297

Update doc by @lBilali in #324 #327 #326 #325 #323

Adding bigquery job location in parameters while getting job details by @phanindra-ramesh in https://github.com/sfu-db/connector-x/pull/346

Refactor for federated query in https://github.com/sfu-db/connector-x/pull/351

Rust version bump to nightly-2022-09-15 https://github.com/sfu-db/connector-x/pull/357

Add array support in arrow2 #339

Upgrade gcp-bigquery-client to 0.13.0 by @hieudepchai in https://github.com/sfu-db/connector-x/pull/355

Bump arrow-rs to 22 by @houqp in https://github.com/sfu-db/connector-x/pull/364

Add ltree string type by @auyer in https://github.com/sfu-db/connector-x/pull/382

Fix docs typo by @alexander-beedie in https://github.com/sfu-db/connector-x/pull/383

Add citext in postgres #372

Add long in oracle #375

Support MySQL in federated query #296

Add appname in mssql connection string #336

Fix bug for NULL values in round func on oracle #333

New Contributors

@wKollendorf, @houqp, @venkashank, @lBilali, @phanindra-ramesh, @hieudepchai, @auyer made their first contribution
Source code(tar.gz)
Source code(zip)
v0.3.0(May 11, 2022)
ConnectorX

Features

Add experimental support on federated query (support PostgreSQL only) #280

BigQuery support experimental -> stable #152

Upgrade rust version to nightly-2022-04-17

Bug Fix

Arrow2 add blob #261

Mssql datetime fix #263

Add Mssql encryption option #265

Fix Mssql count query construction for queries with OFFSET

Add binaryfloat and binarydouble to Oracle #273

Source code(tar.gz)
Source code(zip)
v0.2.5(Mar 29, 2022)
ConnectorX

Features

Build wheel for m1 machine through cross compile #238

Support client authentication on postgres #222

Expose partition_sql and get_meta for Python API

Parse connection url's scheme as the substring before '+', to compatible with possible sqlalchemy user's input

Bug Fix

Sqlite windows path error #245

Oracle convert Number(0,0) to float for round(x, y) #227

Source code(tar.gz)
Source code(zip)
v0.2.4(Mar 3, 2022)
ConnectorX 0.2.4

Bug Fix

Fixing deallocating None bug #201

Clickhouse support on new version #165

Shrink array before return for arrow2 to reduce memory usage when result is small #196

Fix Oracle port error #209

Others

Improve Oracle performance by enlarging array size using prepared statement #127

Upgrade arrow2 to 0.9 #221

Add Postgres Json, TimestampTz support to arrow2 #235 #229

Clean up python dependencies (remove ones for benchmarking)

Source code(tar.gz)
Source code(zip)
v0.2.3(Dec 21, 2021)
ConnectorX 0.2.3

Features

Add Google BigQuery as source https://github.com/sfu-db/connector-x/issues/152 - need benchmark and more tests

Add windows trust_connection for mssql https://github.com/sfu-db/connector-x/discussions/145

Support Postgres HStore (only for cursor protocol) https://github.com/sfu-db/connector-x/issues/174

Others

Investigate SQLite in-memory database https://github.com/sfu-db/connector-x/issues/172

Add redshift to connection uri automatically use cursor protocol https://github.com/sfu-db/connector-x/pull/187

Test on mariadb https://github.com/sfu-db/connector-x/issues/193

Test on azure sql database

Source code(tar.gz)
Source code(zip)
v0.2.2(Nov 23, 2021)
ConnectorX 0.2.2

Features

Stream write to destination https://github.com/sfu-db/connector-x/pull/147

Support CTE https://github.com/sfu-db/connector-x/pull/161

Add money type support in mssql https://github.com/sfu-db/connector-x/issues/141

Add support of converting bytea in postgres arrow https://github.com/sfu-db/connector-x/issues/148

Bug Fixes

Test each feature gate https://github.com/sfu-db/connector-x/issues/139

Support Python 3.10 https://github.com/sfu-db/connector-x/issues/157

Adding instance name to connect mssql https://github.com/sfu-db/connector-x/issues/140

Support multiple integer types when for count query in postgres https://github.com/sfu-db/connector-x/pull/153

Support unsigned for mysql: https://github.com/sfu-db/connector-x/issues/163

Fix issue when limit > count: https://github.com/sfu-db/connector-x/issues/166

Other Changes

Add turbodbc in benchmark https://github.com/sfu-db/connector-x/issues/124

Update polars dependency https://github.com/sfu-db/connector-x/pull/154

Source code(tar.gz)
Source code(zip)