TensorBase is a new big data warehousing with modern efforts.

Last update: Jan 4, 2023

Related tags

Data processing infrastructure rust engineering data database modern analytics high-performance bigdata data-warehouse rust-lang warehouse data-infrastructure

Overview

What is TensorBase

TensorBase is a new big data warehousing with modern efforts.

TensorBase is building on top of Rust, Apache Arrow and Arrow DataFusion.

TensorBase hopes to change the status quo of bigdata system as follows:

low efficiency (in the name of 'scalable')
hard to use (for end users) and understand (for developers)
not evolving with modern infrastructures (OS, hardware, engineering...)

Features

Out-of-the-box to play ( get started just now )
Lighting fast architectural performance in Rust ( real-world benchmarks )
Modern redesigned columnar storage
Top performance network transport server
ClickHouse compatible syntax
Green installation with DBA-Free ops
Reliability and high availability (WIP)
Cluster (WIP)
Cloud-Native Adaptation (WIP)
Arrow dataLake (...)

Architecture (in 10,000 meters altitude)

Quick Start

Benchmarks

TensorBase is lighting fast. TensorBase has shown better performance than that of ClickHouse in simple aggregation query on 1.47-billion rows NYC Taxi Dataset.

TensorBase has enabled full workflow for TPC-H benchmarks from data ingestion to query.

More detail about all benchmarks seen in benchmarks.

Roadmap

Base Space Station

Community Newsletters

This Week in TensorBase

Working Groups

Working Group - Engineering

This is a wg for engineering related topics, like codes or features.

Working Group - Database

This is a higher kind wg for database related topics, like ideas from papers.

Join these working groups on the Discussions or on Discord server.

Communications

Wechat group or other more are on community

Contributing

We have a great contributing guide in the Contributing.

Documents (WIP)

License

TensorBase is distributed under the terms of the Apache License (Version 2.0), which is a commercial-friendly open source license.

It is greatly appreciated that,

you could give this project a star, if you think these got from TensorBase are helpful.
you could indicate yourself in Who is Using TensorBase, if you are using TensorBase in any project, product or service.
you could contribute your changes back to TensorBase, if you want your changes could be helpful for more people.

Your encouragements and helps can make more people realize the value of the project, and motivate the developers and contributors of TensorBase to move forward.

See LICENSE for details.

Comments

move to_date to CH module && support int64, datetime type cast
move to_date(LargeString) out from mix-up in DataFusion #159

support to_date(DateTime), to_date(Int64)

support to pass const func args to to_date, eg: select to_date(1262304000)

add unit tests & integration tests
opened by pandaplusplus 15
Support timezone in format `'+XX:XX'`

As the discussions in #190, it is straight-forward to implement '+XX:XX' formatted timezones within the TimeZoneId, using the lower byte of i16 to represent the offset in quarters.

opened by frank-king 14
REFACTOR(datafusion): extract clickhouse functions in datafusion to a separate enum

Signed-off-by: Frank King [email protected]

As the discussion in #130, this PR is intended to separate CH functions from DF functions for extensibility, and enable CH functions to be case-sensitive.

opened by frank-king 14

INSERT failed if click clickhouse-client version >=21.5 without --compression arg

I tried to cell_towers csv data from clickhouse https://clickhouse.tech/docs/en/getting-started/example-datasets/cell-towers/

I follower the step 5:

clickhouse-client --query "INSERT INTO cell_towers FORMAT CSVWithNames" < cell_towers.csv

Here is my operation:

[w@ww Downloads]$ head cell_towers.csv 
radio,mcc,net,area,cell,unit,lon,lat,range,samples,changeable,created,updated,averageSignal
UMTS,262,2,801,86355,0,13.285512,52.522202,1000,7,1,1282569574,1300155341,0
GSM,262,2,801,1795,0,13.276907,52.525714,5716,9,1,1282569574,1300155341,0
GSM,262,2,801,1794,0,13.285064,52.524,6280,13,1,1282569574,1300796207,0
UMTS,262,2,801,211250,0,13.285446,52.521744,1000,3,1,1282569574,1299466955,0
UMTS,262,2,801,86353,0,13.293457,52.521515,1000,2,1,1282569574,1291380444,0
UMTS,262,2,801,86357,0,13.289106,52.53273,2400,3,1,1282569574,1298860769,0
UMTS,262,3,1107,83603,0,13.349675,52.497575,3102,222,1,1282672189,1300710809,0
GSM,262,2,776,867,0,13.349711,52.497367,1000,214,1,1282672189,1301575206,0
GSM,262,3,1107,13971,0,13.349743,52.497437,1000,212,1,1282672189,1300710809,0
[w@ww Downloads]$ du -h cell_towers.csv
3.5G    cell_towers.csv
[w@ww Downloads]$ clickhouse-client --port 9528 --query "INSERT INTO cell_towers.cell_towers FORMAT CSVWithNames" < cell_towers.csv
Received exception from server (version 2021.5.0):
Code: 4. DB::Exception: Received from localhost:9528. WrappingLangError(ASTError(" --> 1:51\n  |\n1 | INSERT INTO cell_towers.cell_towers FORMAT CSVWithNames\n  |                                                   ^---\n  |\n  = expected table_name_numbers")). Error when AST processing:  --> 1:51
  |
1 | INSERT INTO cell_towers.cell_towers FORMAT CSVWithNames
  |                                                   ^---
  |
  = expected table_name_numbers.

type/bug component/server

opened by pymongo 12

basic support of default timezone
Signed-off-by: Frank King [email protected]

As the discussion in #27, this PR adds a global default timezone with an offset.

What this PR does:

when timestamps are parsing from string, the offset subtracts,

when timestamps are interpreted as days/hours via the toDayOfMonth/toHour functions etc., the offset is added.

What this PR does NOT yet:

when timestamps are displayed as a string, the offset is NOT added yet.
opened by frank-king 10
[Summer 2021] Draft design for MySQL Protocol Server
This is a draft design for the MySQL protocol server of my Summer 2021 project (#149). It still fails some tests and lacks some features. But it is ready for discussion.

The major challenge for our MySQL protocol server is that TensorBase's runtime was built with ClickHouse as its only frontend. So 1) its runtime is tightly coupled with ClickHouse's protocol server implementation, e.g., the code in runtime crate uses Block and Column in the ch module to represent data. 2) Some data encoding are in ClickHouse's format (e.g., BaseChunk), which is different from MySQL's encoding scheme.

For these problems, my approach is to decouple TensorBase's runtime as much as possible with ClickHouse's frontend. 1) I use Arrow's Array to represent columnar data, instead of the original BaseChunk. This is because BaseChunk only exposes its data as Vec<u8>, which makes it hard and buggy to do element-wise data access. As we do data conversion between different protocols, element-wise data access becomes important. Moreover, BaseChunk seems to use Arrow's data format. So BaseChunk and Arrow's Array are essentially the same but the latter has richer features. 2) In the same way, I use Arrow's RecordBatch to represent block data, in place of the original Block.

But I'm confused about our early design decisions: It seems that BaseChunk(Column) and Block are our simplified version of Array and RecordBatch. We added a lot of code for byte access based on BaseChunk, and I believe a lot of debugging efforts are put in it. Why did we choose to use them instead of Arrow's implementation? Was it for performance issues?

Based on the above changes, I wrote a shim layer with the current server_mysql crate. I also added an async MySQL server and client implementation. Now it is able to execute the following simple queries:

CREATE DATABASE IF NOT EXISTS test USE test DROP TABLE IF EXISTS test_tab CREATE TABLE test_tab(foo UInt64) INSERT INTO test_tab VALUES (1), (2), (3) select foo from test_tab select sum(foo) from test_tab

A few things need to be done before this draft design is ready to be merged:

[ ] Fix the bugs that cause integ_checks to fail.

[ ] Decide if we still need Block and BaseChunks. Now I kept them so I make minimum change of code. If we do find the byte access complexity outweigh their benefits, I can completely remove them.

[ ] Add support for Datatime, Date, Decimal and String data types.

[ ] Add support for InsertFormatInline and InsertFormatCSV methods.

[ ] Add integration tests for the MySQL server.
opened by fandahao17 9
Issue encountered attempting to create a table prior to importing a CSV (additional documentation requested)
Hello,

I am very interested in your project and I am attempting to begin testing it out. However, the documentation for tools that exist in m0 does not seem to be accurate (baseops, baseshell). Subsequently, attempting to use the clickhouse client to create a very simple table using ddl fails. I am not sure what to use for ENGINE although it appears to be required and using MergeTree fails. I tried with and without ORDER BY. Any assistance you can provide would be greatly appreciated.

-Chris Whelan

TensorBase :) create table sales (title string) ENGINE = MergeTree ORDER BY title; CREATE TABLE sales ( `title` string ) ENGINE = MergeTree ORDER BY title Query id: 22fd667c-851d-4087-9fb7-5a58128003de 0 rows in set. Elapsed: 0.001 sec. Received exception from server (version 2021.3.0): Code: 3. DB::Exception: Received from localhost:9528. WrappingLangError(ASTError). Error when AST processing.
type/discuss component/docs
opened by chrisfw 7
Cannot join slack channel from the links in README or in offcial website, maybe consider opening a new communication channel like gitter?

When click the Slack Channel link in README or in official website, it redirects you to tensorbase's official Slack Channel link: https://tensorbase.slack.com/, but without an invitation. So I think newcomers cannot log in.

I can log in other slack workspaces like Kubernetes, so I guess it's just because tensorbase's slack link is not an invitation link. See that of k8s', there is a button says "GET MY INVITE" for people who are not in the group.
type/bug community/base-thanks-for-your-help status/wait-for-fix-confirmation-from-reporter

opened by BIAOXYZ 7
REFACTOR(runtime): use RecordBatch to represent query results

Current runtime code uses BaseDataBlock to represent query results, which is a simple wrapper around the Vec<u8> buffer. This straightforward representation does not enable easy element-wise data indexing, which is required for performing type conversion for the MySQL server.

This commit instead uses Arrow::RecordBatch to store query results, which provides good element-wise access API on top of the Arrow data format used by TB.

Note that our original BaseDataBlock is still useful as a "fast path". Where we do not need element-wise data access (e.g., INSERT queries), we can still use BaseDataBlock to directly store the data buffer.

opened by fandahao17 6
support bool-valued func in where clause
As #172, sql-parser support function bool-valued in where clause, but tb's pest parse not support this. three solutions(bql.pest):

modify comp_expr_cmp. ( comp_expr_cmp = { comp_expr_cmp_operand ~( comp_op ~ comp_expr_cmp_operand)? })

add expr after comp_expr_cmp in comp_expr statment

add func_call_expr after comp_expr_cmp in comp_expr statment

I think third one is better, it has smaller impact(just function call) than others.
community/base-thanks-for-your-help
opened by ygf11 6
Support timezone in metadata of `DateTime`
After #166, TB has supported a global default timezone for the DateTime type, but have not supported the DateTime type with tz in metadata.

This PR is to support Datetime type with timezone in metadata.

First, I added a type struct TimeZoneId(u16), a wrapper of some variant in enum chrono_tz::Tz, to store the timezone in BqlType::DateTime for the space-efficiency consideration.

When the user create a table with a column (for example, who's type is DateTime('Etc/GMT-8')), it is parsed as a timezone id, stored in BqlType, and passed to A-DF as type Timestamp32(Some(BaseTimeZone(<tz_id>, <offset>))) where the tz_id corresponds to 'Etc/GMT-8'. The offset is calculated in ahead and stored here for fast lookup.

The CH functions in A-DF which accepts a DateTime with or without a tz uses Timestamp32(None) as its signature. During type coercion, Timestamp32 with or without timezones are "casted" to the same Timestamp32(None) type without any tz offsets applied (i.e. timezones here are ignored).

Then during datetime (CH) functions in A-DF, temporal arguments passed in are treated as timestamps with the timezone that is stored in the input schema.

Finally the timezones can be resolved correctly.

Curerntly , timezones of data values are ignored, only the tz in schema are resolved. However, values can have their own timezone, such as

select toTimeZone(toDateTime('2021-01-01 00:00:00', 'Etc/GMT-8'), 'UTC')

We can solve this problem in the next PR.
opened by frank-king 6
support primary key
The primary key plan supports data deduplication.

First of all, the primary key on a single column is supported. When data is inserted, judge whether the row appears in the table by primary key. If there is already row with the same primary key, the insertion is skipped.

The initial plan is to achieve deduplication by maintaining a deduplication container for each table. When the database is restarted, the primary key column is read from the disk and container in memory is rebuilt.

After investigation, roaring bitmap is a compressed bitmap index with excellent performance and less memory usage.

We can use RoaringBitmap and RoaringTreemap in roaring-rs to store ordinary integer primary keys. For string types that cannot be supported by roaring bitmap, we can use HashSet storage.

Also, where can the deduplication container of each table be placed appropriately, can it be placed in the MetaStore?

[x] sql parse

[ ] deduplication by primary key when data inserting

[ ] recovery

[ ] performance test
opened by nautaa 0
`select 'abc'` cause the server stuck
In the current version, doing queries like select 'abc' returns no result:

TensorBase :) select 'abc' SELECT 'abc' Ok. 0 rows in set. Elapsed: 0.008 sec.

And I commented out this three lines in engine::datafusions: https://github.com/tensorbase/tensorbase/blob/14e4802b9c5e9e6e9543f80380213da1fb1c56cf/crates/engine/src/datafusions.rs#L177-L180 The server will get stuck on this query

TensorBase :) select 'abc' SELECT 'abc' ^C

However, integral literals can work:

TensorBase :) select 1 SELECT 1 ┌─Int64(1)─┐ │ 1 │ └──────────┘ 1 rows in set. Elapsed: 0.005 sec.
type/bug
opened by frank-king 3
How it is different with Datafuse?

In terms of general product strategy, direction? These projects look very similar, I'm wondering how do they differ.

PS. Best wishes from ClickHouse team :)

opened by alexey-milovidov 8