TensorBase is a new big data warehousing with modern efforts.

Overview

chat on Discord

What is TensorBase

TensorBase is a new big data warehousing with modern efforts.

TensorBase is building on top of Rust, Apache Arrow and Arrow DataFusion.

TensorBase hopes to change the status quo of bigdata system as follows:

  • low efficiency (in the name of 'scalable')
  • hard to use (for end users) and understand (for developers)
  • not evolving with modern infrastructures (OS, hardware, engineering...)

Features

  • Out-of-the-box to play ( get started just now )
  • Lighting fast architectural performance in Rust ( real-world benchmarks )
  • Modern redesigned columnar storage
  • Top performance network transport server
  • ClickHouse compatible syntax
  • Green installation with DBA-Free ops
  • Reliability and high availability (WIP)
  • Cluster (WIP)
  • Cloud-Native Adaptation (WIP)
  • Arrow dataLake (...)

Architecture (in 10,000 meters altitude)

arch_base

Quick Start

play_out_of_the_box

Benchmarks

TensorBase is lighting fast. TensorBase has shown better performance than that of ClickHouse in simple aggregation query on 1.47-billion rows NYC Taxi Dataset.

TensorBase has enabled full workflow for TPC-H benchmarks from data ingestion to query.

More detail about all benchmarks seen in benchmarks.

Roadmap

Community Newsletters

Working Groups

Working Group - Engineering

This is a wg for engineering related topics, like codes or features.

Working Group - Database

This is a higher kind wg for database related topics, like ideas from papers.

Join these working groups on the Discussions or on Discord server.

Communications

Wechat group or other more are on community

Contributing

We have a great contributing guide in the Contributing.

Documents (WIP)

More documents will be prepared soon.

Read the Documents.

License

TensorBase is distributed under the terms of the Apache License (Version 2.0), which is a commercial-friendly open source license.

It is greatly appreciated that,

  • you could give this project a star, if you think these got from TensorBase are helpful.
  • you could indicate yourself in Who is Using TensorBase, if you are using TensorBase in any project, product or service.
  • you could contribute your changes back to TensorBase, if you want your changes could be helpful for more people.

Your encouragements and helps can make more people realize the value of the project, and motivate the developers and contributors of TensorBase to move forward.

See LICENSE for details.

Comments
  • move to_date to CH module && support int64, datetime type cast

    move to_date to CH module && support int64, datetime type cast

    1. move to_date(LargeString) out from mix-up in DataFusion #159
    2. support to_date(DateTime), to_date(Int64)
    3. support to pass const func args to to_date, eg: select to_date(1262304000)
    4. add unit tests & integration tests
    opened by pandaplusplus 15
  • Support timezone in format `'+XX:XX'`

    Support timezone in format `'+XX:XX'`

    As the discussions in #190, it is straight-forward to implement '+XX:XX' formatted timezones within the TimeZoneId, using the lower byte of i16 to represent the offset in quarters.

    opened by frank-king 14
  • REFACTOR(datafusion): extract clickhouse functions in datafusion to a separate enum

    REFACTOR(datafusion): extract clickhouse functions in datafusion to a separate enum

    Signed-off-by: Frank King [email protected]

    As the discussion in #130, this PR is intended to separate CH functions from DF functions for extensibility, and enable CH functions to be case-sensitive.

    opened by frank-king 14
  • INSERT failed if click clickhouse-client version >=21.5 without --compression arg

    INSERT failed if click clickhouse-client version >=21.5 without --compression arg

    I tried to cell_towers csv data from clickhouse https://clickhouse.tech/docs/en/getting-started/example-datasets/cell-towers/

    I follower the step 5:

    clickhouse-client --query "INSERT INTO cell_towers FORMAT CSVWithNames" < cell_towers.csv

    Here is my operation:

    [[email protected] Downloads]$ head cell_towers.csv 
    radio,mcc,net,area,cell,unit,lon,lat,range,samples,changeable,created,updated,averageSignal
    UMTS,262,2,801,86355,0,13.285512,52.522202,1000,7,1,1282569574,1300155341,0
    GSM,262,2,801,1795,0,13.276907,52.525714,5716,9,1,1282569574,1300155341,0
    GSM,262,2,801,1794,0,13.285064,52.524,6280,13,1,1282569574,1300796207,0
    UMTS,262,2,801,211250,0,13.285446,52.521744,1000,3,1,1282569574,1299466955,0
    UMTS,262,2,801,86353,0,13.293457,52.521515,1000,2,1,1282569574,1291380444,0
    UMTS,262,2,801,86357,0,13.289106,52.53273,2400,3,1,1282569574,1298860769,0
    UMTS,262,3,1107,83603,0,13.349675,52.497575,3102,222,1,1282672189,1300710809,0
    GSM,262,2,776,867,0,13.349711,52.497367,1000,214,1,1282672189,1301575206,0
    GSM,262,3,1107,13971,0,13.349743,52.497437,1000,212,1,1282672189,1300710809,0
    [[email protected] Downloads]$ du -h cell_towers.csv
    3.5G    cell_towers.csv
    [[email protected] Downloads]$ clickhouse-client --port 9528 --query "INSERT INTO cell_towers.cell_towers FORMAT CSVWithNames" < cell_towers.csv
    Received exception from server (version 2021.5.0):
    Code: 4. DB::Exception: Received from localhost:9528. WrappingLangError(ASTError(" --> 1:51\n  |\n1 | INSERT INTO cell_towers.cell_towers FORMAT CSVWithNames\n  |                                                   ^---\n  |\n  = expected table_name_numbers")). Error when AST processing:  --> 1:51
      |
    1 | INSERT INTO cell_towers.cell_towers FORMAT CSVWithNames
      |                                                   ^---
      |
      = expected table_name_numbers.
    
    type/bug component/server 
    opened by pymongo 12
  • basic support of default timezone

    basic support of default timezone

    Signed-off-by: Frank King [email protected]

    As the discussion in #27, this PR adds a global default timezone with an offset.

    What this PR does:

    • when timestamps are parsing from string, the offset subtracts,
    • when timestamps are interpreted as days/hours via the toDayOfMonth/toHour functions etc., the offset is added.

    What this PR does NOT yet:

    • when timestamps are displayed as a string, the offset is NOT added yet.
    opened by frank-king 10
  • [Summer 2021] Draft design for MySQL Protocol Server

    [Summer 2021] Draft design for MySQL Protocol Server

    This is a draft design for the MySQL protocol server of my Summer 2021 project (#149). It still fails some tests and lacks some features. But it is ready for discussion.

    The major challenge for our MySQL protocol server is that TensorBase's runtime was built with ClickHouse as its only frontend. So 1) its runtime is tightly coupled with ClickHouse's protocol server implementation, e.g., the code in runtime crate uses Block and Column in the ch module to represent data. 2) Some data encoding are in ClickHouse's format (e.g., BaseChunk), which is different from MySQL's encoding scheme.

    For these problems, my approach is to decouple TensorBase's runtime as much as possible with ClickHouse's frontend. 1) I use Arrow's Array to represent columnar data, instead of the original BaseChunk. This is because BaseChunk only exposes its data as Vec<u8>, which makes it hard and buggy to do element-wise data access. As we do data conversion between different protocols, element-wise data access becomes important. Moreover, BaseChunk seems to use Arrow's data format. So BaseChunk and Arrow's Array are essentially the same but the latter has richer features. 2) In the same way, I use Arrow's RecordBatch to represent block data, in place of the original Block.

    But I'm confused about our early design decisions: It seems that BaseChunk(Column) and Block are our simplified version of Array and RecordBatch. We added a lot of code for byte access based on BaseChunk, and I believe a lot of debugging efforts are put in it. Why did we choose to use them instead of Arrow's implementation? Was it for performance issues?

    Based on the above changes, I wrote a shim layer with the current server_mysql crate. I also added an async MySQL server and client implementation. Now it is able to execute the following simple queries:

    CREATE DATABASE IF NOT EXISTS test
    USE test
    DROP TABLE IF EXISTS test_tab
    CREATE TABLE test_tab(foo UInt64)
    INSERT INTO test_tab VALUES (1), (2), (3)
    select foo from test_tab
    select sum(foo) from test_tab
    

    A few things need to be done before this draft design is ready to be merged:

    • [ ] Fix the bugs that cause integ_checks to fail.
    • [ ] Decide if we still need Block and BaseChunks. Now I kept them so I make minimum change of code. If we do find the byte access complexity outweigh their benefits, I can completely remove them.
    • [ ] Add support for Datatime, Date, Decimal and String data types.
    • [ ] Add support for InsertFormatInline and InsertFormatCSV methods.
    • [ ] Add integration tests for the MySQL server.
    opened by fandahao17 9
  • Issue encountered attempting to create a table prior to importing a CSV (additional documentation requested)

    Issue encountered attempting to create a table prior to importing a CSV (additional documentation requested)

    Hello,

    I am very interested in your project and I am attempting to begin testing it out. However, the documentation for tools that exist in m0 does not seem to be accurate (baseops, baseshell). Subsequently, attempting to use the clickhouse client to create a very simple table using ddl fails. I am not sure what to use for ENGINE although it appears to be required and using MergeTree fails. I tried with and without ORDER BY. Any assistance you can provide would be greatly appreciated.

    -Chris Whelan

    TensorBase :) create table sales (title string) ENGINE = MergeTree ORDER BY title;
    
    CREATE TABLE sales
    (
        `title` string
    )
    ENGINE = MergeTree
    ORDER BY title
    
    Query id: 22fd667c-851d-4087-9fb7-5a58128003de
    
    
    0 rows in set. Elapsed: 0.001 sec.
    
    Received exception from server (version 2021.3.0):
    Code: 3. DB::Exception: Received from localhost:9528. WrappingLangError(ASTError). Error when AST processing.
    
    type/discuss component/docs 
    opened by chrisfw 7
  • Cannot join slack channel from the links in README or in offcial website, maybe consider opening a new communication channel like gitter?

    Cannot join slack channel from the links in README or in offcial website, maybe consider opening a new communication channel like gitter?

    When click the Slack Channel link in README or in official website, it redirects you to tensorbase's official Slack Channel link: https://tensorbase.slack.com/, but without an invitation. So I think newcomers cannot log in. image

    I can log in other slack workspaces like Kubernetes, so I guess it's just because tensorbase's slack link is not an invitation link. See that of k8s', there is a button says "GET MY INVITE" for people who are not in the group. image

    type/bug community/base-thanks-for-your-help status/wait-for-fix-confirmation-from-reporter 
    opened by BIAOXYZ 7
  • REFACTOR(runtime): use RecordBatch to represent query results

    REFACTOR(runtime): use RecordBatch to represent query results

    Current runtime code uses BaseDataBlock to represent query results, which is a simple wrapper around the Vec<u8> buffer. This straightforward representation does not enable easy element-wise data indexing, which is required for performing type conversion for the MySQL server.

    This commit instead uses Arrow::RecordBatch to store query results, which provides good element-wise access API on top of the Arrow data format used by TB.

    Note that our original BaseDataBlock is still useful as a "fast path". Where we do not need element-wise data access (e.g., INSERT queries), we can still use BaseDataBlock to directly store the data buffer.

    opened by fandahao17 6
  • support bool-valued func in where clause

    support bool-valued func in where clause

    As #172, sql-parser support function bool-valued in where clause, but tb's pest parse not support this. three solutions(bql.pest):

    1. modify comp_expr_cmp. ( comp_expr_cmp = { comp_expr_cmp_operand ~( comp_op ~ comp_expr_cmp_operand)? })
    2. add expr after comp_expr_cmp in comp_expr statment
    3. add func_call_expr after comp_expr_cmp in comp_expr statment

    I think third one is better, it has smaller impact(just function call) than others.

    community/base-thanks-for-your-help 
    opened by ygf11 6
  • Support timezone in metadata of `DateTime`

    Support timezone in metadata of `DateTime`

    After #166, TB has supported a global default timezone for the DateTime type, but have not supported the DateTime type with tz in metadata.

    This PR is to support Datetime type with timezone in metadata.

    First, I added a type struct TimeZoneId(u16), a wrapper of some variant in enum chrono_tz::Tz, to store the timezone in BqlType::DateTime for the space-efficiency consideration.

    When the user create a table with a column (for example, who's type is DateTime('Etc/GMT-8')), it is parsed as a timezone id, stored in BqlType, and passed to A-DF as type Timestamp32(Some(BaseTimeZone(<tz_id>, <offset>))) where the tz_id corresponds to 'Etc/GMT-8'. The offset is calculated in ahead and stored here for fast lookup.

    The CH functions in A-DF which accepts a DateTime with or without a tz uses Timestamp32(None) as its signature. During type coercion, Timestamp32 with or without timezones are "casted" to the same Timestamp32(None) type without any tz offsets applied (i.e. timezones here are ignored).

    Then during datetime (CH) functions in A-DF, temporal arguments passed in are treated as timestamps with the timezone that is stored in the input schema.

    Finally the timezones can be resolved correctly.


    Curerntly , timezones of data values are ignored, only the tz in schema are resolved. However, values can have their own timezone, such as

    select toTimeZone(toDateTime('2021-01-01 00:00:00', 'Etc/GMT-8'), 'UTC')
    

    We can solve this problem in the next PR.

    opened by frank-king 6
  • support primary key

    support primary key

    The primary key plan supports data deduplication.

    First of all, the primary key on a single column is supported. When data is inserted, judge whether the row appears in the table by primary key. If there is already row with the same primary key, the insertion is skipped.

    The initial plan is to achieve deduplication by maintaining a deduplication container for each table. When the database is restarted, the primary key column is read from the disk and container in memory is rebuilt.

    After investigation, roaring bitmap is a compressed bitmap index with excellent performance and less memory usage.

    We can use RoaringBitmap and RoaringTreemap in roaring-rs to store ordinary integer primary keys. For string types that cannot be supported by roaring bitmap, we can use HashSet storage.

    Also, where can the deduplication container of each table be placed appropriately, can it be placed in the MetaStore?

    • [x] sql parse
    • [ ] deduplication by primary key when data inserting
    • [ ] recovery
    • [ ] performance test
    opened by nautaa 0
  • `select 'abc'` cause the server stuck

    `select 'abc'` cause the server stuck

    In the current version, doing queries like select 'abc' returns no result:

    TensorBase :) select 'abc'
    
    SELECT 'abc'
    
    Ok.
    
    0 rows in set. Elapsed: 0.008 sec. 
    

    And I commented out this three lines in engine::datafusions: https://github.com/tensorbase/tensorbase/blob/14e4802b9c5e9e6e9543f80380213da1fb1c56cf/crates/engine/src/datafusions.rs#L177-L180 The server will get stuck on this query

    TensorBase :) select 'abc'
    
    SELECT 'abc'
    
    ^C
    

    However, integral literals can work:

    TensorBase :) select 1
    
    SELECT 1
    
    ┌─Int64(1)─┐
    │        1 │
    └──────────┘
    
    1 rows in set. Elapsed: 0.005 sec. 
    
    type/bug 
    opened by frank-king 3
  • How it is different with Datafuse?

    How it is different with Datafuse?

    In terms of general product strategy, direction? These projects look very similar, I'm wondering how do they differ.

    PS. Best wishes from ClickHouse team :)

    opened by alexey-milovidov 8
Releases(v2021.07.05)
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 4.8k Nov 26, 2022
New generation decentralized data warehouse and streaming data pipeline

World's first decentralized real-time data warehouse, on your laptop Docs | Demo | Tutorials | Examples | FAQ | Chat Get Started Watch this introducto

kamu 181 Nov 26, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Datafuse Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture Datafuse is a Real-Time Data Processing & Analytics DBMS wit

Datafuse Labs 4.8k Nov 30, 2022
A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

raja sekar 2.1k Nov 24, 2022
This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

null 2 Nov 2, 2022
High-performance runtime for data analytics applications

Weld Documentation Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and func

Weld 2.9k Nov 23, 2022
A high-performance, high-reliability observability data pipeline.

Quickstart • Docs • Guides • Integrations • Chat • Download What is Vector? Vector is a high-performance, end-to-end (agent & aggregator) observabilit

Timber 11.9k Dec 1, 2022
Rayon: A data parallelism library for Rust

Rayon Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel

null 7.6k Nov 23, 2022
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Oct 9, 2022
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

SFU Database Group 888 Nov 25, 2022
A highly efficient daemon for streaming data from Kafka into Delta Lake

A highly efficient daemon for streaming data from Kafka into Delta Lake

Delta Lake 164 Nov 19, 2022
A cross-platform library to retrieve performance statistics data.

A toolkit designed to be a foundation for applications to monitor their performance.

Lark Technologies Pte. Ltd. 155 Nov 12, 2022
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 2, 2022
Analysis of Canadian Federal Elections Data

Canadian Federal Elections election is a small Rust program for processing vote data from Canadian Federal Elections. After building, see election --h

Colin Woodbury 2 Sep 26, 2021
📊 Cube.js — Open-Source Analytics API for Building Data Apps

?? Cube.js — Open-Source Analytics API for Building Data Apps

Cube.js 14.2k Nov 26, 2022
Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. ?? ?? I needed a succinct way to describe 2d pixel map operat

Ben Greenier 0 Oct 29, 2021
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

null 29.6k Nov 26, 2022
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

Memgraph 39 Nov 10, 2022