A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Overview

Datafuse

Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture

Built to make the Data Cloud easy!



Stargazers over time

Principles

  • Fearless

    • No data races, No unsafe, Minimize unhandled errors
  • High Performance

    • Everything is Parallelism
  • High Scalability

    • Everything is Distributed
  • High Reliability

    • Datafuse primary design goal is reliability

Architecture

Datafuse Architecture

Performance

  • Memory SIMD-Vector processing performance only
  • Dataset: 100,000,000,000 (100 Billion)
  • Hardware: AMD Ryzen 7 PRO 4750U, 8 CPU Cores, 16 Threads
  • Rust: rustc 1.55.0-nightly (868c702d0 2021-06-30)
  • Build with Link-time Optimization and Using CPU Specific Instructions
  • ClickHouse server version 21.4.6 revision 54447
Query FuseQuery (v0.4.48-nightly) ClickHouse (v21.4.6)
SELECT avg(number) FROM numbers_mt(100000000000) 4.35 s.
(22.97 billion rows/s., 183.91 GB/s.)
×1.4 slow, (6.04 s.)
(16.57 billion rows/s., 132.52 GB/s.)
SELECT sum(number) FROM numbers_mt(100000000000) 4.20 s.
(23.79 billion rows/s., 190.50 GB/s.)
×1.4 slow, (5.90 s.)
(16.95 billion rows/s., 135.62 GB/s.)
SELECT min(number) FROM numbers_mt(100000000000) 4.92 s.
(20.31 billion rows/s., 162.64 GB/s.)
×2.7 slow, (13.05 s.)
(7.66 billion rows/s., 61.26 GB/s.)
SELECT max(number) FROM numbers_mt(100000000000) 4.77 s.
(20.95 billion rows/s., 167.78 GB/s.)
×3.0 slow, (14.07 s.)
(7.11 billion rows/s., 56.86 GB/s.)
SELECT count(number) FROM numbers_mt(100000000000) 2.91 s.
(34.33 billion rows/s., 274.90 GB/s.)
×1.3 slow, (3.71 s.)
(26.93 billion rows/s., 215.43 GB/s.)
SELECT sum(number+number+number) FROM numbers_mt(100000000000) 19.83 s.
(5.04 billion rows/s., 40.37 GB/s.)
×12.1 slow, (233.71 s.)
(427.87 million rows/s., 3.42 GB/s.)
SELECT sum(number) / count(number) FROM numbers_mt(100000000000) 3.90 s.
(25.62 billion rows/s., 205.13 GB/s.)
×2.5 slow, (9.70 s.)
(10.31 billion rows/s., 82.52 GB/s.)
SELECT sum(number) / count(number), max(number), min(number) FROM numbers_mt(100000000000) 8.28 s.
(12.07 billion rows/s., 96.66 GB/s.)
×4.0 slow, (32.87 s.)
(3.04 billion rows/s., 24.34 GB/s.)
SELECT number FROM numbers_mt(10000000000) ORDER BY number DESC LIMIT 100 4.80 s.
(2.08 billion rows/s., 16.67 GB/s.)
×2.9 slow, (13.95 s.)
(716.62 million rows/s., 5.73 GB/s.)
SELECT max(number), sum(number) FROM numbers_mt(1000000000) GROUP BY number % 3, number % 4, number % 5 6.31 s.
(158.49 million rows/s., 1.27 GB/s.)
×1.02 fast, (6.18 s.)
(161.84 million rows/s., 1.29 GB/s.)

Note:

  • ClickHouse system.numbers_mt is 16-way parallelism processing, gist
  • FuseQuery system.numbers_mt is 16-way parallelism processing, gist

Getting Started

Roadmap

Datafuse is currently in Alpha and is not ready to be used in production, Roadmap 2021

Contributing

License

Datafuse is licensed under Apache 2.0.

Comments
  • [store] refactor: rename store to dfs

    [store] refactor: rename store to dfs

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    [store] refactor: rename store to dfs

    Changelog

    • Improvement

    Related Issues

    opened by drmingdrmer 387
  • Rename trait type names from I$Name to $Name

    Rename trait type names from I$Name to $Name

    I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

    Summary

    Rename trait type names from I$Name to $Name

    Changelog

    • Renames :

      • ITable to Table
      • IDatabase to Database
    • Removes IDataSource, use struct DataSource directly

    • And relevant code

    Related Issues

    Fixes #727

    Test Plan

    No extra ut/stateless_test

    opened by dantengsky 127
  • add toStartOfWeek

    add toStartOfWeek

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    Summary about this PR

    Changelog

    • Improvement

    Related Issues

    #853

    Test Plan

    Unit Tests ok

    Stateless Tests ok

    opened by dust1 99
  • ISSUE-2039: rm param

    ISSUE-2039: rm param "database" from get_table_by_id

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    • signature of Catalog::get_table_by_id changed to
        fn get_table_by_id(
            &self,
            table_id: MetaId,
            table_version: Option<MetaVersion>,
        ) -> Result<Arc<TableMeta>>;
    

    the "database_name" parameter has been removed.

    NOTE:

    1. add database_name and table_name to struct common/meta/types/Table
    2. added annotation #[allow(clippy::large_enum_variant)] to pub enum AppliedState to disable the warning variant is 400 bytes

    Changelog

    • Improvement
    • Not for changelog (changelog entry is not required)

    Related Issues

    Fixes #2039

    Test Plan

    Unit Tests

    Stateless Tests

    opened by dantengsky 70
  • [query/server/http] Add /v1/query.

    [query/server/http] Add /v1/query.

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    Add new endpoint /v1/query to support sql.

    handler usage

    1. post query with JSON with format HttpQueryRequest, and return JSON with format QueryResponse
    2. Pagination with long polling.
    3. result data is embedded in QueryResponse.
    4. return the following 4 URIs, the client should rely on the endpoint /v1/query and URIs returned in the response. the other endpoints may change without notices.
      1. next: get the next page (together with progress) with long polling. page N is released when Page N+1 is requested.
      2. state: return struct like QueryResponse, but the data field is empty. short polling.
      3. kill: kill running query
      4. delete: kill running query and delete it

    test_async can be a demo for their usage.

    pub struct HttpQueryRequest {
        pub sql: String,
    }
    
    pub struct QueryResponse {
        pub id: Option<String>,
        pub columns: Option<DataSchemaRef>,
        pub data: JsonBlockRef,
        pub next_uri: Option<String>,
        pub state_uri: Option<String>,
        pub kill_uri: Option<String>,
        pub delete_uri: Option<String>,
        pub query_state: Option<String>,
        pub request_error: Option<String>,
        pub query_error: Option<String>,
        pub query_progress: Option<ProgressValues>,
    }
    
    

    internal

    1. The web layer (query_handlers.rs) is separated from a more structured internal implementation in mod v1/query.
      1. /v1/statement is kept to post raw SQL, may reuse the internal implementation, this will be done some others changes in another PR.
      2. hope it may event be used to support GRPC if need later?
    2. make sure the internal tokio task stop fast when the query is killed by SQL command or HTTP request

    TODO

    soon( new PR in 1 week or so):

    1. classification and organization of Errors returned will be polished later.
    2. more tests.
    3. rewrite /v1/statements.

    Long term:

    1. Add client session support.
    2. Adjust the handler interface to a stable version with a formal doc. only necessary fields are added currently. including 2 JSON formats and optional URL parameters and headers.
    3. better control of memory.

    Changelog

    • New Feature

    Related Issues

    Fixes #2604

    Test Plan

    Unit Tests

    pr-feature community-take 
    opened by youngsofun 63
  • [ci] fix gcov install failed

    [ci] fix gcov install failed

    Signed-off-by: Chojan Shang [email protected]

    I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

    Summary

    remove ~/.cargo/bin/ from cache

    Changelog

    • Build/Testing/CI
    • Not for changelog (changelog entry is not required)

    Related Issues

    Fixes #1012

    Test Plan

    No

    pr-build 
    opened by PsiACE 63
  • Implements Feature 630

    Implements Feature 630

    Summary

    It's a baby step of integrating Store with Query, which implements

    • update metadata after appending data parts to the table
    • remote table read_plan
    • remote table read

    Basic statements like insert into ... and select .. from ... could be executed now. (and lots of interesting things are left to do)

    Changelog

    • Store: implementions for ITable read_plan and read a5c42b2e5d14d042f3c3d928a35c625ca32f4410

    • Query: implements RemoteTalbe's read_plan & read b55eacf912bc7985765b870ecd658d505eb75a56

    • Adds remote flag to ReadDataSourcePlan deaea8ea29b4a6d4afb1390cdd3e0d3540b2597c

    • Tweaks stateless test cases ed69c92fc37c01650f17478f4d6e446f828f74ad

    The following issues might be worthy of your concern:

    • Remove trait bound Sync from SendableDataBlockStream ed69c92fc37c01650f17478f4d6e446f828f74ad

      Turns out, at least for now, we do not need this trait bound, and without Sync constraint, SendableDataBlockStream is more stream-combinator friendly.

    • Keep ITable::read_plan as a non-async method

      By using runtime of ctx (and channel). IMHO, change ITable::read_plan to async fn may be too harsh at this stage.

    • Add an extra flag to ReadDataSourcePlan and SourceTransform

      So that we could be aware of operating a remote table(and fetch remote table accordingly). It is a temp workaround, let's postpone it until the Catalog API is ready. SourceTransform::execute and FuseQueryContext are tweaked accordingly. pls see deaea8ea29b4a6d4afb1390cdd3e0d3540b2597c

    Related Issues

    resolves #630

    Test Plan

    • UT & Stateless Testes

    Progress

    • [x] Update meta
    • [x] Flight Service
    • [x] Store Client
    • [x] Remote table - read_plan
    • [x] Remote table - read (read partition)
    • [x] Unit tests & integration tests
    • [x] Multi-Node integration tests
    • [x] Code GC
    • [x] Squash commits
    opened by dantengsky 63
  • [query] refactor: Table::read_plan does not need DatabendQueryContext anymore.

    [query] refactor: Table::read_plan does not need DatabendQueryContext anymore.

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    [query] refactor: Table::read_plan does not need DatabendQueryContext anymore.

    Why: Table is a low level concept in databend code base and should be a dependency of some other crate such as common/planner.

    This commit is one of the steps to remove the Table dependency on query.

    Trait Table references DatabendQueryContext as an argument type. In this commit it is replaced with another smaller type TableIOContext so that in future Table can be moved out of crate query.

    TableIOContext provides data-access support, exposes the runtime used by query itself and provides two resource value: max thread number and node list.

    • Add TableIOContext to provide everything a Table need to build a plan or else.

    • Table::read_plan() use TableIOContext as argument to replace DatabendQueryContext.

    • When calling read_plan(), a temporary TableIOContext is built out of DatabendQueryContext.

    • DatabendQueryContext provides two additional supporting methods: get_single_node_table_io_context() and get_cluster_table_io_context().

    • fix: #2072

    Changelog

    • Improvement

    Related Issues

    • #2046
    • #2059
    opened by drmingdrmer 52
  • Cast functions between String and Date16/Date32/DateTime32

    Cast functions between String and Date16/Date32/DateTime32

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    Summary about this PR

    Changelog

    • New Feature

    Related Issues

    Related #853

    Test Plan

    Unit Tests

    Stateless Tests

    pr-feature 
    opened by sundy-li 50
  • ISSUE-1639:Remove session_api.rs

    ISSUE-1639:Remove session_api.rs

    I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

    Summary

    Remove common/store-api/session_api.rs

    Changelog

    • Improvement

    Related Issues

    Fixes #1639

    Test Plan

    Unit Tests

    Stateless Tests

    opened by jyz0309 39
  • Consider renaming project. DataFuse is too similar to DataFusion.

    Consider renaming project. DataFuse is too similar to DataFusion.

    This project appears to have similar goals to Apache Arrow DataFusion, contains code from DataFusion, and has a very similar name.

    The names "DataFuse" and "DataFusion" only differ by a few characters and this could cause confusion about the relationship between these projects.

    On behalf of the Apache Arrow DataFusion community, who have put a lot of work into building the DataFusion software and brand over the past three years, I respectfully ask that you consider renaming this project.

    opened by andygrove 35
  • chore(meta): upgrade openraft 0.7.3..0.7.4-alpha.2

    chore(meta): upgrade openraft 0.7.3..0.7.4-alpha.2

    I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

    Summary

    chore(meta): upgrade openraft 0.7.3..0.7.4-alpha.2

    https://github.com/drmingdrmer/openraft/tree/v0.7.4-alpha.2

    Openraft changes:

    • Fix: changing membership should not remove replication to all learners When changing membership, replications to the learners(non-voters) that are not added as voter should be kept.

      E.g.: with a cluster of voters {0} and learners {1, 2, 3}, changing membership to {0, 1, 2} should not remove replication to node 3.

      Only replications to removed members should be removed.

    • Change: remove AddLearnerError::Exists, which is not actually used

    Other changes:

    • Remove adding learner when leader established

    • Fix: #7895

    Changelog

    Related Issues

    pr-chore 
    opened by drmingdrmer 1
  • bug: return error after adding form and join

    bug: return error after adding form and join

    Search before asking

    • [X] I had searched in the issues and found no similar issues.

    Version

    main

    What's Wrong?

    return error after adding form and join.

    ref clickhouse

    How to Reproduce?

    DROP DATABASE IF EXISTS databend0;
    CREATE DATABASE databend0;
    USE databend0;
    CREATE TABLE t0(c0BOOLEAN BOOL NULL DEFAULT(false));
    CREATE TABLE t1(c0VARCHAR VARCHAR NULL, c1BOOLEAN BOOLEAN NULL DEFAULT(false));
    INSERT INTO t1(c1boolean, c0varchar) VALUES (true, '0');
    
    MySQL [databend0]> SELECT (false and NULL NOT IN (0.1, 0.2, 0.3,0.4)) ::BIGINT FROM t1,t0;
    ERROR 1105 (HY000): Code: 1010, displayText = Can't cast column from nullable data into non-nullable type (while in processor thread 0).
    
    MySQL [databend0]> SELECT (false and NULL NOT IN (0.1, 0.2, 0.3,0.4)) ::BIGINT;
    +--------------------------------------------------+
    | false and null not in(0.1, 0.2, 0.3, 0.4)::int64 |
    +--------------------------------------------------+
    |                                                0 |
    +--------------------------------------------------+
    1 row in set (0.003 sec)
    
    

    Are you willing to submit PR?

    • [ ] Yes I am willing to submit a PR!
    C-bug 
    opened by hanyisong 3
  • chore(hive): improve log if table not exist

    chore(hive): improve log if table not exist

    I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

    Summary

    • query a not-exist table select * from a;

    before:

    ERROR 1105 (HY000): Code: 1002, displayText = remote service threw NoSuchObjectException.
    

    after:

    ERROR 1105 (HY000): Code: 1106, displayText = default.a table not found.
    
    • not support view table
    pr-chore 
    opened by sandflee 1
  • Feature:  customize ser/de of DataValue for min-max statistics

    Feature: customize ser/de of DataValue for min-max statistics

    Summary

    Currently DataValue::String is serialized as Vec<u8> in the min-max statistic, and unfortunately, in JSON format ("[xxx, xxx ... xx]"), which is rather inefficient.

    either

    • customize the ser/de of DataValue::String that used in the min-max statistics
    • or replace the json segment format with another format
    C-feature 
    opened by dantengsky 0
  • Feature:  consider abandoning buffered read in meta readers

    Feature: consider abandoning buffered read in meta readers

    Summary

    By default, meta readers will allocate 1MB of memory for buffered reads, which increases the memory pressure (and likely OOM killed) in various cases. e.g. the parallel pruning phase

    C-feature 
    opened by dantengsky 0
Releases(v0.8.55-nightly)
Owner
Datafuse Labs
The open-source Lakehouse runtime that powers the Modern Data Cloud
Datafuse Labs
Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis

OnTimeDB Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis OnTimeDB is a time

Stuart 2 Apr 5, 2022
Simple document-based NoSQL DBMS from scratch

cudb (a.k.a. cuda++) Simple document-based noSQL DBMS modelled after MongoDB. (Has nothing to do with CUDA, has a lot to do with the Cooper Union and

Jonathan Lam 3 Dec 18, 2021
A high-performance, distributed, schema-less, cloud native time-series database

CeresDB is a high-performance, distributed, schema-less, cloud native time-series database that can handle both time-series and analytics workloads.

null 1.6k Sep 21, 2022
Materialize simplifies application development with streaming data. Incrementally-updated materialized views - in PostgreSQL and in real time. Materialize is powered by Timely Dataflow.

Materialize is a streaming database for real-time applications. Get started Check out our getting started guide. About Materialize lets you ask questi

Materialize, Inc. 4.4k Sep 24, 2022
RisingWave is a cloud-native streaming database that uses SQL as the interface language.

RisingWave is a cloud-native streaming database that uses SQL as the interface language. It is designed to reduce the complexity and cost of building real-time applications. RisingWave consumes streaming data, performs continuous queries, and updates results dynamically. As a database system, RisingWave maintains results inside its own storage and allows users to access data efficiently.

Singularity Data 3.2k Sep 26, 2022
Skybase is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and SSL

Skybase The next-generation NoSQL database What is Skybase? Skybase (or SkybaseDB/SDB) is an effort to provide the best of key/value stores, document

Skybase 1.3k Sep 21, 2022
Skytable is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and TLS

Skytable is an effort to provide the best of key/value stores, document stores and columnar databases, that is, simplicity, flexibility and queryability at scale. The name 'Skytable' exemplifies our vision to create a database that has limitless possibilities. Skytable was previously known as TerrabaseDB (and then Skybase) and is also nicknamed "STable", "Sky" and "SDB" by the community.

Skytable 1.3k Sep 21, 2022
The spatial message broker and database for real-time multiplayer experiences. Official Rust implementation.

WorldQL Server Rust implementation of WorldQL, the spatial message broker and database for real-time multiplayer experiences Setup Instructions ⚠️ Thi

null 201 Sep 18, 2022
A simple library for Firebase real-time database

Firerust A very simple library to implement the Firebase real-time database in your code with the best performance Instalation Add this to your Cargo.

Daniel Dimbarre 1 Apr 15, 2022
asynchronous and synchronous interfaces and persistence implementations for your OOD architecture

OOD Persistence Asynchronous and synchronous interfaces and persistence implementations for your OOD architecture Installation Add ood_persistence = {

Dmitriy Pleshevskiy 1 Feb 15, 2022
Seed your development database with real data ⚡️

Seed Your Development Database With Real Data ⚡️ Replibyte is a blazingly fast tool to seed your databases with your production data while keeping sen

Qovery 3.2k Sep 26, 2022
PRQL is a modern language for transforming data — a simpler and more powerful SQL

PRQL Pipelined Relational Query Language, pronounced "Prequel". PRQL is a modern language for transforming data — a simpler and more powerful SQL. Lik

PRQL 5.1k Sep 26, 2022
Plugin for macro-, mini-quad (quads) to save data in simple local storage using Web Storage API in WASM and local file on a native platforms.

quad-storage This is the crate to save data in persistent local storage in miniquad/macroquad environment. In WASM the data persists even if tab or br

ilya sheprut 7 May 2, 2022
⚡🦀 🧨 make your rust types fit DynamoDB and visa versa

?? ?? dynomite dynomite makes DynamoDB fit your types (and visa versa) Overview Goals ⚡ make writing dynamodb applications in rust a productive experi

Doug Tangren 195 Sep 22, 2022
Command-line tool to make Rust source code entities from Postgres tables.

pg2rs Command-line tool to make Rust source code entities from Postgres tables. Generates: enums structs which can be then used like mod structs; use

Stanislav 10 May 20, 2022
Thin wrapper around [`tokio::process`] to make it streamable

process-stream Wraps tokio::process::Command to future::stream. Install process-stream = "0.2.2" Example usage: From Vec<String> or Vec<&str> use proc

null 4 Jun 25, 2022
Cassandra DB native client written in Rust language. Find 1.x versions on https://github.com/AlexPikalov/cdrs/tree/v.1.x Looking for an async version? - Check WIP https://github.com/AlexPikalov/cdrs-async

CDRS CDRS is looking for maintainers CDRS is Apache Cassandra driver written in pure Rust. ?? Looking for an async version? async-std https://github.c

Alex Pikalov 336 Aug 17, 2022
Native PostgreSQL driver for the Rust programming language

Rust-Postgres PostgreSQL support for Rust. postgres Documentation A native, synchronous PostgreSQL client. tokio-postgres Documentation A native, asyn

Steven Fackler 2.6k Sep 22, 2022
A easy-use client to influxdb

InfluxDBClient-rs A easy-use client to influxdb Overview This is an InfluxDB driver for Rust. Status This project has been able to run properly, PR is

漂流 75 Jul 22, 2022