Databend aimed to be an open source elastic and reliable serverless data warehouse,

Overview

Databend Logo

The Open Source Serverless Data Warehouse for Everyone

Website | Roadmap | Documentation


What is Databend?

Databend aimed to be an open source elastic and reliable serverless data warehouse, it offers blazing fast query and combines elasticity, simplicity, low cost of the cloud, built to make the Data Cloud easy.

Databend design principles:

  1. Elastic In Databend, storage and compute resources can be scaled on demand.
  2. Serverless In Databend, you don’t have to think about servers, you pay only for what you actually used.
  3. User-friendly Databend is an ANSI SQL compliant cloud warehouse, it is easy for data scientist and engineers to use.
  4. Secure All data files and network traffic in Databend is encrypted end-to-end, and provide Role Based Access Control in SQL level.

Design Overview

Databend Architecture

Databend is inspired by ClickHouse and its computing model is based on apache-arrow.

Databend consists of three components: meta service layer, and the decoupled compute and storage layers.

Meta Service Layer

The meta service is a layer to service multiple tenants. In current implementation, the meta service has components:

  • Metadata - Which manages all metadata of databases, tables, clusters, the transaction, etc.
  • Administration Which stores user info, user management, access control information, usage statistics, etc.
  • Security Which performs authorization and authentication to protect the privacy of users' data.

Compute Layer

The compute layer is the clusters that running computing workloads, each cluster have many nodes, each node has components:

  • Planner - Builds execution plan from the user's SQL statement.
  • Optimizer - Optimizer rules like predicate push down or pruning of unused columns.
  • Processors - Vectorized Execution Engine, which is build by planner instructions.
  • Cache - Caching Data and Indexes based on the version.

Many clusters can attach the same database, so they can serve the query in parallel by different users.

Storage Layer

Databend stores data in an efficient, columnar format as Parquet files. For efficient pruning, Databend also creates indexes for each Parquet file to speed up the queries.

Getting Started

Roadmap

Databend is currently in Alpha and is not ready to be used in production, Roadmap 2022

License

Databend is licensed under Apache 2.0.

Acknowledgement

Document Hosting

Comments
  • [store] refactor: rename store to dfs

    [store] refactor: rename store to dfs

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    [store] refactor: rename store to dfs

    Changelog

    • Improvement

    Related Issues

    opened by drmingdrmer 387
  • Rename trait type names from I$Name to $Name

    Rename trait type names from I$Name to $Name

    I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

    Summary

    Rename trait type names from I$Name to $Name

    Changelog

    • Renames :

      • ITable to Table
      • IDatabase to Database
    • Removes IDataSource, use struct DataSource directly

    • And relevant code

    Related Issues

    Fixes #727

    Test Plan

    No extra ut/stateless_test

    opened by dantengsky 127
  • add toStartOfWeek

    add toStartOfWeek

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    Summary about this PR

    Changelog

    • Improvement

    Related Issues

    #853

    Test Plan

    Unit Tests ok

    Stateless Tests ok

    opened by dust1 99
  • ISSUE-2039: rm param

    ISSUE-2039: rm param "database" from get_table_by_id

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    • signature of Catalog::get_table_by_id changed to
        fn get_table_by_id(
            &self,
            table_id: MetaId,
            table_version: Option<MetaVersion>,
        ) -> Result<Arc<TableMeta>>;
    

    the "database_name" parameter has been removed.

    NOTE:

    1. add database_name and table_name to struct common/meta/types/Table
    2. added annotation #[allow(clippy::large_enum_variant)] to pub enum AppliedState to disable the warning variant is 400 bytes

    Changelog

    • Improvement
    • Not for changelog (changelog entry is not required)

    Related Issues

    Fixes #2039

    Test Plan

    Unit Tests

    Stateless Tests

    opened by dantengsky 70
  • [query/server/http] Add /v1/query.

    [query/server/http] Add /v1/query.

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    Add new endpoint /v1/query to support sql.

    handler usage

    1. post query with JSON with format HttpQueryRequest, and return JSON with format QueryResponse
    2. Pagination with long polling.
    3. result data is embedded in QueryResponse.
    4. return the following 4 URIs, the client should rely on the endpoint /v1/query and URIs returned in the response. the other endpoints may change without notices.
      1. next: get the next page (together with progress) with long polling. page N is released when Page N+1 is requested.
      2. state: return struct like QueryResponse, but the data field is empty. short polling.
      3. kill: kill running query
      4. delete: kill running query and delete it

    test_async can be a demo for their usage.

    pub struct HttpQueryRequest {
        pub sql: String,
    }
    
    pub struct QueryResponse {
        pub id: Option<String>,
        pub columns: Option<DataSchemaRef>,
        pub data: JsonBlockRef,
        pub next_uri: Option<String>,
        pub state_uri: Option<String>,
        pub kill_uri: Option<String>,
        pub delete_uri: Option<String>,
        pub query_state: Option<String>,
        pub request_error: Option<String>,
        pub query_error: Option<String>,
        pub query_progress: Option<ProgressValues>,
    }
    
    

    internal

    1. The web layer (query_handlers.rs) is separated from a more structured internal implementation in mod v1/query.
      1. /v1/statement is kept to post raw SQL, may reuse the internal implementation, this will be done some others changes in another PR.
      2. hope it may event be used to support GRPC if need later?
    2. make sure the internal tokio task stop fast when the query is killed by SQL command or HTTP request

    TODO

    soon( new PR in 1 week or so):

    1. classification and organization of Errors returned will be polished later.
    2. more tests.
    3. rewrite /v1/statements.

    Long term:

    1. Add client session support.
    2. Adjust the handler interface to a stable version with a formal doc. only necessary fields are added currently. including 2 JSON formats and optional URL parameters and headers.
    3. better control of memory.

    Changelog

    • New Feature

    Related Issues

    Fixes #2604

    Test Plan

    Unit Tests

    pr-feature community-take 
    opened by youngsofun 63
  • [ci] fix gcov install failed

    [ci] fix gcov install failed

    Signed-off-by: Chojan Shang [email protected]

    I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

    Summary

    remove ~/.cargo/bin/ from cache

    Changelog

    • Build/Testing/CI
    • Not for changelog (changelog entry is not required)

    Related Issues

    Fixes #1012

    Test Plan

    No

    pr-build 
    opened by PsiACE 63
  • Implements Feature 630

    Implements Feature 630

    Summary

    It's a baby step of integrating Store with Query, which implements

    • update metadata after appending data parts to the table
    • remote table read_plan
    • remote table read

    Basic statements like insert into ... and select .. from ... could be executed now. (and lots of interesting things are left to do)

    Changelog

    • Store: implementions for ITable read_plan and read a5c42b2e5d14d042f3c3d928a35c625ca32f4410

    • Query: implements RemoteTalbe's read_plan & read b55eacf912bc7985765b870ecd658d505eb75a56

    • Adds remote flag to ReadDataSourcePlan deaea8ea29b4a6d4afb1390cdd3e0d3540b2597c

    • Tweaks stateless test cases ed69c92fc37c01650f17478f4d6e446f828f74ad

    The following issues might be worthy of your concern:

    • Remove trait bound Sync from SendableDataBlockStream ed69c92fc37c01650f17478f4d6e446f828f74ad

      Turns out, at least for now, we do not need this trait bound, and without Sync constraint, SendableDataBlockStream is more stream-combinator friendly.

    • Keep ITable::read_plan as a non-async method

      By using runtime of ctx (and channel). IMHO, change ITable::read_plan to async fn may be too harsh at this stage.

    • Add an extra flag to ReadDataSourcePlan and SourceTransform

      So that we could be aware of operating a remote table(and fetch remote table accordingly). It is a temp workaround, let's postpone it until the Catalog API is ready. SourceTransform::execute and FuseQueryContext are tweaked accordingly. pls see deaea8ea29b4a6d4afb1390cdd3e0d3540b2597c

    Related Issues

    resolves #630

    Test Plan

    • UT & Stateless Testes

    Progress

    • [x] Update meta
    • [x] Flight Service
    • [x] Store Client
    • [x] Remote table - read_plan
    • [x] Remote table - read (read partition)
    • [x] Unit tests & integration tests
    • [x] Multi-Node integration tests
    • [x] Code GC
    • [x] Squash commits
    opened by dantengsky 63
  • [query] refactor: Table::read_plan does not need DatabendQueryContext anymore.

    [query] refactor: Table::read_plan does not need DatabendQueryContext anymore.

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    [query] refactor: Table::read_plan does not need DatabendQueryContext anymore.

    Why: Table is a low level concept in databend code base and should be a dependency of some other crate such as common/planner.

    This commit is one of the steps to remove the Table dependency on query.

    Trait Table references DatabendQueryContext as an argument type. In this commit it is replaced with another smaller type TableIOContext so that in future Table can be moved out of crate query.

    TableIOContext provides data-access support, exposes the runtime used by query itself and provides two resource value: max thread number and node list.

    • Add TableIOContext to provide everything a Table need to build a plan or else.

    • Table::read_plan() use TableIOContext as argument to replace DatabendQueryContext.

    • When calling read_plan(), a temporary TableIOContext is built out of DatabendQueryContext.

    • DatabendQueryContext provides two additional supporting methods: get_single_node_table_io_context() and get_cluster_table_io_context().

    • fix: #2072

    Changelog

    • Improvement

    Related Issues

    • #2046
    • #2059
    opened by drmingdrmer 52
  • Cast functions between String and Date16/Date32/DateTime32

    Cast functions between String and Date16/Date32/DateTime32

    I hereby agree to the terms of the CLA available at: https://databend.rs/policies/cla/

    Summary

    Summary about this PR

    Changelog

    • New Feature

    Related Issues

    Related #853

    Test Plan

    Unit Tests

    Stateless Tests

    pr-feature 
    opened by sundy-li 50
  • ISSUE-1639:Remove session_api.rs

    ISSUE-1639:Remove session_api.rs

    I hereby agree to the terms of the CLA available at: https://datafuse.rs/policies/cla/

    Summary

    Remove common/store-api/session_api.rs

    Changelog

    • Improvement

    Related Issues

    Fixes #1639

    Test Plan

    Unit Tests

    Stateless Tests

    opened by jyz0309 39
  • Consider renaming project. DataFuse is too similar to DataFusion.

    Consider renaming project. DataFuse is too similar to DataFusion.

    This project appears to have similar goals to Apache Arrow DataFusion, contains code from DataFusion, and has a very similar name.

    The names "DataFuse" and "DataFusion" only differ by a few characters and this could cause confusion about the relationship between these projects.

    On behalf of the Apache Arrow DataFusion community, who have put a lot of work into building the DataFusion software and brand over the past three years, I respectfully ask that you consider renaming this project.

    opened by andygrove 35
  • docs: sample toml files

    docs: sample toml files

    I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

    Summary

    Added a note to the toml configuration page about the sample files.

    Closes #9372

    pr-doc 
    opened by soyeric128 1
  • Alter table sql support

    Alter table sql support

    I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

    Summary

    feat: It is not a pr for review, it just a craft pr for running unit tests of alter table modification.

    opened by lichuang 1
  • bug: expression branch wrap_nullable

    bug: expression branch wrap_nullable

    Search before asking

    • [X] I had searched in the issues and found no similar issues.

    Version

    expression branch

    What's Wrong?

    wrap_nullable will wrap the nullable type and get the inner value. If the inner value is null , will use default value.

    How to Reproduce?

    CREATE TABLE t1(s String NULL, pat String NULL, pos Int64 NULL, occu Int64 NULL, ro Int64 NULL, mt String NULL) Engine = Fuse;
    
    INSERT INTO t1 (s, pat, pos, occu, ro, mt) VALUES ('dog cat dog', 'dog', NULL, 1, 1, 'c');
    
    SELECT s FROM t1 WHERE REGEXP_INSTR(s, pat, pos, occu, ro, mt) = 4;
    ERROR 1105 (HY000): result row write failed: Code: 1001, displayText = Incorrect arguments to regexp_instr: position must be positive, but got 0.
    

    expect result : NULL.

    Are you willing to submit PR?

    • [ ] Yes I am willing to submit a PR!
    C-bug 
    opened by TCeason 0
  • feat: update support subquery

    feat: update support subquery

    mysql> create table t1(id1 int, val1 varchar(255));
    Query OK, 0 rows affected (0.01 sec)
    
    mysql> create table t2(id2 int, val2 varchar(255));
    Query OK, 0 rows affected (0.01 sec)
    
    mysql> insert into t1 values (1,'1') ;
    Query OK, 1 row affected (0.01 sec)
    
    mysql> insert into t2 values (1,'2');
    Query OK, 1 row affected (0.01 sec)
    
    mysql> update t1 set val1 = (select val2 from t2 where id1 = id2) where id1 in (select id2 from t2);
    ERROR 1105 (HY000): Code: 1001, displayText = Unsupported physical scalar: SubqueryExpr(SubqueryExpr { typ: Any, subquery: SExpr { plan: EvalScalar(EvalScalar { items: [ScalarItem { scalar: BoundColumnRef(BoundColumnRef { column: ColumnBinding { database_name: Some("default"), table_name: Some("t2"), column_name: "id2", index: 4, data_type: Int32(Int32), visibility: Visible } }), index: 4 }] }), children: [SExpr { plan: LogicalGet(LogicalGet { table_index: 2, columns: {5, 4}, push_down_predicates: None, limit: None, order_by:
    
    opened by zhyass 0
  • Roadmap 2023

    Roadmap 2023

    After a full year of research and development in 2022, the functionality and stability of Databend were significantly enhanced, and several users began using it in production. Databend has helped them greatly reduce costs and operational complexity issues.

    This is Databend Roadmap in 2023 (discussion).

    See also:

    Main tasks

    Features

    | Task | Status | Comments | |------------------------------------------------------------------------------------|-------------|----------| | Update | IN PROGRESS | | | Merge | PLAN | | | Alter table | IN PROGRESS | | | Window function | PLAN | | | Lambda function and high-order functions | PLAN | | | TimestampTz data type | PLAN | | | Decimal data type | PLAN | | | Materialized view | PLAN | | | Support SET_VAR hints#8833 | PLAN | | | Parquet reader | PLAN | | | Distributed COPY | PLAN | | | JSON indexing | PLAN | | | DataFrame | PLAN | | | Data Sharing(community version) | IN PROGRESS | | | Concurrent query enhance | PLAN | |

    Improvements

    | Task | Status | Comments | |---------------------------------------------------------------------------|-------------|----------| | New expression#9411 | IN PROGRESS | | | Error message | PLAN | |

    Planner

    | Task | Status | Comments | |----------------------------------------------------------------------------------------------|-------------|----------------------| | Scalar expression normalization | PLAN | | | Column constraint framework | PLAN | | | Functional dependency framework#7438 | PLAN | | | Join reorder | IN PROGRESS | | | CBO for distributed plan | PLAN | | | Support TPC-DS | PLAN | | | Support optimization tracing | PLAN | Easy to debug/study. |

    Cache

    | Task | Status | Comments | |---------------------|---------|----------| | Unified cache layer | IN PROGRESS | | | Meta data cache | IN PROGRESS | | | Index data cache | IN PROGRESS | | | Block data cache | PLAN | |

    Data Storage

    | Task | Status | Comments | |---------------------------------|--------|----------------------------------------| | Fuse engine re-clustering | PLAN | | | Fuse engine orphan data cleanup | PLAN | | | Fuse engine segment tree | PLAN | Support large dataset(PB) in one table |

    LakeHouse

    | Task | Status | Comments | |------------------------------------|-------------|----------| | Apache Hive | IN PROGRESS | | | Apache Iceberg | IN PROGRESS | | | Querying external storage(Parquet) | IN PROGRESS | |

    Distributed Query Execution

    | Task | Status | Comments | |----------------------|-------------|----------| | Visualized profiling | IN PROGRESS | | | Aggregation spilling | IN PROGRESS | |

    Resource Quota

    | Task | Status | Comments | |------------------------------------------|-------------|----------| | Session-level quota control (CPU/Memory) | IN PROGRESS | | | User-level quota control (CPU/Memory) | PLAN | |

    Integrations

    | Task | Status | Comments | |-------------------------------------------|-------------|----------| | Dbt integration | IN PROGRESS | | | Airbyte integration | IN PROGRESS | | | Datadog Vector integrate with Rust-driver | IN PROGRESS | | | Datax integrate with Java-driver | IN PROGRESS | | | CDC with Flink | PLAN | | | CDC with Kafka | PLAN | |

    Meta

    | Task | Status | Comments | |-------------|-------------|----------| | Jepsen test | IN PROGRESS | |

    Testing

    | Task | Status | Comments | |---------------|-------------|-----------------------------------| | SQLlogic Test | IN PROGRESS | Supports more test cases | | SQLancer Test | IN PROGRESS | Supports more type and more cases | | Fuzzer Test | PLAN | |

    Releases

    opened by BohuTANG 4
Releases(v0.8.176-nightly)
Owner
Datafuse Labs
The open-source runtime that powers the Modern Data Cloud
Datafuse Labs
Skytable is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and TLS

Skytable is an effort to provide the best of key/value stores, document stores and columnar databases, that is, simplicity, flexibility and queryability at scale. The name 'Skytable' exemplifies our vision to create a database that has limitless possibilities. Skytable was previously known as TerrabaseDB (and then Skybase) and is also nicknamed "STable", "Sky" and "SDB" by the community.

Skytable 1.4k Dec 29, 2022
The most efficient, scalable, and fast production-ready serverless REST API backend which provides CRUD operations for a MongoDB collection

Optimal CRUD Mongo Goals of This Project This is meant to be the most efficient, scalable, and fast production-ready serverless REST API backend which

Evaluates2 1 Feb 22, 2022
Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis

OnTimeDB Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis OnTimeDB is a time

Stuart 2 Apr 5, 2022
Engula empowers engineers to build reliable and cost-effective databases.

Engula is a storage engine that empowers engineers to build reliable and cost-effective databases with less effort and more confidence. Engula is in t

Engula 706 Jan 1, 2023
open source training courses about distributed database and distributed systemes

Welcome to learn Talent Plan Courses! Talent Plan is an open source training program initiated by PingCAP. It aims to create or combine some open sour

PingCAP 8.3k Dec 30, 2022
Owlyshield is an open-source AI-driven behaviour based antiransomware engine written in Rust.

Owlyshield (mailto:[email protected]) We at SitinCloud strongly believe that cybersecurity products should always be open-source: Critical decis

SitinCloud 255 Dec 25, 2022
LIMITS is yet another fully open source, interoperable, decentralised real-time communication protocol!

LIMITS: Limit-IM does not have ITS LIMITS We are undergoing a major refactoring and technology stack adjustment to better accommodate clustered deploy

Limit LAB 14 Feb 4, 2023
Open Data Access Layer that connect the whole world together

OpenDAL Open Data Access Layer that connect the whole world together. Status OpenDAL is in alpha stage and has been early adopted by databend. Welcome

Datafuse Labs 302 Jan 4, 2023
a tokio-enabled data store for triple data

terminusdb-store, a tokio-enabled data store for triple data Overview This library implements a way to store triple data - data that consists of a sub

TerminusDB 307 Dec 18, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 5k Jan 9, 2023
Implements the packet parser for Gran Turismo 7 telemetry data, allowing a developer to retrieve data from a running game.

gran-turismo-query Implements the packet parser for Gran Turismo 7 telemetry data, allowing a developer to retrieve data from a running game. Features

Carlos Menezes 3 Dec 11, 2023
XLite - query Excel (.xlsx, .xls) and Open Document spreadsheets (.ods) as SQLite virtual tables

XLite - query Excel (.xlsx, .xls) and Open Document spreadsheets (.ods) as SQLite virtual tables XLite is a SQLite extension written in Rust. The main

Sergey Khabibullin 1.1k Dec 28, 2022
Open Zignatures Database

The openZign project Zignatures and other binary identification database. For fun and to aid reverse-engineering tasks. Collected from various datasou

Cyrill Leutwiler 3 Sep 19, 2021
Command-line tool to make Rust source code entities from Postgres tables.

pg2rs Command-line tool to make Rust source code entities from Postgres tables. Generates: enums structs which can be then used like mod structs; use

Stanislav 10 May 20, 2022
Efficient and fast querying and parsing of GTDB's data

xgt xgt is a Rust tool that enables efficient querying and parsing of the GTDB database. xgt consists of a collection of commands mirroring the GTDB A

Anicet Ebou 7 Apr 1, 2023
Materialize simplifies application development with streaming data. Incrementally-updated materialized views - in PostgreSQL and in real time. Materialize is powered by Timely Dataflow.

Materialize is a streaming database for real-time applications. Get started Check out our getting started guide. About Materialize lets you ask questi

Materialize, Inc. 4.7k Jan 8, 2023
🐸Slippi DB ingests Slippi replays and puts the data into a SQLite database for easier parsing.

The primary goal of this project is to make it easier to analyze large amounts of Slippi data. Its end goal is to create something similar to Ballchasing.com but for Melee.

Max Timkovich 20 Jan 2, 2023
Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes

Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes

null 5.7k Jan 6, 2023
It's not a novel data sturcture just AVL and Btree for rust

This crate named as ABtree but this not means it is a novel data sturcture. It’s just AVL tree and Btree. For the Btree, what makes it different from

GuoHao 3 Jun 20, 2022