New generation decentralized data warehouse and streaming data pipeline

Overview
kamu

World's first decentralized real-time data warehouse, on your laptop

Docs | Demo | Tutorials | Examples | FAQ | Chat

Release CI Chat

Get Started

About

kamu (pronounced kaˈmju) is an easy-to-use command-line tool for managing, transforming, and collaborating on structured data.

In short, it can be described as:

  • Decentralized data warehouse
  • A peer-to-peer stream processing data pipeline
  • Git for data
  • Blockchain-like ledger for data
  • Or even Kubernetes for data :)

Using kamu, any person or smallest organization can easily share structured data with the world. Data can be static or flow continuously. In all cases kamu will ensure that it stays:

  • Reproducible - i.e. you can ask the publisher "Give me the same exact data you gave me a year ago"
  • Verifiable - i.e. you can ask the publisher "Is this the exact data you had a year ago?"

Teams and data communities can then collaborate on cleaning, enriching, and aggregating data by building arbitrarily complex decentralized data pipelines. Following the "data as code" philosophy kamu doesn't let you touch data manually - instead, you transform it using Streaming SQL (we support multiple frameworks). This ensures that data supply chains are:

  • Autonomous - write query once and run it forever, no more babysitting fragile batch workflows
  • Low latency - get accurate results immediately, as new data arrives
  • Transparent - see where every single piece of data came from, who transformed it, and how
  • Collaborative - collaborate on data just like on Open Source Software

Data scientists, analysts, ML/AI researchers, and engineers can then:

  • Access fresh, clean, and trustworthy data in seconds
  • Easily keep datasets up-to-date
  • Safely reuse data created by the hard work of the community

The reuse is achieved by maintaining unbreakable lineage and provenance trail in tamper-proof metadata, which lets you assess the trustworthyness of data, no matter how many hands and transformation steps it went through.

In a larger context, kamu is a reference implementation of Open Data Fabric - a Web 3.0 protocol for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.

Open Data Fabric

Use Cases

In general, kamu is a great fit for cases where data is exchanged between several independent parties, and for (low to moderate frequency & volume) mission-critical data where high degree of trustworthiness and protection from malicious actors is required.

Examples:

Open Data

To share data outside of your organization today you have limited options:

  • You can publish it on some open data portal, but lose ownership and control of your data
  • You can deploy and operate some open-source data portal (like CKAN or Dataverse), but you probably have neither time nor money to do so
  • You can self-host it as a CSV file on some simple HTTP/FTP server, but then you are making it extremely hard for others to discover and use your data

Let's acknowledge that for organizations that produce the most valuable data (governments, hospitals, NGOs), publishing data is not part of their business. They typically don't have the incentives, expertise, and resources to be good publishers.

This is why the goal of kamu is to make data publishing cheap and effortless:

  • It invisibly guides publishers towards best data management practices (preserving history, making data reproducible and verifiable)
  • Adds as little friction as exporting data to CSV
  • Lets you host your data on any storage (FTP, S3, GCS, etc.)
  • Maintain full control and ownership of your data

As opposed to just the download counter you get on most data portals, kamu brings publishers closer with the communities allowing them to see who and how uses their data. You no longer send data into "the ether", but create a closed feedback loop with your consumers.

Science & Research

One of the driving forces behind kamu's design was the ongoing reproducibility crisis in science, which we believe to a large extent is caused by our poor data management practices.

After incidents like The Surgisphere scandal the sentiment in research is changing from assuming that all research is done in good faith, to considering any research unreliable until proven otherwise.

Data portals like Dataverse, Dryad, Figshare, and Zenodo are helping reproducibility by archiving data, but this approach:

  • Results in hundreds of millions of poorly systematized datasets
  • Tends to produce the research based on stale and long-outdated data
  • Creates lineage and provenance trail that is very manual and hard to trace (through published papers)

In kamu we believe that the majority of valuable data (weather, census, health records, financial core data) flows continuously, and most of the interesting insights lie around the latest data, so we designed it to bring reproducibility and verifiability to near real-time data.

When using kamu:

  • Your data projects are 100% reproducible using a built-in stable references mechanism
  • Your results can be reproduced and verified by others in minutes
  • All the data prep work (that often accounts for 80% of time of a data scientist) can be shared and reused by others
  • Your data projects will continue to function long after you've moved on, so the work done years ago can continue to produce valuable insights with minimal maintenance on your part
  • Continuously flowing datasets are much easier to systematize than the exponentially growing number of snapshots
Data-driven Journalism

Data-driven journalism is on the rise and has proven to be extremely effective. In the world of misinformation and extremely polarized opinions data provides us an anchoring point to discuss complex problems and analyze cause and effect. Data itself is non-partisan and has no secret agenda, and arguments around different interpretations of data are infinitely more productive than ones based on gut feelings.

Unfortunately, too often data has issues that undermine its trustworthiness. And even if the data is correct, it's very easy to pose a question about its sources that will take too long to answer - the data will be dismissed, and the gut feelings will step in.

This is why kamu's goal is to make data verifiably trustworthy and make answering provenance questions a matter of seconds. Only when data cannot be easily dismissed we will start to pay proper attention to it.

And once we agree that source data can be trusted, we can build analyses and real-time dashboards that keep track of complex issues like corruption, inequality, climate, epidemics, refugee crises, etc.

kamu prevents good research from going stale the moment it's published!

Business core data

kamu aims to be the most reliable data management solution that provides recent data while maintaining the highest degree of accountability and tamper-proof provenance, without you having to put all data in some central database.

We're developing it with financial and pharmaceutical use cases in mind, where audit and compliance could be fully automated through our system.

Note that we currently focus on mission-critical data and kamu is not well suited for IoT or other high-frequency and high-volume cases, but can be a good fit for insights produced from such data that influence your company's decisions and strategy.

Personal analytics

Being data geeks, we use kamu for data-driven decision-making even in our personal lives.

Actually, our largest data pipelines so far were created for personal finance:

  • to collect and harmonize data from multiple bank accounts
  • convert currencies
  • analyze stocks trading data.

We also scrape a lot of websites to make smarter purchasing decisions. kamu lets us keep all this data up-to-date with an absolute minimal effort.

Features

kamu connects publishers and consumers of data through a decentralized network and lets people collaborate on extracting insight from data. It offers many perks for everyone who participates in this first-of-a-kind data supply chain:

For Data Publishers
  • Easily share your data with the world without moving it anywhere
  • Retain full ownership and control of your data
  • Close the feedback loop and see who and how uses your data
  • Provide real-time, verifiable and reproducible data that follows the best data management practices Pull Data
For Data Scientists
  • Ingest any existing dataset from the web
  • Always stay up-to-date by pulling latest updates from the data sources with just one command
  • Use stable data references to make your data projects fully reproducible
  • Collaborate on cleaning and improving data of existing datasets
  • Create derivative datasets by transforming, enriching, and summarizing data others have published
  • Write query once and run it forever - our pipelines require nearly zero maintenance
  • Built-in support for GIS data
  • Share your results with others in a fully reproducible and reusable form
For Data Consumers
  • Download a dataset from a shared repository
  • Verify that all data comes from trusted sources using 100% accurate lineage
  • Audit the chain of transformations this data went through
  • Validate that downloaded was not tampered with a single command
  • Trust your data by knowing where every single bit of information came from with our fine grain provenance
For Data Exploration
  • Explore data and run ad-hoc SQL queries (backed by the power of Apache Spark)
  • Launch a Jupyter notebook with one command
  • Join, filter, and shape your data using SQL
  • Visualize the result using your favorite library SQL Shell Jupyter

Community

If you like what we're doing - support us by starring the repo, this helps us a lot!

Subscribe to our YouTube channel to get fresh tech talks and deep dives.

Stop by and say "hi" in our Discord Server - we're always happy to chat about data.

If you'd like to contribute start here.


Comments
  • How to use ODS

    How to use ODS

    opened by JvD007 5
  • `Thrift Server did not start: TimeoutError` error when launching SQL shell

    `Thrift Server did not start: TimeoutError` error when launching SQL shell

    And the error changed when I re-executed the same command:

    (base) ➜  my-repo kamu sql
    ⠂ Starting Spark SQL shell
    thread 'main' panicked at 'Thrift Server did not start: TimeoutError { duration: 60s, backtrace: <disabled> }', kamu-core/src/infra/explore/sql_shell_impl.rs:143:14
    (base) ➜  my-repo kamu sql
    ⠚ Starting Spark SQL shell
    thread 'main' panicked at 'Thrift server start script returned non-zero code: ExitStatusError(ExitStatusError(256))', kamu-core/src/infra/explore/sql_shell_impl.rs:130:18
    

    Is there a way to restart the Thrift server and check the detailed log message?

    bug need more info 
    opened by ihainan 3
  • Documentation and Installation question(s)

    Documentation and Installation question(s)

    Some questions around installation and user documentation:

    • What do we need to install to get Kamu up and running, by not using Docker?

    • Is it possible to give examples on all the functions of the kamu-cli, the help is not giving the best answer what you can do with it. The Add en Pull is clear and sql and notebook also.

    • Some examples with Python, SparkR and maybe others

    Create a dataset on S3 etc

    TX, Jaco

    documentation 
    opened by JvD007 3
  • [FeatureReq] : New engine - Apache Pulsar

    [FeatureReq] : New engine - Apache Pulsar

    Hey Folks,

    The project looks awesome! I'd like to propose an app integration / new engine with Apache Pulsar. It's a streaming pub/sub platform with native support for local code for mutations/transforms of data. Each topic also supports AVRO resigstered with understanding of datamodel revisions.

    -J

    enhancement need more info 
    opened by verbunk 3
  • [4/7] Failed to update root dataset

    [4/7] Failed to update root dataset

    By testing kuma on a data set I got the following error, the kuma examples are working fine on my system.

    $kamu-cli pull hydro.input.3 [1/7] Checking for updates (hydro.input.3) Downloading hydro.input.3: [00:00:00] [##################################################################################################################################] 25.74KB/25.74KB (119.21MB/s, 0s) [4/7] Failed to update root dataset (hydro.input.3) 1 dataset(s) had errors

    Summary of errors:

    hydro.input.3: Ingest error: Engine error: Contract error: Engine did not write a response file, see log files for details: /home/jaco/.kamu/run/spark-jeCbeTBOlV.out.txt /home/jaco/.kamu/run/spark-jeCbeTBOlV.err.txt

    Data set is locate at http://localhost/cameraregister-utrecht-csv.csv I have add the csv file and the yaml file in the zip file

    I can't find what is going wrong with the csv file or yaml

    Any help is welcome

    [ hydro-test.zip

    ](url)

    opened by JvD007 2
  • Bump axum-core from 0.2.7 to 0.2.8

    Bump axum-core from 0.2.7 to 0.2.8

    Bumps axum-core from 0.2.7 to 0.2.8.

    Release notes

    Sourced from axum-core's releases.

    axum-core - v0.2.8

    Security

    • breaking: Added default limit to how much data Bytes::from_request will consume. Previously it would attempt to consume the entire request body without checking its length. This meant if a malicious peer sent an large (or infinite) request body your server might run out of memory and crash.

      The default limit is at 2 MB and can be disabled by adding the new DefaultBodyLimit::disable() middleware. See its documentation for more details.

      This also applies to String which used Bytes::from_request internally.

      (#1346)

    #1346: tokio-rs/axum#1346

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump lz4-sys from 1.9.3 to 1.9.4

    Bump lz4-sys from 1.9.3 to 1.9.4

    Bumps lz4-sys from 1.9.3 to 1.9.4.

    Changelog

    Sourced from lz4-sys's changelog.

    1.24.0:

    • Update to lz4 1.9.4 (lz4-sys 1.9.4) - this fixes CVE-2021-3520, which was a security vulnerability in the core lz4 library
    • export the include directory of lz4 from build.rs

    1.23.3 (March 5, 2022):

    • Update lz4 to 1.9.3
    • Add [de]compress_to_buffer to block API to allow reusing buffers (#16)
    • Windows static lib support
    • Support favor_dec_speed
    • Misc small fixes

    1.23.2:

    • Update lz4 to 1.9.2
    • Remove dependency on skeptic (replace with build-dependency docmatic for README testing)
    • Move to Rust 2018 edition

    1.23.0:

    • Update lz4 to v1.8.2
    • Add lz4 block mode api

    1.22.0:

    • Update lz4 to v1.8.0
    • Remove lz4 redundant dependency to gcc #22 (thanks to Xidorn Quan)

    1.21.1:

    • Fix always rebuild issue #21

    1.21.0:

    • Fix smallest 11-byte stream decoding (thanks to Niklas Hambüchen)
    • Update lz4 to v1.7.5

    1.20.0:

    • Split out separate sys package #16 (thanks to Thijs Cadier)

    1.19.173:

    • Update lz4 to v1.7.3

    1.19.131:

    • Update dependencies for correct work with change build environmet via rustup override

    1.18.131:

    • Implemented Send for Encoder/Decoder #15 (thanks to Maxime Lenoir)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Fix panic on pulling non-existing dataset

    Fix panic on pulling non-existing dataset

    $ kamu list
    ┌─────────────────────────────────┬──────┬────────┬─────────┬──────┐
    │              Name               │ Kind │ Pulled │ Records │ Size │
    ├─────────────────────────────────┼──────┼────────┼─────────┼──────┤
    │ com.cryptocompare.ohlcv.eth-usd │ Root │   -    │       - │    - │
    └─────────────────────────────────┴──────┴────────┴─────────┴──────┘
    
    $ kamu pull zzz
    thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', kamu-core/src/infra/pull_service_impl.rs:165:45
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by sergiimk 1
  • Bump async-graphql from 4.0.5 to 4.0.6

    Bump async-graphql from 4.0.5 to 4.0.6

    Bumps async-graphql from 4.0.5 to 4.0.6.

    Changelog

    Sourced from async-graphql's changelog.

    [4.0.6] 2022-07-21

    • Limit recursive depth to 256 by default
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Improve paginator performance on large block counts

    Improve paginator performance on large block counts

    This PR adds two workarounds for poor pagination performance on datasets with high block counts.

    1. Individual writes into minus::Pager seem to have high overhead, so we format the entire block using a temporary buffer and only write to Pager once per block.

    This already improves performance significanlty (e.g. kamu log on a 1000-block dataset takes about 3 seconds to render).

    1. On top of that to limit performance degradation we introduce --limit parameter to the log command that defaults to 500 blocks.

    Note:

    • kamu log x > file is blazing fast even with 1K blocks, so its all overhead is in the minus pager library
    • minus library has support for "dynamic paging" where a separate thread can push more data progressively but it doesn't work as I expected - it doesn't backpressure the producing thread and seems to make the problem only worse

    @zaychenko-sergei please review

    opened by sergiimk 1
  • Bump openssl-src from 111.21.0+1.1.1p to 111.22.0+1.1.1q

    Bump openssl-src from 111.21.0+1.1.1p to 111.22.0+1.1.1q

    Bumps openssl-src from 111.21.0+1.1.1p to 111.22.0+1.1.1q.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump flatbuffers from 2.0.0 to 22.9.29

    Bump flatbuffers from 2.0.0 to 22.9.29

    Bumps flatbuffers from 2.0.0 to 22.9.29.

    Release notes

    Sourced from flatbuffers's releases.

    v22.9.29

    Changelog

    What's Changed

    New Contributors

    Full Changelog: https://github.com/google/flatbuffers/compare/v22.9.24...v22.9.29

    v22.9.24

    Change Log

    What's Changed

    New Contributors

    ... (truncated)

    Changelog

    Sourced from flatbuffers's changelog.

    22.9.29 (Sept 29 2022)

    • Rust soundness fixes to avoid the crate from bing labelled unsafe (#7518).

    22.9.24 (Sept 24 2022)

    • 20 Major releases in a row? Nope, we switched to a new versioning scheme that is based on date.

    • Python supports fixed size arrays now (#7529).

    • Behavior change in how C++ object API uses UnPackTo. The original intent of this was to reduce allocations by reusing an existing object to pack data into. At some point, this logic started to merge the states of the two objects instead of clearing the state of the packee. This change goes back to the original intention, the packed object is cleared when getting data packed into it (#7527).

    • Fixed a bug in C++ alignment that was using sizeof() instead of the intended AlignOf() for structs (#7520).

    • C# has an offical Nuget package now (#7496).

    2.0.8 (Aug 29 2022)

    • Fix for --keep-prefix the was generating the wrong include statements for C++ (#7469). The bug was introduced in 2.0.7.

    • Added the Verifier::Options option struct to allow specifying runtime configuration settings for the verifier (#7489). This allows to skip verifying nested flatbuffers, a on-by-default change that was introduced in 2.0.7. This deprecates the existing Verifier constructor, which may be removed in a future version.

    • Refactor of tests/test.cpp that lead to ~10% speedup in compilation of the entire project (#7487).

    2.0.7 (Aug 22 2022)

    • This is the first version with an explicit change log, so all the previous features will not be listed.

    • Verifier now checks that buffers are at least the minimum size required to be a flatbuffers (12 bytes). This includes nested flatbuffers, which previously could be declared valid at size 0.

    • Annotated binaries. Given a flatbuffer binary and a schema (or binary schema)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Kamu EventTime column can not have nulls exception

    Kamu EventTime column can not have nulls exception

    I am facting the issue with EventTime column, even though I have followed the date format as mentioned in the yaml file. please help yaml file start kind: DatasetSnapshot version: 1 content: name: Hiding Name kind: root metadata: - kind: setPollingSource fetch: kind: url url: Hiding URL read: kind: csv separator: "," header: true nullValue: "" preprocess: kind: sql engine: spark query: > SELECT CAST(id as BIGINT) as id, CAST(UNIX_TIMESTAMP(date, "yyyy-MM-dd") as TIMESTAMP) as date, username as username,name as name, tweet as tweet, language as language, mentions as mentions, urls as urls, photos as photos, replies_count as replies_count, retweets_count as retweets_count, likes_count as likes,hashtags as hashtags,link as link, retweet as retweet, quote_url as quote_url, video as video, thumbnail as thumbnail, reply_to as reply_to FROM input merge: kind: ledger primaryKey: - id - kind: setVocab eventTimeColumn: date yaml file end

    Kamu_Error Sample_data_to_upload.csv

    need more info 
    opened by suresh852456 1
  • SELinux support

    SELinux support

    User reported that kamu fails to pull a root dataset when installed on fresh Fedora host:

    [4/7] Failed to update root dataset (ca.bankofcanada.exchange-rates.daily)
    
    Summary of errors:
    ca.bankofcanada.exchange-rates.daily: Ingest error: Engine error: Process error: Process exited with code 1, see log files for details:
    - .kamu/run/spark-DNSwZEEJZl.err.txt
    
    Error: Partial failure
    

    Spark logs:

    Exception in thread "main" java.nio.file.AccessDeniedException: /opt/engine/in-out/request.yaml
    	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
    	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    
    help wanted 3rd party issue 
    opened by sergiimk 2
  • CKAN examples

    CKAN examples

    Are there some examples around the CKAN.API

    For example I like to use this opendata set:

    https://ckan.dataplatform.nl/dataset/speeltoestellen/resource/a8e4bb02-f072-424a-8663-5a5f0fe7c7c0

    API call: https://ckan.dataplatform.nl/api/3/action/datastore_search?resource_id=a8e4bb02-f072-424a-8663-5a5f0fe7c7c0&limit=5

    on this platform we have 1900 opendata sources, will be nice to have one example in github

    Let me know

    need more info 
    opened by JvD007 1
  • Null data when reader fails to parse date type

    Null data when reader fails to parse date type

    When reading CSV and coercing a column into DATE type the reader fails to parse value in day/month/year and silently outputs ALL rows as null.

    Expectation: a readable error is displayed asking to specify the dateFormat

    bug good first issue 
    opened by onyalcin 0
  • sparkmagic garbles some dataset types

    sparkmagic garbles some dataset types

    sparkmagic currently transforms SQL queries into spark code with resulting dataframe transferred in stringified form. Since type information is lost pandas tries to guess the datatypes and often does it incorrectly. We need to update sparkmagic to include type information in the transmitted results and use it when building pandas dataframe.

    3rd party issue 
    opened by sergiimk 1
Releases(v0.102.0)
Owner
kamu
Decentralized data supply chain
kamu
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

Memgraph 39 Nov 10, 2022
A highly efficient daemon for streaming data from Kafka into Delta Lake

A highly efficient daemon for streaming data from Kafka into Delta Lake

Delta Lake 164 Nov 19, 2022
TensorBase is a new big data warehousing with modern efforts.

TensorBase is a new big data warehousing with modern efforts.

null 1.3k Nov 29, 2022
A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

raja sekar 2.1k Nov 24, 2022
This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

null 2 Nov 2, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 4.8k Nov 26, 2022
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

SFU Database Group 888 Nov 25, 2022
Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. ?? ?? I needed a succinct way to describe 2d pixel map operat

Ben Greenier 0 Oct 29, 2021
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

null 29.6k Nov 26, 2022
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.7k Dec 2, 2022
High-performance runtime for data analytics applications

Weld Documentation Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and func

Weld 2.9k Nov 23, 2022
Rayon: A data parallelism library for Rust

Rayon Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel

null 7.6k Nov 23, 2022
Quickwit is a big data search engine.

Quickwit This repository will host Quickwit, the big data search engine developed by Quickwit Inc. We will progressively polish and opensource our cod

Quickwit Inc. 2.7k Nov 23, 2022
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Oct 9, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Datafuse Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture Datafuse is a Real-Time Data Processing & Analytics DBMS wit

Datafuse Labs 4.8k Nov 30, 2022
A cross-platform library to retrieve performance statistics data.

A toolkit designed to be a foundation for applications to monitor their performance.

Lark Technologies Pte. Ltd. 155 Nov 12, 2022
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 2, 2022
Analysis of Canadian Federal Elections Data

Canadian Federal Elections election is a small Rust program for processing vote data from Canadian Federal Elections. After building, see election --h

Colin Woodbury 2 Sep 26, 2021