An apocalypse-resistant data storage format for the truly paranoid.

Overview

Carbonado

An apocalypse-resistant data storage format for the truly paranoid.

Designed to keep encrypted, durable, compressed, provably replicated consensus-critical data, without need for a blockchain or powerful hardware. Decoding and encoding can be done in the browser through WebAssembly, built into remote nodes on P2P networks, kept on S3-compatible cloud storage, or locally on-disk as a single highly portable flat file container format.

Features

Carbonado has features to make it resistant against:

  • Drive failure and Data loss
    • Uses bao encoding so it can be uploaded to a remote peer, and random 1KB slices of that data can be periodically checked against a local hash to verify data replication and integrity. This way, copies can be distributed geographically; in case of a coronal mass ejection or solar flare, at most, only half the planet will be affected.
  • Surveillance
    • Files are encrypted at-rest by default using ecies authenticated encryption from secp256k1 keys, which can either be provided or derived from a mnemonic.
  • Theft
    • Decoding is done by the client with their own keys, so it won't matter if devices where data is stored are taken or lost, even if the storage media is unencrypted.
  • Digital obsolescence
    • All project code, dependencies, and programs will be vendored into a tarball and made available in Carbonado format with every release.
  • Bit rot and cosmic rays
    • As a final encoding step, forward error correction codes are added using zfec, to augment the ones already used in some filesystems and storage media.

All without needing a blockchain, however, they can be useful for periodically checkpointing data in a durable place.

Checkpoints

Carbonado supports an optional Bitcoin-compatible HD wallet with a specific derivation path that can be used to secure timestamped Carbonado Checkpoints using an on-chain OP_RETURN.

Checkpoints are structured human-readable yaml files that can be used to reference other carbonado-encoded files. They can also include an index of all the places the file has been stored, so multiple locations on the internet that can be checked for the presence of Carbonado-encoded data for that hash.

Applications

Contracts

RGB contract consignments must be encoded in a consensus-critical manner that is also resistant to data loss, otherwise, they cannot be imported or spent.

Content

Includes metadata for mime type and preview content, good for NFTs and UDAs, especially for taking full possession and self-hosting data, or paying peers to keep it safe, remotely.

Code

Code, dependencies, and programs can be vendored and preserved wherever they are needed. This helps ensure data is accessible, even if there's no longer internet access, or package managers are offline.

Comparisons

Ethereum

On Ethereum, all contract code is replicated by nodes for all addresses at all times. This results in scalability problems, is prohibitively expensive for larger amounts of data, and exposes all data for all contract users, in addition to the possibility it can be altered for all users without their involvement at any time.

Carbonado is specifically designed for encoding RGB contracts, which are to be kept off-chain, encrypted, and safe.

IPFS

IPFS stores data into a database called BadgerDS, encoded in IPLD formats, which isn't the same as a simple, portable flat file format that can be transferred and stored out-of-band of any server, service, or node.

Filecoin

Carbonado uses Bao stream verification based on the performant Blake3 hash algorithm, to establish a statistical proof of replication (which can be proven repeatedly over time). Filecoin instead uses zk-SNARKs, which are notoriously computationally expensive, often recommending GPU acceleration. In addition, Filecoin requires a blockchain, whereas Carbonado does not.

Storm

Storm is great, but it has a file size limit of 16MB, and while files can be split into chunks, they're stored directly in an embedded database, and not in flat files. Ideally, Carbonado would be used in conjunction with Storm.

Error correction

Some decisions were made in how error correction is handled. A chunking forward error correction algorithm was used, called Zfec, which is used in Tahoe-LAFS. Similar to how RAID 5 and 6 stripes parity bits across a storage array, Zfec encodes bits in such a manner where only k valid of m total chunks are needed to reconstruct the original. This becomes more complicated by the fact that Zfec does not have integrity checks built-in. Bao is used to verify the integrity of the decoded input; if the integrity check fails, we can't be quite sure which chunk failed. So, there are two ways to handle this; either create a hash for each chunk and persist it in a safe place out-of-band, or, try each combination of chunks until a combination is found that works. The latter approach is used here, since the need for scrubbing should hopefully be a relatively rare occurrence, especially if reliable storage media is used, a CoW filesystem set to scrub for bitrot, or there's an entire copy that's good. However, if you're down to your last copy, and all you have is the hash (name of the file) and some good chunks, the scrub method in this crate should help, even if it can be computationally-intensive.

Running scrub on an input that has no errors in it actually returns an error; this is to prevent the need for unnecessary writes of bytes that don't need to be scrubbed. This is useful in append-only datastores and metered cloud storage scenarios.

The values 4/8 were chosen for Zfec's k of m parameters, meaning, only 4 valid chunks are needed, but 8 chunks are provided. Half of the chunks could fail to decode. This doubles the size of the data, on top of the encryption and integrity-checking, but such is the price of paranoia. Also, a non-prime k is needed to align chunk size with Bao slice size.

Bao only supports a fixed chunk size of 1KB, so the smallest a Carbonado file can be is 8KB.

Comments
  • Double-encryption

    Double-encryption

    Bytes can be encrypted with your own public key, but also a second layer could be added to encrypt for a storage provider. This will result in a separate hash, and improve proof of replication, since it reduces the risk of de-duplication (multiple providers colluding to keep data in one place, and then relaying slice challenges).

    opened by cryptoquick 0
  • Web Storage Provider

    Web Storage Provider

    A web storage provider will have a private key in a configuration file, and will use that along with the public key the file is signed by to encrypt it locally. All Carbonado files must be either signed or encrypted.

    It will also store chunks in 8 separate folders, which are recommended to be moved to separate storage volume arrays.

    This makes #11 obsolete because for private files, the key is simply not shared. If a storage provider is told to store a file that's not encrypted, it checks the signature and creates an ECDH key that encrypts the file using a shared secret. If the storage provider is paid to, it will provide the content.

    This will also need to support key blacklisting and whitelisting. Whitelisting will be useful for storage providers who only want to support specific users, and blacklisting is useful for if someone is trying to share bad files using the same key across different storage providers.

    opened by cryptoquick 0
  • Content Adressability and File Segmenting

    Content Adressability and File Segmenting

    Files over 16MB will be segmented in order to improve computational parallelization and to support streaming very large files.

    Segments are different than chunks in that there will always need to be 4/8 chunks, but there can be many segment increments of 16MB.

    In order to support parallelization, a content catalog is needed in order to refer to the original content that was encoded. This content catalog will be storage frontend-specific. For BitTorrent it'll be a SHA-2 hash, for IPFS it'll be a Blake2b Multihash, and for the HTTP frontend, it'll use a Blake3 hash. In all cases, the client is encouraged to hash the contents received once-over in order to verify it has indeed received the correct data. Content catalogs will be Carbonado-encoded on-disk, with optional encryption in order to preserve privacy at-rest.

    For each frontend supported, a YAML file is used to simplify inspection, and it will contain a list of segments indexed by the Bao hash used to encode them. Additional metadata can also be included such as offset and index within the file to align the contents with IPLD DAGs or BitTorrent chunks. For the rsync frontend, original file metadata can be stored, and the rsync frontend indexes files by a hash of their path. Blake3 hashes will be keyed using the file's public key in order to improve privacy by breaking authoritative content hash tables (such as a sort of Rainbow table used to index files known by state actors).

    opened by cryptoquick 0
  • Geographic Redundancy

    Geographic Redundancy

    Octants are used to better ensure geographic distribution of data. This is volunteered data, and won't have as much an effect on geographic arbitrage if default storage modes are tolerant of some measure of adjacency. By using an octant system, it's easier to select a different storage region to avoid putting all replicas in the same region without necessarily needing to resort to trigonometry.

    The Carbonado Octants System: Octants Although O4 is quite sparse, if there are no providers available within this region (or others), an adjacent octant will be chosen.

    opened by cryptoquick 2
  • Storage markets

    Storage markets

    Market node, used for routing, LN channels are made to the market node route by both storage clients and storage providers. Storage prices rise on a per-node basis; this encourages a more even distribution of load, and incentivizes adding storage to a network in a competitive market as prices rise with demand. For instance, price should be a function of supply, and increase exponentially as a storage allocation approaches 100%.

    Storage allocation can be verified by the storage provider node. Speed tests can be used between the storage provider and the storage market. Storage markets can relay data between multiple storage providers, improving replication, with replication factor managed by the storage market node itself, or replication can be performed peer to peer between storage clients and individual storage clients. The primary function of a market node is to set a market price. Otherwise, fully P2P operation can occur at a fixed price.

    opened by cryptoquick 1
Owner
diba-io
Infrastructure for Web3 /webZero utility on Bitcoin and lightning
diba-io
Truly universal encoding detector in pure Rust - port of Python version

Charset Normalizer A library that helps you read text from an unknown charset encoding. Motivated by original Python version of charset-normalizer, I'

Nikolay Yarovoy 29 Oct 9, 2023
Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

Databento, Inc. 15 Nov 4, 2022
niwl - a prototype system for open, decentralized, metadata resistant communication

niwl - a prototype system for open, decentralized, metadata resistant communication niwl (/nɪu̯l/) - fog, mist or haze (Welsh). niwl is an experimenta

George Tankersley 5 Feb 4, 2022
Single File Assets is a file storage format for images

SFA (Rust) Single File Assets is a file storage format for images. The packed images are not guaranteed to be of same format because the format while

null 1 Jan 23, 2022
Tight Model format is a lossy 3D model format focused on reducing file size as much as posible without decreasing visual quality of the viewed model or read speeds.

What is Tight Model Format The main goal of the tmf project is to provide a way to save 3D game assets compressed in such a way, that there are no not

null 59 Mar 6, 2023
Given a set of kmers (fasta format) and a set of sequences (fasta format), this tool will extract the sequences containing the kmers.

Kmer2sequences Description Given a set of kmers (fasta / fastq [.gz] format) and a set of sequences (fasta / fastq [.gz] format), this tool will extra

Pierre Peterlongo 22 Sep 16, 2023
UniSBOM is a tool to build a software bill of materials on any platform with a unified data format.

UniSBOM is a tool to build a software bill of materials on any platform with a unified data format. Work in progress Support MacOS Uses system_profile

Simone Margaritelli 32 Nov 2, 2022
PyO3's PyAny as a serde data format

serde-pyobject PyO3's PyAny as a serde data format Usage Serialize T: Serialize into &'py PyAny: use serde::Serialize; use pyo3::{Python, types::{PyAn

Jij 3 Nov 24, 2023
a simple, non-self-describing data-interchange format.

rust-fr 'rust-fr' (aka rust for real) is a simple, non-self-describing data-interchange format. installation You can use either of these methods. Add

Ayush 4 Feb 28, 2024
An event replay tool for the Trento storage backend.

photofinish - a little, handy tool to replay events This tiny CLI tool aims to fulfill the need to replay some events and get fixtures. Photofinish re

null 5 Nov 10, 2022
A simple command line program to upload file or directory to web3.storage with optional encryption and compression

w3s-cli A simple command line program to upload file or directory to web3.storage with optional encryption and compression. Features Uploads single fi

qdwang 5 Oct 22, 2022
Mirroring remote repositories to s3 storage, with atomic updates and periodic garbage collection.

rsync-sjtug WIP: This project is still under development, and is not ready for production use. rsync-sjtug is an open-source project designed to provi

SJTUG 57 Feb 22, 2023
ISG lets you use YouTube as cloud storage for ANY files, not just video

I was working on this instead of my finals, hope you appreciate it. I'll add all relevant executables when I can Infinite-Storage-Glitch AKA ISG (writ

HistidineDwarf 3.6k Feb 23, 2023
Tool and framework for securely reading untrusted USB mass storage devices.

usbsas is a free and open source (GPLv3) tool and framework for securely reading untrusted USB mass storage devices. Description Following the concept

CEA IT Security 250 Aug 16, 2023
A reliable key-value storage for modern software

Quick-KV A reliable key-value storage for modern software Features Binary Based Data-Store Serde Supported Data Types Thread Safe Links Documentation

null 3 Oct 11, 2023
HTTP client/libcurl TUI front end in Rust, with request + key storage

Rust TUI HTTP Client with API Key Management This project is still in active development and although it is useable, there may still be bugs and signi

Preston Thorpe 23 Nov 9, 2023
Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

DanyalMh 29 Dec 11, 2022
Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

drivel drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (

Daniël 36 Jul 5, 2024
⚗️ Superfast CLI interface for the conventional commits commit format

resin ⚗️ Superfast CLI interface for the conventional commits commit format ❓ What is resin? resin is a CLI (command-line interface) tool that makes i

Matt Gleich 23 Oct 12, 2022