A Rust crate that reads and writes tfrecord files

Overview

tfrecord-rust

The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow.

Features

  • Provide both high level Example type as well as low level Vec<u8> bytes {,de}serialization.
  • Support async/await syntax. It's easy to work with futures-rs.
  • Interoperability with serde, image, ndarray and tch.
  • TensorBoard support.

Usage

Use this crate in your project

Append this line to your Cargo.toml.

tfrecord = "0.5"

Notice on TensorFlow updates

The crate compiles the pre-generated ProtocolBuffer code from TensorFlow. In case of TensorFlow updates or custom patches, please run the code generation manually, see Generate ProtocolBuffer code from TensorFlow section for details.

Available Cargo features

Module features

  • full: Enable all features.
  • async_: Enable async/await feature.
  • dataset: Enable the dataset API that can load records from multiple TFRecord files.
  • summary: Enable the summary and event types and writters, mainly for TensorBoard.

Third-party crate support features

  • with-serde: Enable support with serde crate.
  • with-image: Enable support with image crate.
  • with-ndarray: Enable support with ndarray crate.
  • with-tch: Enable support with tch crate.

Documentation

See docs.rs for the API.

Example

File reading example

This is a snipplet copied from examples/tfrecord_info.rs.

use tfrecord::{Error, ExampleReader, Feature, RecordReaderInit};

fn main() -> Result<(), Error> {
    // use init pattern to construct the tfrecord reader
    let reader: ExampleReader<_> = RecordReaderInit::default().open(&*INPUT_TFRECORD_PATH)?;

    // print header
    println!("example_no\tfeature_no\tname\ttype\tsize");

    // enumerate examples
    for (example_index, result) in reader.enumerate() {
        let example = result?;

        // enumerate features in an example
        for (feature_index, (name, feature)) in example.into_iter().enumerate() {
            print!("{}\t{}\t{}\t", example_index, feature_index, name);

            match feature {
                Feature::BytesList(list) => {
                    println!("bytes\t{}", list.len());
                }
                Feature::FloatList(list) => {
                    println!("float\t{}", list.len());
                }
                Feature::Int64List(list) => {
                    println!("int64\t{}", list.len());
                }
                Feature::None => {
                    println!("none");
                }
            }
        }
    }

    Ok(())
}

Work with async/await syntax

The snipplet from examples/tfrecord_info_async.rs demonstrates the integration with async-std.

use futures::stream::TryStreamExt;
use std::{fs::File, io::BufWriter, path::PathBuf};
use tfrecord::{Error, Feature, RecordStreamInit};

pub async fn _main() -> Result<(), Error> {
    // use init pattern to construct the tfrecord stream
    let stream = RecordStreamInit::default()
        .examples_open(&*INPUT_TFRECORD_PATH)
        .await?;

    // print header
    println!("example_no\tfeature_no\tname\ttype\tsize");

    // enumerate examples
    stream
        .try_fold(0, |example_index, example| {
            async move {
                // enumerate features in an example
                for (feature_index, (name, feature)) in example.into_iter().enumerate() {
                    print!("{}\t{}\t{}\t", example_index, feature_index, name);

                    match feature {
                        Feature::BytesList(list) => {
                            println!("bytes\t{}", list.len());
                        }
                        Feature::FloatList(list) => {
                            println!("float\t{}", list.len());
                        }
                        Feature::Int64List(list) => {
                            println!("int64\t{}", list.len());
                        }
                        Feature::None => {
                            println!("none");
                        }
                    }
                }

                Ok(example_index + 1)
            }
        })
        .await?;

    Ok(())
}

Work with TensorBoard

This is a simplified example of examples/tensorboard.rs that sends summary data to log_dir directory. After running the example, launch tensorboard --logdir log_dir to watch the outcome in TensorBoard.

use super::*;
use rand::seq::SliceRandom;
use rand_distr::{Distribution, Normal};
use std::{f32::consts::PI, io, thread, time::Duration};
use tfrecord::EventWriterInit;

pub fn _main() -> Result<()> {
    // show log dir
    let prefix = "log_dir/my_prefix";

    // download image files
    println!("downloading images...");
    let images = IMAGE_URLS
        .iter()
        .cloned()
        .map(|url| {
            let mut bytes = vec![];
            io::copy(&mut ureq::get(url).call().into_reader(), &mut bytes)?;
            let image = image::load_from_memory(bytes.as_ref())?;
            Ok(image)
        })
        .collect::<Result<Vec<_>>>()?;

    // init writer
    let mut writer = EventWriterInit::from_prefix(prefix, None)?;
    let mut rng = rand::thread_rng();

    // loop
    for step in 0..30 {
        println!("step: {}", step);

        // scalar
        {
            let value: f32 = (step as f32 * PI / 8.0).sin();
            writer.write_scalar("scalar", step, value)?;
        }

        // histogram
        {
            let normal = Normal::new(-20.0, 50.0).unwrap();
            let values = normal
                .sample_iter(&mut rng)
                .take(1024)
                .collect::<Vec<f32>>();
            writer.write_histogram("histogram", step, values)?;
        }

        // image
        {
            let image = images.choose(&mut rng).unwrap();
            writer.write_image("image", step, image)?;
        }

        thread::sleep(Duration::from_millis(100));
    }

    Ok(())
}

More examples

To read values from event files used by TensorBoard, you can see the event reader example.

More examples can be found in examples and tests directories.

Generate ProtocolBuffer code from TensorFlow

The crate relies on ProtocolBuffer documents from TensorFlow. The crate ships pre-generated code from ProtocolBuffer documents by default. Most users don't need to bother with the code generation. The step is needed only in case of TensorFlow updates or your custom patch.

The build script accepts several ways to access the TensorFlow source code, controlled by the TFRECORD_BUILD_METHOD environment variable. The generated code will be placed under prebuild_src directory. See the examples below to understand the usage.

  • Build from a source tarball
export TFRECORD_BUILD_METHOD="src_file:///home/myname/tensorflow-2.2.0.tar.gz"
cargo build --release --features serde,generate_protobuf_src  # with serde
cargo build --release --features generate_protobuf_src        # without serde
  • Build from a source directory
export TFRECORD_BUILD_METHOD="src_dir:///home/myname/tensorflow-2.2.0"
cargo build --release --features serde,generate_protobuf_src  # with serde
cargo build --release --features generate_protobuf_src        # without serde
  • Build from a URL
export TFRECORD_BUILD_METHOD="url://https://github.com/tensorflow/tensorflow/archive/v2.2.0.tar.gz"
cargo build --release --features serde,generate_protobuf_src  # with serde
cargo build --release --features generate_protobuf_src        # without serde
  • Build from installed TensorFlow on system. The build script will search ${install_prefix}/include/tensorflow directory for protobuf documents.
export TFRECORD_BUILD_METHOD="install_prefix:///usr"
cargo build --release --features serde,generate_protobuf_src  # with serde
cargo build --release --features generate_protobuf_src        # without serde

License

MIT license. See LICENSE file for full license.

You might also like...
Apache Arrow DataFusion and Ballista query engines
Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

Orkhon: ML Inference Framework and Server Runtime
Orkhon: ML Inference Framework and Server Runtime

Orkhon: ML Inference Framework and Server Runtime Latest Release License Build Status Downloads Gitter What is it? Orkhon is Rust framework for Machin

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference
Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos' Neural Network inference engine. This project used to be called tfdeploy, or Tensorflow-deploy-rust. What ? tract is a Neural Network inference

Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. 🦀 🐾 I needed a succinct way to describe 2d pixel map operat

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

New generation decentralized data warehouse and streaming data pipeline
New generation decentralized data warehouse and streaming data pipeline

World's first decentralized real-time data warehouse, on your laptop Docs | Demo | Tutorials | Examples | FAQ | Chat Get Started Watch this introducto

An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

Comments
  • reqwest introduces a lot of transitive dependencies

    reqwest introduces a lot of transitive dependencies

    First of all, thanks a lot for this crate! We use it in sticker2 to write Tensorboard summaries during training and it works great :tada: . However, adding rust-tfrecord as a dependency added a lot of transitive dependencies. The main culprit is reqwest:

    │   ├── reqwest v0.10.8
    │   │   ├── base64 v0.12.3
    │   │   ├── bytes v0.5.6
    │   │   ├── encoding_rs v0.8.24
    │   │   │   └── cfg-if v0.1.10
    │   │   ├── futures-core v0.3.5
    │   │   ├── futures-util v0.3.5
    │   │   │   ├── futures-core v0.3.5
    │   │   │   ├── futures-io v0.3.5
    │   │   │   ├── futures-task v0.3.5
    │   │   │   │   └── once_cell v1.4.1
    │   │   │   ├── memchr v2.3.3
    │   │   │   ├── pin-project v0.4.23
    │   │   │   │   └── pin-project-internal v0.4.23
    │   │   │   │       ├── proc-macro2 v1.0.13 (*)
    │   │   │   │       ├── quote v1.0.6 (*)
    │   │   │   │       └── syn v1.0.22 (*)
    │   │   │   ├── pin-utils v0.1.0
    │   │   │   └── slab v0.4.2
    │   │   ├── http v0.2.1
    │   │   │   ├── bytes v0.5.6
    │   │   │   ├── fnv v1.0.7
    │   │   │   └── itoa v0.4.5
    │   │   ├── http-body v0.3.1
    │   │   │   ├── bytes v0.5.6
    │   │   │   └── http v0.2.1 (*)
    │   │   ├── hyper v0.13.7
    │   │   │   ├── bytes v0.5.6
    │   │   │   ├── futures-channel v0.3.5
    │   │   │   │   └── futures-core v0.3.5
    │   │   │   ├── futures-core v0.3.5
    │   │   │   ├── futures-util v0.3.5 (*)
    │   │   │   ├── h2 v0.2.6
    │   │   │   │   ├── bytes v0.5.6
    │   │   │   │   ├── fnv v1.0.7
    │   │   │   │   ├── futures-core v0.3.5
    │   │   │   │   ├── futures-sink v0.3.5
    │   │   │   │   ├── futures-util v0.3.5 (*)
    │   │   │   │   ├── http v0.2.1 (*)
    │   │   │   │   ├── indexmap v1.3.2 (*)
    │   │   │   │   ├── slab v0.4.2
    │   │   │   │   ├── tokio v0.2.22
    │   │   │   │   │   ├── bytes v0.5.6
    │   │   │   │   │   ├── fnv v1.0.7
    │   │   │   │   │   ├── futures-core v0.3.5
    │   │   │   │   │   ├── iovec v0.1.4
    │   │   │   │   │   │   └── libc v0.2.70
    │   │   │   │   │   ├── lazy_static v1.4.0
    │   │   │   │   │   ├── memchr v2.3.3
    │   │   │   │   │   ├── mio v0.6.22
    │   │   │   │   │   │   ├── cfg-if v0.1.10
    │   │   │   │   │   │   ├── iovec v0.1.4 (*)
    │   │   │   │   │   │   ├── libc v0.2.70
    │   │   │   │   │   │   ├── log v0.4.8 (*)
    │   │   │   │   │   │   ├── net2 v0.2.34
    │   │   │   │   │   │   │   ├── cfg-if v0.1.10
    │   │   │   │   │   │   │   └── libc v0.2.70
    │   │   │   │   │   │   └── slab v0.4.2
    │   │   │   │   │   ├── num_cpus v1.13.0 (*)
    │   │   │   │   │   ├── pin-project-lite v0.1.7
    │   │   │   │   │   └── slab v0.4.2
    │   │   │   │   ├── tokio-util v0.3.1
    │   │   │   │   │   ├── bytes v0.5.6
    │   │   │   │   │   ├── futures-core v0.3.5
    │   │   │   │   │   ├── futures-sink v0.3.5
    │   │   │   │   │   ├── log v0.4.8 (*)
    │   │   │   │   │   ├── pin-project-lite v0.1.7
    │   │   │   │   │   └── tokio v0.2.22 (*)
    │   │   │   │   └── tracing v0.1.19
    │   │   │   │       ├── cfg-if v0.1.10
    │   │   │   │       ├── log v0.4.8 (*)
    │   │   │   │       └── tracing-core v0.1.15
    │   │   │   │           └── lazy_static v1.4.0
    │   │   │   ├── http v0.2.1 (*)
    │   │   │   ├── http-body v0.3.1 (*)
    │   │   │   ├── httparse v1.3.4
    │   │   │   ├── itoa v0.4.5
    │   │   │   ├── pin-project v0.4.23 (*)
    │   │   │   ├── socket2 v0.3.12 (*)
    │   │   │   ├── time v0.1.43 (*)
    │   │   │   ├── tokio v0.2.22 (*)
    │   │   │   ├── tower-service v0.3.0
    │   │   │   ├── tracing v0.1.19 (*)
    │   │   │   └── want v0.3.0
    │   │   │       ├── log v0.4.8 (*)
    │   │   │       └── try-lock v0.2.3
    │   │   ├── hyper-tls v0.4.3
    │   │   │   ├── bytes v0.5.6
    │   │   │   ├── hyper v0.13.7 (*)
    │   │   │   ├── native-tls v0.2.4
    │   │   │   │   ├── log v0.4.8 (*)
    │   │   │   │   ├── openssl v0.10.30
    │   │   │   │   │   ├── bitflags v1.2.1
    │   │   │   │   │   ├── cfg-if v0.1.10
    │   │   │   │   │   ├── foreign-types v0.3.2
    │   │   │   │   │   │   └── foreign-types-shared v0.1.1
    │   │   │   │   │   ├── lazy_static v1.4.0
    │   │   │   │   │   ├── libc v0.2.70
    │   │   │   │   │   └── openssl-sys v0.9.58 (*)
    │   │   │   │   ├── openssl-probe v0.1.2
    │   │   │   │   └── openssl-sys v0.9.58 (*)
    │   │   │   ├── tokio v0.2.22 (*)
    │   │   │   └── tokio-tls v0.3.1
    │   │   │       ├── native-tls v0.2.4 (*)
    │   │   │       └── tokio v0.2.22 (*)
    │   │   ├── ipnet v2.3.0
    │   │   ├── lazy_static v1.4.0
    │   │   ├── log v0.4.8 (*)
    │   │   ├── mime v0.3.16
    │   │   ├── mime_guess v2.0.3
    │   │   │   ├── mime v0.3.16
    │   │   │   └── unicase v2.6.0
    │   │   │       [build-dependencies]
    │   │   │       └── version_check v0.9.2
    │   │   │   [build-dependencies]
    │   │   │   └── unicase v2.6.0 (*)
    │   │   ├── native-tls v0.2.4 (*)
    │   │   ├── percent-encoding v2.1.0
    │   │   ├── pin-project-lite v0.1.7
    │   │   ├── serde v1.0.110 (*)
    │   │   ├── serde_urlencoded v0.6.1
    │   │   │   ├── dtoa v0.4.5
    │   │   │   ├── itoa v0.4.5
    │   │   │   ├── serde v1.0.110 (*)
    │   │   │   └── url v2.1.1
    │   │   │       ├── idna v0.2.0
    │   │   │       │   ├── matches v0.1.8
    │   │   │       │   ├── unicode-bidi v0.3.4
    │   │   │       │   │   └── matches v0.1.8
    │   │   │       │   └── unicode-normalization v0.1.12 (*)
    │   │   │       ├── matches v0.1.8
    │   │   │       └── percent-encoding v2.1.0
    │   │   ├── tokio v0.2.22 (*)
    │   │   ├── tokio-tls v0.3.1 (*)
    │   │   └── url v2.1.1 (*)
    

    Of course, there are duplicates, but it's still many additional dependencies. From git grep-ing, it seems that reqwest is only used for blocking GET requests. Would you be opposed to replacing these by the curl crate, which would reduce the number of dependencies a lot? For downstream crates it would already help a lot if reqwest was removed from build-dependencies/build.rs, because dev-dependencies are not used IIRC.

    I can do a PR if this would be ok for you.

    opened by danieldk 4
  • BufReader Checksum Error

    BufReader Checksum Error

    Hello, thanks for your contribution. I encountered problem when using this repo try reading tfrecord with check_integrity=true.

    [2021-06-02 13:52:20][ERROR]idx@62680: UnexpectedEofError
    [2021-06-02 13:52:20][ERROR]idx@62681: ChecksumMismatchError { expect: "0x0a75f6ec", found: "0x63c4f097" }
    [2021-06-02 13:52:20][ERROR]idx@62682: ChecksumMismatchError { expect: "0x0a0b1263", found: "0x71959e9e" }
    [2021-06-02 13:52:20][ERROR]idx@62683: ChecksumMismatchError { expect: "0x1b0a6572", found: "0x801e5dc9" }
    [2021-06-02 13:52:20][ERROR]idx@62684: ChecksumMismatchError { expect: "0xd3100a12", found: "0x92913227" }
    [2021-06-02 13:52:20][ERROR]idx@62685: ChecksumMismatchError { expect: "0x4635efaf", found: "0xcaa9b1d1" }
    [2021-06-02 13:52:20][ERROR]idx@62686: ChecksumMismatchError { expect: "0x67616d69", found: "0x690cd862" }                           
    [2021-06-02 13:52:20][ERROR]idx@62687: ChecksumMismatchError { expect: "0xd8ff27a2", found: "0x16c5f22f" }
    

    It seems reading len_buf failed when using BufReader and meet the incompetible end of buffer. After some digging, I found this: https://github.com/rust-lang/rust/issues/22570 So we should change reader.read to reader.read_exact?

    opened by Leoyzen 2
  • Add the example of TFRecord Event Reader to demo the log extraction from TensorBorad

    Add the example of TFRecord Event Reader to demo the log extraction from TensorBorad

    Hi, @jerry73204 . Thanks for your awesome crate. I try to add a simple example to show how to use the TFRecord Event Reader to retrieve the training logs inside the TensorBoard.

    opened by YuanYuYuan 1
Owner
null
PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

Per Arneng 4 Apr 11, 2023
An AWS Lambda for automatically loading JSON files as they're created into Delta tables

Delta S3 Loader This AWS Lambda serves a singular purpose: bring JSON files from an S3 bucket into Delta Lake. This can be highly useful for legacy or

R. Tyler Croy 4 Jan 12, 2022
🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

Firelink Data 3 Dec 22, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

null 5 Sep 6, 2023
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

Suchin 139 Sep 26, 2022
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

SFU Database Group 939 Jan 5, 2023
ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

null 2.6k Jan 7, 2023