Rust library for whisper.cpp compatible Mel spectrograms

Related tags

Deployment mel-spec
Overview

Mel Spec

A Rust implementation of mel spectrograms aligned to the results from the whisper.cpp, pytorch and librosa reference implementations and suited to streaming audio.

Examples:

See wavey-ai/hush for live demo

image

Usage

To require the libary's main features:

use mel_spec::prelude::*

Mel filterbank that has parity with librosa:

Mel filterbanks, within 1.0e-7 of librosa and identical to whisper GGML model-embedded filters.

        let file_path = "./testdata/mel_filters.npz";
        let f = File::open(file_path).unwrap();
        let mut npz = NpzReader::new(f).unwrap();
        let filters: Array2<f32> = npz.by_index(0).unwrap();
        let want: Array2<f64> = filters.mapv(|x| f64::from(x));
        let sampling_rate = 16000.0;
        let fft_size = 400;
        let n_mels = 80;
        let hkt = false;
        let norm = true;
        let got = mel(sampling_rate, fft_size, n_mels, hkt, norm);
        assert_eq!(got.shape(), vec![80, 201]);
        for i in 0..80 {
            assert_nearby!(got.row(i), want.row(i), 1.0e-7);
        }

Spectrogam using Short Time Fourier Transform

STFT with overlap-and-save that has parity with pytorch and whisper.cpp.

The implementation is suitable for processing streaming audio and will accumulate the correct amount of data before returning fft results.

        let fft_size = 8;
        let hop_size = 4;
        let mut spectrogram = Spectrogram::new(fft_size, hop_size);

        // Add PCM audio samples
        let frames: Vec<f32> = vec![1.0, 2.0, 3.0];
        if let Some(fft_frame) = spectrogram.add(&frames) {
            // use fft result
        }  

STFT Spectrogam to Mel Spectrogram

MelSpectrogram applies a pre-computed filerbank to an FFT result. Results are identical to whisper.cpp and whisper.py

        let fft_size = 400;
        let sampling_rate = 16000.0;
        let n_mels = 80;
        let mut mel = MelSpectrogram::new(fft_size, sampling_rate, n_mels);
        // Example input data for the FFT
        let fft_input = Array1::from(vec![Complex::new(1.0, 0.0); fft_size]);
        // Add the FFT data to the MelSpectrogram
        let mel_spec = stage.add(fft_input);

Creating Mel Spectrograms from Audio.

The library includes basic audio helpder and a pipeline for processing PCM audio and creating Mel spectrograms that can be sent to whisper.cpp.

It also has voice activity detection that uses edge detection (which might be a novel approach) to identify word/speech boundaries in real- time.

        // load the whisper jfk sample
        let file_path = "../testdata/jfk_f32le.wav";
        let file = File::open(&file_path).unwrap();
        let data = parse_wav(file).unwrap();
        let samples = deinterleave_vecs_f32(&data.data, 1);

        let fft_size = 400;
        let hop_size = 160;
        let n_mels = 80;
        let sampling_rate = 16000.0;

        let mel_settings = MelConfig::new(fft_size, hop_size, n_mels, sampling_rate);
        let vad_settings = DetectionSettings::new(1.0, 10, 5, 0, 100);

        let config = PipelineConfig::new(mel_settings, Some(vad_settings));

        let mut pl = Pipeline::new(config);

        let handles = pl.start();

        // chunk size can be anything, 88 is random
        for chunk in samples[0].chunks(88) {
            let _ = pl.send_pcm(chunk);
        }

        pl.close_ingress();

        while let Ok((_, mel_spectrogram)) = pl.rx().recv() {
          // do something with spectrogram
        }

Saving Mel Spectrograms to file

Mel spectrograms can be saved in Tga format - an uncompressed image format supported by OSX and Windows.

As these images directly encode quantized mel spectrogram data they represent a "photographic negative" of audio data that whisper.cpp can develop and print without the need for direct audio input.

tga files are used in lieu of actual audio for most of the library tests. These files are lossless in Speech-to-Text terms, they encode all the information that is available in the model's view of raw audio and will produce identical results.

Note that spectrograms must have an even number of columns in the time domain, otherwise Whisper will hallucinate. the library takes care of this if using the core methods.

     let file_path = "../testdata/jfk_full_speech_chunk0_golden.tga";
     let dequantized_mel = load_tga_8bit(file_path).unwrap();
     // dequantized_mel can be sent straight to whisper.cpp
❯ ffmpeg -hide_banner -loglevel error -i ~/Downloads/JFKWHA-001-AU_WR.mp3 -f f32le -ar 16000 -acodec pcm_f32le -ac 1 pipe:1  | ./target/debug/tga_whisper -t ../../doc/cutsec_46997.tga
...
whisper_init_state: Core ML model loaded
Got 1
 the quest for peace.

image the quest for peace.

Voice Activity Detection

I had the idea of using the Sobel operator for this as speech in Mel spectrograms is characterised by clear gradients.

The general idea is to outline structure in the spectrogram and then find vertical gaps that are suitable for cutting - to allow passing new spectrograms to the model in near real-time.

It's particualrly good at separating speech activity - this is important, because anything resembling white noise is hallucinogenic to Whisper. The Voice Activity Detector module therefore drops frames that look to be gaps in speech.

This is still not perfect and definitely a downside of stream processing, at least with Whisper. However, pre-processing audio as spectrograms should be more robust than pre-processing raw audio - with raw audio it's necessary to look for attack transients to find boundaries, but it's not easy to tell if enegry changes are voice or something else. Mel spectrograms already provide a distinctive "voice" signature.

The graphic below shows part of JFK's speech and uses Sobel edge detection to find possible word/speech boundaries. As you can see, it works pretty well:

image

For reference, the settings used for this example are:

        let settings = DetectionSettings {
            min_energy: 1.0,
            min_y: 3,
            min_x: 5,
            min_mel: 0,
            min_frames: 100,
        };

Voice boundares for the entire inaugural address can be found in: testdata/jfk_full_speech_chunk0_golden.tga.

It does a good job of detecting when a window contains no speech, vs when it contains very short expressions - green means no speech detected - green as it means it's safe to cut without cutting a word in half.

A segment in the JFK speech that's noisy and somewhat strucutured - but not speech (I picked these by finding the most wild hallucinations in the transcript):

energy but no speech: image vad result: image

Word detection will discard this entire frame as the intersections are only a pixel or two wide - it needs at least 5 pixels of contiguous intersection in the time domain (and 3 in the frequency domain - see DetectionSettings above) to count the window as including speech.

A fleeting word: image vad result: image

This passes as speech.

More work needs to be done here, but it is a good start. Hallucinations remain a problem but this always happens when the model is passed mel spectrograms that don't contain actual speech. TODO: I think there are also probability metrics for tokens returned by the model that might help.

The current state of play, the full JFK speech with the above voice activity and word boundary settings, processing on a stream and sending to Whisper approx every and 1-second, can be found here:

jfk_transcript_golden.txt

It will be possible to tidy up hallucinations by checking the spectrograms and refining the boundary detection (each segment/line has a corresponding spectrogram saved - see examples).

Discussion

  • Mel spectrograms encode at 6.4Kb /sec (80 * 2 bytes * 40 frames)
  • Float PCM required by whispser audio APIs is 64Kb /sec at 16Khz     - expensive to reprocess     - resource intensive to keep PCM in-band for overlapping

whisper.cpp produces mel spectrograms with 1.0e-6 precision. However, these spectrograms are invariant to 8-bit quantisation: we can save them as 8-bit images and not lose useful information - not lose any actual information about the sound wave at all.

Heisenberg's Uncertainty Principle puts a limit on how much resolution a spectrogram can have - the more we zoom in on a wave, the more blurry it becomes.

Time stretching by overlapping (whisper uses a 60% overlap) mitigates this, to a point. But after that more precision doesn't mean more accuracy, and may actually cause noise:

Indeed, we only need 1.0e-1 precision to get accurate results, and rounding to 1.0e-1 seems more accurate for some difficult transcriptions.

Consider these samples from the jfk speech used in the original whisper.py tests:

[src/lib.rs:93] &mel_spectrogram[10..20] = [
    0.15811597,
    0.26561865,
    0.07558561,
    0.19564378,
    0.16745868,
    0.21617787,
    -0.29193184,
    0.12279237,
    0.13897367,
    -0.17434756,
]
[src/lib.rs:92] &mel_spectrogram_rounded[10..20] = [
    0.2,
    0.3,
    0.1,
    0.2,
    0.2,
    0.2,
    -0.3,
    0.1,
    0.1,
    -0.2,
]

Once quantised, the spectrograms are the same:

image image (top: not rounded, botton: rounded to 1.0e-1)

A lot has to do with how speech can be encapsulated almost entirely in the frequency domain, and how effectively the mel scale divides those frequencies into 80 bins. 8-bytes of 0-255 grayscale is probably overkill even to measure the total power in each of those bins - it could be compressed even further.

You might also like...
Experimental implementation of the oci-runtime in Rust

youki Experimental implementation of the oci-runtime in Rust Overview youki is an implementation of runtime-spec in Rust, referring to runc. This proj

Krustlet: Kubernetes Kubelet in Rust for running WASM

Krustlet: Kubernetes Kubelet in Rust for running WASM 🚧 🚧 This project is highly experimental. 🚧 🚧 It should not be used in production workloads.

youki is an implementation of the OCI runtime-spec in Rust, similar to runc.
youki is an implementation of the OCI runtime-spec in Rust, similar to runc.

youki is an implementation of the OCI runtime-spec in Rust, similar to runc.

Shallow Container is a light-weight container tool written in Rust.
Shallow Container is a light-weight container tool written in Rust.

Shallow Container is a light-weight container tool written in Rust. It is totally for proof-of-concept and may not suit for production environment.

Easy to use, extendable, OCI-compliant container runtime written in pure Rust
Easy to use, extendable, OCI-compliant container runtime written in pure Rust

PURA - Lightweight & OCI-compliant container runtime Pura is an experimental Linux container runtime written in pure and dependency-minimal Rust. The

Container monitor in Rust

Conmon-rs A pod level OCI container runtime monitor. The goal of this project is to provide a container monitor in Rust. The scope of conmon-rs encomp

Rust Kubernetes client and controller runtime

kube-rs Rust client for Kubernetes in the style of a more generic client-go, a runtime abstraction inspired by controller-runtime, and a derive macro

A simple containerized application manage system like Kubernetes, but written in Rust
A simple containerized application manage system like Kubernetes, but written in Rust

rMiniK8s A simple dockerized application management system like Kubernetes, written in Rust, plus a simple FaaS implementation. Course Project for SJT

Rust client for the huggingface hub aiming for minimal subset of features over `huggingface-hub` python package

This crates aims to emulate and be compatible with the huggingface_hub python package. compatible means the Api should reuse the same files skipping d

Comments
  • Expose a library API

    Expose a library API

    So I was looking at using this as a library but none of the functionality seems publicly exposed? The readme looks quite impressive compared to the API functions docs.rs shows for the crate https://docs.rs/mel_spec/latest/mel_spec/index.html

    opened by xd009642 3
  • interested in testing against data generated by librosa's functions

    interested in testing against data generated by librosa's functions

    Awesome project. If I was able to generate data from librosa's python functions(input and output npy files) to compare against, would you be interested in including those as a test case?

    I started working on a tool for generating files for this for the library I've been working on after realizing none of the existing python functions for computing an MFCC had anywhere near the same results, and models may be somewhat overfit to the preprocessing step. but I'm less of a mathematician and more of a code monkey, so I'm just guessing.

    Also interested in contributors?

    opened by skewballfox 1
Owner
Wavey.ai
Wavey.ai
Automated builded images for rust-lang with rustup, "the ultimate way to install RUST"

rustup Automated builded images on store and hub for rust-lang with musl added, using rustup "the ultimate way to install RUST". tag changed: all3 ->

刘冲 83 Nov 30, 2022
docker-rust — the official Rust Docker image

About this Repo This is the Git repo of the Docker official image for rust. See the Docker Hub page for the full readme on how to use this Docker imag

The Rust Programming Language 321 Dec 11, 2022
Docker images for compiling static Rust binaries using musl-libc and musl-gcc, with static versions of useful C libraries. Supports openssl and diesel crates.

rust-musl-builder: Docker container for easily building static Rust binaries Source on GitHub Changelog UPDATED: Major updates in this release which m

Eric Kidd 1.3k Jan 1, 2023
Very small rust docker image

mini-docker-rust Very small rust docker image. This is an example project on how to build very small docker images for a rust project. The resulting i

null 155 Jan 1, 2023
Docker images for compiling static Rust binaries using musl-cross

rust-musl-cross Docker images for compiling static Rust binaries using musl-cross-make, inspired by rust-musl-builder Prebuilt images Currently we hav

messense 365 Dec 30, 2022
A wasm template for Rust to publish to gh-pages without npm-deploy

Wasm template for Rust hosting without npm-deploy on github pages using Travis script It automatically hosts you wasm projects on gh-pages using a tra

Siddharth Naithani 102 Dec 24, 2022
App Engine Rust boilerplate

Rust App Engine This projects is a minimal boilerplate ro run Rust web application inside Google App Engine. To deploy it use Google Cloud Shell: ```s

Denis Kolodin 48 Apr 26, 2022
A buildpack for Rust applications on Heroku, with full support for Rustup, cargo and build caching.

Heroku buildpack for Rust This is a Heroku buildpack for Rust with support for cargo and rustup. Features include: Caching of builds between deploymen

Eric Kidd 502 Nov 7, 2022
A tiny minimal container runtime written in Rust.

vas-quod A tiny minimal container runtime written in Rust. The idea is to support a minimal isolated containers without using existing runtimes, vas-q

flouthoc 438 Dec 26, 2022
oci-image and oci-runtime spec in rust.

oci-lib Oci-Spec for your container runtime or container registry. Oci-lib is a rust port for original oci spec written in go. Following crate contain

flouthoc 12 Mar 10, 2022