High-performance QEMU memory and instruction tracing

Related tags

Utilities cannoli
Overview

Cannoli Mascot!

Cannoli

Cannoli is a high-performance tracing engine for qemu-user. It can record a trace of both PCs executed, as well as memory operations. It consists of a small patch to QEMU to expose locations to inject some code directly into the JIT, a shared library which is loaded into QEMU to decide what and how to instrument, and a final library which consumes the stream produced by QEMU in another process, where analysis can be done on the trace.

Cannoli is designed to record this information with minimum interference of QEMU's execution. In practice, this means that QEMU needs to produce a stream of events, and hand them off (very quickly) to another process to handle more complex analysis of them. Doing the analysis during execution of the QEMU JIT itself would dramatically slow down execution.

Cannoli can handle billions of target instructions per second, can handle multi-threaded qemu-user applications, and allows multiple threads to consume the data from a single QEMU thread to parallelize processing of traces.

Is it fast?

Graph showing 2.2 billion instructions/sec

Performance with a single QEMU thread running the benchmark example on a Intel Xeon Silver 4310 @ 2.1 GHz, target is mipsel-linux, hot loop of unrolled nops to benchmark PC tracing bandwidth (worst case for us)

Example symbolizer

For an example, check out the symbolizer! Here's the kind of information you can get!

Example symbolizer showing memory accesses and PC executions

TL;DR Getting it running

Build Cannoli

git clone https://github.com/MarginResearch/cannoli
cd cannoli
cargo build --release

Checkout QEMU

git clone https://gitlab.com/qemu-project/qemu.git

Switch to the current QEMU branch we're working on

git checkout 00b1faea41d283e931256aa78aa975a369ec3ae6

Apply patch from qemu_patches.patch

cd qemu
git am --3way </path/to/cannoli>/qemu_patches.patch

Build QEMU for your desired targets (example mipsel and riscv64)

./configure --target-list=mipsel-linux-user,riscv64-linux-user --extra-ldflags="-ldl" --with-cannoli=</path/to/cannoli>
make -j48

Try out the example symbolizer

In one terminal, start the symbolizer

cd examples/symbolizer
cargo run --release

In another terminal, run the program in QEMU with Cannoli!

cd examples/symbolizer
</path/to/qemu>/build/qemu-mipsel -cannoli </path/to/cannoli>/target/release/<your jitter so>.so ./example_app

Coverage Example

Cannoli can be used to get coverage of binary applications for pretty cheap. There's an example provided that uses terrible symbol resolution, but it gives you a rough idea of what you can do

Build and run the client:

cd cannoli/examples/coverage
cargo run --release

Invoke QEMU on a binary you want coverage of, using the coverage hooks

QEMU_CANNOLI=cannoli/target/release/libcoverage.so qemu/build/qemu-x86_64 /usr/bin/vlc

This should work even for large, many-threaded applications! The coverage is self-silencing, meaning it will disable reporting of coverage (in the JIT) by patching itself out once it executes for the first time. You might get events in the future for the same callbacks due to re-JITting of the same code, but it's just meant to be a major filter to cut down on the traffic that you would otherwise get will full tracing.

What to do

  1. Create an application using the cannoli library to process traces by implementing the Cannoli trait (see one of the examples)
  2. Create a library using the jitter library to filter JIT hooks by implementing hook_inst and hook_mem, this must be a cdylib that produces the .so that you pass into QEMU with -cannoli. For a basic example of this that hooks everything, see jitter_always
  3. Run your trace-parsing application
  4. Launch QEMU with the -cannoli argument, and a path to the compiled <jitter>.so that you built!

User-experience

To start off, we should cover what you should expect as an end-user.

QEMU patches

As a user you will have to apply a small patch set to QEMU, consisting of about 200 lines of additions. These are all gated with #ifdef CANNOLI, such that if CANNOLI is not defined, QEMU will build identically to having none of the patches in the first place.

The patches aren't too relevant to the user, other than understanding that they add a -cannoli flag to QEMU which expects a path to a shared library. This shared library is loaded into QEMU and is invoked at various points of the JIT.

To apply the patches, simply run something like:

git am qemu_patches.patch

Jitter

The shared library which is loaded into QEMU is called the Cannoli Jitter.

Using this library expects two basic callbacks to be implemented, such that QEMU knows when to hook, and how to hook, certain operations. This is the filter mechanism that prevents JIT code from being produced in the first place if you do not want to hook literally everything.

use jitter::HookType;

/// Called before an instruction is lifted in QEMU.
///
/// The `HookType` dictates the type of hook used for the instruction, and may
/// be `Never`, `Always`, and `Once`
///
/// This may be called from multiple threads
#[no_mangle]
fn hook_inst(_pc: u64, _branch: bool) -> HookType {
    HookType::Always
}

/// Called when a memory access is being lifted in QEMU. Returning `true` will
/// cause the memory access to generate events in the trace buffer.
///
/// This may be called from multiple threads
#[no_mangle]
fn hook_mem(_pc: u64, _write: bool, _size: usize) -> bool {
    true
}

These hooks provide an opportunity for a user to decide whether or not a given instruction or memory access should be hooked. Returning true (the default) results in instrumenting the instruction. Returning false means that no instrumentation is added to the JIT, and thus, QEMU runs with full speed emulation.

This API is invoked when QEMU lifts target instructions. Lifting in this case, is the core operation of an emulator, where it disassembles a target instruction, and transforms it into an IL or JITs it to another architecture for execution. Since QEMU caches instructions it has already lifted, these functions are called "rarely" (with respect to how often the instructions themselves execute), and thus this is the location where you should put in your smart logic to filter what you hook.

If you hook a select few instructions, the performance overhead of this tool is effectively zero. Cannoli is designed to provide very low overhead for full tracing, however if you don't need full tracing you should filter at this stage. This prevents the JIT from being instrumented in the first place, and provides a filtering mechanism for an end-user.

Cannoli "client"

Cannoli then has a client component. The client's goal is to process the massive stream of data being produced by QEMU. Further, the API for Cannoli has been designed with threading in mind, such that a single thread can be running inside qemu-user, and complex analysis of that stream can be done by threading the analysis while getting maximum single-core performance in QEMU itself.

Cannoli exposes a standard Rust trait-style interface, where you implement Cannoli on your structure.

As an implementer of this trait, you must implement init. This is where you create a structure for both a single-threaded mutable context (Self), as well as a multi-threaded shared immutable context (Self::Context).

You then optionally can implement the callbacks for the Cannoli trait.

These callbacks are relatively self-explanatory, with the exception of the threading aspects. The three main execution callbacks exec, read, and write can be called from multiple threads in parallel. Thus, these are not called sequentially. This is where stateless processing should be done. These also only have immutable access to the Self::Context, as they run in parallel. This is the correct location to do any processing which does not need to know the ordering/sequence of instructions or memory accesses. For example, applying symbols where you convert from a pc into a symbol + address should be done here, such that you can symbolize the trace in parallel.

All of the main callbacks (eg. exec) provide access to a trace buffer. Pushing values of type Self::Trace to this buffer allow you to sequence data. Pushing events to this buffer allows them to be viewed in-execution-order when the trace is processed in the trace() callback.

This trace is then exposed back to the user fully in-order via the trace callback. The trace callback is called from various threads (eg. you might run in a different TID), however, is it ensured to always be called sequentially and in-order with respect to execution. Due to this, you get mutable access to self, as well as a reference to the shared Self::Context.

I know this is a weird API, but it effectively allows parallelism of processing the trace until you absolutely need it to be sequential. I hope it's not too confusing for end users, but processing 2 billion instructions/second of data kind of requires threading on the consumer side, otherwise you bottleneck QEMU!

Comments
  • Re-implement libcannoli as a TCG plugin

    Re-implement libcannoli as a TCG plugin

    As the only two things you want to plugin into are instruction execution and memory accesses you could use the existing TCG plugin infrastructure for your backend and remove the dependency on patching QEMU itself. See https://qemu.readthedocs.io/en/latest/devel/tcg-plugins.html for some examples and the API.

    opened by stsquad 10
  • cargo build error

    cargo build error

    **cargo --version cargo 1.61.0 (a028ae4 2022-04-29)

    Linux CentOS7 X86-64**

    ==============================

    Compiling xcursor v0.3.4 Compiling cexpr v0.6.0 Compiling gpu-descriptor v0.2.2 Compiling env_logger v0.9.0 Compiling mempipe v0.1.0 (/data_sdd/qemu_highPerf/cannoli/mempipe) error[E0432]: unresolved import alloc::ffi --> mempipe/src/lib.rs:55:12 | 55 | use alloc::ffi::{CString, NulError}; | ^^^ could not find ffi in alloc

    error[E0554]: #![feature] may not be used on the stable release channel --> mempipe/src/lib.rs:29:1 | 29 | #![feature(maybe_uninit_uninit_array, array_from_fn)] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    error[E0554]: #![feature] may not be used on the stable release channel --> mempipe/src/lib.rs:30:1 | 30 | #![feature(inline_const, alloc_c_string)] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    error[E0554]: #![feature] may not be used on the stable release channel --> mempipe/src/lib.rs:29:12 | 29 | #![feature(maybe_uninit_uninit_array, array_from_fn)] | ^^^^^^^^^^^^^^^^^^^^^^^^^

    error[E0554]: #![feature] may not be used on the stable release channel --> mempipe/src/lib.rs:29:39 | 29 | #![feature(maybe_uninit_uninit_array, array_from_fn)] | ^^^^^^^^^^^^^

    opened by KnoooW 2
  • quemu_patches.patch fails

    quemu_patches.patch fails

    git am --3way ../cannoli/qemu_patches.patch 
    Applying: Synced with 742848ad987b27fdbeab11323271ca7d196152fb
    Applying: Style cleanup, more comments
    Applying: Added PC support to memops
    Applying: Updated path
    Applying: Added --with-cannoli build flag
    error: sha1 information is lacking or useless (include/tcg/tcg.h).
    error: could not build fake ancestor
    Patch failed at 0005 Added --with-cannoli build flag
    
    opened by Semnodime 1
  • Fix patch to handle updated QEMU as of 2dc7bf63cf77d23b287c8d78628d62…

    Fix patch to handle updated QEMU as of 2dc7bf63cf77d23b287c8d78628d62…

    Qemu commit 2dc7bf63cf77d23b287c8d78628d62046fba1bf4 increased the size of the features field in the m68k cpu struct, which broke patch application of qemu_patches.patch to latest QEMU. This fixes it up so qemu patches can be applied to latest git qemu as written in the README as of QEMU commit dbc4f48b5a.

    opened by novafacing 1
  • Cannoli sometimes misses instruction events (maybe only on self-modifying code)

    Cannoli sometimes misses instruction events (maybe only on self-modifying code)

    Here's a reproducible test case of Cannoli missing an instruction execution event, with a readme: https://github.com/trishume/cannoli/tree/missing-trace-repro/examples/secrecy

    opened by trishume 1
  • [Discussion/Feature Request] Viability of Cannoli on qemu-system

    [Discussion/Feature Request] Viability of Cannoli on qemu-system

    A little bit of background here: I am currently trying to port over Cannoli to XEMU, which is running a custom infrastructure for QEMU 6 in order to emulate the original Xbox. During the efforts I made porting the patches over, I noticed a comment that specifically mentioned that while Cannoli could run on qemu-system, it shouldn't as it would be pointless, due to loss of granularity.

    Could you elaborate on what exactly is lost in the switch from qemu-user to qemu-system? And what would it take to actually get feature parity for Cannoli on qemu-system, if at all possible?

    opened by mariaWitch 2
  • Add docker build.

    Add docker build.

    This Dockerfile builds qemu after applying the patch, then builds cannoli in a separate stage.

    This container will make a CI release possible in a future patch.

    opened by Tim-Nosco 1
  • What if I want to get a count of how many times  a particular instruction is called?

    What if I want to get a count of how many times a particular instruction is called?

    I'd like to count the number of times a particular multiply-accummulate instruction is called (specifically vfmadd231ps) in an executable. Is this possible? If so, how would one go about it?

    opened by philtomson 3
  • additional hooks

    additional hooks

    Cool project! With a bit of tooling on top, I'll probably be able to replace many of my use cases for usercorn with a tool that works on more complex targets.

    There are a few hooks I've found valuable to get a complete picture with this kind of tracing:

    • syscall (# + arg registers) - you can just emit a trace event in do_syscall()
    • mmap / munmap / mprotect (if a file is mmaped, I'd like enough information to best-effort mirror the mapping into a tracing tool. filename+offset may be sufficient for most cases? I'd also likely want to know about the initial mappings of the interpreter and executable.)
    • simple register change (e.g. r0, eax, etc)
    • special register change (e.g. MSR, SIMD)

    Register change tracking is the reason I've wanted something more like cannoli for a long time - it would be so much faster to copy individual register writes to a buffer within the JIT, than what I was doing before (diff the register file repeatedly from a C helper)

    opened by lunixbochs 1
Owner
Margin Research
Margin Research
A convenient tracing config and init lib, with symlinking and local timezone.

clia-tracing-config A convenient tracing config and init lib, with symlinking and local timezone. Use these formats default, and can be configured: pr

Cris Liao 5 Jan 3, 2023
This crate bridges between gstreamer and tracing ecosystems.

This crate provides a bridge between gstreamer and the tracing ecosystem. The goal is to allow Rust applications utilizing GStreamer to better integra

Standard Cognition OSS 17 Jun 7, 2022
An example of a fairing for rocket to use tracing (as this pops up at many places in dicussions and questions)

Rocket Tracing Fairing Example This repository aims to give a short example of how you can add a Fairing to your Rocket for tracing and how to use it

Christof Weickhardt 9 Nov 23, 2022
Tracing layer to quickly inspect spans and events

tracing-texray First, a word of warning: This is alpha software. Don't run this in prod or anywhere where a panic would ruin your day. tracing-texray

Russell Cohen 23 Dec 3, 2022
Middlewares and tools to integrate axum + tracing + opentelemetry

axum-tracing-opentelemetry Middlewares and tools to integrate axum + tracing + opentelemetry. Read OpenTelemetry header from incoming request Start a

David Bernard 31 Jan 4, 2023
A high-performance SPSC bounded circular buffer of bytes

Cueue A high performance, single-producer, single-consumer, bounded circular buffer of contiguous elements, that supports lock-free atomic batch opera

Thaler Benedek 38 Dec 28, 2022
Monorep for fnRPC (high performance serverless rpc framework)

fnrpc Monorep for fnRPC (high performance serverless rpc framework) cli Cli tool help build and manage functions Create RPC functions Create & Manage

Faasly 3 Dec 21, 2022
High-performance BitTorrent tracker compatible with UNIT3D tracker software

UNIT3D-Announce High-performance backend BitTorrent tracker compatible with UNIT3D tracker software. Usage # Clone this repository $ git clone https:/

HDInnovations 4 Feb 6, 2023
A high-performance Lambda authorizer for API Gateway that can validate OIDC tokens

oidc-authorizer A high-performance token-based API Gateway authorizer Lambda that can validate OIDC-issued JWT tokens. ?? Use case This project provid

Luciano Mammino 4 Oct 30, 2023
High-performance, Reliable ChatGLM SDK natural language processing in Rust-Lang

RustGLM for ChatGLM Rust SDK - 中文文档 High-performance, high-quality Experience and Reliable ChatGLM SDK natural language processing in Rust-Language 1.

Blueokanna 3 Feb 29, 2024
Error propagation tracing in Rust.

Propagate Error propagation tracing in Rust. Why Propagate? Being able to trace the cause of an error is critical for many types of software written i

Ben Reeves 10 Sep 23, 2021
A tracing profiler for the Sega MegaDrive/Genesis

md-profiler, a tracing profiler for the Sega MegaDrive/Genesis This program, meant to be used with this fork of BlastEm, helps you finding bottlenecks

null 15 Nov 3, 2022
A tracing layer for macOS/iOS's `oslog`

tracing_oslog This is a tracing layer for the Apple OS logging framework. Activities are used to handle spans, Example use tracing_oslog::OsLogger; l

Lucy 12 Dec 6, 2022
Emit ETW events in tracing-enabled Rust applications.

tracing-etw Emit ETW events in tracing-enabled Rust applications. This crate depends on rust_win_etw. There are four ETW events. fn NewSpan(span_id: u

Microsoft 11 Aug 10, 2022
A patch to fix tracing LocalTime problem.

tracing-local-time A patch to fix tracing LocalTime problem. Tracing-subscriber now has a bug in LocalTime, so build ourselves' to fix it. In this pat

Cris Liao 2 Dec 27, 2021
A dynamic binary tracing tool

Backlight Backlight is a dynamic binary tracing tool. Install $ git clone [email protected]:JoshMcguigan/backlight.git $ cd backlight $ cargo install-b

Josh Mcguigan 42 Dec 3, 2022
tracing - a framework for instrumenting Rust programs to collect structured, event-based diagnostic information

tracing-appender Writers for logging events and spans Documentation | Chat Overview tracing is a framework for instrumenting Rust programs to collect

Cris Liao 1 Mar 9, 2022
A rust `tracing` compatible framework inspired by log4rs.

trace4rs This crate allows users to configure output from tracing in the same way as you would configure the output of log4rs. Overview For a usage ex

Imperva 5 Oct 24, 2022
AWS Cloudwatch layer for tracing-subscriber

tracing-cloudwatch tracing-cloudwatch is a custom tracing-subscriber layer that sends your application's tracing events(logs) to AWS CloudWatch Logs.

ymgyt 7 May 14, 2023