An intel PT trace converter from `perf.data` to Fuchsia trace format.

Overview

Introduction

Recent Intel processors feature the "Intel Processor Trace" feature, which can be used to capture the full execution trace of a program. It is an amazing tool for debugging, optimization and learning how (natively compiled) programs work.

Linux supports Intel PT in perf. This repository contains an exporter of Intel PT traces from perf to Fuchsia trace format for convenient viewing in Perfetto.

Example output (viewed in Perfetto)

Screenshot of an example output viewed

How to use

Basic usage

First collect the trace with perf record. For example, to trace a running process my_prog, you can do:

TRACE_DURATION_SECS=1
PID=$(pgrep --newest --exact my_prog)
sudo perf record -o perf.data -p $PID -e intel_pt/cyc=1/ -- sleep $TRACE_DURATION_SECS

(This records both the kernel space and the user space. perf record can be configured to only record one of those, but this tool doesn't support that right now. It shouldn't be hard to add support for that, but I didn't get around to it.)

The above command will generate a perf.data file. Note that tracing all branches with Intel PT like that can output several hundred MiB per second per core. Be careful with traces longer than a second so that perf.data doesn't clog up your hard drive.

perf record offers more options that just "trace everything". For example, you can select "trace start" and "trace stop" addresses, use snapshot mode to save trace snapshots only on signals and on exit, and more. Refer to man perf record for details.

To decode the trace to .ftf, build the dlfilter library with

cargo build --release
DLFILTER_PATH=$(realpath target/release/libperf2perfetto.so)

And run perf script with the library passed as a --dlfilter. For example with:

#!/bin/bash
# decode_pt.sh

RELSTART=$1
DURATION=$2
DLFILTER_PATH="$3"
OUT_FILENAME="$4"
MODE="$5"

ABSSTART=$(perf script -f --itrace=i0ns -Ftime -i perf.data | head -n1 | tr -d ':[:space:]')
START=$(echo "$ABSSTART + $RELSTART" | bc)
END=$(echo "$START + $DURATION" | bc)

perf script -f --itrace=bei0ns -i perf.data --dlfilter "$DLFILTER_PATH" --time $START,$END --dlarg "$OUT_FILENAME" --dlarg $MODE

Note the bei0ns. This causes perf to emit "branches" (b), "errors" (e) and "instruction" (i, period 0ns) events. "branches" have to be emitted for this tool to work. "errors" are optional. They will be printed to stderr for your information only and don't affect the tool. "instructions" are optional. If they are emitted, they are used to calculate more exact instruction counts, though it slows decoding down quite significantly. I recommend emitting them.

Usage:

./decode_pt.sh 0.01 0.03 "$DLFILTER_PATH" out_cyc_0.01-0.04.ftf c

This will decode the 10ms-40ms span of the trace (relative to the beginning) from perf.data to out_0.01-0.04.ftf. The c parameter chooses CPU cycles as the time axis in the output. Other options are t (timestamp) and i (instructions).

Instruction and cycle counters

The cycle counts emitted in the output are not exact. The cycle information comes from IntelPT packets, so it has the same granularity as packets. In other words: during decoding, cycle count is only updated on indirect jumps and some conditional jumps (and periodically every few thousand cycles). Cycle count for an instruction is calculated by substracting the cycle count seen on ret from the count seen at call. If there was no update between them, the count will be 0. If there was no updates for a long time before call, the count will include many cycles that passed before call. So the count can be wrong in both ways.

You can improve that with noretcomp (see below), which will force an update on ret. (But trying to achieve granularity finer than a few hundred cycles is unlikely to get you anywhere due to how out-of-order CPUs work.)

The instruction counts are exact when using i0ns (see above). Without it, instruction count is only updated when the cycle count is updated.

noretcomp

By default perf record enables "return compression", which disables the generation of Intel PT packets on ret instructions. Even though the target of ret can't be deduced offline in general, well-behaved applications don't modify return addresses, and the return target can be deduced from preceding calls. This fact can be used to decrease the number (and thus overhead) of Intel PT packets without losing correctness.

Even if your application is well-behaved, you can consider disabling return compression with noretcomp=1, as in perf record -e intel_pt/cyc=1,noretcomp=1/ .... This will result in more exact instruction and cycle counts, though it will also increase the overhead of tracing (think: from 3% to 5%).

Output file size

To view the trace, head to https://ui.perfetto.dev/ and open the .ftf file.

Avoid very big traces to avoid severe lags in Perfetto. Keep them smaller than several hundred megabytes and several hundred thousand function calls. When tracing a single thread of Scylla, this translates to about 25ms.

You can split longer spans into manageable chunks like so:

parallel -j42 ./decode_pt.sh {} 0.025 ./libperf2perfetto.so out_cyc_{}.ftf c ::: $(seq 0.100 0.025 0.299)

This will decode the span 100ms-300ms split into 8 files, each covering 25ms.

Drilling down

If you find something interesting in the trace, you can view a trace of individual instructions like so:

perf script -i perf.data  --itrace=i0ns -Fip,time,insn,srcline,sym --xed --tid 23612 --time 13161.554437440,13161.554462914

where --tid and --time refer to the interesting part of the trace. You can find a copypaste-ready time span in "slice details" in perfetto. (This does not necessarily match the time axis in perfetto. What you seen there can be something other than time (instructions or cycles), depending on the second --dlarg.) This will output a list of all executed instructions in that span, with source code locations, for detailed inspection:

Screenshot of an example output viewed

Archiving traces and decoding traces from remote machines

perf.data isn't a standalone file. Raw Intel PT data contains only information that can't be deduced offline: the results of conditional branches (taken/not taken), target addresses of indirect jumps, timing information. Decoding the raw trace to something useful, (a call-ret trace, an instruction trace, etc.), requires the access to all binaries executed when the program was traced.

perf.data doesn't embed those binaries, but it contains build-ids of required binaries. When decoding, perf looks for given buildids in system directories (your package manager installs debug info there) and in "buildid cache", (usually located at ~/.debug). If you update the machine or move perf.data to other machines, the necessary buildids will likely not be present on the system anymore.

If you want to trace to be portable across updates, reboots and machines, you should archive the binaries and store them with the trace, so that you can repopulate the build cache before decoding when necessary.

Fortunately perf has a script that packs all the needed binaries into an archive so you can do that easily.

On the remote do:

# First, record the trace.
TRACE_DURATION_SECS=1
PID=$(pgrep --newest --exact my_prog)
sudo perf record -o perf.data --kcore -p $PID -e intel_pt/cyc=1/ -- sleep $TRACE_DURATION_SECS
# Note the added `--kcore`. Instead of a `perf.data` file, this will output a `perf.data/` directory containing a
# `data` file and `kcore_dir/` directory with a copy of the kernel image.
# `perf script` understands this directory scheme. Don't pass `-i perf.data/data` to it, just `-i perf.data`.
#
# (The kernel image is passed separately from the buildid cache mechanism.
# Even if it's in the cache, you still need to manually tell `perf script` to use it using `--kallsyms`,
# or use the `kcore_dir/` directory scheme. I don't know why it doesn't just want to behave like any other
# binaries.)

# And collect all relevant binaries into an archive.
sudo perf archive
# This will create `perf.data.tar.bz2`

Then on your workstation:

# Download the trace and the archive.
rsync -rz --progress --rsync-path="sudo rsync" remote:perf.data .
rsync -r --progress remote:perf.data.tar.bz2 .

# Unpack the binaries into a place searched by perf when decoding. `~/.debug` is the default.
lbzip2 -dc perf.data.tar.bz2 | tar x -C ~/.debug

Now you can perf script -i perf.data as usual.

Note that every user has their own buildid-cache. If you are going to sudo perf script, you have to unpack the archive to /root/.debug, not ~/.debug.

If your perf distribution doesn't have perf archive, just grab tools/perf/perf-archive.sh from the Linux repository.

Troubleshooting

I have encountered some programs (e.g. Firefox on Fedora) that I can't trace because the decoding fails with SIGSEGV. This is a problem with perf, not this dlfilter. It happens when perf tries to read something (symbol names or instructions, I'm not sure) from library segments with PROT_NONE. I'm not sure what causes this.

Related projects

magic-trace

magic-trace provides the same general functionality (export from perf to .ftf). AFAIK the main differences are:

  • magic-trace exposes its own CLI and does the necessary perf invocations under the hood
  • magic-trace has some features for interactive choice of the recording target (PID, symbol to collect snapshots on, etc.) using fzf
  • magic-trace is written in Ocaml
  • magic-trace parses the text output of perf script instead of using its binary API (the dlfilter feature)
  • this project shows some additional info in the trace: instructions, cycles, instruction cache footprint
  • the call-stack simulation logic may differ (in particular the handling of gaps in the trace)

I wrote my own converter instead of using magic-trace because I wanted instruction and cycle counts, a separation of recording and decoding (for example, to visualize traces collected on remote machines), and it failed with some regex errors the first time I tried it (on a C++ project with elaborate template symbols).

You might also like...
A cross-platform tool for embedding GPS data into photographs

nya-exif 中文 | English 介绍 nya-exif 是一个用于匹配照片 GPS 信息, 并写入文件 EXIF 信息的工具, 支持 JPEG 和 PNG 及各大相机厂商的主流RAW格式. 本工具基于 Rust 编写, 支持全平台使用 Features 支持 JPEG 和 PNG 及各大

A parser for the perf.data format

linux-perf-data This repo contains a parser for the perf.data format which is output by the Linux perf tool. It also contains a main.rs which acts sim

Convert perf.data files to the Firefox Profiler format

fxprof-perf-convert A converter from the Linux perf perf.data format into the Firefox Profiler format, specifically into the processed profile format.

Camera RAW to DNG file format converter

DNGLab - A camera RAW to DNG file format converter Command line tool to convert camera RAW files to Digital Negative Format (DNG). It is currently in

Intel 8080 cpu emulator by Rust
Intel 8080 cpu emulator by Rust

i8080 i8080 is a emulator for Intel 8080 cpu. 8080 Programmers Manual 8080 opcodes [dependencies] i8080 = { git = "https://github.com/mohanson/i8080"

Prototype: ORAM and related for Intel SGX enclaves
Prototype: ORAM and related for Intel SGX enclaves

mc-oblivious Traits and implementations for Oblivious RAM inside of Intel SGX enclaves. The scope of this repository is: Traits for fast constant-time

MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone

MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone MesaTEE GBDT-RS is a gradient boost decision tre

Intel 8085 CPU emulation in Rust
Intel 8085 CPU emulation in Rust

PP8085 PP808 is a program that emulates the Intel 8085 Microprocessor architecure. The library is written in Rust and aims to mirror the operation of

Utility library for some Lenovo IdeaPad laptops. Supports IdeaPad Intel and AMD Models (15IIL05 and 15ARE05)

ideapad A Rust utility library for some Lenovo IdeaPad specific functionality. A Fair Warning This crate calls raw ACPI methods, which on the best cas

An Intel HAXM powered, protected mode, 32 bit, hypervisor addition calculator, written in Rust.

HyperCalc An Intel HAXM powered, protected mode, 32 bit, hypervisor addition calculator, written in Rust. Purpose None 😏 . Mostly just to learn Rust

A simple hypervisor demonstrating the use of the Intel VT-rp (redirect protection) technology.

Hello-VT-rp A simple hypervisor demonstrating the use of the Intel VT-rp (redirect protection) technology. This repository is a complement of the blob

Rust DLT (Diagnostic Log and Trace) packet parser

dlt_parse A zero allocation rust library for basic parsing & writing DLT (Diagnostic Log and Trace) packets. Currently only the parsing and writing of

A Rust ESP stack trace decoder that can also runs in your browser thanks to WebAssembly
A Rust ESP stack trace decoder that can also runs in your browser thanks to WebAssembly

ESP Stack Trace Decoder A Rust ESP stack trace decoder that can also runs in your browser thanks to WebAssembly. It is composed of a ⌨️ Rust library,

Fox Ear is a Linux process behavior trace tool powered by eBPF
Fox Ear is a Linux process behavior trace tool powered by eBPF

Fox Ear Fox Ear is a Linux process behavior trace tool powered by eBPF. Banner image by Birger Strahl on Unsplash. Features Log process and its subpro

A better visualization of clang's -ftime-trace output

crofiler: Easier C++ build profiling Understanding why C++ builds get slow has become a lot easier since clang introduced their -ftime-trace build tra

convert CHAIN format to PAF format

convert CHAIN format to PAF format

Astro Format is a library for efficiently encoding and decoding a set of bytes into a single buffer format.

Astro Format is a library for efficiently transcoding arrays into a single buffer and native rust types into strings

Convert an MCU register description from the EDC format to the SVD format

edc2svd Convert an MCU register description from the EDC format to the SVD format EDC files are used to describe the special function registers of PIC

Tight Model format is a lossy 3D model format focused on reducing file size as much as posible without decreasing visual quality of the viewed model or read speeds.
Tight Model format is a lossy 3D model format focused on reducing file size as much as posible without decreasing visual quality of the viewed model or read speeds.

What is Tight Model Format The main goal of the tmf project is to provide a way to save 3D game assets compressed in such a way, that there are no not

Owner
Michał Chojnowski
A student of Computer Science at University of Warsaw
Michał Chojnowski
Artsy pixel image to vector graphics converter

inkdrop inkdrop is an artsy bitmap to vector converter. Command line interface The CLI binary is called inkdrop-cli and reads almost any image bitmap

Matthias Vogelgesang 62 Dec 26, 2022
A Simple Image to Ascii converter in Rust

Image to Ascii A Simple Image to Ascii converter in Rust Brief ?? In my way to learn Rust i decided to make this converter. Challenges ?? new to Rust

WasixXD 7 Sep 16, 2022
A fast wordlist to nthash converter

nthasher A fast wordlist to nthash converter Usage Pass it a UTF8 encoded wordlist, and write the output to a file. ./nthasher <wordlist> > wordlist.n

Dominic White 18 Nov 3, 2022
Convert and save photomode screenshots from Red Dead Redemption 2 to JPEG format.

RDR2 Screenshot converter Convert and save photomode screenshots from Red Dead Redemption 2 to JPEG format. QuickStart Just download the executable fi

Timofey Gelazoniya 12 Sep 29, 2022
Foxtrot is a fast viewer for STEP files, a standard interchange format for mechanical CAD

Foxtrot is a fast viewer for STEP files, a standard interchange format for mechanical CAD. It is an experimental project built from the ground up, including new libraries for parsing and triangulation.

null 160 Jan 3, 2023
Rust port of the Quite Okay Image format

qoi_rs What is this? A pretty boring Rust translation of qoi. Status What's there Encode & Decode works Results agree with the C implementation for al

null 9 Dec 9, 2021
A Rust encoder/decoder for Dominic Szablewski's QOI format for fast, lossless image compression.

QOI - The “Quite OK Image” format This is a Rust encoder and decoder for Dominic Szablewski's QOI format for fast, lossless image compression. See the

Chevy Ray Johnston 62 Nov 29, 2022
Rust library to get image size and format without loading/decoding

imageinfo-rs Rust library to get image size and format without loading/decoding. The imageinfo don't get image format by file ext name, but infer by f

xiaozhuai, Weihang Ding 47 Dec 30, 2022
mico (minimalistic config file format) encoder and decoder

mico This library implements a parser and emitter for mico (minimalistic config file format). Format example: Name: mico Description: minimalistic con

null 1 Jan 30, 2022
Fast encoder/decoder for the lossless DTM 16 bit image format

DTM Image Format Fast encoder/decoder for the DTM image format. The DTM image format is a 16-bit lossless image format supporting one to four channels

Kurt Kühnert 4 Oct 15, 2022