reth-indexer reads directly from the reth db and indexes the data into a postgres database all decoded with a simple config file and no extra setup alongside exposing a API ready to query the data.

Overview

reth-indexer

reth-indexer reads directly from the reth db and indexes the data into a postgres database all decoded with a simple config file and no extra setup alongside exposing a API ready to query the data.

Disclaimer

This is an R&D project and most likely has missing features and bugs. Also most likely plenty of optimistations we can do in rust land. PRs more then welcome to make this even faster and better.

Why

If you want to get data from the chain you tend to have to use a provider like infura or alchemy, it can get expensive with their usage plans if you are trying to just get event data. On top of that pulling huge amount of data fast is not possible freely. Over the wire JSONRPC calls adds a lot of overhead and are slow. You have the TLS handshake, you may be calling a API which is located the other side of the world to you, it adds TCP connections to your backend and scaling this is not easy mainly because of how over the wire JSONRPC calls work and with that your bill increases with your provider.

If you wish to build a big data lake or even just fetch dynamic events from the chain this task is near impossible without a third party paid tool or something like thegraph hoster services and most do not let you pull in millions of rows at a time fast. This data should be able to be fetched for free, blazing fast and customisable to your needs. This is what reth-indexer does.

This project aims to solve this by reading directly from the reth node db and indexing the data into a postgres database you can then query fast with indexes already applied for you. You can also scale this easily by running multiple instances of this program on different boxes and pointing them at the same postgres database (we should build that in the tool directly).

This tool is perfect for all kinds of people from developers, to data anaylsis, to ML developers to anyone who wants to just get a snapshot of event data and use it in a production app or just for their own reports.

Features

  • Creates postgres tables for you automatically
  • Creates indexes on the tables for you automatically to allow you to query the data fast
  • Indexes any events from the reth node db
  • Supports indexing any events from multiple contracts or all contracts at the same time
  • Supports filtering even on the single input types so allowing you to filter on every element of the event
  • Snapshot between from and to block numbers
  • No code required it is all driven by a json config file that is easy to edit and understand
  • Created on your own infurstructure so you can scale it as you wish
  • Exposes a ready to go API for you to query the data

Benchmarks

Very hard to benchmark as it is all down to the block range and how often your event is emitted but roughly: (most likely could be speed up with more some optimisations.)

  • indexes around 30,000 events a second (depending on how far away the events are in each block)
  • scans around 10,000 blocks which have no events within 400ms using blooms

Head to head

Only compared what I can right now but happy for others to do head to head

providers:

  • The Graph Hosted (Substreams) - note we are not comparing legacy as it is 100x slower then substream so you can do the maths on legacy
Task The Graph Hosted (Substreams) reth Indexer reth % faster
indexing reth (rocket pool eth) transfer/approval events both starting on block 11446767 and finishing on block 17576926
config to run yourself here
19.5 hours
https://thegraph.com/hosted-service/subgraph/data-nexus/reth-substreams-mainnet
15.9 minutes 73.5x faster

Note on benchmarking

We should compare this tool to other resync tools which go from block N not one which is already resynced. If you have the resynced information already then the results will always be faster as the data is already indexed. This goes block by block scanning each blockm for information so the bigger block ranges you have of course the longer it takes to process. How fast it is depeneds on how many events are present in the blocks.

Indexes

This right now it is just focusing on indexing events it does not currently index ethereum transactions or blocks, that said it would not be hard to add this functionality on top of the base logic now built - PR welcome.

  • [] Indexes blocks
  • [] Indexes transactions
  • [] Indexes eth transfers
  • Indexes and decodes event logs

Requirements

  • This must be ran on the same box and the reth node is running
  • You must have a postgres database running on the box

How it works

reth-indexer goes block by block using reth db directly searching for any events that match the event mappings you have supplied in the config file. It then writes the data to a csv file and then bulk copies the data into the postgres database. It uses blooms to disregard blocks it does not need to care about. It uses CSVs and the postgres COPY syntax as that can write thousands of records a second bypassing some of the processing and logging overhead associated with individual INSERT statements, when you are dealing with big data this is a really nice optimisation.

How to use

Syncing

  • git clone this repo on your box - git clone https://github.com/joshstevens19/reth-indexer.git
  • create a reth-indexer-config.json in the root of the project an example of the structure is in reth-indexer-config-example.json, you can use cp reth-indexer-config-example.json reth-indexer-config.json to create the file with the template.
  • map your config file (we going through what else property means below)
  • run RUSTFLAGS="-C target-cpu=native" cargo run --profile maxperf --features jemalloc to run the indexer
  • see all the data get synced to your postgres database

Advise

reth-indexer goes block by block this means if you put block 0 to an end block it will have to check all the blocks - it does use blooms so its very fast at knowing if a block have nothing we need, but if the contract was not deployed till block x then its pointless use of resources, put in the block number the contract was deployed at as the from block number if you wanted all the events for that contract. Of course you should use the from and to block number as you wish but this is just a tip.

API

You can also run an basic API alongside this which exposes a REST API for you to query the data. This is not meant to be a full blown API but just a simple way to query the data if you wish. This is not required to run the syncing logic. Alongside you can resync data and then load the API up to query the data.

  • you need the same mapping as what you synced as that is the source of truth
  • run RUSTFLAGS="-C target-cpu=native" API=true cargo run --profile maxperf --features jemalloc to run the api
  • it will expose an endpoints on 127.0.0.1:3030/api/
    • The rest structure is the name of ABI event name you are calling so:
    {
      "anonymous": false,
      "inputs": [
        {
          "indexed": true,
          "internalType": "address",
          "name": "from",
          "type": "address"
        },
        {
          "indexed": true,
          "internalType": "address",
          "name": "to",
          "type": "address"
        },
        {
          "indexed": false,
          "internalType": "uint256",
          "name": "value",
          "type": "uint256"
        }
      ],
      "name": "Transfer",
      "type": "event"
    }
    • if you wanted data from this event you query 127.0.0.1:3030/api/transfer and it will return all the data for that event.
  • you can use limit to define the amount you want brought back - 127.0.0.1:3030/api/transfer?limit=100
  • you can use offset to page the results - 127.0.0.1:3030/api/transfer?limit=100&offset=100
  • the result of the rest API are dependant on the event ABI you have supplied so it always includes the fields in the ABI input and then the additional fields of blockNumber, txHash, blockHash, contractAddress, indexedId.
{
  "events:": [
    {
      "from": "0x8d263F61D0F67A75868F831D83Ef51F24d10A003",
      "to": "0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D",
      "value": 1020203030,
      "blockNumber": 13578900,
      "indexedId": "aae863fb-2d13-4da5-9db7-55707ae93d8a",
      "contractAddress": "0xae78736cd615f374d3085123a210448e74fc6393",
      "txHash": "0xb4702508ef5170cecf95ca82cb3465278fc2ef212eadd08c60498264a216f378",
      "blockHash": "0x8f493854e6d10e4fdd2b5b0d42834d331caa80ad739225e2feb1b89cb9a1dd3c"
    }
  ],
  "pagingInfo": {
    "next": "127.0.0.1:3030/api/transfer?limit=100&offset=200",
    "previous": "127.0.0.1:3030/api/transfer?limit=100&offset=100" // < this will be null if no previous page
  }
}

Searching

The api allows you to filter on every element on the data with query string parameters for example if i wanted to filter on the from and to address i would do:

curl "127.0.0.1:3030/api/transfer?from=0x8d263F61D0F67A75868F831D83Ef51F24d10A003&to=0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D" - remember the quotes around the endpoint if using curl as it will only bring in certain query string parameters if you dont.

you can mix and match with ANY fields that you want including the common fields no limit with the amount of fields either. Bare in mind the postgres database automatically creates an index for the fields which are marked as "indexed" true on the ABI so if you filter on those fields it will be very fast, if you filter on a field which is not indexed it will not be as fast, you can of course add your index in, will create a ticket to allow you to pass in a custom index for the future

Config file

you can see an example config here but its important to read the below config options as it has many different features which can be enabled by the config setup.

rethDBLocation - required

The location of the reth node db on the box.

example: "rethDBLocation": "/home/ubuntu/.local/share/reth/mainnet/db",

csvLocation - required

The location the application uses to write temp csvs file, the folder needs to be able to be read by the user running the program, alongside the postgres user must be able to read it. On ubuntu using /tmp/ is the best option.

example: "csvLocation": "/tmp/",

fromBlockNumber - required

The block number to start indexing from.

example: "fromBlockNumber": 17569693,

toBlockNumber - optional

The block number to stop indexing at, if you want to a live indexer leave it blank and it will index all the data and once caught up to the head sync live.

example: "toBlockNumber": 17569794,

postgres - required

Holds the postgres connection and settings info

example:

"postgres": {
  "dropTableBeforeSync": true,
  "connectionString": "postgresql://postgres:password@localhost:5432/reth_indexer"
}

dropTableBeforeSync - required

If you want to drop the table before syncing the data to it, this is useful if you want to reindex the data. The tables are auto created for you everytime. Advised you have it on or you could get duplicate data.

example: "dropTableBeforeSync": true,

connectionString - required

The connection string to connect to the postgres database.

example: "connectionString": "postgresql://postgres:password@localhost:5432/reth_indexer"

eventMappings

An array of event mappings that you want to index, each mapping will create a table in the database and index the events from the reth node db. You can index data based on an contract address or if you do not supply a contract address it will index all events from all contracts for that event.

filterByContractAddress - optional

The contract addresses you want to only index events from

example: "filterByContractAddress": ["0xdAC17F958D2ee523a2206206994597C13D831ec7", "0xA0b86991c6218b36c1d19D4a2e9Eb0cE3606eB48"],

syncBackRoughlyEveryNLogs

How often you want to sync the data to postgres, it uses rough math on KB size per row to work out when to sync the data to postgres, the smaller you set this more often it will write to postgres, the bigger you set it the less often it will write to postgres. If you are syncing millions of rows or do not care to see it update as fast in the database its best to go for a bigger range like 20,000+ - Roughly 20,000 is 7KB of data.

This config is set per each input so it allow you in a mappings to define for example transfer events to sync back on a bigger range then something else you want to see more often in the db.

decodeAbiItems

An array of ABI objects for the events you want to decode the logs for, you only need the events ABI object you care about, you can paste as many as you like in here.

example:

 "decodeAbiItems": [
        {
          "anonymous": false,
          "inputs": [
            {
              "indexed": true,
              "internalType": "address",
              "name": "owner",
              "type": "address"
            },
            {
              "indexed": true,
              "internalType": "address",
              "name": "spender",
              "type": "address"
            },
            {
              "indexed": false,
              "internalType": "uint256",
              "name": "value",
              "type": "uint256"
            }
          ],
          "name": "Approval",
          "type": "event"
        },
        {
          "anonymous": false,
          "inputs": [
            {
              "indexed": true,
              "internalType": "address",
              "name": "from",
              "type": "address"
            },
            {
              "indexed": true,
              "internalType": "address",
              "name": "to",
              "type": "address"
            },
            {
              "indexed": false,
              "internalType": "uint256",
              "name": "value",
              "type": "uint256"
            }
          ],
          "name": "Transfer",
          "type": "event"
        }
      ]
custom regex per input type - (rethRegexMatch)

You can also apply a custom regex on the input type to filter down what you care about more, this is useful if you only care about a certain address or a certain token id or a value over x - anything you wish to filter on which has the same events. This allows you to sync in every direction you wish with unlimited filters on it.

example below is saying i want all the transfer events from all contracts if the from is 0x545a25cBbCB63A5b6b65AF7896172629CA763645 or 0x60D5B4d6Df0812b4493335E7c2209f028F3d19eb. You can see how powerful the rethRegexMatch is. It supports regex so this means you can do any filtering, it is NOT case sensitive.

  "eventMappings": [
    {
      "decodeAbiItems": [
        {
          "anonymous": false,
          "inputs": [
            {
              "indexed": true,
              "internalType": "address",
              "name": "from",
              "type": "address",
              "rethRegexMatch": "^(0x545a25cBbCB63A5b6b65AF7896172629CA763645|0x60D5B4d6Df0812b4493335E7c2209f028F3d19eb)$"
            },
            {
              "indexed": true,
              "internalType": "address",
              "name": "to",
              "type": "address"
            },
            {
              "indexed": false,
              "internalType": "uint256",
              "name": "value",
              "type": "uint256"
            }
          ],
          "name": "Transfer",
          "type": "event"
        }
      ]
    }
  ]
Comments
  • feat: migrate to use the reth provider

    feat: migrate to use the reth provider

    migrated to use the reth provider but results - using the benchmarks config in the benchmark folder

    • using reth-provider - Elapsed time: 1335.68s
    • using reth-db directly - Elapsed time: 1235.08s

    So its 100 seconds slower - 1.5 minutes - have now verified this as well as just ran the benchmark again on master

    @mattsse

    opened by joshstevens19 3
  • share a `DatabaseProvider` per loop or per execution

    share a `DatabaseProvider` per loop or per execution

    "if you want to improve your benchmarks, you can look at the recently introduced ProviderFactory and DatabaseProvider. You can share a DatabaseProvider per loop or execution, which essentially shares a db transaction when querying all the data. When dealing with large data, creating a tx per each data query adds up"

    https://github.com/paradigmxyz/reth/blob/c236521cff86077c43aa3e01821fbf39df01fa53/examples/db-access.rs#L112

    https://github.com/paradigmxyz/reth/blob/c236521cff86077c43aa3e01821fbf39df01fa53/examples/db-access.rs#L17C1-L30

    opened by joshstevens19 3
  • Make it easier to reproduce results

    Make it easier to reproduce results

    First, great project. Very interested.

    I was hoping to reproduce these results, but I found it difficult to understand exactly what was being indexed (specifically which addresses), what topics (although I guess I understand that it was ERC20 token Transfer and Approval events), and which block range (this was easy).

    In the README section about benchmarks, perhaps you could include that information more explicitly in a table so that others can try to reproduce the results themselves on local machines.

    Just an idea. Thanks for the great project.

    opened by tjayrush 2
  • Once it hits the end block make it stay alive and keep resyncing head

    Once it hits the end block make it stay alive and keep resyncing head

    right now once the code reaches the end block it will exit but we should add a setting in the config like "stayResynced" so once it reaches the top of the head it keeps reindexing the data as it comes in, not a big change just factoring in a delay sleep instead of continue in the while and resetting the from and to parameters

    good first issue 
    opened by joshstevens19 2
  • build a axum API around it

    build a axum API around it

    We know the structure of tables as we create them etc so it is amazing if we created a wrapper around this to expose a ready to go rocket API endpoints to query the data as you want...

    enhancement 
    opened by joshstevens19 2
  • run benchmarks against thegraph and other resync tools

    run benchmarks against thegraph and other resync tools

    not got any benchmarks on this vs thegraph or other resync tools, the benchmark needs to be from the same block resyncing the data to see the results. Even if someone could give me the results for an event on thegraph or another tool then i could write the benchmark test for reth, bare in mind to be able to do this nicely you need an reth node running on a box (not everyone has that)

    opened by joshstevens19 2
  • feat: index live blocks when it reaches the head

    feat: index live blocks when it reaches the head

    https://github.com/joshstevens19/reth-indexer/issues/10

    for now its polling but as halljson was saying we could use websockets but polling tends to be a lot more stable so for now this approach works quite nice as you have less overhead and just speaking to the db once a new record is indexed in the db.

    opened by joshstevens19 0
  • feat: support reth-indexer API

    feat: support reth-indexer API

    alongside running the syncer you now have a fully-fledged API to be able to query all the data from the postgres instance without knowing having to learn SQL

    opened by joshstevens19 0
  • API not returning correct values for solidity uint256

    API not returning correct values for solidity uint256

    I am trying to sync univ2 swaps and read them back out via the API.

    The following query works (the results match what is on etherscan):

    select * from swap 
    where "tx_hash"='0x8cb77f390961b967d0fbbb4745a1f4aba6e86787372c95acdfafa062c04ee940' 
    and amount1out=106531830891460120;
    
    to: 0xf164fC0Ec4E93095b804a4795bBe1e041497b92a	
    tx_hash: 0x8cb77f390961b967d0fbbb4745a1f4aba6e86787372c95acdfafa062c04ee940	
    block_number: 10207859	
    block_hash: 0x51876f00dab692e4229e42494c82ae53517b4edcb1bb2e27bea4ef534cf4ff8c	
    timestamp: 1591388247
    amount0in: 140
    amount1in: 0
    amount0out: 0
    amount1out: 106531830891460120 // <-- wei equivalent of the "0.10653183089146012" eth value on etherscan
    

    However, when I try to make the same query through the API, the amount1out is incorrect:

    curl "localhost:3030/api/swap?limit=100&tx_hash=0x8cb77f390961b967d0fbbb4745a1f4aba6e86787372c95acdfafa062c04ee940&amount1out=106531830891460120" | jq
    
    {
      "events": [
        {
          "amount0in": 140,
          "amount0out": 0,
          "amount1in": 0,
          "amount1out": 1838348756084981800,  // <-- incorrect
          "block_hash": "0x51876f00dab692e4229e42494c82ae53517b4edcb1bb2e27bea4ef534cf4ff8c",
          "block_number": 10207859,
          "contract_address": "0xa6f3ef841d371a82ca757fad08efc0dee2f1f5e2                        ",
          "indexed_id": "02704168-9a50-45aa-b465-fef8e103e9b1",
          "sender": "0xf164fC0Ec4E93095b804a4795bBe1e041497b92a",
          "timestamp": "1591388247",
          "to": "0xf164fC0Ec4E93095b804a4795bBe1e041497b92a",
          "tx_hash": "0x8cb77f390961b967d0fbbb4745a1f4aba6e86787372c95acdfafa062c04ee940"
        }
      ],
      "pagingInfo": {
        "next": "127.0.0.1:3030/swap?offset=100&limit=100&amount1out=106531830891460120&tx_hash=0x8cb77f390961b967d0fbbb4745a1f4aba6e86787372c95acdfafa062c04ee940",
        "previous": null
      }
    }
    

    I believe it has something to do with storing the values as NUMERIC data type, but I tried to tweak the value types to BIGINT to no avail.

    It's possible that uint256 values may need to be stored as VARCHAR, or perhaps there's a better way.

    opened by halljson 4
  • use readable yaml files for config

    use readable yaml files for config

    https://github.com/aave/protocol-subgraphs/blob/main/templates/v3.subgraph.template.yaml#L416

    subgraph uses nice .yaml files and its easy to read we should do the same

    opened by joshstevens19 0
  • Add ability / instructions to sync / create test rETH node data

    Add ability / instructions to sync / create test rETH node data

    Referencing TG convo in paradigm_reth room -- add ability for users to test and develop using reth-indexer without having to create / run a full or archive rETH node. this could take a form of: (a) a small database / datadump, (b) a generator to create some fake data, or (3) links or setup instructions on standing up a minimally viable rETH node with much smaller requirements than the current 2TB requirement for archive nodes.

    https://t.me/paradigm_reth/9337

    opened by liangjh 0
Owner
Josh Stevens
VP, Engineering - (Lens) at AAVE - TypeScript, React, SQL, Solidity, Node, Rust(in-progress) and AWS
Josh Stevens
All the data an IC app needs to make seamless experiences, accessible directly on the IC. DAB is an open internet service for NFT, Token, Canister, and Dapp registries.

DAB ?? Overview An Internet Computer open internet service for data. All the data an IC app needs to make a seamless experience, accessible directly o

Psychedelic 58 Oct 6, 2022
Ethereum (and Ethereum like) indexer using P2P message to fetch blocks and transactions

Ethereum P2P indexer This project is an indexer for Ethereum and Ethereum forks. It takes advantage of the ETH (Ethereum Wire Protocol) to fetch block

null 5 Nov 10, 2023
Minimalistic EVM-compatible chain indexer.

EVM Indexer Minimalistic EVM-compatible blockchain indexer written in rust. This repository contains a program to index helpful information from any E

Kike B 14 Dec 24, 2022
Minimalistic EVM-compatible chain indexer.

EVM Indexer Minimalistic EVM-compatible blockchain indexer written in rust. This repository contains a program to index helpful information from any E

LlamaFolio 11 Dec 15, 2022
Koofr Vault is an open-source, client-side encrypted folder for your Koofr cloud storage offering an extra layer of security for your most sensitive files.

Koofr Vault https://vault.koofr.net Koofr Vault is an open-source, client-side encrypted folder for your Koofr cloud storage offering an extra layer o

Koofr 12 Dec 30, 2022
Open Protocol Indexer, OPI, is the best-in-slot open-source indexing client for meta-protocols on Bitcoin.

OPI - Open Protocol Indexer Open Protocol Indexer, OPI, is the best-in-slot open-source indexing client for meta-protocols on Bitcoin. OPI uses a fork

Best in Slot 33 Dec 16, 2023
A high-performance, highly compatible EVM Inscriptions Indexer

Insdexer A high-performance, highly compatible EVM Inscriptions Indexer by Rust. An accessible and complete version of the documentation is available

null 105 Mar 17, 2024
An all-in-one IBC protocol providing fungible token transfer, interchain account, and async query functionalities

ICS-999 An all-in-one IBC protocol providing fungible token transfer, interchain account (ICA), and query (ICQ) functionalities, implemented in CosmWa

larry 9 Apr 1, 2023
This is my home environment setup for monitoring temperature and humidity

Home EnViroNment Motivation This is my IoT temperature and humidity monitoring solution for where i live. I found it cheaper to go buy sensors and hoo

Fredrik 1 Jan 5, 2022
Spartan: High-speed zkSNARKs without trusted setup

Spartan: High-speed zkSNARKs without trusted setup Spartan is a high-speed zero-knowledge proof system, a cryptographic primitive that enables a prove

Microsoft 435 Jan 4, 2023
Consensus layer peer-to-peer connection setup

Consensus Layer P2P This project is a basic setup for a consensus layer peer-to-peer connection, as specified in the consensus layer specifications of

Brechy 11 Dec 31, 2022
Easy setup for Edge host.

Pod's Edge Staking GUI (beta) Features Easily setup your Edge host with a GUI (Graphical User Interface). Uses the device token staking method. Has no

null 3 Apr 29, 2023
🎙️ Catalyst Voices provides a unified experience and platform including production-ready liquid democracy

??️ Catalyst Voices provides a unified experience and platform including production-ready liquid democracy, meaningful collaboration opportunities & data-driven context for better onboarding & decisions.

Input Output 6 Oct 11, 2023
WASM wrapper of mozjpeg, ready for the browser

mozjpeg-wasm This library wraps mozjpeg-sys and exposes a few functions to perform decoding, encoding and simple transformation on JPEG images using m

Tommaso 25 Nov 17, 2022
Hackathon project, not production ready (yet)

Ledger Nano PIV Application This is a Ledger Hackathon project targeted on building a PIV compatible Ledger Nano X/S+ application. The focus of this a

Ledger 6 Dec 20, 2022
De-chained Ready-to-play ink! playground

DRink! Dechained Ready-to-play ink! playground drink.mp4 What is DRink? DRink! aims providing support for ink! developers. It comes in two parts: drin

Cardinal 9 Jun 23, 2023
A fresh FRAME-based Substrate node, ready for hacking

Substrate Node Template A fresh FRAME-based Substrate node, ready for hacking ?? Getting Started Follow the steps below to get started with the Node T

Web 3 | Mobile | Blockchain Full Stack Engineer 6 Jun 23, 2023
Rust API Client for ImageKit.io a file storage and image processing service

Rust API Client for ImageKit.io a file storage and image processing service Usage You must retrieve your Public and Private Keys from the ImageKit Dev

Esteban Borai 4 Jul 31, 2022
A guide for Mozilla's developers and data scientists to analyze and interpret the data gathered by our data collection systems.

Mozilla Data Documentation This documentation was written to help Mozillians analyze and interpret data collected by our products, such as Firefox and

Mozilla 75 Dec 1, 2022