ROAPI
ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code. It builds on top of Apache Arrow and Datafusion. The core of its design can be boiled down to the following:
- Query frontends to translate SQL, GraphQL and REST API queries into Datafusion plans.
- Datafusion for query plan execution.
- Data layer to load datasets from a variety of sources and formats with automatic schema inference.
- Response encoding layer to serialize intermediate Arrow record batch into various formats requested by client.
See below for a high level diagram:
Installation
Install pre-built binary
pip install roapi-http
Check out Github release page for pre-built binaries for each platform. Pre-built docker images are also available at ghcr.io/roapi/roapi-http.
Install from source
cargo install --locked --git https://github.com/roapi/roapi --branch main --bin roapi-http
Usage
Quick start
Spin up APIs for test_data/uk_cities_with_headers.csv
and test_data/spacex_launches.json
:
roapi-http \
--table "uk_cities=test_data/uk_cities_with_headers.csv" \
--table "test_data/spacex_launches.json"
For windows, full scheme(file:// or filesystem://) must filled, and use double quote(") instead of single quote(') to escape windows cmdline limit:
roapi-http \
--table "uk_cities=file://d:/path/to/uk_cities_with_headers.csv" \
--table "file://d:/path/to/test_data/spacex_launches.json"
Or using docker:
docker run -t --rm -p 8080:8080 ghcr.io/roapi/roapi-http:latest --addr 0.0.0.0:8080 \
--table "uk_cities=test_data/uk_cities_with_headers.csv" \
--table "test_data/spacex_launches.json"
Query tables using SQL, GraphQL or REST:
curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
curl -X POST -d "query { uk_cities(limit: 2) {city, lat, lng} }" localhost:8080/api/graphql
curl "localhost:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"
Get inferred schema for all tables:
curl 'localhost:8080/api/schema'
Config file
You can also configure multiple table sources using YAML config, which supports more advanced format specific table options:
addr: 0.0.0.0:8084
tables:
- name: "blogs"
uri: "test_data/blogs.parquet"
- name: "ubuntu_ami"
uri: "test_data/ubuntu-ami.json"
option:
format: "json"
pointer: "/aaData"
array_encoded: true
schema:
columns:
- name: "zone"
data_type: "Utf8"
- name: "name"
data_type: "Utf8"
- name: "version"
data_type: "Utf8"
- name: "arch"
data_type: "Utf8"
- name: "instance_type"
data_type: "Utf8"
- name: "release"
data_type: "Utf8"
- name: "ami_id"
data_type: "Utf8"
- name: "aki_id"
data_type: "Utf8"
- name: "spacex_launches"
uri: "https://api.spacexdata.com/v4/launches"
option:
format: "json"
- name: "github_jobs"
uri: "https://jobs.github.com/positions.json"
To run serve tables using config file:
roapi-http -c ./roapi.yml
See config documentation for more options including using Google spreadsheet as a table source.
Response serialization
By default, ROAPI encodes responses in JSON format, but you can request different encodings by specifying the ACCEPT
header:
curl -X POST \
-H 'ACCEPT: application/vnd.apache.arrow.stream' \
-d "SELECT launch_library_id FROM spacex_launches WHERE launch_library_id IS NOT NULL" \
localhost:8080/api/sql
REST API query interface
You can query tables through REST API by sending GET
requests to /api/tables/{table_name}
. Query operators are specified as query params.
REST query frontend currently supports the following query operators:
- columns
- sort
- limit
- filter
To sort column col1
in ascending order and col2
in descending order, set query param to: sort=col1,-col2
.
To find all rows with col1
equal to string 'foo'
, set query param to: filter[col1]='foo'
. You can also do basic comparisons with filters, for example predicate 0 <= col2 < 5
can be expressed as filter[col2]gte=0&filter[col2]lt=5
.
GraphQL query interface
To query tables using GraphQL, send the query through POST
request to /api/graphql
endpoint.
GraphQL query frontend supports the same set of operators supported by REST query frontend. Here how is you can apply various operators in a query:
{
table_name(
filter: {
col1: false
col2: { gteq: 4, lt: 1000 }
}
sort: [
{ field: "col2", order: "desc" }
{ field: "col3" }
]
limit: 100
) {
col1
col2
col3
}
}
SQL query interface
To query tables using a subset of standard SQL, send the query through POST
request to /api/sql
endpoint. This is the only query interface that supports table joins.
Features
Query layer:
- REST API GET
- GraphQL
- SQL
- join between tables
- support query on nested struct fields
- index
- protocol
- gRPC
- MySQL
- Postgres
Response serialization:
- JSON
application/json
- Arrow
application/vnd.apache.arrow.stream
- Parquet
application/vnd.apache.parquet
- msgpack
Data layer:
- filesystem
- HTTP/HTTPS
- S3
- GCS
- Google spreadsheet
- MySQL
- Postgres
- Airtable
- Data format
- CSV
- JSON
- NDJSON
- parquet
- xls, xlsx, xlsm, ods: https://github.com/tafia/calamine
- DeltaLake
Misc:
- auto gen OpenAPI doc for rest layer
- query input type conversion based on table schema
- stream arrow encoding response
- authentication layer
Development
The core of ROAPI, including query frontends and data layer, lives in the self-contained columnq crate. It takes queries and outputs Arrow record batches. Data sources will also be loaded and stored in memory as Arrow record batches.
The roapi-http crate wraps columnq
with a HTTP based API layer. It serializes Arrow record batches produced by columnq
into different formats based on client request.
Building ROAPI with simd
optimization requires nightly rust toolchain.
Build Docker image
docker build --rm -t ghcr.io/roapi/roapi-http:latest .