PRQL is a modern language for transforming data — a simpler and more powerful SQL

Last update: Jun 21, 2022

PRQL

Language Docs Discord

GitHub CI Status GitHub contributors Stars

Pipelined Relational Query Language, pronounced "Prequel".

PRQL is a modern language for transforming data — a simpler and more powerful SQL. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.

PRQL was discussed on Hacker News and Lobsters earlier this year when it was just a proposal.

Here's a short example of the language; for more examples, visit prql-lang.org. To experiment with PRQL in the browser, check out PRQL Playground.

from employees                                # Each line transforms the previous result.
filter start_date > @2021-01-01               # Clear date syntax.
derive [                                      # `derive` adds columns / variables.
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost   # Variables can use other variables.
]
filter gross_cost > 0
group [title, country] (                      # `group` runs a pipeline over each group.
  aggregate [                                 # `aggregate` reduces each group to a row.
    average salary,
    sum     salary,
    average gross_salary,
    sum     gross_salary,
    average gross_cost,
    sum_gross_cost = sum gross_cost,          # `=` sets a column name.
    ct = count,
  ]
)
sort [sum_gross_cost, -country]               # `-country` means descending order.
filter ct > 200
take 20

Resources

To learn more, check out the PRQL Website.

For specific resources, check out:

Contributors

Many thanks to those who've made our progress possible:

Contributors

Core developers

We have a few core developers who are responsible for reviewing code, making decisions on the direction of the language, and project administration:

We welcome others to join who have a track record of contributions.

GitHub

https://github.com/prql/prql
Comments
  • 1. Orthogonal treatment of groups (nee Window functions)

    Edit — this issue evolved into a much broader discussion of how we do groups. The original message is below, and I've tried to summarize the current state towards the end of the issue so it's easier for others to engage.


    More than half the people I speak with about PRQL tell me that window functions are painful in SQL and would be a great feature to add to PRQL. I put window funcitons in the examples in the Readme at first, since I had the same experience in SQL.

    For clarity: a window / analytic / analytical function is a function that takes other rows as an input, but instead of aggregating over many rows, it creates a new value for every row. For example, a moving average or lag. In SQL it's represented by the OVER keyword.

    I haven't thought about this enough, but wanted to post an issue so we could start thinking about how to do it. A couple of initial options:

    • A separate pipeline, with the column at the end, and the function (lag in this case) in the pipeline:
    from prices
    derive price_yesterday: (
      window
      by sec_id
      sort date
      lag 1
      price
    )
    
    • A function with lots of named args, with the function lag as a named arg (but how would this work for functions with 0 or >1 parameters?):
    from prices
    derive price_yesterday: (window by:sec_id sort:date lag:1 price)
    
    • Eliminate window — because we have aggregate, we can't get confused about sum(foo) OVER and sum(foo), so possibly we don't need to have window:
    from prices
    derive price_yesterday: (price | lag rows:1 sort:date by:sec_id)
    

    I don't think this is sufficiently thought through, and I'm sure these examples have inadequacies.

    Here are some example queries we can use as cases: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts#get_the_most_popular_item_in_each_category

    Reviewed by max-sixty at 2022-04-02 19:31
  • 2. Functions - syntax, scope and limitations

    In issue #437 we ended up with quite a lot of discussion around the syntax for functions. For anyone browsing the issue log that might be quite difficult to find so I'm opening a separate issue for it here. The syntax also gets updated in PR #442 .

    The current syntax is described in: https://github.com/prql/prql/blob/main/reference/src/functions.md

    @aljazerzen also gave some descriptions on Discord which I've reproduced below:

    Currently they are limited to "any expression", but there are plans for them to allow:

    • "late binding variables"
    func net gross = gross * (1 - tax)
    
    from finances
    select [net gross_salaries, net gross_materials, net gross_something]
    

    where tax is column from the finances, but is not resolved until after the function is materialised

    • currying
    func add a b = a + b
    func add_one = add 1
    
    from employees
    select [ age_next_year: add_one age ]
    
    • transform functions
    func take_first col = (sort col | take 1)
    
    from employees
    take_first age
    
    Reviewed by snth at 2022-05-03 10:02
  • 3. Dialect-specific SQL

    Even with the current code, take 20 will produce TOP(20), which is incompatible with some DBs.

    The original plan was to allow a db named arg to queries, which I still think works well, but open to alternatives.

    Reviewed by max-sixty at 2022-03-20 23:02
  • 4. Named args vs assignments

    One issue https://github.com/max-sixty/prql/pull/52 raises is how assignments are parsed; specifically:

    derive gross_salary: salary + payroll_tax,
    

    Should parse into:

    • func: derive
    • assign:
      • lvalue: gross_salary
      • rvalue: salary + payroll_tax

    But named args' syntax conflict with this. For example:

    aggregate by:[title] [sum salary]
    

    should parse into:

    • func: aggregate
    • assign:
      • lvalue: by
      • rvalue: [title]
    • arg: [sum salary]

    ...and not

    • func: aggregate
    • assign:
      • lvalue: by
      • rvalue:[title] [sum salary]

    ...which is what the assign implies.

    Options:

    • change syntax so only one of these uses :
      • Though trying to avoid using = so we can keep the SQL (and human!) use of this
    • force the assign to always take one arg, so gross_salary: salary + payroll_tax needs to be gross_salary: (salary + payroll_tax)
    • restore the rules around whitespace, such that passing named args disallows whitespace (i.e. by:[title] and not by: [title]). But this makes semantics very dependent on small changes in whitespace, which could get confusing. An autoformatter isn't going to be able to deal with this

    References:

    Reviewed by max-sixty at 2022-02-03 19:51
  • 5. Window functions

    Following on from https://github.com/prql/prql/issues/300, we can now implement window functions.

    Much of the infrastructure has already been built by @aljazerzen , including using a similar construction to group. There will be some differences though, and it'll require some work in translation. It will be moderate-hard difficulty.

    Reviewed by max-sixty at 2022-04-26 04:32
  • 6. Date, time & duration literals

    In SQL, creating date objects is a mixture of:

    • A string '2020-01-01'
    • DATE(2016, 12, 25) — BQ
    • make_date(year int, month int, day int) — Postgres

    The strings are dynamically typed so they can be implicitly converted when, for example, they're compared to another date (some quick experimenting suggests this also happens when compared to other strings that might be a date!).

    Some better options:

    • Malloy uses @2020-01-01
    • We could extend our use of number & string combinations from https://github.com/prql/prql/pull/369 (5days) to dates and use d2020-01-01 or D2020-01-01. We can also apply this to timestamps / times / etc with different letters.
    • Very open to other ideas

    Any thoughts?

    Reviewed by max-sixty at 2022-04-17 05:08
  • 7. Starting on the parser

    A couple of people reached out asking whether they could help writing the parser, and potentially there are others who would be interested in contributing.

    My plan was to use nom, because I'd heard good things, had a good first experience, and it's easy to adapt it if we decide to make it more hand-written or use a PEG grammar.

    My initial plan was to try and get some of the initial simple examples parsing. This will be mostly activation energy — it's not the most difficult language to parse. So if we can parse this into an AST, that would be a great success:

    from employees
    filter country = "USA"                           # Each line transforms the previous result.
    derive [                                         # This adds columns / variables.
      gross_salary: salary + payroll_tax,
      gross_cost:   gross_salary + benefits_cost     # Variables can use other variables.
    ]           
    filter gross_cost > 0
    aggregate by:[title, country] [                  # `by` are the columns to group by.
        average salary,                              # These are aggregation calcs run on each group.
        sum     salary,
        average gross_salary,
        sum     gross_salary,
        average gross_cost,
        sum_gross_cost: sum gross_cost,
        count,
    ]
    sort sum_gross_cost
    filter count > 200
    take 20
    

    If anyone wants to have a go, I would suggest:

    • Take one function (e.g. derive)
    • Put a PR in after you have a few lines of code — then people won't do the same thing (though I'm not sure if that will actually be a problem!)
    • Don't worry about making it complete (e.g. it'd be fine to start on derive only for anonymous columns), add comments where stuff is missing
    • Do add a few tests. You can see the very early functions I added have tests

    If parts are unclear, please feel free to reply here.

    Reviewed by max-sixty at 2022-01-27 03:38
  • 8. Notation for specifying sort direction

    I find that having separate order:asc terms breaks the visual flow when reading sort expressions. For example, in the example below at first glance it appears as if one is sorting by four columns and it is difficult to pick out the column names.

    sort age order:desc, name order:asc
    

    I would explore what alternatives there might be (bearing in mind that I'm speaking from the perspective of a user and I have no knowledge how possible or not it is to implement these in the parser).

    To me the simplest, and closest to SQL, would be to dispense with the order token and just affix the direction to the column name like so

    sort age:desc, name:asc
    

    However in the quest for simplicity and terseness I've also considered whether one could overload some symbols/operators. The ones that naturally come to my mind are <, >, +, and -. So

    sort age>, name<
    

    or

    sort age-, name+
    

    Of these, I prefer the latter. I also think it makes logical sense in that one can think of this as always sorting ascending and just sometimes we don't sort the values but rather the negatives of the values which naturally gives the descending order.

    One unaddressed issue is the placement of the NULL values which I haven't found a neat suggestion for yet.

    Reviewed by snth at 2022-04-28 21:47
  • 9. Adding to_json in prql-compiler/src/lib , and exposing to python

    @max-sixty There are some python tests in prql-python/python/tests that would be nice to have run before the Release phase as a check before uploading, I could not get them to run correctly. Pinging @rbpatt2019 also he might know how to integrate running the python tests before Release in .github/workflows/python.yaml

    Reviewed by charlie-sanders at 2022-04-23 14:35
  • 10. dbt-prql

    I've been thinking of ways that people could really start using PRQL practically. Somehow, something needs to transform their PRQL into SQL between the user and the DB.

    Some ideas:

    • An specific interactive environment, a bit like PyPrql have built / are building
    • An editor plugin, similar to what Malloy have built
    • A plugin to a batch tool, such as dbt (https://github.com/prql/prql/issues/13)

    They are all compelling! It might be that the dbt plugin is the approach that can get the most amount of traction the fastest. While it might not take advantage of some of the things that PRQL allows in the long term — auto-complete, type-inference — it also has much lower requirements — just compile the query when otherwise compiling the SQL.

    dbt would be the natural choice given how widely it's used.

    A problem with dbt packages is that IIUC they don't allow for executing arbitrary code — they can only execute macros, which have a very defined scope.

    So a couple of alternatives:

    1. dbt package + import hack — We write something that patches the dbt jinja context to allow a prql macro which compiles the contents to SQL — e.g. {{% prql ... %}}
    • The advantage of this is that it's exactly like running existing dbt models, but people can write PRQL
    • The disadvantage is that it requires extreme python hackery — check out @mtkennerly's approach to patching poetry here. Ours would be a bit easier than this, since we don't need to account for multiple installation approaches, but it wouldn't be simple.
    • We'd be using dbt's internal APIs, which could break on dbt upgrades.
    • I'm not sure whether the dbt folks would approve? (@kwigley @drewbanin ?) I'm sure they'd approve of us building on top of dbt, but not so sure about patching their library's jinja environment...
    1. Wrapper tool — We write a tool which pre-compiles files for dbt; like prql dbt models/ -- dbt run -m foo
    • The advantage is that the implementation would be extremely simple
    • The disadvantage is that it adds a layer of indirection to dbt — It means prql might need to understand dbt's file structure, and run a similar process to that which dbt is already running to walk the file tree — in order to compile .prql files to .sql, so that dbt sees the .sql files. Processes which run dbt now need to run a different command.
    • Up a level, it offloads complexity onto to user, rather than encapsulates it in the tool — something we should try and avoid.
    • But possibly it's the practical approach is the short-term.

    Either of these requires a way of allowing jinja in PRQL — probably we could just treat {{ }} as comments without giving up anything in the language. (Hopefully people don't need as much jinja with PRQL, given we have abstractions like functions, but we don't need to be strict about it)

    This could either be implemented by calling out to PyPrql, or through embedding the rust binary in a python package — this is something dbt do already with https://github.com/dbt-labs/dbt-extractor.

    Reviewed by max-sixty at 2022-04-17 20:02
  • 11. let vs set in new columns

    justinpombrio makes a good point:

    Another suggestion around let: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called let and set. Those are in effect pretty different operations: you need to know which one is happening to know how many columns the table will have, and renaming a table column can with your current system change which operation is happening.

    Splitting them into separate operations would make things easier on the reader: they can tell what's happening without having to know all the column names of the table. And it shouldn't really be harder for the writer, who ought to already know which they're doing.

    I think this is a good idea! dplyr has something similar with mutate & transmute.

    It can mostly be enforced by PRQL. There's a case where we transpile to:

    select *, x+1 as x_plus_one
    

    ...where we don't know whether or not we're overwriting an existing column. But it's a minority of cases, and the contract could stand within PRQL.

    let & set seem reasonable but also open to other syntax.

    Reviewed by max-sixty at 2022-01-24 23:43
  • 12. Move playground within the website

    As discussed on Discord:

    It would be less of a jump, and so fewer people giving up exploring the language, if Playground could share a header and formatting with the rest of the site.

    I don't think it can't be build purely with Hugo given it requires wasm & JS, but it could be copied into a path as part of a build step (as we do in the build).

    It would require adjusting the Playground's CSS to be more similar to the site's.

    This was originally inspired by @alonso's Rulex. Though it looks like they've recently regressed a bit and it's not exactly the same header.

    Reviewed by max-sixty at 2022-06-21 21:49
  • 13. Add a "card" testimonial in the "what people are saying" section?

    Is it possible for us to add a "card" among the tweets section, with something someone said, but hasn't tweeted? In particular, my acquaintance Jeremiah Lowin, founder of Prefect, generously offered a testimonial:

    It starts with FROM, it fixes trailing commas, and it's called PRQL?? If this is a dream, don't wake me up.

    Would be great to have this there along with his name and role, if the formatting will work.

    Reviewed by max-sixty at 2022-06-21 17:01
  • 14. Deduplicate npm packages

    Ideally, we would not have 3 npm packages but a single one.

    I did a little research, and currently it is possible, but not easy.

    This is very similar issue: https://github.com/spruceid/didkit/issues/208

    One way to do it would mean creating two builds and merge results together, described here and here.

    Another (easier) way would be to wait for this to merge: https://github.com/rustwasm/wasm-pack/issues/313 https://github.com/rustwasm/wasm-pack/issues/469

    Reviewed by aljazerzen at 2022-06-18 21:26
  • 15. Date diff with `-` operator

    From @matsonj: https://twitter.com/matsonj/status/1537969665070092295

    Currently we don't handle this, but I think we could without that much trouble:

    derive x = @2022-01-01 - @2021-01-01
    
    SELECT
      DATE '2022-01-01' - DATE '2021-01-01' AS x
    

    ...should be DATEDIFF(DATE '2022-01-01', DATE '2021-01-01')

    Reviewed by max-sixty at 2022-06-18 03:05
A Rust SQL query builder with a pleasant fluent API closely imitating actual SQL

Scooby An SQL query builder with a pleasant fluent API closely imitating actual SQL. Meant to comfortably build dynamic queries with a little bit of s

Jun 8, 2022
Gh-sql - Query GitHub Projects (beta) with SQL
Gh-sql - Query GitHub Projects (beta) with SQL

gh-sql: Query GitHub Projects (beta) with SQL Installation gh extension install KOBA789/gh-sql Features SELECT items DELETE items UPDATE item fields

Jun 5, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Jun 25, 2022
An object-relational in-memory cache, supports queries with an SQL-like query language.

qlcache An object-relational in-memory cache, supports queries with an SQL-like query language. Warning This is a rather low-level library, and only p

Nov 14, 2021
RisingWave is a cloud-native streaming database that uses SQL as the interface language.

RisingWave is a cloud-native streaming database that uses SQL as the interface language. It is designed to reduce the complexity and cost of building real-time applications. RisingWave consumes streaming data, performs continuous queries, and updates results dynamically. As a database system, RisingWave maintains results inside its own storage and allows users to access data efficiently.

Jun 18, 2022
Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis

OnTimeDB Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis OnTimeDB is a time

Apr 5, 2022
Query LDAP and AD with SQL

SQLDAP Ever wanted to query AD or LDAP with SQL like queries ? I'm going to answer this question myself: yes ! Why ? Because I never could remember al

Jun 1, 2022
FeOphant - A SQL database server written in Rust and inspired by PostreSQL.

A PostgreSQL inspired SQL database written in Rust.

May 2, 2022
Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres.

SDB - SignatureDB Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres

Apr 26, 2022
Rust library to parse, deparse and normalize SQL queries using the PostgreSQL query parser

This Rust library uses the actual PostgreSQL server source to parse SQL queries and return the internal PostgreSQL parse tree.

Jun 23, 2022
TDS 7.2+ (mssql / Microsoft SQL Server) async driver for rust

Tiberius A native Microsoft SQL Server (TDS) client for Rust. Supported SQL Server versions Version Support level Notes 2019 Tested on CI 2017 Tested

Jun 22, 2022
GlueSQL is a SQL database library written in Rust

GlueSQL is a SQL database library written in Rust. It provides a parser (sqlparser-rs), execution layer, and optional storage (sled) packaged into a single library.

Jun 19, 2022
Run SQL queries on CSV files

zsql run SQL queries on csv files A terminal utility to easily run SQL queries on CSV files. zsql is shipped as a small single binary powered by rust

Apr 26, 2022
Running SQL-like queries on files.

filesql Running SQL-like queries on files. Features Supported: REPL Basic SQL expressions. INSERT clause. (which inserts data into another file) WHERE

Nov 15, 2021
Tool to automate the visualisation of UML dependencies from a SQL file
Tool to automate the visualisation of UML dependencies from a SQL file

Doteur A simple tool to draw your mysql relations from exports. Help us If you use the tool, and like it, don't forget to add a star to the project on

May 20, 2022
Fully typed SQL query builder for Rust [deprecated]

What is Deuterium? Deuterium is a fancy SQL builder for Rust. It's designed to provide a DSL to easily build SQL queries in safe and typed way. Like R

Apr 10, 2022
Ormlite - An ORM in Rust for developers that love SQL.

ormlite ormlite is an ORM in Rust for developers that love SQL. It provides the following, while staying close to SQL, both in syntax and performance:

Jun 16, 2022
Distributed SQL database in Rust, written as a learning project

toyDB Distributed SQL database in Rust, written as a learning project. Most components are built from scratch, including: Raft-based distributed conse

Jun 22, 2022
X-Engine: A SQL Engine built from scratch in Rust.

XNGIN (pronounced "X Engine") This is a personal project to build a SQL engine from scratch. The project name is inspired by Nginx, which is a very po

May 3, 2022