Pipelined Relational Query Language, pronounced "Prequel"

Related tags

Miscellaneous prql
Overview

PRQL

Pipelined Relational Query Language, pronounced "Prequel".

PRQL is a modern language for transforming data — a simpler and more powerful SQL. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.

An example

Here's a fairly simple SQL query:

SELECT TOP 20
    title,
    country,
    AVG(salary) AS average_salary,
    SUM(salary) AS sum_salary,
    AVG(salary + payroll_tax) AS average_gross_salary,
    SUM(salary + payroll_tax) AS sum_gross_salary,
    AVG(salary + payroll_tax + healthcare_cost) AS average_gross_cost,
    SUM(salary + payroll_tax + healthcare_cost) AS sum_gross_cost,
    COUNT(*) as count
FROM employees
WHERE salary + payroll_tax + healthcare_cost > 0 AND country = 'USA'
GROUP BY title, country
ORDER BY sum_gross_cost
HAVING count > 200

Even this simple query demonstrates some of the problems with SQL's lack of abstractions:

  • Unnecessary repetition — the calculations for each measure are repeated, despite deriving from a previous measure. The repetition in the WHERE clause obfuscates the meaning of the expression.
  • Functions have multiple operators — HAVING & WHERE are fundamentally similar operations applied at different stages of the pipeline but SQL's lack of pipeline-based precedence requires it to have two different operators.
  • Operators have multiple functions — the SELECT operator both creates new aggregations, and selects which columns to include.
  • Awkward syntax — when developing the query, commenting out the final line of the SELECT list causes a syntax error because of how commas and handled, and we need to repeat the columns in the GROUP BY clause in the SELECT list.

Here's the same query with PRQL:

0 aggregate split:[title, country] [ # Split are the columns to group by. average salary, # These are the calcs to run on the groups. sum salary, average gross_salary, sum gross_salary, average gross_cost, sum gross_cost, count, ] sort sum_gross_cost # Uses the auto-generated column name. filter count > 200 top 20">
from employees
filter country = "USA"                         # Each line transforms the previous result.
gross_salary = salary + payroll_tax            # This _adds_ a column / variable.
gross_cost   = gross_salary + healthcare_cost  # Variable can use other variables.
filter gross_cost > 0
aggregate split:[title, country] [             # Split are the columns to group by.
    average salary,                            # These are the calcs to run on the groups.
    sum     salary,
    average gross_salary,
    sum     gross_salary,
    average gross_cost,
    sum     gross_cost,
    count,
]
sort sum_gross_cost                            # Uses the auto-generated column name.
filter count > 200
top 20

As well as using variables to reduce unnecessary repetition, the query is also more readable — it flows from top to bottom, each line representing a transformation of the previous line's result. For example, TOP 20 modifies the final result in both queries — but only PRQL represents it as the final transformation. And context is localized — the aggregate function contains both the calculations and the columns to group by.

An example using Functions

Here's another SQL query, which calculates returns from prices on days with valid prices.

SELECT
  date,
  -- Can't use a `WHERE` clause, as it would affect the row that the `LAG` function referenced.
  IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER 
    (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_total,
  IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER 
    (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) AS return_usd,
  IF(is_valid_price, price_adjusted / LAG(price_adjusted, 1) OVER 
    (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) 
    - interest_rate / 252 AS return_excess,
  IF(is_valid_price, price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER 
    (PARTITION BY sec_id ORDER BY date) - 1 + dividend_return, NULL) 
    - interest_rate / 252 AS return_usd_excess
FROM prices

This might seem like a convoluted example, but it's taken from a real query. Indeed, it's also simpler and smaller than the full logic — note that it starts from price_adjusted, whose logic had to be split into a previous query to avoid the SQL becoming even less readable.

Here's the same query with PRQL:

prql version:0.0.1 db:snowflake                       # Version number & database name.

func lag_day x = (
  window x 
  split sec_id 
  sort date
  lag 1
)
func ret x = x / (x | lag_day) - 1 + dividend_return
func excess x = (x - interest_rate) / 252    
func if_valid x = is_valid_price ? x : null

from prices
return_total       = prices_adj   | ret | if_valid    # `|` can be used rather than newlines.
return_usd         = prices_usd   | ret | if_valid
return_excess      = return_total | excess
return_usd_excess  = return_usd   | excess
select [
  date,
  sec_id,
  return_total,
  return_usd,
  return_excess,
  return_usd_excess,
]

Because we define the functions once rather than copying & pasting the code, we get all the benefits of encapsulation and extensibility — we can have reliable & tested functions, whose purpose is explicit, which we can share across queries and colleagues.

Principles

PRQL is intended to be a modern, simple, declarative language for transforming data, with abstractions such as variables & functions. It's intended to replace SQL, but doesn't have ambitions as a general-purpose programming language. While it's at a pre-alpha stage, it has some immutable principles:

  • Pipelined — PRQL is a linear pipeline of transformations — each line of the query is a transformation of the previous line's result. This makes it easy to read, and simple to write.
  • Simple — PRQL serves both sophisticated engineers and analysts without coding experience. By providing simple, clean abstractions, the language can be both powerful and easy to use.
  • Compatible — PRQL transpiles to SQL, so it can be used with any database that uses SQL. Where possible PRQL can unify syntax across databases. PRQL should allow for a gradual onramp — it should be practical to mix SQL into a PRQL query where PRQL doesn't yet have an implementation.
  • Analytical — PRQL's focus is analytical queries; we de-emphasize other SQL features such as inserting data or transactions.
  • Extensible — PRQL can be extended through its abstractions, and can evolve without breaking backward-compatibility, because its queries can specify their PRQL version.

TODOs

  • Write a basic parser
    • Currently writing it using nom.
  • Write a basic complier
    • This should be fairly easy since it's just generating SQL.
  • Demonstrate some more complicated examples — e.g. most of the examples in https://github.com/dbt-labs/dbt-utils could all be covered much better by this.

Notes

Joins

  • Joins are implemented as {join_type} {table} {[conditions]}. For example:

    from employees
    left_join positions [id=employee_id]

    ...is equivalent to...

    SELECT * FROM employees LEFT JOIN positions ON id = employee_id
  • Possibly we could shorten [id=id] to id, and use SQL's USING, but it may be ambiguous with using id as a boolean column.

Functions

  • Functions can take two disjoint types of arguments:

    1. Positional arguments. Callers must pass these.
    2. Named arguments, which can optionally have a default value.
  • So a function like:

    func lag col sort_col split_col=id = (
      window col 
      split split_col
      sort sort_col
      lag 1
    )

    ...is called lag, takes three arguments col, sort_col & split_col, of which the first two much be supplied, the third can optionally be supplied with split_col:sec_id.

Assignments

  • To create a column, we use {column_name} = {calculation} in a pipeline. Technically this is "upserts" the column — it'll either create or overwrite a column, depending on whether it already exists.
  • I'd be open to alternative syntax, given that this syntax is generally a new statement in most programming languages.
    • But I can't think of any syntax that's more familiar than this.
    • Possibly let {column_name} {calculation} would be more consistent with the other keywords?

Lists

  • Currently lists require brackets; there's no implicit list like:

    from employees
    select salary  # fails, would require `select [salary]`
  • For some functions where we're only expecting a single arg, like select, we could accept a single arg not as a list?

Pipelines

  • A line-break generally creates a pipelined transformation. For example:

    from tbl
    select [
      col1,
      col2,
    ]
    filter col1 = col2

    ...is equivalent to:

    from tbl | select [col1, col2] | filter col1 = col2
  • A line-break doesn't created a pipeline in a few cases:

    • Within a list (e.g. the select example above).
    • When the following line is a new statement, by starting with a keyword such as func.

Thinking about

  • The previous result is passed as the final argument of a function; i.e. aggregate would be like; where X is taken from the line above:

    aggregate split=[] calcs X
  • CTE syntax — something like table =?

  • Raw syntax — I think we should have backticks represent raw SQL; i.e. UPPER could be defined as:

    func upper col = `UPPER(`col`)`
    # or with f-string-like syntax
    func upper col = `UPPER({col})`
    # or with " rather than `
    func upper col = "UPPER({col})"
  • Arrays — PRQL is in part inspired by DataPipes.jl, which demonstrates how effective point-free pipelines can be (Chain.jl is similar). One benefit of this is how well it deals with arbitrarily nested pipelines — which are difficult to read in SQL and even in jq. Could we do something similar for nested data in PRQL?

    • Here's a snippet from DataPipes.jl — and we could avoid the macros / do / end):

      @p begin
        text
        strip
        split(__, "\n")
        map() do __
            collect
            map() do __
              __ == chars[begin] ? 1 : 0
            end
        end
        hcat(__...)'
      end
  • Partials — how functional do we want to make the lang? e.g. should we have partial functions? e.g. [now based on an old version of window] potentially we don't need the col in lag here?

    func lag col = window col split:sec_id sort:date lag:1
  • Boolean logic — how should we represent boolean logic like or? With some or function that takes *args (which we don't currently have a design for)? Or implement dyadic operators; either or or ||? (Same for not)

  • from — do we need from? A previous version of this proposal didn't require this — just start with the table name. But some initial feedback was that removing from made it less clear.

  • Readme syntax — we can't get syntax highlighting in GitHub's markdown — is there a solution to this aside from submitting a parser to GitHub / screenshots / creating a website?

    • Currently we use elm as it coincidentally provides the best syntax highlight (open to suggestions for others!).
  • In advance of a full parser & compiler, could we use something like Codex to generate the transformations, and let us explore the space? We can provide our owen examples, by using fine-tuning. Changing examples is easier than changing compilers!

Inspired by

  • dplyr is a beautiful language for manipulating data, in R. It's the most similar language to PRQL.
  • Julia's DataPipes.jl & Chain.jl, which demonstrate how effective point-free pipelines can be, and how line-breaks can work as pipes.
  • Ocaml's elegant and simple syntax.
You might also like...
Ruxnasm is an assembler for Uxntal — a programming language for the Uxn stack-machine by Hundred Rabbits

Ruxnasm is an assembler for Uxntal — a programming language for the Uxn stack-machine by Hundred Rabbits. Ruxnasm strives to be an alternative to Uxnasm, featuring more user-friendly error reporting, warnings, and helpful hints, reminiscent of those seen in modern compilers for languages such as Rust or Elm.

A Simple, But amazing telegram bot, Made using the Rust language!

Telegram bot in Rust A fun Telegram bot made using Rust language.

a simple compiled language i made in rust. it uses intermediate representation (IR) instead of an abstract syntax tree (AST).

a simple compiled language i made in rust. it uses intermediate representation (IR) instead of an abstract syntax tree (AST).

A compiler for the esoteric language ℂ.

The ℂ Programming Language It's a language where the only types are "complex number" and "matrix of complex numbers". In particular, this means you ca

Frame is a markdown language for creating state machines (automata) in 7 programming languages as well as generating UML documentation.

Frame Language Transpiler v0.5.1 Hi! So very glad you are interested in Frame. Frame system design markdown language for software architects and engin

beat saber is a strongly typed, self-documenting and highly performant programming language

beatsaber beat saber is a strongly typed, self-documenting and highly performant programming language. With beat saber we aimed to create a language t

Analogous, indented syntax for the Rust programming language.

Note: After experimenting with this in the wild, I have found representing keywords as symbols to be far less readable in large codebases. Additionall

A cell-based esoteric programming language

Tape A cell-based esoteric programming language Tape is a cell-based, brainfuck-like programming language that has a readable syntax and a non-wasted

 RusTiny -- A Rust implementation of Tiny+ language
RusTiny -- A Rust implementation of Tiny+ language

RusTiny -- A Rust implementation of Tiny+ language 编译器实践 基本要求: 参考《编译原理及实践》的TINY语言编译器(已上传到群中)完成TINY+ 语言(见附录 A)的解释器:即给定满足 TINY+语言的源代码输入,你的解 释器可以给出对其的解释执

Releases(0.3.1)
Owner
Maximilian Roos
Maximilian Roos
tr-lang is a language that aims to bring programming language syntax closer to Turkish.

tr-lang Made with ❤️ in ???? tr-lang is a language that aims to bring programming language syntax closer to Turkish. tr-lang is a stack based language

Kerem Göksu 10 Apr 2, 2022
A language server implementation for the WGSL shading language

wgsl-analyzer wgsl-analyzer is a language server plugin for the WGSL Shading language. It comes with a VS Code plugin located in ./editors/code, but d

null 155 Jan 2, 2023
The SATySFi Language Server

[WIP] SATySFi Language Server This repository is work-in-progress yet. Features Kind Function Done codeAction Add the definition of an undefined comma

monaqa 50 Dec 24, 2022
scraps of a potential language

seaslug small, beautiful, knowable, DOESN'T EXIST YET LOL non-turing complete, verified terminating code placed into well-defined interfaces similar t

Tyler Neely 34 Nov 1, 2022
A language server for lua written in rust

lua-analyzer lua-analyzer is a lsp server for lua. This is mostly for me to learn the lsp protocol and language analysis so suggestions are helpful. T

null 61 Dec 11, 2022
A scripting language that allows complex key remapping on Linux.

Map2 A scripting language that allows complex key remapping on Linux, written in Rust. All of the functionality related to interacting with graphical

Matt 99 Dec 6, 2022
Uindex is a data store, for data that can be parsed as sentences in some context-free language.

Uindex - Universal index Uindex is a data store, for data that can be parsed as sentences in some context-free language.

Enrique Pérez Arnaud 3 Jul 20, 2021
A simple programming language for everyone.

Slang A simple programming language for everyone, made with Rust. State In very early stages. Plan is to create a byte-code compiler and make that exe

Slang, Inc. 11 Jul 1, 2022
A programming language. Better mantra pending.

Dusk Dusk is a programming language I've been working on on and off for the past while now. It's still very much in its infancy (... a quick look thro

Kaylynn Morgan 14 Oct 24, 2022
very cool esoteric language pls use

okfrick has one memory pointer has less than 5 characters hopefully works well is turing complete (possibly) + - increase memeory pointer value ( - st

null 3 Jun 24, 2021