factor-expr

# Factor Expr  Factor Expression Historical Data Factor Values
(TSLogReturn 30 :close) + 2019-12-27~2020-01-14.pq = [0.01, 0.035, ...]

Extreme fast factor expression & computation library for quantitative trading in Python.

On a server with an E7-4830 CPU (16 cores, 2000MHz), computing 48 factors over a dataset with 24.5M rows x 683 columns (12GB) takes 150s.

Join [Discussions] for Q&A and feature proposal!

## Features

• Express factors in S-Expression.
• Compute factors in parallel over multiple factors and multiple datasets.

## Usage

There are three steps to use this library.

1. Prepare the datasets into files. Currently, only the Parquet format is supported.
2. Define factors using S-Expression.
3. Run `replay` to compute the factors on the dataset.

### 1. Prepare the dataset

A dataset is a tabular format with float64 columns and arbitrary column names. Each row in the dataset represents a tick, e.g. for a daily dataset, each row is one day. For example, here is an OHLC candle dataset representing 2 ticks:

```df = pd.DataFrame({
"open": [3.1, 5.8],
"high": [8.8, 7.7],
"low": [1.1, 2.1],
"close": [4.4, 3.4]
})```

You can use the following code to store the DataFrame into a Parquet file:

`df.to_parquet("data.pq")`

### 2. Define your factors

`Factor Expr` uses the S-Expression to describe a factor. For example, on a daily OHLC dataset, the 30 days log return on the column `close` is expressed as:

```from factor_expr import Factor

Factor("(TSLogReturn 30 :close)")```

Note, in `Factor Expr`, column names are referred by the `:column-name` syntax.

### 3. Compute the factors on the prepared dataset

Following step 1 and 2, you can now compute the factors using the `replay` function:

```from factor_expr import Factor, replay

result = await replay(
["data.pq"],
[Factor("(TSLogReturn 30 :close)")]
)```

The first parameter of `replay` is a list of dataset files and the second parameter is a list of Factors. This gives you the ability to compute multiple factors on multiple datasets. Don't worry about the performance! `Factor Expr` allows you parallelize the computation over the factors as well as the datasets by setting `n_factor_jobs` and `n_data_jobs` in the `replay` function.

The returned result is a pandas DataFrame with factors as the column names and `time` as the index. In case of multiple datasets are passed in, the results will be concatenated with the exact order of the datasets. This is useful if you have a scattered dataset. E.g. one file for each year.

For example, the code above will give you a DataFrame looks similar to this:

index (TSLogReturn 30 :close)
0 0.23
... ...

Check out the docstring of `replay` for more information!

## Installation

`pip install factor-expr`

## Supported Functions

Notations:

• means a constant, e.g. `3`.
• means either a constant or an S-Expression or a column name, e.g. `3` or `(+ :close 3)` or `:open`.

Here's the full list of supported functions. If you didn't find one you need, consider asking on Discussions or creating a PR!

### Arithmetics

• Addition: `(+ )`
• Subtraction: `(- )`
• Multiplication: `(* )`
• Division: `(/ )`
• Power: `(^ )` - compute ` ^ `
• Negation: `(Neg )`
• Signed Power: `(SPow )` - compute `sign() * abs() ^ `
• Natural Logarithm after Absolute: `(LogAbs )`
• Sign: `(Sign )`
• Abs: `(Abs )`

### Logics

Any larger than 0 are treated as `true`.

• If: `(If )` - if the first is larger than 0, return the second otherwise return the third
• And: `(And )`
• Or: `(Or )`
• Less Than: `(< )`
• Less Than or Equal: `(<= )`
• Great Than: `(> )`
• Greate Than or Equal: `(>= )`
• Equal: `(== )`
• Not: `(! )`

### Window Functions

All the window functions take a window size as the first argument. The computation will be done on the look-back window with the size given in .

• Sum of the window elements: `(TSSum )`
• Mean of the window elements: `(TSMean )`
• Min of the window elements: `(TSMin )`
• Max of the window elements: `(TSMax )`
• The index of the min of the window elements: `(TSArgMin )`
• The index of the max of the window elements: `(TSArgMax )`
• Stdev of the window elements: `(TSStd )`
• Skew of the window elements: `(TSSkew )`
• The rank (ascending) of the current element in the window: `(TSRank )`
• The value ticks back: `(Delay )`
• The log return of the value ticks back to current value: `(TSLogReturn )`
• Rolling correlation between two series: `(TSCorrelation )`
• Rolling quantile of a series: `(TSQuantile )`, e.g. `(TSQuantile 100 0.5 )` computes the median of a window sized 100.

#### Warm-up Period for Window Functions

Factors containing window functions require a warm-up period. For example, for `(TSSum 10 :close)`, it will not generate data until the 10th tick is replayed. In this case, `replay` will write `NaN` into the result during the warm-up period, until the factor starts to produce data. This ensures the length of the factor output will be as same as the length of the input dataset. You can use the `trim` parameter to let replay trim off the warm-up period before it returns.

## Factors Failed to Compute

`Factor Expr` guarantees that there will not be any `inf`, `-inf` or `NaN` appear in the result, except for the warm-up period. However, sometimes a factor can fail due to numerical issues. For example, `(Pow 3 (Pow 3 (Pow 3 :volume)))` might overflow and become `inf`, and `1 / inf` will become `NaN`. `Factor Expr` will detect these situations and mark these factors as failed. The failed factors will still be returned in the replay result, but the values in that column will be all `NaN`. You can easily remove these failed factors from the result by using `pd.DataFrame.dropna(axis=1, how="all")`.

## I Want to Have a Time Index for the Result

The `replay` function optionally accepts a `index_col` parameter. If you want to set a column from the dataset as the index of the returned result, you can do the following:

```from factor_expr import Factor, replay

pd.DataFrame({
"time": [datetime(2021,4,23), datetime(2021,4,24)],
"open": [3.1, 5.8],
"high": [8.8, 7.7],
"low": [1.1, 2.1],
"close": [4.4, 3.4],
}).to_parquet("data.pq")

result = await replay(
["data.pq"],
[Factor("(TSLogReturn 30 :close)")],
index_col="time",
)```

Note, accessing the `time` column from factor expressions will cause an error. Factor expressions can only read `float64` columns.

## API

There are two components in `Factor Expr`, a `Factor` class and a `replay` function.

### Factor

The factor class takes an S-Expression to construct. It has the following signature:

int: """Returns the first index after the warm-up period. For non-window functions, this will always return 0.""" def __len__(self) -> int: """Returns how many subtrees contained in this factor tree. Example ------- `(+ (/ :close :open) :high)` has 5 subtrees, namely: 1. (+ (/ :close :open) :high) 2. (/ :close :open) 3. :close 4. :open 5. :high """ def __getitem__(self, i:int) -> Factor: """Get the i-th subtree of the sequence from the pre-order traversal of the factor tree. Example ------- `(+ (/ :close :open) :high)` is traversed as: 0. (+ (/ :close :open) :high) 1. (/ :close :open) 2. :close 3. :open 4. :high Consequently, f will give you `Factor(":close")`. """ def depth(self) -> int: """How deep is this factor tree. Example ------- `(+ (/ :close :open) :high)` has a depth of 2, namely: 1. (+ (/ :close :open) :high) 2. (/ :close :open) """ def child_indices(self) -> List[int]: """The indices for the children of this factor tree. Example ------- The child_indices result of `(+ (/ :close :open) :high)` is [1, 4] """ def replace(self, i: int, other: Factor) -> Factor: """Replace the i-th node with another subtree. Example ------- `Factor("+ (/ :close :open) :high").replace(4, Factor("(- :high :low)")) == Factor("+ (/ :close :open) (- :high :low)")` """ def columns(self) -> List[str]: """Return all the columns that are used by this factor. Example ------- `(+ (/ :close :open) :high)` uses [:close, :open, :high]. """ def clone(self) -> Factor: """Create a copy of itself.""" ">
```class Factor:
def __init__(sexpr: str) -> None:
"""Construct a Factor using an S-Expression"""

def ready_offset(self) -> int:
"""Returns the first index after the warm-up period.
For non-window functions, this will always return 0."""

def __len__(self) -> int:
"""Returns how many subtrees contained in this factor tree.

Example
-------
`(+ (/ :close :open) :high)` has 5 subtrees, namely:
1. (+ (/ :close :open) :high)
2. (/ :close :open)
3. :close
4. :open
5. :high
"""

def __getitem__(self, i:int) -> Factor:
"""Get the i-th subtree of the sequence from the pre-order traversal of the factor tree.

Example
-------
`(+ (/ :close :open) :high)` is traversed as:
0. (+ (/ :close :open) :high)
1. (/ :close :open)
2. :close
3. :open
4. :high

Consequently, f will give you `Factor(":close")`.
"""

def depth(self) -> int:
"""How deep is this factor tree.

Example
-------
`(+ (/ :close :open) :high)` has a depth of 2, namely:
1. (+ (/ :close :open) :high)
2. (/ :close :open)
"""

def child_indices(self) -> List[int]:
"""The indices for the children of this factor tree.

Example
-------
The child_indices result of `(+ (/ :close :open) :high)` is [1, 4]
"""

def replace(self, i: int, other: Factor) -> Factor:
"""Replace the i-th node with another subtree.

Example
-------
`Factor("+ (/ :close :open) :high").replace(4, Factor("(- :high :low)")) == Factor("+ (/ :close :open) (- :high :low)")`
"""

def columns(self) -> List[str]:
"""Return all the columns that are used by this factor.

Example
-------
`(+ (/ :close :open) :high)` uses [:close, :open, :high].
"""

def clone(self) -> Factor:
"""Create a copy of itself."""```

### replay

Replay has the following signature:

Union[pd.DataFrame, pa.Table]: """ Replay a list of factors on a bunch of data. Parameters ---------- files: Iterable[str] Paths to the datasets. Currently, only parquet format is supported. factors: List[Factor] A list of Factors to replay on the given set of files. predicate: Optional[Factor] = None Use a predicate to pre-filter the replay result. Any value larger than 0 is treated as True. batch_size: int = 40960 How many rows to replay at one time. Default is 40960 rows. n_data_jobs: int = 1 How many datasets to run in parallel. Note that the factor level parallelism is controlled by n_factor_jobs. n_factor_jobs: int = 1 How many factors to run in parallel for **each** dataset. e.g. if `n_data_jobs=3` and `n_factor_jobs=5`, you will have 3 * 5 threads running concurrently. pbar: bool = True Whether to show the progress bar using tqdm. trim: bool = False Whether to trim the warm up period off from the result. index_col: Optional[str] = None Set the index column. verbose: bool = False If True, failed factors will be printed out in stderr. output: Literal["pandas" | "pyarrow" | "raw"] = "pandas" The return format, can be pandas DataFrame ("pandas") or pyarrow Table ("pyarrow") or un-concatenated pyarrow Tables ("raw"). """ ">
```async def replay(
files: Iterable[str],
factors: List[Factor],
*,
predicate: Optional[Factor] = None,
batch_size: int = 40960,
n_data_jobs: int = 1,
n_factor_jobs: int = 1,
pbar: bool = True,
trim: bool = False,
index_col: Optional[str] = None,
verbose: bool = False,
output: Literal["pandas", "pyarrow", "raw"] = "pandas",
) -> Union[pd.DataFrame, pa.Table]:
"""
Replay a list of factors on a bunch of data.

Parameters
----------
files: Iterable[str]
Paths to the datasets. Currently, only parquet format is supported.
factors: List[Factor]
A list of Factors to replay on the given set of files.
predicate: Optional[Factor] = None
Use a predicate to pre-filter the replay result. Any value larger than 0 is treated as True.
batch_size: int = 40960
How many rows to replay at one time. Default is 40960 rows.
n_data_jobs: int = 1
How many datasets to run in parallel. Note that the factor level parallelism is controlled by n_factor_jobs.
n_factor_jobs: int = 1
How many factors to run in parallel for **each** dataset.
e.g. if `n_data_jobs=3` and `n_factor_jobs=5`, you will have 3 * 5 threads running concurrently.
pbar: bool = True
Whether to show the progress bar using tqdm.
trim: bool = False
Whether to trim the warm up period off from the result.
index_col: Optional[str] = None
Set the index column.
verbose: bool = False
If True, failed factors will be printed out in stderr.
output: Literal["pandas" | "pyarrow" | "raw"] = "pandas"
The return format, can be pandas DataFrame ("pandas") or pyarrow Table ("pyarrow") or un-concatenated pyarrow Tables ("raw").
"""```
• #### O(nlogn) algorithm for ranking

Currently, the implementation for the `TSRank` function is O(n^2). It would better to use the algorithm in https://github.com/contribu/rollingrank.

opened by dovahcrow 2
###### Weiyuan Wu
In the world without problem formalization, solutions go wild. ###### Uindex is a data store, for data that can be parsed as sentences in some context-free language.

Uindex - Universal index Uindex is a data store, for data that can be parsed as sentences in some context-free language.

3 May 23, 2021
###### A scripting language that allows complex key remapping on Linux.

Map2 A scripting language that allows complex key remapping on Linux, written in Rust. All of the functionality related to interacting with graphical

44 Jun 6, 2021
###### Web Browser Engineering

This is a port of Web Browser Engineering series from Python to Rust done by Korean Rust User Group.

23 May 23, 2021
###### Applied offensive security with Rust

Black Hat Rust - Early Access Deep dive into offensive security with the Rust programming language Buy the book now! Summary Whether in movies or main

396 Jun 11, 2021
###### A programming language. Better mantra pending.

Dusk Dusk is a programming language I've been working on on and off for the past while now. It's still very much in its infancy (... a quick look thro

6 Jun 4, 2021
###### basic multiple package manager

baka basic multiple package manager Docs Env baka_root_setting Windows: %USERPROFILE%/.baka/config Linux, Mac: \$HOME/.baka/config baka_plugins (Just u

5 Jun 14, 2021
###### MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine

MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine. Both searching and indexing are highly customizable. Features such as typo-tolerance, filters, and synonyms are provided out-of-the-box. For more information about features go to our documentation.

14.5k Jun 13, 2021
###### Rustyread is a drop in replacement of badread simulate.

Rustyread is a drop in replacement of badread simulate. Rustyread is very heavily inspired by badread, it reuses the same error and quality model file. But Rustyreads is multi-threaded and benefits from other optimizations.

16 May 21, 2021
###### unFlow is a Design as Code implementation, a DSL for UX & backend modeling. DSL to Sketch file, Sketch to DSL, DSL to code.

unflow 是一个低代码、无代码设计语言。unFlow is a Design as Code implementation, a DSL for UX & backend modeling. DSL to Sketch file, Sketch to DSL, DSL to code.

51 Jun 10, 2021
###### Rust library that can be reset if you think it's slow

GoodbyeKT Rust library that can be reset if you think it's slow

36 Apr 29, 2021
###### Simple library to host lv2 plugins. Is not meant to support any kind of GUI.

lv2-host-minimal Simple library to host lv2 plugins. Is not meant to support any kind of GUI. Host fx plugins (audio in, audio out) Set parameters Hos

5 May 17, 2021
###### Rholang runtime in rust

Rholang Runtime A rholang runtime written in Rust.

14 May 29, 2021
###### Easy-to-use optional function arguments for Rust

OptArgs uses const generics to ensure compile-time correctness. I've taken the liberty of expanding and humanizing the macros in the reference examples.

26 Apr 25, 2021
###### A job queue built on sqlx and PostgreSQL.

sqlxmq A job queue built on sqlx and PostgreSQL. This library allows a CRUD application to run background jobs without complicating its deployment. Th

16 Jun 5, 2021