# Factor Expr

Factor Expression | Historical Data | Factor Values | ||
---|---|---|---|---|

(TSLogReturn 30 :close) | + | 2019-12-27~2020-01-14.pq | = | [0.01, 0.035, ...] |

Extreme fast factor expression & computation library for quantitative trading in Python.

On a server with an E7-4830 CPU (16 cores, 2000MHz), computing 48 factors over a dataset with 24.5M rows x 683 columns (12GB) takes 150s.

Join [Discussions] for Q&A and feature proposal!

## Features

- Express factors in S-Expression.
- Compute factors in parallel over multiple factors and multiple datasets.

## Usage

There are three steps to use this library.

- Prepare the datasets into files. Currently, only the Parquet format is supported.
- Define factors using S-Expression.
- Run
`replay`

to compute the factors on the dataset.

### 1. Prepare the dataset

A dataset is a tabular format with float64 columns and arbitrary column names. Each row in the dataset represents a tick, e.g. for a daily dataset, each row is one day. For example, here is an OHLC candle dataset representing 2 ticks:

```
df = pd.DataFrame({
"open": [3.1, 5.8],
"high": [8.8, 7.7],
"low": [1.1, 2.1],
"close": [4.4, 3.4]
})
```

You can use the following code to store the DataFrame into a Parquet file:

`df.to_parquet("data.pq")`

### 2. Define your factors

`Factor Expr`

uses the S-Expression to describe a factor. For example, on a daily OHLC dataset, the 30 days log return on the column `close`

is expressed as:

```
from factor_expr import Factor
Factor("(TSLogReturn 30 :close)")
```

Note, in `Factor Expr`

, column names are referred by the `:column-name`

syntax.

### 3. Compute the factors on the prepared dataset

Following step 1 and 2, you can now compute the factors using the `replay`

function:

```
from factor_expr import Factor, replay
result = await replay(
["data.pq"],
[Factor("(TSLogReturn 30 :close)")]
)
```

The first parameter of `replay`

is a list of dataset files and the second parameter is a list of Factors. This gives you the ability to compute multiple factors on multiple datasets. Don't worry about the performance! `Factor Expr`

allows you parallelize the computation over the factors as well as the datasets by setting `n_factor_jobs`

and `n_data_jobs`

in the `replay`

function.

The returned result is a pandas DataFrame with factors as the column names and `time`

as the index. In case of multiple datasets are passed in, the results will be concatenated with the exact order of the datasets. This is useful if you have a scattered dataset. E.g. one file for each year.

For example, the code above will give you a DataFrame looks similar to this:

index | (TSLogReturn 30 :close) |
---|---|

0 | 0.23 |

... | ... |

Check out the docstring of `replay`

for more information!

## Installation

`pip install factor-expr`

## Supported Functions

Notations:

means a constant, e.g.`3`

.

means either a constant or an S-Expression or a column name, e.g.`3`

or`(+ :close 3)`

or`:open`

.

Here's the full list of supported functions. If you didn't find one you need, consider asking on Discussions or creating a PR!

### Arithmetics

- Addition:
`(+`

) - Subtraction:
`(-`

) - Multiplication:
`(*`

) - Division:
`(/`

) - Power:
`(^`

- compute) ^ - Negation:
`(Neg`

) - Signed Power:
`(SPow`

- compute) `sign(`

) * abs( ) ^ - Natural Logarithm after Absolute:
`(LogAbs`

) - Sign:
`(Sign`

) - Abs:
`(Abs`

)

### Logics

Any

larger than 0 are treated as `true`

.

- If:
`(If`

- if the first)

is larger than 0, return the second

otherwise return the third - And:
`(And`

) - Or:
`(Or`

) - Less Than:
`(<`

) - Less Than or Equal:
`(<=`

) - Great Than:
`(>`

) - Greate Than or Equal:
`(>=`

) - Equal:
`(==`

) - Not:
`(!`

)

### Window Functions

All the window functions take a window size as the first argument. The computation will be done on the look-back window with the size given in

.

- Sum of the window elements:
`(TSSum`

) - Mean of the window elements:
`(TSMean`

) - Min of the window elements:
`(TSMin`

) - Max of the window elements:
`(TSMax`

) - The index of the min of the window elements:
`(TSArgMin`

) - The index of the max of the window elements:
`(TSArgMax`

) - Stdev of the window elements:
`(TSStd`

) - Skew of the window elements:
`(TSSkew`

) - The rank (ascending) of the current element in the window:
`(TSRank`

) - The value

ticks back:`(Delay`

) - The log return of the value

ticks back to current value:`(TSLogReturn`

) - Rolling correlation between two series:
`(TSCorrelation`

) - Rolling quantile of a series:
`(TSQuantile`

, e.g.) `(TSQuantile 100 0.5`

computes the median of a window sized 100.)

#### Warm-up Period for Window Functions

Factors containing window functions require a warm-up period. For example, for `(TSSum 10 :close)`

, it will not generate data until the 10th tick is replayed. In this case, `replay`

will write `NaN`

into the result during the warm-up period, until the factor starts to produce data. This ensures the length of the factor output will be as same as the length of the input dataset. You can use the `trim`

parameter to let replay trim off the warm-up period before it returns.

## Factors Failed to Compute

`Factor Expr`

guarantees that there will not be any `inf`

, `-inf`

or `NaN`

appear in the result, except for the warm-up period. However, sometimes a factor can fail due to numerical issues. For example, `(Pow 3 (Pow 3 (Pow 3 :volume)))`

might overflow and become `inf`

, and `1 / inf`

will become `NaN`

. `Factor Expr`

will detect these situations and mark these factors as failed. The failed factors will still be returned in the replay result, but the values in that column will be all `NaN`

. You can easily remove these failed factors from the result by using `pd.DataFrame.dropna(axis=1, how="all")`

.

## I Want to Have a Time Index for the Result

The `replay`

function optionally accepts a `index_col`

parameter. If you want to set a column from the dataset as the index of the returned result, you can do the following:

```
from factor_expr import Factor, replay
pd.DataFrame({
"time": [datetime(2021,4,23), datetime(2021,4,24)],
"open": [3.1, 5.8],
"high": [8.8, 7.7],
"low": [1.1, 2.1],
"close": [4.4, 3.4],
}).to_parquet("data.pq")
result = await replay(
["data.pq"],
[Factor("(TSLogReturn 30 :close)")],
index_col="time",
)
```

Note, accessing the `time`

column from factor expressions will cause an error. Factor expressions can only read `float64`

columns.

## API

There are two components in `Factor Expr`

, a `Factor`

class and a `replay`

function.

### Factor

The factor class takes an S-Expression to construct. It has the following signature:

```
class Factor:
def __init__(sexpr: str) -> None:
"""Construct a Factor using an S-Expression"""
def ready_offset(self) -> int:
"""Returns the first index after the warm-up period.
For non-window functions, this will always return 0."""
def __len__(self) -> int:
"""Returns how many subtrees contained in this factor tree.
Example
-------
`(+ (/ :close :open) :high)` has 5 subtrees, namely:
1. (+ (/ :close :open) :high)
2. (/ :close :open)
3. :close
4. :open
5. :high
"""
def __getitem__(self, i:int) -> Factor:
"""Get the i-th subtree of the sequence from the pre-order traversal of the factor tree.
Example
-------
`(+ (/ :close :open) :high)` is traversed as:
0. (+ (/ :close :open) :high)
1. (/ :close :open)
2. :close
3. :open
4. :high
Consequently, f[2] will give you `Factor(":close")`.
"""
def depth(self) -> int:
"""How deep is this factor tree.
Example
-------
`(+ (/ :close :open) :high)` has a depth of 2, namely:
1. (+ (/ :close :open) :high)
2. (/ :close :open)
"""
def child_indices(self) -> List[int]:
"""The indices for the children of this factor tree.
Example
-------
The child_indices result of `(+ (/ :close :open) :high)` is [1, 4]
"""
def replace(self, i: int, other: Factor) -> Factor:
"""Replace the i-th node with another subtree.
Example
-------
`Factor("+ (/ :close :open) :high").replace(4, Factor("(- :high :low)")) == Factor("+ (/ :close :open) (- :high :low)")`
"""
def columns(self) -> List[str]:
"""Return all the columns that are used by this factor.
Example
-------
`(+ (/ :close :open) :high)` uses [:close, :open, :high].
"""
def clone(self) -> Factor:
"""Create a copy of itself."""
```

### replay

Replay has the following signature:

```
async def replay(
files: Iterable[str],
factors: List[Factor],
*,
predicate: Optional[Factor] = None,
batch_size: int = 40960,
n_data_jobs: int = 1,
n_factor_jobs: int = 1,
pbar: bool = True,
trim: bool = False,
index_col: Optional[str] = None,
verbose: bool = False,
output: Literal["pandas", "pyarrow", "raw"] = "pandas",
) -> Union[pd.DataFrame, pa.Table]:
"""
Replay a list of factors on a bunch of data.
Parameters
----------
files: Iterable[str]
Paths to the datasets. Currently, only parquet format is supported.
factors: List[Factor]
A list of Factors to replay on the given set of files.
predicate: Optional[Factor] = None
Use a predicate to pre-filter the replay result. Any value larger than 0 is treated as True.
batch_size: int = 40960
How many rows to replay at one time. Default is 40960 rows.
n_data_jobs: int = 1
How many datasets to run in parallel. Note that the factor level parallelism is controlled by n_factor_jobs.
n_factor_jobs: int = 1
How many factors to run in parallel for **each** dataset.
e.g. if `n_data_jobs=3` and `n_factor_jobs=5`, you will have 3 * 5 threads running concurrently.
pbar: bool = True
Whether to show the progress bar using tqdm.
trim: bool = False
Whether to trim the warm up period off from the result.
index_col: Optional[str] = None
Set the index column.
verbose: bool = False
If True, failed factors will be printed out in stderr.
output: Literal["pandas" | "pyarrow" | "raw"] = "pandas"
The return format, can be pandas DataFrame ("pandas") or pyarrow Table ("pyarrow") or un-concatenated pyarrow Tables ("raw").
"""
```