Extreme fast factor expression & computation library for quantitative trading in Python.

Overview

Factor Expr status pypi

Factor Expression Historical Data Factor Values
(TSLogReturn 30 :close) + 2019-12-27~2020-01-14.pq = [0.01, 0.035, ...]

Extreme fast factor expression & computation library for quantitative trading in Python.

On a server with an E7-4830 CPU (16 cores, 2000MHz), computing 48 factors over a dataset with 24.5M rows x 683 columns (12GB) takes 150s.

Join [Discussions] for Q&A and feature proposal!

Features

  • Express factors in S-Expression.
  • Compute factors in parallel over multiple factors and multiple datasets.

Usage

There are three steps to use this library.

  1. Prepare the datasets into files. Currently, only the Parquet format is supported.
  2. Define factors using S-Expression.
  3. Run replay to compute the factors on the dataset.

1. Prepare the dataset

A dataset is a tabular format with float64 columns and arbitrary column names. Each row in the dataset represents a tick, e.g. for a daily dataset, each row is one day. For example, here is an OHLC candle dataset representing 2 ticks:

df = pd.DataFrame({
    "open": [3.1, 5.8], 
    "high": [8.8, 7.7], 
    "low": [1.1, 2.1], 
    "close": [4.4, 3.4]
})

You can use the following code to store the DataFrame into a Parquet file:

df.to_parquet("data.pq")

2. Define your factors

Factor Expr uses the S-Expression to describe a factor. For example, on a daily OHLC dataset, the 30 days log return on the column close is expressed as:

from factor_expr import Factor

Factor("(TSLogReturn 30 :close)")

Note, in Factor Expr, column names are referred by the :column-name syntax.

3. Compute the factors on the prepared dataset

Following step 1 and 2, you can now compute the factors using the replay function:

from factor_expr import Factor, replay

result = await replay(
    ["data.pq"],
    [Factor("(TSLogReturn 30 :close)")]
)

The first parameter of replay is a list of dataset files and the second parameter is a list of Factors. This gives you the ability to compute multiple factors on multiple datasets. Don't worry about the performance! Factor Expr allows you parallelize the computation over the factors as well as the datasets by setting n_factor_jobs and n_data_jobs in the replay function.

The returned result is a pandas DataFrame with factors as the column names and time as the index. In case of multiple datasets are passed in, the results will be concatenated with the exact order of the datasets. This is useful if you have a scattered dataset. E.g. one file for each year.

For example, the code above will give you a DataFrame looks similar to this:

index (TSLogReturn 30 :close)
0 0.23
... ...

Check out the docstring of replay for more information!

Installation

pip install factor-expr

Supported Functions

Notations:

  • means a constant, e.g. 3.
  • means either a constant or an S-Expression or a column name, e.g. 3 or (+ :close 3) or :open.

Here's the full list of supported functions. If you didn't find one you need, consider asking on Discussions or creating a PR!

Arithmetics

  • Addition: (+ )
  • Subtraction: (- )
  • Multiplication: (* )
  • Division: (/ )
  • Power: (^ ) - compute ^
  • Negation: (Neg )
  • Signed Power: (SPow ) - compute sign() * abs() ^
  • Natural Logarithm after Absolute: (LogAbs )
  • Sign: (Sign )
  • Abs: (Abs )

Logics

Any larger than 0 are treated as true.

  • If: (If ) - if the first is larger than 0, return the second otherwise return the third
  • And: (And )
  • Or: (Or )
  • Less Than: (< )
  • Less Than or Equal: (<= )
  • Great Than: (> )
  • Greate Than or Equal: (>= )
  • Equal: (== )
  • Not: (! )

Window Functions

All the window functions take a window size as the first argument. The computation will be done on the look-back window with the size given in .

  • Sum of the window elements: (TSSum )
  • Mean of the window elements: (TSMean )
  • Min of the window elements: (TSMin )
  • Max of the window elements: (TSMax )
  • The index of the min of the window elements: (TSArgMin )
  • The index of the max of the window elements: (TSArgMax )
  • Stdev of the window elements: (TSStd )
  • Skew of the window elements: (TSSkew )
  • The rank (ascending) of the current element in the window: (TSRank )
  • The value ticks back: (Delay )
  • The log return of the value ticks back to current value: (TSLogReturn )
  • Rolling correlation between two series: (TSCorrelation )
  • Rolling quantile of a series: (TSQuantile ), e.g. (TSQuantile 100 0.5 ) computes the median of a window sized 100.

Warm-up Period for Window Functions

Factors containing window functions require a warm-up period. For example, for (TSSum 10 :close), it will not generate data until the 10th tick is replayed. In this case, replay will write NaN into the result during the warm-up period, until the factor starts to produce data. This ensures the length of the factor output will be as same as the length of the input dataset. You can use the trim parameter to let replay trim off the warm-up period before it returns.

Factors Failed to Compute

Factor Expr guarantees that there will not be any inf, -inf or NaN appear in the result, except for the warm-up period. However, sometimes a factor can fail due to numerical issues. For example, (Pow 3 (Pow 3 (Pow 3 :volume))) might overflow and become inf, and 1 / inf will become NaN. Factor Expr will detect these situations and mark these factors as failed. The failed factors will still be returned in the replay result, but the values in that column will be all NaN. You can easily remove these failed factors from the result by using pd.DataFrame.dropna(axis=1, how="all").

I Want to Have a Time Index for the Result

The replay function optionally accepts a index_col parameter. If you want to set a column from the dataset as the index of the returned result, you can do the following:

from factor_expr import Factor, replay

pd.DataFrame({
    "time": [datetime(2021,4,23), datetime(2021,4,24)], 
    "open": [3.1, 5.8], 
    "high": [8.8, 7.7], 
    "low": [1.1, 2.1], 
    "close": [4.4, 3.4],
}).to_parquet("data.pq")

result = await replay(
    ["data.pq"],
    [Factor("(TSLogReturn 30 :close)")],
    index_col="time",
)

Note, accessing the time column from factor expressions will cause an error. Factor expressions can only read float64 columns.

API

There are two components in Factor Expr, a Factor class and a replay function.

Factor

The factor class takes an S-Expression to construct. It has the following signature:

int: """Returns the first index after the warm-up period. For non-window functions, this will always return 0.""" def __len__(self) -> int: """Returns how many subtrees contained in this factor tree. Example ------- `(+ (/ :close :open) :high)` has 5 subtrees, namely: 1. (+ (/ :close :open) :high) 2. (/ :close :open) 3. :close 4. :open 5. :high """ def __getitem__(self, i:int) -> Factor: """Get the i-th subtree of the sequence from the pre-order traversal of the factor tree. Example ------- `(+ (/ :close :open) :high)` is traversed as: 0. (+ (/ :close :open) :high) 1. (/ :close :open) 2. :close 3. :open 4. :high Consequently, f[2] will give you `Factor(":close")`. """ def depth(self) -> int: """How deep is this factor tree. Example ------- `(+ (/ :close :open) :high)` has a depth of 2, namely: 1. (+ (/ :close :open) :high) 2. (/ :close :open) """ def child_indices(self) -> List[int]: """The indices for the children of this factor tree. Example ------- The child_indices result of `(+ (/ :close :open) :high)` is [1, 4] """ def replace(self, i: int, other: Factor) -> Factor: """Replace the i-th node with another subtree. Example ------- `Factor("+ (/ :close :open) :high").replace(4, Factor("(- :high :low)")) == Factor("+ (/ :close :open) (- :high :low)")` """ def columns(self) -> List[str]: """Return all the columns that are used by this factor. Example ------- `(+ (/ :close :open) :high)` uses [:close, :open, :high]. """ def clone(self) -> Factor: """Create a copy of itself.""" ">
class Factor:
    def __init__(sexpr: str) -> None:
        """Construct a Factor using an S-Expression"""

    def ready_offset(self) -> int:
        """Returns the first index after the warm-up period. 
        For non-window functions, this will always return 0."""

    def __len__(self) -> int:
        """Returns how many subtrees contained in this factor tree.

        Example
        -------
        `(+ (/ :close :open) :high)` has 5 subtrees, namely:
        1. (+ (/ :close :open) :high)
        2. (/ :close :open)
        3. :close
        4. :open
        5. :high
        """

    def __getitem__(self, i:int) -> Factor:
        """Get the i-th subtree of the sequence from the pre-order traversal of the factor tree.

        Example
        -------
        `(+ (/ :close :open) :high)` is traversed as:
        0. (+ (/ :close :open) :high)
        1. (/ :close :open)
        2. :close
        3. :open
        4. :high

        Consequently, f[2] will give you `Factor(":close")`.
        """

    def depth(self) -> int:
        """How deep is this factor tree.

        Example
        -------
        `(+ (/ :close :open) :high)` has a depth of 2, namely:
        1. (+ (/ :close :open) :high)
        2. (/ :close :open)
        """

    def child_indices(self) -> List[int]:
        """The indices for the children of this factor tree.

        Example
        -------
        The child_indices result of `(+ (/ :close :open) :high)` is [1, 4]
        """
        
    def replace(self, i: int, other: Factor) -> Factor:
        """Replace the i-th node with another subtree.

        Example
        -------
        `Factor("+ (/ :close :open) :high").replace(4, Factor("(- :high :low)")) == Factor("+ (/ :close :open) (- :high :low)")`
        """

    def columns(self) -> List[str]:
        """Return all the columns that are used by this factor.

        Example
        -------
        `(+ (/ :close :open) :high)` uses [:close, :open, :high].
        """
    
    def clone(self) -> Factor:
        """Create a copy of itself."""

replay

Replay has the following signature:

Union[pd.DataFrame, pa.Table]: """ Replay a list of factors on a bunch of data. Parameters ---------- files: Iterable[str] Paths to the datasets. Currently, only parquet format is supported. factors: List[Factor] A list of Factors to replay on the given set of files. predicate: Optional[Factor] = None Use a predicate to pre-filter the replay result. Any value larger than 0 is treated as True. batch_size: int = 40960 How many rows to replay at one time. Default is 40960 rows. n_data_jobs: int = 1 How many datasets to run in parallel. Note that the factor level parallelism is controlled by n_factor_jobs. n_factor_jobs: int = 1 How many factors to run in parallel for **each** dataset. e.g. if `n_data_jobs=3` and `n_factor_jobs=5`, you will have 3 * 5 threads running concurrently. pbar: bool = True Whether to show the progress bar using tqdm. trim: bool = False Whether to trim the warm up period off from the result. index_col: Optional[str] = None Set the index column. verbose: bool = False If True, failed factors will be printed out in stderr. output: Literal["pandas" | "pyarrow" | "raw"] = "pandas" The return format, can be pandas DataFrame ("pandas") or pyarrow Table ("pyarrow") or un-concatenated pyarrow Tables ("raw"). """ ">
async def replay(
    files: Iterable[str],
    factors: List[Factor],
    *,
    predicate: Optional[Factor] = None,
    batch_size: int = 40960,
    n_data_jobs: int = 1,
    n_factor_jobs: int = 1,
    pbar: bool = True,
    trim: bool = False,
    index_col: Optional[str] = None,
    verbose: bool = False,
    output: Literal["pandas", "pyarrow", "raw"] = "pandas",
) -> Union[pd.DataFrame, pa.Table]:
    """
    Replay a list of factors on a bunch of data.

    Parameters
    ----------
    files: Iterable[str]
        Paths to the datasets. Currently, only parquet format is supported.
    factors: List[Factor]
        A list of Factors to replay on the given set of files.
    predicate: Optional[Factor] = None
        Use a predicate to pre-filter the replay result. Any value larger than 0 is treated as True.
    batch_size: int = 40960
        How many rows to replay at one time. Default is 40960 rows.
    n_data_jobs: int = 1
        How many datasets to run in parallel. Note that the factor level parallelism is controlled by n_factor_jobs.
    n_factor_jobs: int = 1
        How many factors to run in parallel for **each** dataset.
        e.g. if `n_data_jobs=3` and `n_factor_jobs=5`, you will have 3 * 5 threads running concurrently.
    pbar: bool = True
        Whether to show the progress bar using tqdm.
    trim: bool = False
        Whether to trim the warm up period off from the result.
    index_col: Optional[str] = None
        Set the index column.
    verbose: bool = False
        If True, failed factors will be printed out in stderr.
    output: Literal["pandas" | "pyarrow" | "raw"] = "pandas"
        The return format, can be pandas DataFrame ("pandas") or pyarrow Table ("pyarrow") or un-concatenated pyarrow Tables ("raw").
    """
You might also like...
Build fast, reward everyone, and scale without friction.
Build fast, reward everyone, and scale without friction.

Scrypto Language for building DeFi apps on Radix. Terminology Package: A collection of blueprints, compiled and published as a single unit. Blueprint:

Fast and simple PHP version manager written in rust

[WIP] phpup (PHP-up): Fast and Simple PHP version manager ⚑ Fast and simple PHP version manager, written in rust Features No requirements for system P

A simple and fast FRC autonomous path planner (designed for swerve drive)! (Desktop/Laptop only)

This is a website developed for planning autonomous paths for FRC robots. It is intended to be a simple and fast tool to create autos, which works offline at competitions.

qn (quick note) is a simple, fast and user-friendly way to save notes πŸ¦€βš™οΈ
qn (quick note) is a simple, fast and user-friendly way to save notes πŸ¦€βš™οΈ

Quick Note qn Install This is currently for my personal use. I may push breaking changes at any time. If you want to use it, bring down the code and r

An implementation of Code Generation and Factoring for Fast Evaluation of Low-order Spherical Harmonic Products and Squares

sh_product An implementation of Code Generation and Factoring for Fast Evaluation of Low-order Spherical Harmonic Products and Squares (paper by John

πŸ¦€πŸš€πŸ”₯ A blazingly fast and memory-efficient implementation of `if err != nil` πŸ”₯πŸš€πŸ¦€

πŸ¦€πŸš€πŸ”₯ A blazingly fast and memory-efficient implementation of `if err != nil` πŸ”₯πŸš€πŸ¦€

πŸš€ Fleet is the blazing fast build tool for Rust
πŸš€ Fleet is the blazing fast build tool for Rust

Fleet is the blazing fast build tool for Rust. Compiling with Fleet is up-to 5x faster than with cargo.

A simple, fast and fully-typed JSPaste API wrapper for Rust

rspaste A simple, fast and fully-typed JSPaste API wrapper for Rust. aidak.tk Β» Installation Put the desired version of the crate into the dependencie

A blazingly fast πŸ”₯ Discord bot written in Rust

rusty-bot πŸ¦€ A blazingly fast πŸ”₯ Discord bot written in Rust. Commands name use !rm count deletes old messages !meme subreddit sends a random meme

Owner
Weiyuan Wu
In the world without problem formalization, solutions go wild.
Weiyuan Wu
A trading bot written in Rust based on the orderbook delta volume.

The strategy based on the concept of mean reversion. We look for large deviations in the volume delta of BTC-PERP on FTX at a depth of 1. These deviations could be caused by over-enthusiastic and over-leveraged market participants.

Dinesh Pinto 45 Dec 28, 2022
RuES - Expression Evaluation as Service

RuES is a minimal JMES expression evaluation side-car, that uses JMESPath, and it can handle arbitrary JSON. Which effectively makes it general purpose logical expression evaluation engine, just like some Python libraries that used to evaluate logical expression. This in turn can allow you implement complex stuff like Rule engine, RBAC, or Policy engines etc.

Zohaib Sibte Hassan 14 Jan 3, 2022
lightweight and customizable rust s-expression (s-expr) parser and printer

s-expr Rust library for S-expression like parsing and printing parser keeps track of spans, and representation (e.g. number base) number and decimal d

Vincent Hanquez 5 Oct 26, 2022
Unofficial python bindings for the rust llm library. πŸβ€οΈπŸ¦€

llm-rs-python: Python Bindings for Rust's llm Library Welcome to llm-rs, an unofficial Python interface for the Rust-based llm library, made possible

Lukas Kreussel 7 May 20, 2023
Who said python couldn't have nice errors?

potato Who said python couldn't have nice errors? Running git clone https://github.com/KittyBorgX/potato.git cd potato cargo build --release ./target/

Krishna Ramasimha 5 Jan 22, 2023
MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine

MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine. Both searching and indexing are highly customizable. Features such as typo-tolerance, filters, and synonyms are provided out-of-the-box. For more information about features go to our documentation.

MeiliSearch 31.6k Dec 30, 2022
Novus - A blazingly fast and efficient package manager for windows.

Novus - A blazingly fast and efficient package manager for windows. Why Novus Swift Unlike any other package manager, Novus uses multithreaded downloads

Novus 197 Dec 18, 2022
Beanstalk is a simple, fast work queue.

beanstalkd Simple and fast general purpose work queue.

Beanstalkd 6.3k Dec 30, 2022
A little bit fast and modern Ruby version manager written in Rust

A little bit fast and modern Ruby version manager written in Rust Features Pure Rust implementation not using ruby-build Cross-platform support (macOS

Takayuki Maeda 510 Jan 5, 2023
πŸš€ Fast and 100% API compatible postcss replacer, built in Rust

?? Fast and 100% API compatible postcss replacer, built in Rust

θΏ·ζΈ‘ 472 Jan 7, 2023