A Rust implementation of Glidesort, my stable adaptive quicksort/mergesort hybrid sorting algorithm.

Orson Peters

Last update: Apr 22, 2023

Related tags

Command-line glidesort

Overview

Glidesort

Glidesort is a novel stable sorting algorithm that combines the best-case behavior of Timsort-style merge sorts for pre-sorted data with the best-case behavior of pattern-defeating quicksort for data with many duplicates. It is a comparison-based sort supporting arbitrary comparison operators, and while exceptional on data with patterns it is also very fast for random data.

For sorting n elements with k distinct values glidesort has the following characteristics by default:

Best    Average     Worst       Memory      Stable      Deterministic
n       n log k     n log n     n / 8       Yes         Yes

Glidesort can use as much (up to n) or as little extra memory as you want. If given only O(1) memory the average and worst case become O(n (log n)^2), however in practice its performance is great for all but the most skewed data size / auxiliary space ratios. The default is to allocate up to n elements worth of data, unless this exceeds 1 MiB, in which case we scale this down to n / 2 elements worth of data up until 1 GiB after which glidesort uses n / 8 memory.

Benchmark

Performance varies a lot from machine to machine and dataset to dataset, so your mileage will vary. Nevertheless, an example benchmark from a 2021 Apple M1 machine comparing against [T]::sort and [T]::sort_unstable for various input distributions of u64:

Compiled with rustc 1.69.0-nightly (11d96b593) using --release --features unstable and lto = "thin".

Usage

Use cargo add glidesort and replace a.sort() with glidesort::sort(&mut a). A similar process works for sort_by and sort_by_key.

Glidesort exposes two more families of sorting functions. glidesort::sort_with_buffer(&mut a, buf) asks you to pass a &mut [MaybeUninit<T>] buffer which it will then (exclusively) use as auxiliary space to sort the elements. glidesort::sort_in_vec(&mut v) behaves like normal glidesort but will allocate its auxiliary space at the end of the passed Vec<T>. This allows future sorting calls to re-use the same space and reduce allocations. Both these families also support the _by and _by_key interface.

Visualization

This visualization focuses on demonstrating the advanced merging techniques in glidesort:

glidesort_merge_example.mp4

This visualization shows how glidesort is adaptive to both pre-existing runs as well as many duplicates together:

glidesort_adaptiveness_example.mp4

Note that both visualizations have different small sorting thresholds and auxiliary memory parameters to show the techniques in action on a smaller scale.

Technique overview

If you prefer I also have a recorded talk I gave at FOSDEM 2023 that gives a high level overview of glidesort:

Glidesort uses a novel main loop based on powersort. Powersort is similar to Timsort, using heuristics to find a good order of stably merging sorted runs. Like powersort it does a linear scan over the input, recognizing any ascending or strictly descending sequences. However, unlike powersort it does not eagerly sort sequences that are considered unordered into small sorted blocks. Instead it processes them as-is, unsorted. This process produces logical runs, which may be sorted or unsorted.

Glidesort repeatedly uses a logical merge operation on these logical runs, as powersort would. In a logical merge unsorted runs are simply concatenated into larger unsorted runs. Sorted runs are also concatenated into double sorted runs. Only when merging a sorted and unsorted run finally the unsorted run is sorted using stable quicksort, and when merging double sorted runs glidesort uses interleaved ping-pong merges.

Using this novel hybrid approach glidesort can take advantage of arbitrary sorted runs in the data as well as process data with many duplicate items faster similar to pattern-defeating quicksort.

Stable merging

Glidesort merges multiple sorted runs at the same time, and interleaves their merging loops for better memory-level and instruction-level parallelism as well as hiding data dependencies. For similar reasons it also interleaves independent left-to-right and right-to-left merging loops as bidirectional merges, which are a generalization of quadsorts parity merges. Merging multiple runs at the same time also lets glidesort use ping-pong merging, avoiding unnecessary memcpy calls by using the implicit copy you get from an out-of-place merge. All merging loops are completely branchless, making it fast for random data as well.

Glidesort further uses binary searches to split up large merge operations into smaller merge operations that it then performs at the same time using instruction-level parallelism. This splitting procedure also allows glidesort to use arbitrarily small amounts of memory, as it can choose to split a merge repeatedly until it fits in our scratch space to process.

Stable quicksort

Yes, stable quicksort. Wikipedia will outright tell you that quicksort is unstable, or at least all efficient implementations are. That simply isn't true, all it needs is auxiliary memory. Credit to Igor van den Hoven's fluxsort for demonstrating that stable quicksort can be efficient in practice.

Glidesort uses a novel bidirectional stable partitioning method that interleaves a left-to-right partition scan with a right-to-left partition scan for greater memory-level parallelism and hiding data dependencies. Partitioning is done entirely branchlessly (if the comparison operator is), giving consistent performance on all data.

License

Glidesort is dual-licensed under the Apache License, Version 2.0 and the MIT license.

Comments

Implement optimizations from Stavenga's article

Please, use optimizations from this cool @gerben-stavenga's article: https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html

opened by safinaskar 5
Potentially replace standard library sort?

If this sorting algorithm is stable and strictly better performance-wise than the standard library sort, it seems like the standard library implementation could be replaced completely (like it happened previously with hashbrown and crossbeam-channel)?

opened by GoldsteinE 1
Host visualizations on Github

This is just a hack to host visualization videos on Github.

https://user-images.githubusercontent.com/202547/216675278-e4c8f15c-e42d-4224-b8c7-fdc67fdc2bde.mp4

https://user-images.githubusercontent.com/202547/216675274-6e61689f-a120-4b7c-b1a7-9b5aa5fd013e.mp4

opened by orlp 0
Quadratic-time run scanning

The minimum run length commit seems to introduce quadratic behavior for runs somewhat shorter than sqrt(n / 2), because the run is repeatedly followed and discarded. If so, this would cause worst-case performance of O(n^(3/2)) by multiplying O(n) time to get past each run by O(sqrt(n)) runs that fit in a length-n array. I took the following timings on a length 1e8 array to confirm that this has a practical impact; the input data is just 0, 1, ... r-1 repeated, for run length r. sqrt(1e8 / 2) is about 7071; strangely, performance improves gradually from about 6000 to that number instead of sharply as I'd expected. The "% create" here is a loose estimate from perf top of fraction of time spent in LogicalRun<B,T>::create, and "Time create" is that multiplied by total time.

| Run | Time (s) | % create | Time create |------|----------|----------|------ | 500 | 2.44 | 0.20 | 0.49 | 1000 | 2.96 | 0.25 | 0.74 | 2000 | 3.74 | 0.38 | 1.42 | 4000 | 4.48 | 0.45 | 2.02

opened by mlochbaum 3
Frustrations

Well, I'll be blunt. I find the glidesort codebase hard to work with even by the standards of sorting research. I feel that in your desire to give a polished view of the algorithm to the world, you've actually made it harder for people who want to dig into the details. Given that pdqsort (Rust version included) was basically my entry point into high-performance sorting, seeing the next step like this is tough. Worse, from what I understand of the algorithm, it doesn't seem to be much more complicated than pdqsort, given that it throws out a lot of things like pivot scrambling and heapsort. I'll describe my difficulties as best as I'm able to give you the most information if you'd like to help.

I believe a real commit history would be very useful, and find the decision to publish as a single commit surprising for software presented at an open source conference. You apparently had enough of an implementation to benchmark for the talk in May, without full panic safety infrastructure. Because I can't access any version like this, I have no way to test your claim that panic safety accounts for 10-15% of time taken by the algorithm. Could it be different across processors? I have no insight into how tuning decisions were made, which is often available in the history too.

You've shared benchmarks from an ARM machine that's presumably your M1, and an unspecified AMD processor, as a png. Could you include, or link to, the processor specs and raw data in this repository?

Generally it feels that while sorting concepts are well explained, the way they are implemented isn't. As I understand it the way you use Rust isn't typical, so I expect even fluent Rust readers (I'm not one) could use some help. For example gap_guard.rs makes no attempt to explain what "the gap" is. And much of branchless_merge.rs is taken up by implementation of the BranchlessMergeState structure with no explanation of how this structure will be used.

Other structures have no comments at all. I suppose the names are supposed to be self-documenting. Take enum PartitionStrategy<T> in quicksort. LeftWithPivot is meaningless to me. What goes left? And of course there's a pivot, you're partitioning! Eventually I figured out that the un-named parameter is the pivot value to be used. Is left pivoting the variety used for the left side in a bidirectional partition? Because the block comment at the top never connects to any specific part of the code, I can't tell. Are LeftIfNewPivotEquals and LeftIfNewPivotEqualsCopy identical other than the way they way they store the pivot? The definition of partition_left certainly suggests this, but later less_strategy and geq_strategy recognize only the Copy version.

Where does the recursive median-based strategy for pivot selection come from? Is there a reference? To me it seems obviously questionable because if just two of the three systematically-chosen regions based on a, b, and c have lower median values, then you'll get a low pivot. For example, what happens with an array consisting of three up-down patterns?

What is tracking? I couldn't even google cfg(feature = "tracking").

opened by mlochbaum 4
Provide a way to easily reproduce benchmarks

Hey, glidesort is very impressive.

Could this provide an easy way to run benchmarks on our machines and see these results? (I'm willing to help if necessary).

(EDIT: by chance, I have a 4800 MHz dual-channel system available, and a 2666MHz single-channel one, I'm curious to compare bench results on both.)

opened by marcospb19 8

Owner

Orson Peters

GitHub

Learning rust by coding different sorting algorithms in it

Sorting in Rust Sorting a 32 bit unsigned integer vector by using: Radix Sort Quick Sort Merge Sort Bubble Sort Inbuilt Sort (Probably Tim Sort) This

1 Nov 28, 2021

A cross-platform file sorting program

Cabinet Cross-platform file sorting system that sorts files based on their attributes, such as file type, file name and date modified. Disclaimer: Not

2 Jul 12, 2022

Human numeric sorting program — does what `sort -h` is supposed to do!

7 Sep 25, 2022

FileSorterX is an automatic file sorting application that sorts your files into folders based on their file extension

FileSorterX is an automatic file sorting application that sorts your files into folders based on their file extension. With FileSorterX, you can easily keep your files organized and find what you need quickly.

22 Apr 4, 2023

🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

python-daachorse daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. This is a Python wrap

11 Nov 30, 2022

Sensorial System's Stable Diffusion codebase

Stable Diffusion XL LoRA Trainer Welcome to the official codebase for the Sensorial System's Stable Diffusion projects. For now, this only hosts the c

8 Mar 2, 2024

Implemented reverse-engineered signature algorithm to successfully register with Apple's caching server.

View as English 项目描述本项目通过逆向得到苹果缓存服务器的签名算法，并可以成功注册缓存服务。算法分为两种运行模式。运行模式直接运行（x64）: 效率较高，但只支持64位CPU。已测试可运行在Windows/Linux/macOS上。模拟器运行: 兼容性极高，支持所有CPU架构

6 Oct 27, 2023

A SIMD implementation of Keccak256 for aarch64, forked from Remco Bloeman's Goldilocks K12 implementation.

keccak256-aarch64 Warning This crate was forked from real cryptographers (Goldilocks by Remco Bloeman), by not a real cryptographer. Do not use this k

6 Oct 24, 2023

Readline Implementation in Rust

RustyLine Readline implementation in Rust that is based on Antirez' Linenoise Supported Platforms Unix (tested on FreeBSD, Linux and macOS) Windows cm

1.1k Dec 29, 2022

Rust implementation of the termbox library

Rustbox Rustbox is a Rust implementation of termbox. Currently, this is just a wrapper of the C library by nsf, though my plan is to convert it to be

462 Dec 19, 2022

A very fast implementation of tldr in Rust

A very fast implementation of tldr in Rust: Simplified, example based and community-driven man pages.

2.9k Dec 31, 2022

A compact implementation of connect four written in rust.

connect-four A compact implementation of connect four written in rust. Run the game At the moment there no pre-built binaries - but you can build it l

12 Jul 31, 2022

Baby's first Rust CLI project. Basic implementation of grep. Written in about 100 SLOC.

minigrep Coding project from Chapter 12 of the The Rust Programming Language book. Usage Compile and run as so minigrep QUERY FILENAME QUERY being the

2 Oct 2, 2021

An implementation of Verifiable Delay Functions in Rust

Verifiable Delay Function (VDF) Implementation in Rust What is a VDF? A Verifiable Delay Function (VDF) is a function that requires substantial time t

147 Dec 12, 2022

A Decimal Implementation written in pure Rust suitable for financial calculations.

Decimal A Decimal implementation written in pure Rust suitable for financial calculations that require significant integral and fractional digits wi

702 Jan 5, 2023

Rust implementation of custom numeric base conversion.

base_custom Use any characters as your own numeric base and convert to and from decimal. This can be taken advantage of in various ways: Mathematics:

5 Dec 28, 2021

Red-blue graph problem solver - Rust implementation

Red-blue graph problem solver - Rust implementation The problem is the following: In a directed graph, each node is colored either red or blue. Furthe

2 Jan 17, 2022

A better customization password dictionary generator implementation by Rust.

abcdict A better customization password dictionary generator implementation by Rust. Features Cli Faster Customize Rules Build & Installation $> cargo

6 Jun 2, 2022

An Rust implementation of the Modified Merkle Patricia Tree described by ETH Yellow Paper

Merkle Patricia Tree的Rust实现博客文章：https://dere.press/2022/01/24/eth-trie/ 本实现参考下列项目： https://ethereum.github.io/yellowpaper/paper.pdf https://github.co

3 Dec 13, 2022

A Rust implementation of Glidesort, my stable adaptive quicksort/mergesort hybrid sorting algorithm.

Related tags

Overview

Glidesort

Benchmark

Usage

Visualization

Technique overview

Stable merging

Stable quicksort

License

Comments

Implement optimizations from Stavenga's article

Potentially replace standard library sort?

Host visualizations on Github

Quadratic-time run scanning

Frustrations

Provide a way to easily reproduce benchmarks

Owner

Orson Peters

Learning rust by coding different sorting algorithms in it

A cross-platform file sorting program

Human numeric sorting program — does what `sort -h` is supposed to do!

FileSorterX is an automatic file sorting application that sorts your files into folders based on their file extension

🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

Sensorial System's Stable Diffusion codebase

Implemented reverse-engineered signature algorithm to successfully register with Apple's caching server.

A SIMD implementation of Keccak256 for aarch64, forked from Remco Bloeman's Goldilocks K12 implementation.

Readline Implementation in Rust

Rust implementation of the termbox library

A very fast implementation of tldr in Rust

A compact implementation of connect four written in rust.

Baby's first Rust CLI project. Basic implementation of grep. Written in about 100 SLOC.

An implementation of Verifiable Delay Functions in Rust

A Decimal Implementation written in pure Rust suitable for financial calculations.

Rust implementation of custom numeric base conversion.

Red-blue graph problem solver - Rust implementation

A better customization password dictionary generator implementation by Rust.

An Rust implementation of the Modified Merkle Patricia Tree described by ETH Yellow Paper