A faster Arc.

Overview

Trc

rustc 1.70.0 stable MIT License Build status Docs status Tests status

Trc is a performant heap-allocated smart pointer for Rust that implements thread reference counting. Trc stands for: Thread Reference Counted. Trc provides a shared ownership of the data similar to Arc<T> and Rc<T>. It implements thread reference counting, which is based on the observation that most objects are only used by one thread. This means that two reference counts can be created: one for thread-local use, and one atomic one for sharing between threads. Thread reference counting sets the atomic reference count to the number of threads using the data.

A cycle between Trc pointers cannot be deallocated as the reference counts will never reach zero. The solution is a Weak<T>. A Weak<T> is a non-owning reference to the data held by a Trc<T>. They break reference cycles by adding a layer of indirection and act as an observer. They cannot even access the data directly, and must be converted back into Trc<T>. Weak<T> does not keep the value alive (which can be dropped), and only keeps the backing allocation alive.

To soundly implement thread safety Trc<T> does not itself implement Send or Sync. However, SharedTrc<T> does, and it is the only way to safely send a Trc<T> across threads. See SharedTrc for it's API, which is similar to that of Weak.

Examples

Example of Trc<T> in a single thread:

use trc::Trc;

let mut trc = Trc::new(100);
assert_eq!(*trc, 100);
*Trc::get_mut(&mut trc).unwrap() = 200;
assert_eq!(*trc, 200);

Example of Trc<T> with multiple threads:

use std::thread;
use trc::Trc;
use trc::SharedTrc;

let trc = Trc::new(100);
let shared = SharedTrc::from_trc(&trc);
let handle = thread::spawn(move || {
    let trc = SharedTrc::to_trc(shared);
    assert_eq!(*trc, 100);
});

handle.join().unwrap();
assert_eq!(*trc, 100);

Example of Weak<T> in a single thread:

use trc::Trc;
use trc::Weak;

let trc = Trc::new(100);
let weak = Trc::downgrade(&trc);
let mut new_trc = Weak::upgrade(&weak).unwrap();
assert_eq!(*new_trc, 100);
drop(trc);
drop(weak);
*Trc::get_mut(&mut new_trc).unwrap() = 200;
assert_eq!(*new_trc, 200);

Example of Weak<T> with multiple threads:

use std::thread;
use trc::Trc;
use trc::Weak;

let trc = Trc::new(100);
let weak = Trc::downgrade(&trc);

let handle = thread::spawn(move || {
    let trc = Weak::upgrade(&weak).unwrap();
    assert_eq!(*trc, 100);
});
handle.join().unwrap();
assert_eq!(*trc, 100);

Benchmarks

Benchmarks via Criterion. As can be seen, Trc's performance really shines when there are many Clones. The reason Trc does not do as well for fewer operations is that it needs to allocate n+1 blocks of memory for n threads, and so for 1 thread, there are 2 allocations. However, after the initial allocations, Trc performs very well - 3.81x Arc's time for Clones.

Click here for more benchmarks. Multiple different operating systems, CPUs, and architectures are tested.

Clone

Type Mean time
Trc 26.913ns
Arc 33.803ns
Rc 11.228ns

Multiple Clone (100 times)

Type Mean time
Trc 423.020ns
Arc 1273.200ns
Rc 352.920ns

Deref

Type Mean time
Trc 20.802ns
Arc 20.802ns
Rc 9.264ns

Multiple Deref (100 times)

Type Mean time
Trc 35.577ns
Arc 35.853ns
Rc 29.454ns

Multiple Threads Drop and Clone (1000 times)

Type Mean time
Trc 2.80ms
Arc 4.02ms

1.44x faster - because of the allocation cost of SharedTrc.

Multiple Threads Drop and Clone (5000 times)

Type Mean time
Trc 3.480ms
Arc 9.415ms

2.71x faster - the allocation cost of SharedTrc begins to become replaced by the Clone efficiency.

Multiple Threads Drop and Clone (100000 times)

Type Mean time
Trc 18.451ms
Arc 137.980ms

7.44x faster - the allocation cost of SharedTrc is now negligible and the Clone advantage is beginning to be demonstrated.

Multiple Threads Drop and Clone (500000 times)

Type Mean time
Trc 71.490ms
Arc 638.180ms

8.92x faster - the allocation cost of SharedTrc is now negligible and the Clone advantage is demonstrated.

Trc vs Arc performance

Use

To use Trc, simply run cargo add trc, or add trc = "1.2.1". Optionally, you can always use the latest version by adding trc = {git = "https://github.com/EricLBuehler/trc.git"}.

Comments
  • Memory fences and general improvements

    Memory fences and general improvements

    Quoted from u/TDplay from this comment.

    Todo

    • [x] Weak::to_trc race condition
    • [x] Weak::clone race condition
    • [x] impl Clone for SharedTrc
    • [x] Support no_std
    • [x] Trc::cmp documentation
    • [x] Fix Pointer behavior
    • [x] Use different Layout API.
    • [x] Use ptr::drop_in_place instead of ptr::read

    Quote

    https://github.com/EricLBuehler/trc/blob/89f3eb2ffec4f860b13620c82cd0f443ed6dcfeb/src/trc.rs#L245

    Not entirely sure about this part. You seem to decrement and check the weak count, and then construct and drop a Weak (thus calling Weak::drop, which should also decrement and check the weak count).

    It also seems like you only drop the data when the weak count reaches zero, which seems to defeat the point of supporting Weak pointers.

    Another point is to perhaps consider using ptr::drop_in_place instead of ptr::read. It makes the intent more obvious, and it can make some optimisations more obvious to the compiler (since it doesn't need to reason about removing the copy). I think this would also allow you to support dynamically-sized types.

    Trc::drop looks fine.

    https://github.com/EricLBuehler/trc/blob/89f3eb2ffec4f860b13620c82cd0f443ed6dcfeb/src/trc.rs#L1148

    Weak::to_trc has a race condition, leading to a possible use-after-free.

    Initial state: We have a SharedTrc in thread A and a Weak in thread B.

    Thread B calls Weak::to_trc

    Weak::to_trc checks the strong count, and reads 1

    Thread A drops the SharedTrc, decrementing the strong count.

    SharedTrc::drop notices that the strong count is now zero, so it deallocates the data.

    Weak::to_trc increments the strong count and constructs a Trc.

    The Trc is now passed to user code, which access the contents, causing a use-after-free.

    To resolve this, the zero-check and the increment need to be one atomic operation, and the increment needs to not happen if the zero check fails. fetch_update is perfect for this. Sadly, this is slower than fetch_add, but I can't see a way to avoid it, and it seems the standard library developers couldn't either.

    https://github.com/EricLBuehler/trc/blob/89f3eb2ffec4f860b13620c82cd0f443ed6dcfeb/src/trc.rs#L1191

    Clone for Weak technically has a race condition: if a program clones an absurd number of Weaks in different threads, with thread being pre-empted before it can call abort, it may be possible to overflow the weak count. In practice, I don't think any real program will ever do this, and the actual likelihood of this behaviour is slim.

    That's the big stuff out of the way, on to the small nitpicks:

    Clone is not implemented for SharedTrc, this feels like an oversight.

    You might be able to support no_std. Add this to your lib.rs:

    #![cfg_attr(not(test), no_std)]
    
    extern crate alloc;
    #[cfg(feature = "force_lock")]
    extern crate std;
    

    It will then throw errors at every use of std. Replace these with either core or alloc.

    https://github.com/EricLBuehler/trc/blob/89f3eb2ffec4f860b13620c82cd0f443ed6dcfeb/src/trc.rs#L899

    Documentation for Trc::cmp doesn't seem right.

    https://github.com/EricLBuehler/trc/blob/89f3eb2ffec4f860b13620c82cd0f443ed6dcfeb/src/trc.rs#L793

    Behaviour of the Pointer formatting seems a little strange. Trc is a pointer, so it should probably print the address the Trc points to. This would also be in-line with what the standard library smart pointers do: Arc, Rc, Box, &T, and &mut T all implement Pointer unconditionally.

    https://github.com/EricLBuehler/trc/blob/89f3eb2ffec4f860b13620c82cd0f443ed6dcfeb/src/trc.rs#L1043

    There is no need for the unsafe call to Layout::from_size_align_unchecked, you can just use:

    Layout::new::<SharedTrcInternal<T>>()
    

    If you choose to support dynamically sized types, then you can instead use:

    Layout::for_value
    
    bug documentation enhancement good first issue 
    opened by EricLBuehler 2
  • Switch to Criterion for benchmarking

    Switch to Criterion for benchmarking

    Hi, a better way to measure benchmarks would be to use the Criterion crate as it abstracts over a lot of the hard-to-get-right benchmarking code. I can send a PR if you'd like!

    enhancement 
    opened by sigaloid 2
  • More trait implementations and methods

    More trait implementations and methods

    Methods

    • [x] Replace Weak::from_trc with Trc::downgrade
    • [x] Replace Weak::to_trc with Weak::upgrade
    • [x] Add Trc::get_mut
    • [x] Add Trc::try_unwrap - Note the necessity for a race condition warning
    • [x] Add Trc::into_inner
    • [x] Add Trc::unwrap_or_clone

    Traits

    "D" signifies impl dependant on T implements the trait.

    • [x] D AsFd
    • [x] #11
    • [x] D Error
    • [x] D Hash
    • [x] D Error
    • [x] Unpin
    • [x] UnwindSafe
    • [x] Any
    • [x] From<&[T]>
    documentation enhancement help wanted good first issue 
    opened by EricLBuehler 1
  • Benchmarks shouldn't use `Instant::now()` in a loop

    Benchmarks shouldn't use `Instant::now()` in a loop

    I see your benchmarks call Instant::now() in the inner loop. That timing function takes much longer than the deref operation being benchmarked, so what you end up benchmarking is how fast you can read the clock, rather than how fast you can deref.

    https://github.com/EricLBuehler/trc/blob/56c73e560fd6ccf4667149f80b56f83e773f3dff/src/main.rs#L67

    A better approach is to hoist the timing outside the loop, so you only call the clock once per 100000 iterations. You should also use std::hint::black_box to ensure that computations don't get hoisted outside the loop or eliminated by the compiler. So, a better benchmark would be:

    fn test_deref_rc() -> f64{
        let rc = Rc::new(100);
        
        let start = Instant::now();
        for _ in 0..100000 {
            std::hint::black_box(*std::hint::black_box(rc));
        }
        let end = Instant::now();
        (end-start).as_nanos() as f64 / 100000.
    }
    
    bug enhancement 
    opened by reinerp 1
  • Seems unsound

    Seems unsound

    What prevents me from simply cloning a Trc value, sending it to another thread and then cloning both values on different threads? Trc should not implement Send and Sync because it's neither. Rather it should be able to produce a Send + Sync value that can be send across thread and then converted to a Trc.

    bug 
    opened by knopp 1
  • Atomic `Drop` implementations are unsound

    Atomic `Drop` implementations are unsound

    This is largely copied from my Reddit comment.

    SharedTrc::drop:

    fn drop(&mut self) {
        sub_value(&unsafe { self.data.as_ref() }.atomicref, 1);
    
        let atomic = unsafe { self.data.as_mut() }
            .atomicref
            .load(std::sync::atomic::Ordering::Relaxed);
    
       if atomic == 0 {
            unsafe { Box::from_raw(self.data.as_ptr()) };
        }
    }
    

    Trc::drop:

    fn drop(&mut self) {
        *unsafe { self.threadref.as_mut() } -= 1;
    
        if *unsafe { self.threadref.as_ref() } == 0 {
            sub_value(&unsafe { self.shared.as_ref() }.atomicref, 1);
            let weak = unsafe { self.shared.as_mut() }
                .weakcount
                .load(std::sync::atomic::Ordering::Relaxed);
            let atomic = unsafe { self.shared.as_mut() }
                .atomicref
                .load(std::sync::atomic::Ordering::Relaxed);
            if weak == 0 && atomic == 0 {
                unsafe { Box::from_raw(self.threadref.as_ptr()) };
                unsafe { std::ptr::drop_in_place(self.shared.as_ptr()) };
            }
        }
    }
    

    are both unsound.

    First off, your decrement operation and read operation are two different operations. You need to consider the possible interleavings of the operations: I can think of possible interleavings that cause double-free and use-after-free. Consider, for example:

    • Initial state: atomicref is 2.
    • A SharedTrc is dropped in thread A.
    • A SharedTrc is dropped in thread B.
    • Thread A decrements atomicref.
    • Thread B decrements atomicref.
    • Thread A reads atomicref, and gets 0.
    • Thread A deallocates self.data.
    • Thread B reads atomicref. This is a use after free, which is undefined behaviour. The likely (but not guaranteed) outcome is to read 0, so let's suppose this happens.
    • Thread B deallocates self.data. This is a double free, which is undefined behaviour.

    The same issue exists for Trc::drop.

    The solution here is to use the returned value from the fetch_sub operation to determine when to deallocate. If you get 1, then you are the last reference, and should deallocate. There is, of course, the tricky detail that you have to look at the weak count before deallocating - to solve this, look at how std::sync::Arc does it. It uses the weak count to determine when to deallocate. If there are no Weak references, then the weak count is 1. When the strong count reaches zero, the weak count is decremented - this means only the weak count needs to be considered when deallocating.

    Second off, the calls to as_mut() assert to the compiler that we are the only thread with any kind of access to self.data and self.shared. I'm not convinced that this is true, so I'm going to conservatively say this is undefined behaviour. This can be easily fixed, by replacing the calls to as_mut() with calls to as_ref().

    I believe that threadref.as_mut() is fine.

    Third, you haven't performed the necessary synchronisation. Consider the type

    struct MyStruct(AtomicI32);
    impl MyStruct {
        fn new() -> Self {
            Self(AtomicI32::new(0))
        }
        fn inc(&self) {
            self.0.fetch_add(1, Relaxed);
        }
    }
    impl Drop for MyStruct {
        fn drop(&mut self) {
            println!("{}", self.0.get_mut());
        }
    }
    

    Now consider

    • Thread A constructs Trc::new(MyStruct::new()), and stores it in variable x
    • Thread A constructs SharedTrc::from_trc(&x) and sends it to thread B.
    • Thread A calls x.inc();. This is a Relaxed atomic read-write. Let us call this access A.
    • Thread A drops x, resulting in a Relaxed atomic decrement.
    • Thread B drops the SharedTrc, resulting in a Relaxed atomic decrement.
    • Thread B calls MyStruct::drop. This is a non-atomic read. Let us call this access B.

    Since access B is not atomic, and Acquire/Release semantics have not been used, it is not synchronised with access A. This is a data race, which is undefined behaviour.

    To fix this:

    • Decrementing the counter needs to be a Release atomic operation.
    • There needs to be an Acquire fence before calling Drop::drop.

    We can now establish that all counter decrements happen before the fence. Thus, we can establish that access A happens before access B, synchronising the accesses and avoiding the data race.

    bug 
    opened by TDplai 0
  • So you want to be considered by T-libs?

    So you want to be considered by T-libs?

    Before it is worth putting in the queue for review:

    1. This crate must pass miri. Ideally it would also pass miri with the following flags:
    • [x] #12
    • [x] #13
    • [x] #14
    1. This crate must be sound against all the correctness errors we have encountered in Arc and Rc and solved:
    • [x] https://github.com/rust-lang/rust/issues/108706
    • [x] https://github.com/rust-lang/rust/issues/95334
    • [ ] https://github.com/rust-lang/rust/issues/55005
    • [ ] https://github.com/rust-lang/rust/issues/51780
    • [ ] https://github.com/rust-lang/rust/issues/80365
    • [ ] https://github.com/rust-lang/rust/issues/30031
    • [ ] https://github.com/rust-lang/rust/issues/54908
    1. In addition you must validate that it remains sound and fast even when used on "weak memory model" platforms. PowerISA (aka POWER aka PowerPC) is the weakest I know of for modern CPUs, but in a pinch, testing on AArch64 should flush out most problems, and is more important because we support aarch64-unknown-linux-gnu as a tier 1 platform. That makes its soundness in implementation of consequently higher concern, though we do take correctness seriously for tier 2 platforms, we're just humans with priorities.

    2. Finally, and this is the most subjective evaluation: There has to be a reason to actually include it in std instead of just leaving it as an external crate. It's not good enough for it to simply be faster in a few cases. We often reject new API simply if it is too confusing or niche to choose when to use it.

    bug enhancement 
    opened by workingjubilee 8
  • Inadequate microbenchmarks

    Inadequate microbenchmarks

    Your benchmarks are unrealistically simple and do not cover common concerns:

    And your benchmarks are... probably inadequate. You are microbenching the most primitive operation. It can still be annoyingly obvious to the optimizer if the result is e.g. unused. You need more holistic tests that demonstrate Trc improves a semi-realistic workload. I mean, strictly speaking, that isn't necessary for acceptance to std, I just think you should, in order to better support your claim.

    btw since your implementation does two allocations in new() rather than just one, you should probably also include a benchmark for Trc::new(). (and perhaps also one for converting between SharedTrc and Trc.)

    And you should make sure the tested functions are actually generating different assembly for these types and operations when compiled. Otherwise, you may be actually microbenchmarking something else, like the quality of LLVM's optimizer in piercing certain abstractions. No, "black box" functions are not actually completely immune to the optimizer's gaze.

    enhancement help wanted good first issue 
    opened by workingjubilee 5
📈 The fastest map possible in Rust, where keys are integers and the capacity is fixed (faster than Vec!)

It is an alternative on-heap implementation of a map with keys of type usize and a fixed capacity. It works much faster than a standard HashMap becaus

Yegor Bugayenko 6 Apr 26, 2023
Code for blog post "{n} times faster than C, where n = 128"

Code for {n} times faster than C, where n = 128 Actually, n = 290 ?? Benchmark Setup Rust version: rustc 1.70.0 (90c541806 2023-05-31) Run test: cargo

Thomas Ip 9 Jul 24, 2023
An experimental implementation of Arc against Apache Datafusion

box This is an experimental repository to perform a proof of concept replacement of the Apache Spark executor for Arc with Apache DataFusion. This is

tripl.ai 1 Nov 26, 2021
Near contract collection for story-arc.io

arc-near-contracts TBD Development Setup Make sure to format code using defaults before every commit or set up the environment to handle this for you.

null 2 Mar 14, 2022
Quinine is a Rust library that implements atomic, lock-free, but write-once versions of containers like `Box` or `Arc`

Quinine is a Rust library that implements atomic, lock-free, but write-once versions of containers like `Box` or `Arc`

Paul Khuong 4 Feb 19, 2022
Extensions for Arc such as field projection.

arc-ext Extensions for Arc<T> such as field projection. Usage The ArcExt trait implementation extends Arc<T> with .project and .project_option methods

Brendan Molloy 3 Dec 29, 2022
A little tool to create region-free openingTitle.arc files for New Super Mario Bros. Wii, or to convert them from one region to another

smallworld ...though the mountains divide and the oceans are wide... smallworld is a little tool that can create region-free openingTitle.arc files fo

NSMBW Community 7 Feb 6, 2023
A faster way to navigate your filesystem

zoxide A faster way to navigate your filesystem Table of contents Introduction Examples Getting started Step 1: Install zoxide Step 2: Install fzf (op

Ajeet D'Souza 8.7k Jan 8, 2023
Build smaller, faster, and more secure desktop applications with a web frontend.

TAURI Tauri Apps footprint: minuscule performance: ludicrous flexibility: gymnastic security: hardened Current Releases Component Descrip

Tauri 56.3k Jan 3, 2023
A starter template for actix-web projects that feels very Django-esque. Avoid the boring stuff and move faster.

Jelly A.K.A, the actix-web starter you probably wish you had. This is provided as-is, and anyone is free to extend it or rework it as they desire - ju

SecretKeys 198 Dec 15, 2022
A simplified but faster version of Routerify

Routerify lite Routerify-lite is a simplified but faster version of Routerify. It only provides below functions: path matching error handling Why not

jinhua luo 7 Dec 30, 2022
fcp is a significantly faster alternative to the classic Unix cp(1) command

A significantly faster alternative to the classic Unix cp(1) command, copying large files and directories in a fraction of the time.

Kevin Svetlitski 532 Jan 3, 2023
Faster division by constants that aren't known at compile-time

Baseline implementation of division by constants When dividing integers by compile-time constants, compilers (LLVM) can be trusted to convert those to

Paul Khuong 18 Jul 6, 2022
Download a file using multiple threads in parallel for faster download speeds.

multidl Download a file using multiple threads in parallel for faster download speeds. Uses 0 external dependencies. Usage Usage: multidl [--help] ADD

Divyanshu Agrawal 2 Sep 12, 2021
Create target folder as a RAMdisk for faster Rust compilation.

cargo-ramdisk This crate is only supported for unix like systems! cargo-ramdisk creates a ramdisk at the target folder of your project for ridiculousl

PauMAVA 20 Jan 8, 2023
A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

raja sekar 2.1k Jan 5, 2023
loc is a tool for counting lines of code. It's a rust implementation of cloc, but it's more than 100x faster.

2019-10-07: I really haven't been on top of accepting pull requests or looking at issues, you guy should definitely look at SCC. It's faster and more

cgag 2.1k Jan 2, 2023
A zero-dependency crate for writing repetitive code easier and faster.

akin A zero-dependency crate for writing repetitive code easier and faster. Check Syntax for information about how to use it. Why? Example Syntax NONE

LyonSyonII 36 Dec 29, 2022
Faster and better alternative to Vtop written in Rust.

Rtop Faster and better alternative to Vtop written in Rust. Work in Progress Features Lightweight < 1MB Responsive UI Sort Process by Memory, CPU Usag

Squitch 7 Nov 18, 2022
Image processing proxy and API, created to allow faster media delivery for the Holaplex storefront.

Description imgopt is an image processing proxy with a very simple API, created to download and dynamically scaledown, convert, and cache different me

Holaplex 5 Nov 3, 2022