A memory efficient immutable string type that can store up to 24* bytes on the stack

Overview

compact_str

A memory efficient immutable string type that can store up to 24* bytes on the stack.

Continuous Integration Status Cross Platform Status version on crates.io mit license

* 12 bytes for 32-bit architectures


About

A CompactStr is a more memory efficient immutable string type, that can store smaller strings on the stack, and transparently stores longer strings on the heap. They can mostly be used as a drop in replacement for String and are particularly useful in parsing, deserializing, or any other application where you may have smaller strings.

Properties

A CompactStr specifically has the following properties:

  • size_of:: () == size_of:: ()
  • Stores up to 24 bytes on the stack
    • Only up to 23 bytes if the leading character is non-ASCII
    • 12 bytes (or 11 if leading is non-ASCII) if running on a 32 bit architecture
  • Strings longer than 24 bytes are stored on the heap
  • Clone is O(1)

Features

compact_str has the following features:

  1. serde, which implements Deserialize and Serialize from the popular serde crate, for CompactStr.

How it works

Note: this explanation assumes a 64-bit architecture, for 32-bit architectures generally divide any number by 2.

Normally strings are stored on the heap since they're dynamically sized. In Rust a String consists of three fields, each of which are the size of a usize. e.g. its layout is something like the following:

String: [ ptr<8> | len<8> | cap<8> ]

  1. ptr is a pointer to a location on the heap that stores the string
  2. len is the length of the string
  3. cap is the total capacity of the buffer being pointed to

This results in 24 bytes being stored on the stack, 8 bytes for each field. Then the actual string is stored on the heap, usually with additional memory allocated to prevent re-allocating if the string is mutated.

The idea of CompactStr is instead of storing metadata on the stack, just store the string itself. This way for smaller strings we save a bit of memory, and we don't have to heap allocate so it's more performant. A CompactStr is limited to 24 bytes (aka size_of:: () ) so it won't ever use more memory than a String would.

The memory layout of a CompactStr looks something like:

CompactStr: [ len<1> | buffer<23> ]

Memory Layout

Internally a CompactStr has three variants:

  1. Heap allocated, a string >= 24 bytes long
  2. Inline, a string <= 23 bytes long
  3. Packed, a string == 24 bytes long and first character is ASCII

To maximize memory usage, we use a union instead of an enum. In Rust an enum requires at least 1 byte for the discriminant (tracking what variant we are), instead we use a union which allows us to manually define the discriminant. CompactStr defines the discriminant within the first byte, using any extra bits for metadata. Specifically the discriminant has three variants:

  1. 0b11111111 - All 1s, indicates heap allocated
  2. 0b1XXXXXXX - Leading 1, indicates inline, with the trailing 7 bits used to store the length
  3. 0b0XXXXXXX - Leading 0, indicates packed, with the trailing 7 bits being the first character of the string

and specifically the overall memory layout of a CompactStr is:

  1. heap: { _padding: [u8; 8], string: Arc }
  2. inline: { metadata: u8, buffer: [u8; 23] }
  3. packed: { buffer: [u8; 24] }

All variants are 24 bytes long

For heap allocated strings we use an Arc which is only 16 bytes, so we prefix it with 8 bytes of padding to make it equal to the other sizes. This padding is set to all 1's since it doesn't pertain to the actual string at all, and it allows us to define a unique discriminant. You might be wondering though, how can we be sure the other two variants will never have all 1's as their first byte?

  • The inline variant will never have all 1's for it's first byte because we use the trailing 7 bits to store length, all 1's would indicate a length of 127. Our max length is 23, which is < 127, and even on 128-bit architectures we'd only be able to inline 63 bytes, still < our 127 limit.
  • The packed variant will never have all 1's for it's first byte because we define the first byte to be ASCII. All strings in Rust use UTF-8 encoding, and UTF-8 encoding does not support Extended ASCII. Meaning, our first character will have a decimal value <= 127, guaranteeing the first bit to always be 0.

unsafe code

CompactStr uses a bit of unsafe code because accessing fields from a union is inherently unsafe, the compiler can't guarantee what value is actually stored. That being said, uses of unsafe code in this library are quite limited and constrained to only where absolutely necessary, and always documented with // SAFETY: .

Testing

Strings and unicode can be quite messy, even further, we're working with things at the bit level. To guard against bugs, compact_str uses a mix of unit testing for sanity checks and randomized testing for correctness; then automatically runs these tests on 64-bit, 32-bit, big endian, and little endian architectures. Further, every 12 hours 1 billion unicode strings are generated and ensured to roundtrip through CompactStr, and we assert their location on either the stack or the heap.

Similar Crates

Storing strings on the stack is not a new idea, in fact there are a few other crates in the Rust ecosystem that do similar things, an incomplete list:

  1. smol_str - Can inline 22 bytes, Clone is O(1), doesn't adjust for 32-bit archs
  2. smartstring - Can inline 23 bytes, Clone is O(n), is mutable, relies on the memory layout of String

Thanks for readingme!
Comments
  • Add && Impl new trait ToCompactStr

    Add && Impl new trait ToCompactStr

    Similar to std::string::ToString, ToCompactStr convert any value implements fmt::Display into a CompactStr efficiently.

    However, due to the fact that Arc::<[MaybeUninit<T>]> is still unstable, there is no way to implement HeapString efficiently without first creating a String, then copy it onto the buffer and deallocate the original String.

    It seems that the compiler may not be able to optimize these calls, especially with std::format! calling format.

    Signed-off-by: Jiahao XU [email protected]

    opened by NobodyXu 17
  • Smol option

    Smol option

    #22 but up to date with git main. (Missing some extra tests from #22, though.)

    Attempts to avoid the perf cliff seen by #22 by dealing almost exclusively in "ReprNiche", and only using ReprUnion as a view into ReprNiche.

    ( đź‘Ź on figuring out how to use UTF-8 to get a full 24 bytes inline, btw! )

    opened by CAD97 12
  • Is it possible to make Option<CompactStr> same size as Option<String>?

    Is it possible to make Option same size as Option?

    It would be very beneficial for me to reduce size of Option<CompactStr>. I use lots of it in my projects.

    I made some tests, and it looks like Option<ComactStr> takes additional 8 bytes in compaction to Option<String>.

    Rust has null pointer optimization (see https://doc.rust-lang.org/std/option/index.html#representation) but I not sure if it can by used in userland crate.

    sizeofs of popular Strings.

    String: 24 bytes
    Option<String>: 24 bytes
    CompactStr: 24 bytes
    Option<CompactStr>: 32 bytes
    KString: 24 bytes
    Option<KString>: 24 bytes
    SmartString<Compact>: 24 bytes
    Option<SmartString<Compact>>: 32 bytes
    SmolStr: 24 bytes
    Option<SmolStr>: 24 bytes
    
    opened by mateuszkj 11
  • Consider mutable representation for heap variant

    Consider mutable representation for heap variant

    Currently, the heap variant is Arc<str>. This has a couple drawbacks:

    • Arc wastes code/space for weak pointer support which we don't use
    • mutating the string requires a full copy of data
    • (need to re-check this) pointer in Arc<str> points to the start of the ArcInner, not to the start of str, so there's extra offset during deref.

    An ideal representation would be to store *const str, which points:

                                 here
                                   V
    | rc: AtomicUsize | cap: usize | str byte data | spare capacity |
    

    This should allow us to just mutate the string just fine (using Arc::make_mut-like API).

    Admittedly, this increases the complexity of the implementation quit a bit.

    opened by matklad 8
  • Implement fmt::Write for CompactStr

    Implement fmt::Write for CompactStr

    This makes it possible to write!() into a CompactStr. This is useful if you know that the formatted string will be (most often) shorter than 24 bytes.

    opened by Kijewski 5
  • tests: Remove cargo-nextest because it wasn't running anything

    tests: Remove cargo-nextest because it wasn't running anything

    I just realized the workflows that use cargo-nextest are completing so fast because it's not actually running any tests, example. For now I'm just removing nextest because it's not obvious why it's not actually testing anything

    opened by ParkMyCar 4
  • perf: Option<CompactString> same size as CompactString

    perf: Option same size as CompactString

    Changes CompactString such that Option<CompactString> is the same size!

    Specifically size_of::<CompactString>() == size_of::<Option<CompactString>>() == size_of::<String>(). We're able to achieve this by defining an enum NonMaxU8 where the valid values are [0,255), this allows the compiler to use 255 to store the None variant of the Option<_>.

    Combination of #22 and #75

    @CAD97 I pulled in your changes from #75 and then updated some comments and fuzzing tests, what do you think?

    opened by ParkMyCar 3
  • feature: easy integer to CompactStr

    feature: easy integer to CompactStr

    TL;DR: 1234.to_string() but without heap allocation (and without std::fmt).

    My use case are file-descriptor which I need to pass to a Command (and do other things with them before) so I need to have them as a &str instead of an i32/RawFd. It would be great if compact_str would support creation from an integer type (possible with itoa).

    If you think this does not match the goal of the project fell free to close.

    opened by rusty-snake 3
  • Optimize fmt::Write for trivial cases

    Optimize fmt::Write for trivial cases

    This PR adds an optimization to CompactStr's fmt::Write implementation for invocations like that extend the buffer by a single string, e.g. write!(s, "example")

    opened by Kijewski 3
  • feature: Create a new heap variant to replace Arc<str>

    feature: Create a new heap variant to replace Arc

    This PR changes our HeapString implementation from using an Arc<str> to a custom defined ArcString type. This type is more analogous to Arc<[u8]>, specifically it allocates a buffer which may be larger than the "string" it contains. This gets us closer to being a true drop-in replacement for String and allows us to write APIs for mutating a CompactStr

    opened by ParkMyCar 3
  • draft: Optimize Option<CompactStr> to be the same size as CompactStr

    draft: Optimize Option to be the same size as CompactStr

    This is a proof of concept to show that it's possible in stable Rust to get std::mem::size_of<CompactStr> == std::mem::size_of::<Option<CompactStr>>().

    Unfortunately as-is the performance of CompactStr is degraded. We would need to fix this before merging.

    Fixes #19

    opened by ParkMyCar 3
  • [WIP] perf: Refactor underlying buffers for branchless access

    [WIP] perf: Refactor underlying buffers for branchless access

    This PR refactors the underlying InlineString and BoxString structs in a major way to allow for branch-less string accesses. It also renames these structs to InlineBuffer and HeapBuffer, treating these types strictly as "buffers". We also move string operations into Repr (e.g. pushing and popping), which de-dupes code.

    opened by ParkMyCar 0
  • Should From<String> convert to inline if possible?

    Should From convert to inline if possible?

    I was profiling something and noticed that CompactString::clone was hitting the allocator consistently even though the strings were small. Turns out that if you convert a String into CompactString using the From/Into trait it'll do so by taking ownership of the backing allocation as expected. But as a side-effect further clones will behave like String::clone even if the strings would fit inline.

    If you are using CompactString chances are you expect it to be inlineable most of the time so this comes as a performance pitfall depending on how the CompactString was created.

    I suggest changing From<String> to convert to inline if possible to avoid this scenario. Doing so may make From<CompactString> for String slower but I'll argue that an allocation is expected in this case and any re-use is a bonus. Alternatively the Clone implementation can check if the clone should be inline even if the source is boxed.

    One could also argue for full flexibility and that the behavior should be kept and a new function added CompacString::compact, but it's both cumbersome to use and most people will never realize it's thing. If such hyperspecific apis are desirable I suggest a different constructor that keeps the exiting behavior CompactString::from_string_buffer.

    opened by arthurprs 0
Owner
Parker Timmerman
Project names drawn on inspiration from Calvin and Hobbes.
Parker Timmerman
An annotated string type in Rust, made up of string slices

A string type made up of multiple annotated string slices.

Togglebit 3 Dec 29, 2022
Translate C++/Rust type into C type with the same memory layout

clayout, translate C++/Rust type into C type with the same memory layout. Generally, clayout is used together with bpftrace. clayout is developed on d

盏一 11 Nov 17, 2022
An efficient method of heaplessly converting numbers into their string representations, storing the representation within a reusable byte array.

NumToA #![no_std] Compatible with Zero Heap Allocations The standard library provides a convenient method of converting numbers into strings, but thes

Michael Murphy 42 Sep 6, 2022
Stack buffer provides alternatives to Buf{Reader,Writer} allocated on the stack instead of the heap.

StackBuf{Reader,Writer} Stack buffer provides alternatives to BufReader and BufWriter allocated on the stack instead of the heap. Its implementation i

Alex Saveau 14 Nov 20, 2022
Rc version `tokio-rs/bytes`

RcBytes The aim for this crate is to implement a Rc version bytes, which means that the structs in this crate does not implement the Sync and Send. Th

Al Liu 2 Aug 1, 2022
A high-performance SPSC bounded circular buffer of bytes

Cueue A high performance, single-producer, single-consumer, bounded circular buffer of contiguous elements, that supports lock-free atomic batch opera

Thaler Benedek 38 Dec 28, 2022
LinkedBytes is a linked list of Bytes and BytesMut.

LinkedBytes LinkedBytes is a linked list of Bytes and BytesMut (though we use VecDeque to implement it now). It is primarily used to manage Bytes and

Volo 5 Dec 9, 2022
messloc is a drop in replacement for malloc that can transparently recover from memory fragmentation without any changes to application code.

messloc is a drop in replacement for malloc that can transparently recover from memory fragmentation without any changes to application code. Goals Al

null 11 Dec 10, 2022
microtemplate - A fast, microscopic helper crate for runtime string interpolation.

microtemplate A fast, microscopic helper crate for runtime string interpolation. Design Goals Very lightweight: I want microtemplate to do exactly one

_iPhoenix_ 13 Jan 31, 2022
SubStrings, Slices and Random String Access in Rust

SubStrings, Slices and Random String Access in Rust This is a simple way to do it. Description Rust string processing is kind of hard, because text in

JoĂŁo Nuno Carvalho 2 Oct 24, 2021
A simple string interner / symbol table for Rust projects.

Symbol Interner A small Rust crate that provides a naĂŻve string interner. Consult the documentation to learn about the types that are exposed. Install

Ryan Chandler 1 Nov 18, 2021
A string truncator and scroller written in Rust

scissrs A string truncator and scroller written in Rust. Usage scissrs --help covers the definitions of this program's flags.

Skybbles 5 Aug 3, 2022
Count and convert between different indexing schemes on utf8 string slices

Str Indices Count and convert between different indexing schemes on utf8 string slices. The following schemes are currently supported: Chars (or "Unic

Nathan Vegdahl 11 Dec 25, 2022
Rust library to detect bots using a user-agent string

Rust library to detect bots using a user-agent string

Bryan Morgan 8 Dec 21, 2022
A simple string parsing utility library for Rust, supporting no_std contexts.

strp Utility library for parsing data from an input string, or stdin if built with the std feature. Supports no_std contexts when built without the st

iqon 5 Nov 3, 2022
A tool to subscribe to Twitch channels and store them efficiently on disk

twitch-messages A tool to subscribe to Twitch channels and store them efficiently on disk Build the Tools You can start by building the binaries that

Clément Renault 1 Oct 31, 2021
A lambda extension to hot reload parameters from SSM Parameter Store, Secrets Manager, DynamoDB, AppConfig

A lambda extension to hot reload parameters from SSM Parameter Store, Secrets Manager, DynamoDB, AppConfig

Jake Scott 7 Jun 12, 2022
Nix binary cache implemented in rust using libnix-store

harmonia Build Whole application nix-shell --run cargo b C Library Wrapper around libnixstore nix-shell --run make Note: The makefile is only to pro

Helsinki Systems 84 Dec 24, 2022
A stack-allocated box that stores trait objects.

This crate allows saving DST objects in the provided buffer. It allows users to create global dynamic objects on a no_std environment without a global allocator.

Aleksey Sidorov 19 Dec 13, 2022