A high level diffing library for rust based on diffs

Overview

Similar: A Diffing Library

Build Status Crates.io License Documentation

Similar is a dependency free crate for Rust that implements different diffing algorithms and high level interfaces for it. It is based on the pijul implementation of the Patience algorithm and inherits some ideas from there. It also incorporates the Myer's diff algorithm which was largely written by Brandon Williams. This library was built for the insta snapshot testing library.

use similar::{ChangeTag, TextDiff};

fn main() {
    let diff = TextDiff::from_lines(
        "Hello World\nThis is the second line.\nThis is the third.",
        "Hallo Welt\nThis is the second line.\nThis is life.\nMoar and more",
    );

    for change in diff.iter_all_changes() {
        let sign = match change.tag() {
            ChangeTag::Delete => "-",
            ChangeTag::Insert => "+",
            ChangeTag::Equal => " ",
        };
        print!("{}{}", sign, change);
    }
}

Screenshot

terminal highlighting

What's in the box?

  • Myer's diff
  • Patience diff
  • Hunt–McIlroy / Hunt–Szymanski LCS diff
  • Diffing on arbitrary comparable sequences
  • Line, word, character and grapheme level diffing
  • Text and Byte diffing
  • Unified diff generation

Related Projects

License and Links

Comments
  • Figure out relationship to diffs-rs

    Figure out relationship to diffs-rs

    Currently this does not use diffs-rs but it tries to stay largely compatible to it. The reason it's a fork rather than reusing the code comes down to the following reasons at the moment:

    • I want to replace the difference.rs crate with this crate and my desire is not to introduce additional dependencies
    • The source code of diffs-rs is hard to read and documentation is very, very light
    • The diffs-rs source code is tracked as part of pijul which is not a very common tool and it's rather cumbersome to find the source code and issue tracker of it if one runs into issues.

    The outcome right now is not super great because it should be possible in theory to create a version of this crate which is reduced to exactly what diffs-rs does.

    opened by mitsuhiko 5
  • Crash in UnifiedDiff::to_string method

    Crash in UnifiedDiff::to_string method

    This code crashes or produces incorrect output

        let fst = "\u{18}\n\n";
        let snd = "\n\n\r";
    
        let mut config = similar::TextDiffConfig::default();
        let diff = config.diff_lines(fst, snd);
        let mut output = diff.unified_diff();
        let result = output.context_radius(0).to_string();
        println!("Result:\n{}", result);
    
    opened by sv-91 3
  • Restricting Trait Bounds in `ChangesIter`

    Restricting Trait Bounds in `ChangesIter`

    First of all, thank you for this crate.

    The iter_changes method in DiffOp requires old and new to be Index<usize, Output = &'x T>, which makes it unusable in many situations; the doc's example won't compile if I change the type of vector from Vec<&str> to Vec<usize>:

    use similar::{ChangeTag, Algorithm};
    use similar::capture_diff_slices;
    
    let old = vec![1, 2, 3]; // changed from let old = vec!["foo", "bar", "baz"];
    let new = vec![1, 2, 4]; // changed from let new = vec!["foo", "bar", "blah"];
    let ops = capture_diff_slices(Algorithm::Myers, &old, &new);
    let changes: Vec<_> = ops
        .iter()
        .flat_map(|x| x.iter_changes(&old, &new))
        .map(|x| (x.tag(), x.value()))
        .collect();
    assert_eq!(changes, vec![
        (ChangeTag::Equal, "foo"),
        (ChangeTag::Equal, "bar"),
        (ChangeTag::Delete, "baz"),
        (ChangeTag::Insert, "blah"),
    ]);
    

    I know the Rust compiler make this almost impossible, but is there any workaround for this?

    enhancement 
    opened by alidn 3
  • Chars diff: DiffOp's length is different from str::len — Bug or expected?

    Chars diff: DiffOp's length is different from str::len — Bug or expected?

    Hi,

    First, thank you for this great crate.

    I expected the length field of DiffOp to be the same as str::len, i.e. the length of the resulting bytes if the text is encoded in UTF-8. They turned out to be different. The former is instead the same as the number of Unicode scalar values (~ code points). Is this a bug or expected? If it's expected, is there a way to get the "bytes-length" from a DiffOp?

    Minimal working example:

    use similar::{DiffOp, TextDiff};
    
    fn main() {
        let new = "á";
        let diff = TextDiff::from_chars("", new);
    
        let op = diff.ops()[0];
        if let DiffOp::Insert { old_index, new_index, new_len } = op {
            let real_new_len = new.len();
            let char_count = new.chars().count();
            println!("new_len = {new_len}, real_new_len = {real_new_len}, char_count = {char_count}");
        } else {
            unreachable!();
        }
    }
    

    The code above will output:

    new_len = 1, real_new_len = 2, char_count = 1
    

    Tested on v2.1.0 and main branch (236a299ff01b8d4bdfc95c6439c1302c8422ae13).

    opened by danieljl 2
  • pretty_assertions alternative based on similar

    pretty_assertions alternative based on similar

    pretty_assertions is pretty popular and based on difference.rs and also abandoned. It would be great to have an alternative or even replace difference.rs in the pretty_assertions itself.

    opened by RazrFalcon 2
  • Crash to_string method

    Crash to_string method

    There is another example of crash

        let fst = "\ne\n\n\n\n\n\u{2}\n\n\n\n\n\n\n\u{2}\n\n\u{2}\n\n\n\nA\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n";
        let snd = "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n*\0\n\0\n\n\n\n\n\0";
    
        let mut config = similar::TextDiffConfig::default();
        let diff = config.diff_lines(fst, snd);
        let mut output = diff.unified_diff();
        let result = output.to_string();
        println!("Result:\n{}", result);
    

    And pleas, enable content_radius == 0 back

    opened by sv-91 1
  • `get_close_matches` type incosistency

    `get_close_matches` type incosistency

    The get_close_matches expects a slice of the type &[&std::string::String], however I have no idea on how one could construct this type. Therefore would a move to the modern &[&str] or even just a &[String] be possible?

    opened by Daggy1234 1
  • Mapping diff changes to a stream

    Mapping diff changes to a stream

    I'm trying to integrate similar into https://github.com/helix-editor/helix/pull/228, however I've been having quite a lot of difficulty in mapping at which index in the &str the Change should be applied. I'm trying to use Change::old_index() and Change::new_index() as it sounds like that's what would help me, but it keeps panicking because it unwraps a None value.

    The text is a ropey::Rope, and I'm iterating over it with ropey::iter::Chunks, and then decoding it to a user-selected encoding through encoding_rs, though it is UTF-8 the entire way through.

    /// (from, to, replacement)
    pub type Change = (usize, usize, Option<Tendril>); // https://docs.rs/tendril/0.4.2/tendril/struct.Tendril.html
    
    let iter = self.text.chunks(); // ropey::Rope::chunks()
    let iter_len = iter.clone().count();
    let mut decoder = encoding.new_decoder(); // encoding_rs::Encoding::new_decoder()
    let mut changes: Vec<Change> = Vec::new();
    
    for (i, chunk) in iter.enumerate() {
        // Check if this is the last element in the iterator.
        let is_last = i == iter_len - 1;
        let capacity = Self::calculate_decode_capacity(&mut decoder, chunk.as_bytes());
        let mut buf = String::with_capacity(capacity);
        let mut total_read = 0;
    
        // Loop until the entire chunk has been decoded into `buf`.
        loop {
            let (result, read, ..) =
                decoder.decode_to_string(chunk[total_read..].as_bytes(), &mut buf, is_last);
            // Track how many bytes we have read so far, in case we need to allocate more
            // capacity to `buf`.
            total_read += read;
    
            // Check if we need to allocate more capacity to `buf`, otherwise append
            // to `changes`.
            match result {
                encoding_rs::CoderResult::InputEmpty => {
                    debug_assert_eq!(total_read, chunk.len());
                    let diff = similar::TextDiff::from_unicode_words(chunk, &buf);
                    let diff_ops = diff.ops();
                    let diff_changes = diff_ops
                        .iter()
                        .flat_map(|x| diff.iter_changes(x))
                        .filter_map(|x| {
                            let index = x.old_index().unwrap_or(x.new_index().unwrap());
                            let value = x.value();
    
                            match x.tag() {
                                similar::ChangeTag::Delete => {
                                    Some((index, index + value.chars().count(), None))
                                }
                                similar::ChangeTag::Insert => {
                                    Some((index, index + value.chars().count(), Some(value.into())))
                                }
                                similar::ChangeTag::Equal => None,
                            }
                        });
                    changes.extend(diff_changes);
    
                    break;
                }
                encoding_rs::CoderResult::OutputFull => {
                    debug_assert!(buf.len() > total_read);
                    let needed_capacity =
                        Self::calculate_decode_capacity(&mut decoder, chunk[total_read..].as_bytes());
                    buf.reserve(needed_capacity);
                }
            }
        }
    
        if is_last {
            break;
        }
    }
    

    The code is logically incorrect, such as not keeping the index relative to the overall ropey::Rope rather than the Chunk, but I don't think it matters in regards to the unwrapping problem.

    opened by kirawi 1
  • Implement Post-processing Cleanup to Counter Prefix/Suffix Optimization

    Implement Post-processing Cleanup to Counter Prefix/Suffix Optimization

    Currently both the lcs and myers implementation (and with that the patience implementation) are suffering from the optimizations done by detecting common prefix and suffix. This means that a diff like the following can be generated:

         (
             Value,
             0,
    +    ),
    +    (
    +        Value,
    +        1,
         ),
     ]
    

    vs the nicer

             Value,
             0,
         ),
    +    (
    +        Value,
    +        1,
    +    ),
     ]
    

    I believe the right thing to do would be to shift the inserts down as much as possible.

    opened by mitsuhiko 1
  • internal: remove unneeded boxes

    internal: remove unneeded boxes

    You mentioned here that you had some Boxes you thought were unnecessary - I think you were almost there, I pretty much just shuffled around the lifetimes you already had and got something that worked.

    The main hurdle was:

    error[E0700]: hidden type for `impl Trait` captures lifetime that does not appear in bounds
       --> src/text/mod.rs:465:54
        |
    465 |     pub fn iter_all_changes<'x, 'slf>(&'slf self) -> impl Iterator<Item = Change<'x, T>> + 'slf
        |                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
    note: hidden type `FlatMap<std::slice::Iter<'slf, types::DiffOp>, impl Iterator, [closure@src/text/mod.rs:471:36: 471:68]>` captures the lifetime `'new` as defined on the impl at 383:12
       --> src/text/mod.rs:383:12
        |
    383 | impl<'old, 'new, 'bufs, T: DiffableStr + ?Sized + 'old + 'new> TextDiff<'old, 'new, 'bufs, T> {
        |            ^^^^
    
    

    Which states that:

    • the type we're hiding with impl Iterator, Flatmap, is capturing the lifetime 'new
    • but our type signature for impl Iterator doesn't contain the lifetime 'new

    The only lifetime we use in impl Iterator is 'x. So all we need to do is add an additional bound that 'x: 'new, which then means that 'new is implicitly used in our final type signature.

    The rest is just adding lifetime bounds on other functions that call this one until the compiler is happy.

    opened by tommilligan 1
  • Make the change type be generic over any T rather &T

    Make the change type be generic over any T rather &T

    This makes the interface of this crate more flexible as the utility methods such as iter_changes now also work if a container does not contain references.

    Fixes #29

    opened by mitsuhiko 0
  • WebAssembly Bindings

    WebAssembly Bindings

    This PR fixes #39 by introducing a wasm crate which exposes a simplified version of similar's API to WebAssembly. The simplified API is defined in wasm/similar.wit and can be used by wit-bindgen to generate bindings for various other languages.

    The code is fairly straightforward, although we needed to pull in ouroboros so the text being diffed and its results could be stored in the same 'static struct (TextDiff).

    For now, I've published the package to wapm.dev under Michael-F-Bryan/similar, but that can be changed by updating the namespace = "..." bit in wasm/Cargo.toml.

    I also added a CI job to automatically publish a new version to WAPM whenever a tagged commit is pushed to this repo. It might be useful if forgetting to publish to WAPM is a concern, otherwise we can easily remove .github/workflows/releases.yml.

    opened by Michael-F-Bryan 2
  • WIT Bindings + WAPM

    WIT Bindings + WAPM

    I created the Python library snapshottest a few years ago (similar to the insta that you created in Rust!).

    This library requires a string diffing library in order to pretty print the differences, and to do it in python in a fast way I also created the fastdiff library, which is also implemented in Rust.

    I would like to start using your Similar as a dependency in snapshottest (instead of fastdiff), and I was wondering if you may be interested on that. If that's the case, we will need:

    • To create wit bindings in this library
    • Publish it to WAPM

    Let me know your thoughts!

    opened by syrusakbary 1
  • `Index` trait bound makes it impossible to return an owned value

    `Index` trait bound makes it impossible to return an owned value

    I'm working on json diffing. I have some representation of two jsons in memory, like a serde_json::Value. Obviously these cannot be directly diffed, since they're not array-like. So to diff them I need to flatten them into an array of json-lines. I'd rather not actually construct the array, however: I don't want the memory usage. Instead, I'd like to generate the lines on the fly.

    I can almost do this by implementing the Index trait. It doesn't quite work, however, since I need to return a reference. Ideally, I'd like to be able to pass in an FnMut(usize) -> T or something along those lines.

    I'm not sure if anybody but me will find this useful, TBH, but I figured there's no harm in asking.

    opened by rocurley 12
  • Support custom comparison with capture_diff_slices_by

    Support custom comparison with capture_diff_slices_by

    I should first say, thank you for the crate, this is very useful for even comparing binary data.

    Similar to capture_diff_slices but user can supply a F: FnMut(&T, &T) -> Ordering. This is more of a wishlist since I don't have a concrete design in mind. Please feel free to close if this is not possible or not on the roadmap.

    opened by WiSaGaN 3
  • Improved Inline Highlighting

    Improved Inline Highlighting

    Currently inline highlighting is always using word (or unicode word if available) level diffing. There are some expectations that these are character level diffs (https://github.com/mitsuhiko/similar-asserts/issues/1).

    There are advantages and disadvantages to both. In the current form I'm not happy with how character diffs look like since similar does not have semantic cleanups like dissimilar has. Though even in dissimilar I found that character level diffs can be awkward when put on some diffs. This is a meta item to figure out what a good strategy could look like here.

    opened by mitsuhiko 0
Owner
Armin Ronacher
Software developer and Open Source nut. Creator of the Flask framework. Engineering at @getsentry. Other things of interest: @pallets and @rust-lang
Armin Ronacher
High-level documentation for rerun

rerun-docs This is the high-level documentation for rerun that is hosted at https://www.rerun.io/docs Other documentation API-level documentation for

rerun.io 9 Feb 19, 2023
A low-ish level tool for easily writing and hosting WASM based plugins.

A low-ish level tool for easily writing and hosting WASM based plugins. The goal of wasm_plugin is to make communicating across the host-plugin bounda

Alec Deason 62 Sep 20, 2022
Mononym is a library for creating unique type-level names for each value in Rust.

Mononym is a library for creating unique type-level names for each value in Rust.

MaybeVoid 52 Dec 16, 2022
A low-level I/O ownership and borrowing library

This library introduces OwnedFd, BorrowedFd, and supporting types and traits, and corresponding features for Windows, which implement safe owning and

Dan Gohman 74 Jan 2, 2023
Rust mid-level IR Abstract Interpreter

MIRAI MIRAI is an abstract interpreter for the Rust compiler's mid-level intermediate representation (MIR). It is intended to become a widely used sta

Facebook Experimental 793 Jan 2, 2023
Low level access to ATmega32U4 registers in Rust

Deprecation Note: This crate will soon be deprecated in favor of avr-device. The approach of generating the svd from hand-written register definitions

Rahix 9 Jan 27, 2021
A Rust framework to develop and use plugins within your project, without worrying about the low-level details.

VPlugin: A plugin framework for Rust. Website | Issues | Documentation VPlugin is a Rust framework to develop and use plugins on applications and libr

VPlugin 11 Dec 31, 2022
Let Tauri's transparent background rendering window be stacked on Bevy's rendering window in real time to run the game with native-level performance!

Native Bevy with Tauri HUD DEMO 将 Tauri 的透明背景渲染窗口实时叠在 Bevy 的渲染窗口上,以使用原生级别性能运行游戏! Let Tauri's transparent background rendering window be stacked on Bev

伊欧 3 Mar 25, 2024
Simple procedural macros `tnconst![...]`, `pconst![...]`, `nconst![...]` and `uconst![...]` that returns the type level integer from `typenum` crate.

typenum-consts Procedural macros that take a literal integer (or the result of an evaluation of simple mathematical expressions or an environment vari

Jim Chng 3 Mar 30, 2024
A code generator to reduce repetitive tasks and build high-quality Rust libraries. 🦀

LibMake A code generator to reduce repetitive tasks and build high-quality Rust libraries Welcome to libmake ?? Website • Documentation • Report Bug •

Sebastien Rousseau 27 Mar 12, 2023
High-performance, Reliable ChatGLM SDK natural language processing in Rust-Lang

RustGLM for ChatGLM Rust SDK - 中文文档 High-performance, high-quality Experience and Reliable ChatGLM SDK natural language processing in Rust-Language 1.

Blueokanna 3 Feb 29, 2024
hy-rs, pronounced high rise, provides a unified and portable to the hypervisor APIs provided by various platforms.

Introduction The hy-rs crate, pronounced as high rise, provides a unified and portable interface to the hypervisor APIs provided by various platforms.

S.J.R. van Schaik 12 Nov 1, 2022
A high-performance SPSC bounded circular buffer of bytes

Cueue A high performance, single-producer, single-consumer, bounded circular buffer of contiguous elements, that supports lock-free atomic batch opera

Thaler Benedek 38 Dec 28, 2022
Monorep for fnRPC (high performance serverless rpc framework)

fnrpc Monorep for fnRPC (high performance serverless rpc framework) cli Cli tool help build and manage functions Create RPC functions Create & Manage

Faasly 3 Dec 21, 2022
High-performance BitTorrent tracker compatible with UNIT3D tracker software

UNIT3D-Announce High-performance backend BitTorrent tracker compatible with UNIT3D tracker software. Usage # Clone this repository $ git clone https:/

HDInnovations 4 Feb 6, 2023
High-performance QEMU memory and instruction tracing

Cannoli Cannoli is a high-performance tracing engine for qemu-user. It can record a trace of both PCs executed, as well as memory operations. It consi

Margin Research 412 Oct 18, 2023
A high-performance Lambda authorizer for API Gateway that can validate OIDC tokens

oidc-authorizer A high-performance token-based API Gateway authorizer Lambda that can validate OIDC-issued JWT tokens. ?? Use case This project provid

Luciano Mammino 4 Oct 30, 2023
Easy to use Rust i18n library based on code generation

rosetta-i18n rosetta-i18n is an easy-to-use and opinionated Rust internationalization (i18n) library powered by code generation. rosetta_i18n::include

null 38 Dec 18, 2022
A library to compile USDT probes into a Rust library

sonde sonde is a library to compile USDT probes into a Rust library, and to generate a friendly Rust idiomatic API around it. Userland Statically Defi

Ivan Enderlin 40 Jan 7, 2023