A high level diffing library for rust based on diffs

Armin Ronacher

Last update: Dec 30, 2022

Related tags

Overview

Similar: A Diffing Library

Similar is a dependency free crate for Rust that implements different diffing algorithms and high level interfaces for it. It is based on the pijul implementation of the Patience algorithm and inherits some ideas from there. It also incorporates the Myer's diff algorithm which was largely written by Brandon Williams. This library was built for the insta snapshot testing library.

use similar::{ChangeTag, TextDiff};

fn main() {
    let diff = TextDiff::from_lines(
        "Hello World\nThis is the second line.\nThis is the third.",
        "Hallo Welt\nThis is the second line.\nThis is life.\nMoar and more",
    );

    for change in diff.iter_all_changes() {
        let sign = match change.tag() {
            ChangeTag::Delete => "-",
            ChangeTag::Insert => "+",
            ChangeTag::Equal => " ",
        };
        print!("{}{}", sign, change);
    }
}

Screenshot

What's in the box?

Myer's diff
Patience diff
Hunt–McIlroy / Hunt–Szymanski LCS diff
Diffing on arbitrary comparable sequences
Line, word, character and grapheme level diffing
Text and Byte diffing
Unified diff generation

Related Projects

insta snapshot testing library
similar-asserts assertion library

License and Links

Comments

Figure out relationship to diffs-rs
Currently this does not use diffs-rs but it tries to stay largely compatible to it. The reason it's a fork rather than reusing the code comes down to the following reasons at the moment:

I want to replace the difference.rs crate with this crate and my desire is not to introduce additional dependencies

The source code of diffs-rs is hard to read and documentation is very, very light

The diffs-rs source code is tracked as part of pijul which is not a very common tool and it's rather cumbersome to find the source code and issue tracker of it if one runs into issues.

The outcome right now is not super great because it should be possible in theory to create a version of this crate which is reduced to exactly what diffs-rs does.
opened by mitsuhiko 5

Crash in UnifiedDiff::to_string method

This code crashes or produces incorrect output

    let fst = "\u{18}\n\n";
    let snd = "\n\n\r";

    let mut config = similar::TextDiffConfig::default();
    let diff = config.diff_lines(fst, snd);
    let mut output = diff.unified_diff();
    let result = output.context_radius(0).to_string();
    println!("Result:\n{}", result);

opened by sv-91 3

Restricting Trait Bounds in `ChangesIter`

First of all, thank you for this crate.

The iter_changes method in DiffOp requires old and new to be Index<usize, Output = &'x T>, which makes it unusable in many situations; the doc's example won't compile if I change the type of vector from Vec<&str> to Vec<usize>:

use similar::{ChangeTag, Algorithm};
use similar::capture_diff_slices;

let old = vec![1, 2, 3]; // changed from let old = vec!["foo", "bar", "baz"];
let new = vec![1, 2, 4]; // changed from let new = vec!["foo", "bar", "blah"];
let ops = capture_diff_slices(Algorithm::Myers, &old, &new);
let changes: Vec<_> = ops
    .iter()
    .flat_map(|x| x.iter_changes(&old, &new))
    .map(|x| (x.tag(), x.value()))
    .collect();
assert_eq!(changes, vec![
    (ChangeTag::Equal, "foo"),
    (ChangeTag::Equal, "bar"),
    (ChangeTag::Delete, "baz"),
    (ChangeTag::Insert, "blah"),
]);

I know the Rust compiler make this almost impossible, but is there any workaround for this?

enhancement

opened by alidn 3

Chars diff: DiffOp's length is different from str::len — Bug or expected?
Hi,

First, thank you for this great crate.

I expected the length field of DiffOp to be the same as str::len, i.e. the length of the resulting bytes if the text is encoded in UTF-8. They turned out to be different. The former is instead the same as the number of Unicode scalar values (~ code points). Is this a bug or expected? If it's expected, is there a way to get the "bytes-length" from a DiffOp?

Minimal working example:

use similar::{DiffOp, TextDiff}; fn main() { let new = "á"; let diff = TextDiff::from_chars("", new); let op = diff.ops()[0]; if let DiffOp::Insert { old_index, new_index, new_len } = op { let real_new_len = new.len(); let char_count = new.chars().count(); println!("new_len = {new_len}, real_new_len = {real_new_len}, char_count = {char_count}"); } else { unreachable!(); } }

The code above will output:

new_len = 1, real_new_len = 2, char_count = 1

Tested on v2.1.0 and main branch (236a299ff01b8d4bdfc95c6439c1302c8422ae13).
opened by danieljl 2
pretty_assertions alternative based on similar

pretty_assertions is pretty popular and based on difference.rs and also abandoned. It would be great to have an alternative or even replace difference.rs in the pretty_assertions itself.

opened by RazrFalcon 2

Crash to_string method

There is another example of crash

    let fst = "\ne\n\n\n\n\n\u{2}\n\n\n\n\n\n\n\u{2}\n\n\u{2}\n\n\n\nA\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n";
    let snd = "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n*\0\n\0\n\n\n\n\n\0";

    let mut config = similar::TextDiffConfig::default();
    let diff = config.diff_lines(fst, snd);
    let mut output = diff.unified_diff();
    let result = output.to_string();
    println!("Result:\n{}", result);

And pleas, enable content_radius == 0 back

opened by sv-91 1

`get_close_matches` type incosistency

The get_close_matches expects a slice of the type &[&std::string::String], however I have no idea on how one could construct this type. Therefore would a move to the modern &[&str] or even just a &[String] be possible?

opened by Daggy1234 1

Mapping diff changes to a stream

I'm trying to integrate similar into https://github.com/helix-editor/helix/pull/228, however I've been having quite a lot of difficulty in mapping at which index in the &str the Change should be applied. I'm trying to use Change::old_index() and Change::new_index() as it sounds like that's what would help me, but it keeps panicking because it unwraps a None value.

The text is a ropey::Rope, and I'm iterating over it with ropey::iter::Chunks, and then decoding it to a user-selected encoding through encoding_rs, though it is UTF-8 the entire way through.

/// (from, to, replacement)
pub type Change = (usize, usize, Option<Tendril>); // https://docs.rs/tendril/0.4.2/tendril/struct.Tendril.html

let iter = self.text.chunks(); // ropey::Rope::chunks()
let iter_len = iter.clone().count();
let mut decoder = encoding.new_decoder(); // encoding_rs::Encoding::new_decoder()
let mut changes: Vec<Change> = Vec::new();

for (i, chunk) in iter.enumerate() {
    // Check if this is the last element in the iterator.
    let is_last = i == iter_len - 1;
    let capacity = Self::calculate_decode_capacity(&mut decoder, chunk.as_bytes());
    let mut buf = String::with_capacity(capacity);
    let mut total_read = 0;

    // Loop until the entire chunk has been decoded into `buf`.
    loop {
        let (result, read, ..) =
            decoder.decode_to_string(chunk[total_read..].as_bytes(), &mut buf, is_last);
        // Track how many bytes we have read so far, in case we need to allocate more
        // capacity to `buf`.
        total_read += read;

        // Check if we need to allocate more capacity to `buf`, otherwise append
        // to `changes`.
        match result {
            encoding_rs::CoderResult::InputEmpty => {
                debug_assert_eq!(total_read, chunk.len());
                let diff = similar::TextDiff::from_unicode_words(chunk, &buf);
                let diff_ops = diff.ops();
                let diff_changes = diff_ops
                    .iter()
                    .flat_map(|x| diff.iter_changes(x))
                    .filter_map(|x| {
                        let index = x.old_index().unwrap_or(x.new_index().unwrap());
                        let value = x.value();

                        match x.tag() {
                            similar::ChangeTag::Delete => {
                                Some((index, index + value.chars().count(), None))
                            }
                            similar::ChangeTag::Insert => {
                                Some((index, index + value.chars().count(), Some(value.into())))
                            }
                            similar::ChangeTag::Equal => None,
                        }
                    });
                changes.extend(diff_changes);

                break;
            }
            encoding_rs::CoderResult::OutputFull => {
                debug_assert!(buf.len() > total_read);
                let needed_capacity =
                    Self::calculate_decode_capacity(&mut decoder, chunk[total_read..].as_bytes());
                buf.reserve(needed_capacity);
            }
        }
    }

    if is_last {
        break;
    }
}

The code is logically incorrect, such as not keeping the index relative to the overall ropey::Rope rather than the Chunk, but I don't think it matters in regards to the unwrapping problem.

opened by kirawi 1

Implement Post-processing Cleanup to Counter Prefix/Suffix Optimization
Currently both the lcs and myers implementation (and with that the patience implementation) are suffering from the optimizations done by detecting common prefix and suffix. This means that a diff like the following can be generated:

( Value, 0, + ), + ( + Value, + 1, ), ]

vs the nicer

Value, 0, ), + ( + Value, + 1, + ), ]

I believe the right thing to do would be to shift the inserts down as much as possible.
opened by mitsuhiko 1
internal: remove unneeded boxes
You mentioned here that you had some Boxes you thought were unnecessary - I think you were almost there, I pretty much just shuffled around the lifetimes you already had and got something that worked.

The main hurdle was:

error[E0700]: hidden type for `impl Trait` captures lifetime that does not appear in bounds --> src/text/mod.rs:465:54 | 465 | pub fn iter_all_changes<'x, 'slf>(&'slf self) -> impl Iterator<Item = Change<'x, T>> + 'slf | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | note: hidden type `FlatMap<std::slice::Iter<'slf, types::DiffOp>, impl Iterator, [closure@src/text/mod.rs:471:36: 471:68]>` captures the lifetime `'new` as defined on the impl at 383:12 --> src/text/mod.rs:383:12 | 383 | impl<'old, 'new, 'bufs, T: DiffableStr + ?Sized + 'old + 'new> TextDiff<'old, 'new, 'bufs, T> { | ^^^^

Which states that:

the type we're hiding with impl Iterator, Flatmap, is capturing the lifetime 'new

but our type signature for impl Iterator doesn't contain the lifetime 'new

The only lifetime we use in impl Iterator is 'x. So all we need to do is add an additional bound that 'x: 'new, which then means that 'new is implicitly used in our final type signature.

The rest is just adding lifetime bounds on other functions that call this one until the compiler is happy.
opened by tommilligan 1
Make the change type be generic over any T rather &T

This makes the interface of this crate more flexible as the utility methods such as iter_changes now also work if a container does not contain references.

Fixes #29

opened by mitsuhiko 0
WebAssembly Bindings

This PR fixes #39 by introducing a wasm crate which exposes a simplified version of similar's API to WebAssembly. The simplified API is defined in wasm/similar.wit and can be used by wit-bindgen to generate bindings for various other languages.

The code is fairly straightforward, although we needed to pull in ouroboros so the text being diffed and its results could be stored in the same 'static struct (TextDiff).

For now, I've published the package to wapm.dev under Michael-F-Bryan/similar, but that can be changed by updating the namespace = "..." bit in wasm/Cargo.toml.

I also added a CI job to automatically publish a new version to WAPM whenever a tagged commit is pushed to this repo. It might be useful if forgetting to publish to WAPM is a concern, otherwise we can easily remove .github/workflows/releases.yml.

opened by Michael-F-Bryan 2
WIT Bindings + WAPM
I created the Python library snapshottest a few years ago (similar to the insta that you created in Rust!).

This library requires a string diffing library in order to pretty print the differences, and to do it in python in a fast way I also created the fastdiff library, which is also implemented in Rust.

I would like to start using your Similar as a dependency in snapshottest (instead of fastdiff), and I was wondering if you may be interested on that. If that's the case, we will need:

To create wit bindings in this library

Publish it to WAPM

Let me know your thoughts!
opened by syrusakbary 1
`Index` trait bound makes it impossible to return an owned value

I'm working on json diffing. I have some representation of two jsons in memory, like a serde_json::Value. Obviously these cannot be directly diffed, since they're not array-like. So to diff them I need to flatten them into an array of json-lines. I'd rather not actually construct the array, however: I don't want the memory usage. Instead, I'd like to generate the lines on the fly.

I can almost do this by implementing the Index trait. It doesn't quite work, however, since I need to return a reference. Ideally, I'd like to be able to pass in an FnMut(usize) -> T or something along those lines.

I'm not sure if anybody but me will find this useful, TBH, but I figured there's no harm in asking.

opened by rocurley 12
Support custom comparison with capture_diff_slices_by

I should first say, thank you for the crate, this is very useful for even comparing binary data.

Similar to capture_diff_slices but user can supply a F: FnMut(&T, &T) -> Ordering. This is more of a wishlist since I don't have a concrete design in mind. Please feel free to close if this is not possible or not on the roadmap.

opened by WiSaGaN 3
Improved Inline Highlighting

Currently inline highlighting is always using word (or unicode word if available) level diffing. There are some expectations that these are character level diffs (https://github.com/mitsuhiko/similar-asserts/issues/1).

There are advantages and disadvantages to both. In the current form I'm not happy with how character diffs look like since similar does not have semantic cleanups like dissimilar has. Though even in dissimilar I found that character level diffs can be awkward when put on some diffs. This is a meta item to figure out what a good strategy could look like here.

opened by mitsuhiko 0

Owner

Armin Ronacher

Software developer and Open Source nut. Creator of the Flask framework. Engineering at @getsentry. Other things of interest: @pallets and @rust-lang

GitHub https://insta.rs/similar

High-level documentation for rerun

rerun-docs This is the high-level documentation for rerun that is hosted at https://www.rerun.io/docs Other documentation API-level documentation for

9 Feb 19, 2023

A low-ish level tool for easily writing and hosting WASM based plugins.

A low-ish level tool for easily writing and hosting WASM based plugins. The goal of wasm_plugin is to make communicating across the host-plugin bounda

62 Sep 20, 2022

Mononym is a library for creating unique type-level names for each value in Rust.

52 Dec 16, 2022

A low-level I/O ownership and borrowing library

This library introduces OwnedFd, BorrowedFd, and supporting types and traits, and corresponding features for Windows, which implement safe owning and

74 Jan 2, 2023

Rust mid-level IR Abstract Interpreter

MIRAI MIRAI is an abstract interpreter for the Rust compiler's mid-level intermediate representation (MIR). It is intended to become a widely used sta

793 Jan 2, 2023

Low level access to ATmega32U4 registers in Rust

Deprecation Note: This crate will soon be deprecated in favor of avr-device. The approach of generating the svd from hand-written register definitions

9 Jan 27, 2021

A Rust framework to develop and use plugins within your project, without worrying about the low-level details.

VPlugin: A plugin framework for Rust. Website | Issues | Documentation VPlugin is a Rust framework to develop and use plugins on applications and libr

11 Dec 31, 2022

Let Tauri's transparent background rendering window be stacked on Bevy's rendering window in real time to run the game with native-level performance!

Native Bevy with Tauri HUD DEMO 将 Tauri 的透明背景渲染窗口实时叠在 Bevy 的渲染窗口上，以使用原生级别性能运行游戏！ Let Tauri's transparent background rendering window be stacked on Bev

3 Mar 25, 2024

Simple procedural macros `tnconst![...]`, `pconst![...]`, `nconst![...]` and `uconst![...]` that returns the type level integer from `typenum` crate.

typenum-consts Procedural macros that take a literal integer (or the result of an evaluation of simple mathematical expressions or an environment vari