Hey! I'm using the str_utils code in another project and I was reading the sse code to understand it better.
In ByteChunk for sse2::__m128i the code currently does this:
#[inline(always)]
fn sum_bytes(&self) -> usize {
const ONES: u64 = std::u64::MAX / 0xFF;
let tmp = unsafe { std::mem::transmute::<Self, (u64, u64)>(*self) };
let a = tmp.0.wrapping_mul(ONES) >> (7 * 8);
let b = tmp.1.wrapping_mul(ONES) >> (7 * 8);
(a + b) as usize
}
.. Which is a neat trick, but it makes the "vertical" loop have to build the accumulator every 31 iterations so you don't overflow. I'm no expert at this stuff, but some reading recommended using PSADBW(x, 0)
("Compute sum of absolute differences") instead to accumulate into the array.
So changing the code to this:
#[inline(always)]
fn max_acc() -> usize {
255
}
#[inline(always)]
fn sum_bytes(&self) -> usize {
unsafe {
let zero = sse2::_mm_setzero_si128();
let diff = sse2::_mm_sad_epu8(*self, zero);
let (low, high) = std::mem::transmute::<Self, (u64, u64)>(diff);
(low + high) as usize
}
}
This yields a (modest) performance improvement on my ryzen 5800:
ropey:master $ taskset 0x1 nice -10 RUSTFLAGS=-C target-cpu=native cargo criterion -- --measurement-time=10 index_convert
Compiling ropey v1.3.2 (/home/seph/3rdparty/ropey)
Finished bench [optimized] target(s) in 37.01s
index_convert/byte_to_char
time: [41.762 ns 41.799 ns 41.837 ns]
change: [-1.0722% -0.8577% -0.6697%] (p = 0.00 < 0.05)
Change within noise threshold.
index_convert/byte_to_line
time: [103.24 ns 103.25 ns 103.27 ns]
change: [+1.1863% +1.2842% +1.3631%] (p = 0.00 < 0.05)
Performance has regressed.
index_convert/char_to_byte
time: [87.674 ns 87.701 ns 87.730 ns]
change: [-1.6249% -1.5190% -1.4211%] (p = 0.00 < 0.05)
Performance has improved.
index_convert/char_to_line
time: [153.53 ns 153.55 ns 153.57 ns]
change: [-1.4996% -1.3924% -1.2970%] (p = 0.00 < 0.05)
Performance has improved.
index_convert/line_to_byte
time: [143.57 ns 143.65 ns 143.77 ns]
change: [-7.6773% -7.5422% -7.3956%] (p = 0.00 < 0.05)
Performance has improved.
index_convert/line_to_char
time: [143.31 ns 143.34 ns 143.39 ns]
change: [-7.9232% -7.8228% -7.7185%] (p = 0.00 < 0.05)
Performance has improved.
Is this code change correct?