I was able to achieve a 1.7x speedup on 6-limb mul_assign
against my previous pull request #163, bringing the total speedup to about 1.84x. This is achieved with a generic code-generation script utilising build.rs
that generates x64 assembly utilising the mulxq
, adoxq
and adcxq
instructions for 2 up to 8 limbs.
~Update: was able to extend work to 18 limbs, achieving 1.34x speedup on Fp832. May be worth looking into translating the old square_in_place
into assembly, but this is a significant amount of work.~
Update: Put behind feature. Build with:
RUSTFLAGS="--emit=asm -C target-feature=+bmi2,+adx" cargo +nightly bench/test/build --features asm
Next steps: more limbs with less data movement using kilic's strategy. Extend strategy to square_in_place
.
Details
I implemented the assembly version of square_in_place
as applying mul_assign
to itself. That's because the goff squaring formula is not really good, while the original squaring formula requires twice the number of registers and hence goes beyond the scope of this PR, and may in fact be slower in the end, as this PR takes massive advantage of the add and carry chains present in mul_assign
.
Update: I cleaned up the messy build.rs
by creating a standalone subcrate and importing the generate function.
Future work
Currently, the maximum number of limbs supported is 8, and could be possibly be extended by current methods to about 10 for mul and 11 for squaring. However, going beyond that may require additional work to optimally move data to and from registers into memory (i.e. L1 cache). One should study how a compiler would reason about this to write a reasonable code generation procedure script to do this.
Edit: our data movement is extremely suboptimal. https://raw.githubusercontent.com/kilic/fp/master/generic/x86_arithmetic.s is able to achieve 149ns on 13 limbs, whereas we achieve 190ns. There's a lot to learn here.
We are also somehow still trailing behind MCL, probably the gold-standard for pairings. Here are the benchmarks, run on a similar i7-7700 to mine. Although our Fp performance totally outclasses it, everything else is lagging behind. It manages to achieve a staggering ~0.7ms BLS12_381 pairing (?). Their webassembly target achieves a sickening 3.5ms on my Chrome browser.
I suspect there could be bottlenecks in adc, sbb, negate. Negate is much slower than sub_assign. This I suspect could be fixed with a little more inline assembly, since it's unclear the compiler knows how to do adc. In particular, the modulus can be hard-coded for negate. Assuming we can shave of 3ns per add/sub, we can probably improve performance by another 5%. (Given this, it may indeed be worthwhile to write a procedural macro and do a cleanup. This may turn out not to be possible. But it is worth trying.)
Areas for Improvement
This however does not explain the discrepancy. Looking at the benchmarks more carefully, for BLS12_381, Fq12 mul is ~6800ns, whereas for MCL it is ~3000ns. We're better for Miller loop, 413us vs 527us (but not the precomputed Miller loop (fixed Q in G12) - 405us (which are we using?)). Nonetheless, those timings were for the BN curve, so really we need to improve the Miller loop (7500Mq) down to about 350us.
But this makes it apparent that the significant weakness is in the final exponentiation (9000Mq). We must get timings down from 850us down to 414us, a 2x speedup. Part of this is probably due to our slow Fq inverse timings. This is clear also by examining pairing formulas, where the number of field ops for the final exponentiation should be roughly similar to the Miller loop, but for us the former has a timing double the latter. I estimate fixing this could give another 35% boost to our pairings.
G2 should also be fixed - 25% slowdown - 2.7us vs 2.1us. for add, 1.85us vs 1.49us for double. Our mul assign is apparently ~3x slower (930us vs 360us).
G1 add 529us vs MCL's 472us. Suggests, as suspected, some inefficiency (probably from the very slow Fq doubling). Our G1 doubling beats it however. Should investigate why. Mul is 25% slower (v.s. 154us), but should be fixable with standalone pippenger.
MCL also implements GLV.
I will try to study MCL and obtain more detailed and up to date benchmarks for it.
Main results:
384-bit mul_assign
bls12_377::fq::bench_fq_mul_assign 51,780 30,340 -21,440 -41.41% x 1.71
bls12_381::fq::bench_fq_mul_assign 52,143 30,280 -21,863 -41.93% x 1.72
sw6::fr::bench_fr_mul_assign 50,664 30,005 -20,659 -40.78% x 1.69
384-bit square_in_place
:
bls12_377::fq::bench_fq_square 43,704 30,089 -13,615 -31.15% x 1.45
bls12_381::fq::bench_fq_square 44,295 31,291 -13,004 -29.36% x 1.42
sw6::fr::bench_fr_square 42,764 30,561 -12,203 -28.54% x 1.40
256-bit mul_assign
:
bls12_377::fr::bench_fr_mul_assign 25,583 16,173 -9,410 -36.78% x 1.58
bls12_381::fr::bench_fr_mul_assign 27,166 17,113 -10,053 -37.01% x 1.59
256-bit square_in_place
:
bls12_377::fr::bench_fr_square 21,929 15,620 -6,309 -28.77% x 1.40
bls12_381::fr::bench_fr_square 24,908 16,834 -8,074 -32.42% x 1.48
384-bit G1:
bls12_377::ec::g1::bench_g1_add_assign 1,048,201 755,061 -293,140 -27.97% x 1.39
bls12_377::ec::g1::bench_g1_add_assign_mixed 749,362 529,950 -219,412 -29.28% x 1.41
bls12_377::ec::g1::bench_g1_double 535,695 418,700 -116,995 -21.84% x 1.28
bls12_377::ec::g1::bench_g1_mul_assign 269,918 202,057 -67,861 -25.14% x 1.34
bls12_381::ec::g1::bench_g1_add_assign 1,059,324 769,470 -289,854 -27.36% x 1.38
bls12_381::ec::g1::bench_g1_add_assign_mixed 758,426 534,781 -223,645 -29.49% x 1.42
bls12_381::ec::g1::bench_g1_double 539,424 419,918 -119,506 -22.15% x 1.28
bls12_381::ec::g1::bench_g1_mul_assign 274,129 206,160 -67,969 -24.79% x 1.33
bls12_381::ec::g1::bench_g1_rand 274,444 204,898 -69,546 -25.34% x 1.34
384-bit pairings:
bls12_377::pairing::pairing::bench_pairing_final_exponentiation 1,292,296 1,087,830 -204,466 -15.82% x 1.19
bls12_377::pairing::pairing::bench_pairing_full 2,235,430 1,845,299 -390,131 -17.45% x 1.21
bls12_377::pairing::pairing::bench_pairing_miller_loop 631,072 510,977 -120,095 -19.03% x 1.24
bls12_381::pairing::pairing::bench_pairing_final_exponentiation 1,072,725 847,028 -225,697 -21.04% x 1.27
bls12_381::pairing::pairing::bench_pairing_full 1,871,067 1,472,014 -399,053 -21.33% x 1.27
bls12_381::pairing::pairing::bench_pairing_miller_loop 546,691 413,627 -133,064 -24.34% x 1.32
SW6 (updated)
sw6::fq::bench_fq_mul_assign 253,632 189,117 -64,515 -25.44% x 1.34
sw6::fq::bench_fq_square 208,612 192,834 -15,778 -7.56% x 1.08
sw6::ec::g1::bench_g1_add_assign 4,087,676 3,464,448 -623,228 -15.25% x 1.18
sw6::ec::g1::bench_g1_add_assign_mixed 2,888,104 2,429,125 -458,979 -15.89% x 1.19
sw6::ec::g1::bench_g1_double 2,608,558 2,417,105 -191,453 -7.34% x 1.08
sw6::ec::g1::bench_g1_mul_assign 1,758,755 1,576,919 -181,836 -10.34% x 1.12
sw6::pairing::pairing::bench_pairing_final_exponentiation 8,648,924 7,737,242 -911,682 -10.54% x 1.12
sw6::pairing::pairing::bench_pairing_full 95,098,636 85,313,158 -9,785,478 -10.29% x 1.11
sw6::pairing::pairing::bench_pairing_miller_loop 86,559,010 77,266,953 -9,292,057 -10.73% x 1.12
Full results
bls12_377::ec::g1::bench_g1_add_assign 1,048,201 755,061 -293,140 -27.97% x 1.39
bls12_377::ec::g1::bench_g1_add_assign_mixed 749,362 529,950 -219,412 -29.28% x 1.41
bls12_377::ec::g1::bench_g1_double 535,695 418,700 -116,995 -21.84% x 1.28
bls12_377::ec::g1::bench_g1_mul_assign 269,918 202,057 -67,861 -25.14% x 1.34
bls12_377::ec::g1::bench_g1_rand 267,739 202,269 -65,470 -24.45% x 1.32
bls12_377::ec::g2::bench_g2_add_assign 4,612 3,798 -814 -17.65% x 1.21
bls12_377::ec::g2::bench_g2_add_assign_mixed 3,232 2,703 -529 -16.37% x 1.20
bls12_377::ec::g2::bench_g2_double 2,158 1,877 -281 -13.02% x 1.15
bls12_377::ec::g2::bench_g2_mul_assign 1,102,020 937,104 -164,916 -14.96% x 1.18
bls12_377::ec::g2::bench_g2_rand 1,080,859 922,946 -157,913 -14.61% x 1.17
bls12_377::fq12::bench_fq12_add_assign 133 131 -2 -1.50% x 1.02
bls12_377::fq12::bench_fq12_double 126 114 -12 -9.52% x 1.11
bls12_377::fq12::bench_fq12_inverse 25,742 28,005 2,263 8.79% x 0.92
bls12_377::fq12::bench_fq12_mul_assign 6,874 5,566 -1,308 -19.03% x 1.23
bls12_377::fq12::bench_fq12_square 4,723 3,882 -841 -17.81% x 1.22
bls12_377::fq12::bench_fq12_sub_assign 119 116 -3 -2.52% x 1.03
bls12_377::fq2::bench_fq2_add_assign 17 16 -1 -5.88% x 1.06
bls12_377::fq2::bench_fq2_double 17 20 3 17.65% x 0.85
bls12_377::fq2::bench_fq2_inverse 13,930 18,062 4,132 29.66% x 0.77
bls12_377::fq2::bench_fq2_mul_assign 249 191 -58 -23.29% x 1.30
bls12_377::fq2::bench_fq2_sqrt 127,995 96,823 -31,172 -24.35% x 1.32
bls12_377::fq2::bench_fq2_square 231 227 -4 -1.73% x 1.02
bls12_377::fq2::bench_fq2_sub_assign 19 19 0 0.00% x 1.00
bls12_377::fq::bench_fq_add_assign 10 10 0 0.00% x 1.00
bls12_377::fq::bench_fq_double 9,108 9,111 3 0.03% x 1.00
bls12_377::fq::bench_fq_from_repr 59 38 -21 -35.59% x 1.55
bls12_377::fq::bench_fq_into_repr 32 31 -1 -3.12% x 1.03
bls12_377::fq::bench_fq_inverse 13,708 17,631 3,923 28.62% x 0.78
bls12_377::fq::bench_fq_mul_assign 51,780 30,340 -21,440 -41.41% x 1.71
bls12_377::fq::bench_fq_negate 13 14 1 7.69% x 0.93
bls12_377::fq::bench_fq_repr_add_nocarry 8 10 2 25.00% x 0.80
bls12_377::fq::bench_fq_repr_div2 6 6 0 0.00% x 1.00
bls12_377::fq::bench_fq_repr_mul2 5 13 8 160.00% x 0.38
bls12_377::fq::bench_fq_repr_num_bits 4 3 -1 -25.00% x 1.33
bls12_377::fq::bench_fq_repr_sub_noborrow 9 11 2 22.22% x 0.82
bls12_377::fq::bench_fq_sqrt 77,501 54,993 -22,508 -29.04% x 1.41
bls12_377::fq::bench_fq_square 43,704 30,089 -13,615 -31.15% x 1.45
bls12_377::fq::bench_fq_sub_assign 11 11 0 0.00% x 1.00
bls12_377::fr::bench_fr_add_assign 7 7 0 0.00% x 1.00
bls12_377::fr::bench_fr_double 10,316 5,832 -4,484 -43.47% x 1.77
bls12_377::fr::bench_fr_from_repr 31 22 -9 -29.03% x 1.41
bls12_377::fr::bench_fr_into_repr 20 19 -1 -5.00% x 1.05
bls12_377::fr::bench_fr_inverse 7,516 5,451 -2,065 -27.47% x 1.38
bls12_377::fr::bench_fr_mul_assign 25,583 16,173 -9,410 -36.78% x 1.58
bls12_377::fr::bench_fr_negate 10 7 -3 -30.00% x 1.43
bls12_377::fr::bench_fr_repr_add_nocarry 6 7 1 16.67% x 0.86
bls12_377::fr::bench_fr_repr_div2 4 4 0 0.00% x 1.00
bls12_377::fr::bench_fr_repr_mul2 4 7 3 75.00% x 0.57
bls12_377::fr::bench_fr_repr_num_bits 4 3 -1 -25.00% x 1.33
bls12_377::fr::bench_fr_repr_sub_noborrow 6 8 2 33.33% x 0.75
bls12_377::fr::bench_fr_sqrt 30,712 22,193 -8,519 -27.74% x 1.38
bls12_377::fr::bench_fr_square 21,929 15,620 -6,309 -28.77% x 1.40
bls12_377::fr::bench_fr_sub_assign 8 7 -1 -12.50% x 1.14
bls12_377::pairing::pairing::bench_pairing_final_exponentiation 1,292,296 1,087,830 -204,466 -15.82% x 1.19
bls12_377::pairing::pairing::bench_pairing_full 2,235,430 1,845,299 -390,131 -17.45% x 1.21
bls12_377::pairing::pairing::bench_pairing_miller_loop 631,072 510,977 -120,095 -19.03% x 1.24
bls12_381::ec::g1::bench_g1_add_assign 1,059,324 769,470 -289,854 -27.36% x 1.38
bls12_381::ec::g1::bench_g1_add_assign_mixed 758,426 534,781 -223,645 -29.49% x 1.42
bls12_381::ec::g1::bench_g1_double 539,424 419,918 -119,506 -22.15% x 1.28
bls12_381::ec::g1::bench_g1_mul_assign 274,129 206,160 -67,969 -24.79% x 1.33
bls12_381::ec::g1::bench_g1_rand 274,444 204,898 -69,546 -25.34% x 1.34
bls12_381::ec::g2::bench_g2_add_assign 4,660 3,785 -875 -18.78% x 1.23
bls12_381::ec::g2::bench_g2_add_assign_mixed 3,267 2,691 -576 -17.63% x 1.21
bls12_381::ec::g2::bench_g2_double 2,175 1,871 -304 -13.98% x 1.16
bls12_381::ec::g2::bench_g2_mul_assign 1,116,571 936,037 -180,534 -16.17% x 1.19
bls12_381::ec::g2::bench_g2_rand 1,084,845 919,647 -165,198 -15.23% x 1.18
bls12_381::fq12::bench_fq12_add_assign 130 125 -5 -3.85% x 1.04
bls12_381::fq12::bench_fq12_double 126 112 -14 -11.11% x 1.12
bls12_381::fq12::bench_fq12_inverse 24,139 25,376 1,237 5.12% x 0.95
bls12_381::fq12::bench_fq12_mul_assign 6,199 4,687 -1,512 -24.39% x 1.32
bls12_381::fq12::bench_fq12_square 4,118 3,150 -968 -23.51% x 1.31
bls12_381::fq12::bench_fq12_sub_assign 119 120 1 0.84% x 0.99
bls12_381::fq2::bench_fq2_add_assign 18 20 2 11.11% x 0.90
bls12_381::fq2::bench_fq2_double 17 19 2 11.76% x 0.89
bls12_381::fq2::bench_fq2_inverse 14,351 18,160 3,809 26.54% x 0.79
bls12_381::fq2::bench_fq2_mul_assign 225 162 -63 -28.00% x 1.39
bls12_381::fq2::bench_fq2_sqrt 124,951 83,086 -41,865 -33.51% x 1.50
bls12_381::fq2::bench_fq2_square 177 142 -35 -19.77% x 1.25
bls12_381::fq2::bench_fq2_sub_assign 19 19 0 0.00% x 1.00
bls12_381::fq::bench_fq_add_assign 10 10 0 0.00% x 1.00
bls12_381::fq::bench_fq_double 9,240 9,062 -178 -1.93% x 1.02
bls12_381::fq::bench_fq_from_repr 59 39 -20 -33.90% x 1.51
bls12_381::fq::bench_fq_into_repr 32 31 -1 -3.12% x 1.03
bls12_381::fq::bench_fq_inverse 14,949 17,811 2,862 19.15% x 0.84
bls12_381::fq::bench_fq_mul_assign 52,143 30,280 -21,863 -41.93% x 1.72
bls12_381::fq::bench_fq_negate 14 14 0 0.00% x 1.00
bls12_381::fq::bench_fq_repr_add_nocarry 8 10 2 25.00% x 0.80
bls12_381::fq::bench_fq_repr_div2 6 6 0 0.00% x 1.00
bls12_381::fq::bench_fq_repr_mul2 6 13 7 116.67% x 0.46
bls12_381::fq::bench_fq_repr_num_bits 4 4 0 0.00% x 1.00
bls12_381::fq::bench_fq_repr_sub_noborrow 9 11 2 22.22% x 0.82
bls12_381::fq::bench_fq_sqrt 61,857 38,010 -23,847 -38.55% x 1.63
bls12_381::fq::bench_fq_square 44,295 31,291 -13,004 -29.36% x 1.42
bls12_381::fq::bench_fq_sub_assign 11 10 -1 -9.09% x 1.10
bls12_381::fr::bench_fr_add_assign 7 7 0 0.00% x 1.00
bls12_381::fr::bench_fr_double 10,284 5,845 -4,439 -43.16% x 1.76
bls12_381::fr::bench_fr_from_repr 31 22 -9 -29.03% x 1.41
bls12_381::fr::bench_fr_into_repr 20 19 -1 -5.00% x 1.05
bls12_381::fr::bench_fr_inverse 7,600 5,440 -2,160 -28.42% x 1.40
bls12_381::fr::bench_fr_mul_assign 27,166 17,113 -10,053 -37.01% x 1.59
bls12_381::fr::bench_fr_negate 10 8 -2 -20.00% x 1.25
bls12_381::fr::bench_fr_repr_add_nocarry 6 7 1 16.67% x 0.86
bls12_381::fr::bench_fr_repr_div2 4 5 1 25.00% x 0.80
bls12_381::fr::bench_fr_repr_mul2 4 7 3 75.00% x 0.57
bls12_381::fr::bench_fr_repr_num_bits 4 3 -1 -25.00% x 1.33
bls12_381::fr::bench_fr_repr_sub_noborrow 7 8 1 14.29% x 0.88
bls12_381::fr::bench_fr_sqrt 27,321 19,871 -7,450 -27.27% x 1.37
bls12_381::fr::bench_fr_square 24,908 16,834 -8,074 -32.42% x 1.48
bls12_381::fr::bench_fr_sub_assign 8 7 -1 -12.50% x 1.14
bls12_381::pairing::pairing::bench_pairing_final_exponentiation 1,072,725 847,028 -225,697 -21.04% x 1.27
bls12_381::pairing::pairing::bench_pairing_full 1,871,067 1,472,014 -399,053 -21.33% x 1.27
bls12_381::pairing::pairing::bench_pairing_miller_loop 546,691 413,627 -133,064 -24.34% x 1.32
sw6::ec::g1::bench_g1_add_assign 4,185,546 4,048,507 -137,039 -3.27% x 1.03
sw6::ec::g1::bench_g1_add_assign_mixed 2,963,674 2,836,688 -126,986 -4.28% x 1.04
sw6::ec::g1::bench_g1_double 2,624,621 2,602,247 -22,374 -0.85% x 1.01
sw6::ec::g1::bench_g1_mul_assign 1,795,032 1,730,592 -64,440 -3.59% x 1.04
sw6::ec::g1::bench_g1_rand 1,798,136 1,731,877 -66,259 -3.68% x 1.04
sw6::ec::g2::bench_g2_add_assign 4,600 3,771 -829 -18.02% x 1.22
sw6::ec::g2::bench_g2_add_assign_mixed 3,206 2,663 -543 -16.94% x 1.20
sw6::ec::g2::bench_g2_double 2,140 1,863 -277 -12.94% x 1.15
sw6::ec::g2::bench_g2_mul_assign 1,098,494 935,481 -163,013 -14.84% x 1.17
sw6::ec::g2::bench_g2_rand 1,076,092 915,453 -160,639 -14.93% x 1.18
sw6::fq3::bench_fq3_add_assign 75 62 -13 -17.33% x 1.21
sw6::fq3::bench_fq3_double 50 55 5 10.00% x 0.91
sw6::fq3::bench_fq3_inverse 50,737 42,375 -8,362 -16.48% x 1.20
sw6::fq3::bench_fq3_mul_assign 2,121 2,046 -75 -3.54% x 1.04
sw6::fq3::bench_fq3_sqrt 3,321,726 3,201,614 -120,112 -3.62% x 1.04
sw6::fq3::bench_fq3_square 1,696 1,645 -51 -3.01% x 1.03
sw6::fq3::bench_fq3_sub_assign 67 66 -1 -1.49% x 1.02
sw6::fq6::bench_fq6_add_assign 144 114 -30 -20.83% x 1.26
sw6::fq6::bench_fq6_double 96 96 0 0.00% x 1.00
sw6::fq6::bench_fq6_inverse 58,863 50,288 -8,575 -14.57% x 1.17
sw6::fq6::bench_fq6_mul_assign 7,118 6,819 -299 -4.20% x 1.04
sw6::fq6::bench_fq6_square 5,248 4,976 -272 -5.18% x 1.05
sw6::fq6::bench_fq6_sub_assign 131 127 -4 -3.05% x 1.03
sw6::fq::bench_fq_add_assign 22 19 -3 -13.64% x 1.16
sw6::fq::bench_fq_double 16,932 17,126 194 1.15% x 0.99
sw6::fq::bench_fq_from_repr 245 248 3 1.22% x 0.99
sw6::fq::bench_fq_into_repr 138 132 -6 -4.35% x 1.05
sw6::fq::bench_fq_inverse 50,218 38,487 -11,731 -23.36% x 1.30
sw6::fq::bench_fq_mul_assign 242,531 240,596 -1,935 -0.80% x 1.01
sw6::fq::bench_fq_negate 24 22 -2 -8.33% x 1.09
sw6::fq::bench_fq_repr_add_nocarry 14 14 0 0.00% x 1.00
sw6::fq::bench_fq_repr_div2 15 7 -8 -53.33% x 2.14
sw6::fq::bench_fq_repr_mul2 9 13 4 44.44% x 0.69
sw6::fq::bench_fq_repr_num_bits 4 3 -1 -25.00% x 1.33
sw6::fq::bench_fq_repr_sub_noborrow 17 17 0 0.00% x 1.00
sw6::fq::bench_fq_sqrt 526,690 518,031 -8,659 -1.64% x 1.02
sw6::fq::bench_fq_square 208,952 210,092 1,140 0.55% x 0.99
sw6::fq::bench_fq_sub_assign 19 19 0 0.00% x 1.00
sw6::fr::bench_fr_add_assign 10 10 0 0.00% x 1.00
sw6::fr::bench_fr_double 9,174 9,065 -109 -1.19% x 1.01
sw6::fr::bench_fr_from_repr 58 38 -20 -34.48% x 1.53
sw6::fr::bench_fr_into_repr 32 31 -1 -3.12% x 1.03
sw6::fr::bench_fr_inverse 13,712 17,668 3,956 28.85% x 0.78
sw6::fr::bench_fr_mul_assign 50,664 30,005 -20,659 -40.78% x 1.69
sw6::fr::bench_fr_negate 13 14 1 7.69% x 0.93
sw6::fr::bench_fr_repr_add_nocarry 8 10 2 25.00% x 0.80
sw6::fr::bench_fr_repr_div2 6 6 0 0.00% x 1.00
sw6::fr::bench_fr_repr_mul2 5 13 8 160.00% x 0.38
sw6::fr::bench_fr_repr_num_bits 4 3 -1 -25.00% x 1.33
sw6::fr::bench_fr_repr_sub_noborrow 9 11 2 22.22% x 0.82
sw6::fr::bench_fr_sqrt 77,926 54,407 -23,519 -30.18% x 1.43
sw6::fr::bench_fr_square 42,764 30,561 -12,203 -28.54% x 1.40
sw6::fr::bench_fr_sub_assign 11 10 -1 -9.09% x 1.10
sw6::pairing::pairing::bench_pairing_final_exponentiation 8,881,217 8,317,250 -563,967 -6.35% x 1.07
sw6::pairing::pairing::bench_pairing_full 97,824,713 86,015,907 -11,808,806 -12.07% x 1.14
sw6::pairing::pairing::bench_pairing_miller_loop 88,591,942 77,475,292 -11,116,650 -12.55% x 1.14