Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Moderate Path Float Parsing Algorithms #775

Open
Alexhuszagh opened this issue May 10, 2021 · 1 comment
Open

Update Moderate Path Float Parsing Algorithms #775

Alexhuszagh opened this issue May 10, 2021 · 1 comment

Comments

@Alexhuszagh
Copy link

There have been a few updates to the moderate path float parsing algorithms in minimal-lexical, which can either provide performance benefits or reduce the amount of static storage required, depending on your use-case. I'll summarize a list of plausible options below, and if any seem beneficial to the maintainers of serde-json, will be happy to submit a PR.

Quick Background

Just for a quick summary: the float parsing algorithm is broken into 3 parts:

  • A fast path algorithm, where the significant digits can be exactly represented as a native float without truncated.
  • A moderate path algorithm, to process all floats except near-halfway cases through an extended representation of a float.
  • A slow path algorithm, that discerns the proper way to round near-halfway floats using arbitrary-precision arithmetic.

The moderate path is ~66-75% faster than the slow path, and therefore improvements to it either from a performance standpoint or correctness standpoint can lead to dramatic performance gains.

Interpolate the Cached Power Exponent

Serde uses pre-computed values for the cached float exponents in cached_float80. However, we can interpolate all these exponents, since each exponent is just effectively ⌈ log2(10) * exp10 ⌉. Using a pre-computed, integral power for log2(10), we can calculate the exponent exactly from the index to the cached power.

The specific pseudo-code can be used to generate this magic number, and verify it produces the correct result over the entire range of valid exponents:

import math

def get_range(radix, max_exp, bitshift):
    den = 1 << bitshift
    num = int(math.ceil(math.log2(radix) * den))
    for exp in range(0, max_exp):
        exp2_exact = int(math.log2(radix**exp))
        exp2_guess = num * exp // den
        if exp2_exact != exp2_guess:
            raise ValueError(f'{exp}')
    return num, bitshift

>>> get_range(10, 350, 16)
(217706, 16)

See the appendix to see the full changes required to implement this change.

Pros:

  • Less storage required.
  • No discernible impact on runtime performance.

Cons:

  • N/A

Correctness Concerns:

  • N/A, can be proven the generated exponents are identical for all valid inputs.

Add the Eisel-Lemire Algorithm.

A fast algorithm for creating correct representations of floats from an extended 128-bit (or 192-bit) representation was developed and is currently in use in major Google projects like Golang and Wuffs, as well as others. The Eisel-Lemire algorithm is ~15% for a uniform distribution of randomly-generated floats over the entire float range, and catches halfway cases different than the existing extended-float algorithm.

A few examples of cases:

  • "9007199254740992.0" (or 1<<53): correctly classified by both.
  • "9007199254740992000.0e-3"(or 1<<53): only classified by extended-float only.
  • "9007199254740993.0" (or 1 + (1<<53`): both cannot classify.
  • "9007199254740994.0" (or 2 + (1<<53)`): correctly classified by both.
  • "9007199254740995.0" (or 3 + (1<<53)`): correctly classified by Eisel-Lemire only.
  • "9007199254740996.0" (or 4 + (1<<53)`): correctly classified by both.
  • "2.470328229206232720e-324" (near-halfway subnormal float): correctly classified by extended-float only.
  • "8.988465674311580536e307" (large near-halfway float): correctly classified by Eisel-Lemire only.

In short, the two combined have overlapping coverage, and can avoid delegating to the slow path algorithm, leading to major performance benefits. See minimal-lexical/lemire.rs for an example implementation of this algorithm. The general approach therefore is run Eisel-Lemire, and if the algorithm fails, delegate to the extended-float algorithm.

Pros:

  • Slightly faster performance than extended-float in some cases.
  • Can be combined with extended-float to minimize delegating to the slow path.
  • Can use pre-computed powers for Eisel-Lemire for extended-float too, leading to minor performance improvements.

Cons:

  • Increased storage required (requires an additional 1226 u64s, or ~9.8 KB).

Correctness Concerns:

  • Substantial, but well-established algorithm and passes all correctness tests.
  • It passes the curated suite of halfway cases, a large, curated suite of cases used to validate Go's parser, and Rust's extensive randomly-generated test-cases.

Replace Extended-Float with Lemire

A third option is to entirely remove the extended-float algorithm, and replace it with the Eisel-Lemire algorithm. In order to do so, we need to round-down to b so the slow algorithm can correctly differentiate between b, b+h, and b+u. Extensive comments and code samples are included in lexical-core/lemire.rs for how to implement this.

Pros:

  • Slightly faster performance than extended-float in some cases.

Cons:

  • Increased storage required (requires an additional 1226 u64s, or ~9.8 KB).
  • Less correct than extended-float, and therefore delegates to the slow path algorithm more often.

Correctness Concerns:

  • Substantial, but well-established algorithm and passes all correctness tests.
  • It passes the curated suite of halfway cases, a large, curated suite of cases used to validate Go's parser, and Rust's extensive randomly-generated test-cases.

Appendix

Interpolation

The full changes to interpolate the exponent are the following:

diff --git a/src/lexical/cached.rs b/src/lexical/cached.rs
index ef5a9fe..701a897 100644
--- a/src/lexical/cached.rs
+++ b/src/lexical/cached.rs
@@ -5,31 +5,8 @@
 use super::cached_float80;
 use super::float::ExtendedFloat;
 
-// POWERS
-
-/// Precalculated powers that uses two-separate arrays for memory-efficiency.
-#[doc(hidden)]
-pub(crate) struct ExtendedFloatArray {
-    // Pre-calculated mantissa for the powers.
-    pub mant: &'static [u64],
-    // Pre-calculated binary exponents for the powers.
-    pub exp: &'static [i32],
-}
-
-/// Allow indexing of values without bounds checking
-impl ExtendedFloatArray {
-    #[inline]
-    pub fn get_extended_float(&self, index: usize) -> ExtendedFloat {
-        let mant = self.mant[index];
-        let exp = self.exp[index];
-        ExtendedFloat { mant, exp }
-    }
-
-    #[inline]
-    pub fn len(&self) -> usize {
-        self.mant.len()
-    }
-}
+const LOG2: i64 = 217706;
+const LOG2_SHIFT: i32 = 16;
 
 // MODERATE PATH POWERS
 
@@ -37,9 +14,9 @@ impl ExtendedFloatArray {
 #[doc(hidden)]
 pub(crate) struct ModeratePathPowers {
     // Pre-calculated small powers.
-    pub small: ExtendedFloatArray,
+    pub small: &'static [u64],
     // Pre-calculated large powers.
-    pub large: ExtendedFloatArray,
+    pub large: &'static [u64],
     /// Pre-calculated small powers as 64-bit integers
     pub small_int: &'static [u64],
     // Step between large powers and number of small powers.
@@ -52,17 +29,23 @@ pub(crate) struct ModeratePathPowers {
 impl ModeratePathPowers {
     #[inline]
     pub fn get_small(&self, index: usize) -> ExtendedFloat {
-        self.small.get_extended_float(index)
+        let mant = self.small[index];
+        let exp = -63 + ((LOG2 * index as i64) >> LOG2_SHIFT);
+        ExtendedFloat {
+            mant,
+            exp: exp as i32,
+        }
     }
 
     #[inline]
     pub fn get_large(&self, index: usize) -> ExtendedFloat {
-        self.large.get_extended_float(index)
-    }
-
-    #[inline]
-    pub fn get_small_int(&self, index: usize) -> u64 {
-        self.small_int[index]
+        let mant = self.large[index];
+        let biased_e = index as i64 * self.step as i64 - self.bias as i64;
+        let exp = -63 + ((LOG2 * biased_e) >> LOG2_SHIFT);
+        ExtendedFloat {
+            mant,
+            exp: exp as i32,
+        }
     }
 }
 
diff --git a/src/lexical/cached_float80.rs b/src/lexical/cached_float80.rs
index 9beda3d..43e18e8 100644
--- a/src/lexical/cached_float80.rs
+++ b/src/lexical/cached_float80.rs
@@ -10,7 +10,7 @@
 //! integer to calculate exact extended-representation of each value.
 //! These values are all normalized.
 
-use super::cached::{ExtendedFloatArray, ModeratePathPowers};
+use super::cached::ExtendedFloatArray;
 
 // LOW-LEVEL
 // ---------
@@ -29,18 +29,6 @@ const BASE10_SMALL_MANTISSA: [u64; 10] = [
     13743895347200000000, // 10^8
     17179869184000000000, // 10^9
 ];
-const BASE10_SMALL_EXPONENT: [i32; 10] = [
-    -63, // 10^0
-    -60, // 10^1
-    -57, // 10^2
-    -54, // 10^3
-    -50, // 10^4
-    -47, // 10^5
-    -44, // 10^6
-    -40, // 10^7
-    -37, // 10^8
-    -34, // 10^9
-];
 const BASE10_LARGE_MANTISSA: [u64; 66] = [
     11555125961253852697, // 10^-350
     13451937075301367670, // 10^-340
@@ -109,74 +97,6 @@ const BASE10_LARGE_MANTISSA: [u64; 66] = [
     11830521861667747109, // 10^290
     13772540099066387756, // 10^300
 ];
-const BASE10_LARGE_EXPONENT: [i32; 66] = [
-    -1226, // 10^-350
-    -1193, // 10^-340
-    -1160, // 10^-330
-    -1127, // 10^-320
-    -1093, // 10^-310
-    -1060, // 10^-300
-    -1027, // 10^-290
-    -994,  // 10^-280
-    -960,  // 10^-270
-    -927,  // 10^-260
-    -894,  // 10^-250
-    -861,  // 10^-240
-    -828,  // 10^-230
-    -794,  // 10^-220
-    -761,  // 10^-210
-    -728,  // 10^-200
-    -695,  // 10^-190
-    -661,  // 10^-180
-    -628,  // 10^-170
-    -595,  // 10^-160
-    -562,  // 10^-150
-    -529,  // 10^-140
-    -495,  // 10^-130
-    -462,  // 10^-120
-    -429,  // 10^-110
-    -396,  // 10^-100
-    -362,  // 10^-90
-    -329,  // 10^-80
-    -296,  // 10^-70
-    -263,  // 10^-60
-    -230,  // 10^-50
-    -196,  // 10^-40
-    -163,  // 10^-30
-    -130,  // 10^-20
-    -97,   // 10^-10
-    -63,   // 10^0
-    -30,   // 10^10
-    3,     // 10^20
-    36,    // 10^30
-    69,    // 10^40
-    103,   // 10^50
-    136,   // 10^60
-    169,   // 10^70
-    202,   // 10^80
-    235,   // 10^90
-    269,   // 10^100
-    302,   // 10^110
-    335,   // 10^120
-    368,   // 10^130
-    402,   // 10^140
-    435,   // 10^150
-    468,   // 10^160
-    501,   // 10^170
-    534,   // 10^180
-    568,   // 10^190
-    601,   // 10^200
-    634,   // 10^210
-    667,   // 10^220
-    701,   // 10^230
-    734,   // 10^240
-    767,   // 10^250
-    800,   // 10^260
-    833,   // 10^270
-    867,   // 10^280
-    900,   // 10^290
-    933,   // 10^300
-];
 const BASE10_SMALL_INT_POWERS: [u64; 10] = [
     1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000,
 ];
@@ -187,14 +107,8 @@ const BASE10_BIAS: i32 = 350;
 // ----------
 
 const BASE10_POWERS: ModeratePathPowers = ModeratePathPowers {
-    small: ExtendedFloatArray {
-        mant: &BASE10_SMALL_MANTISSA,
-        exp: &BASE10_SMALL_EXPONENT,
-    },
-    large: ExtendedFloatArray {
-        mant: &BASE10_LARGE_MANTISSA,
-        exp: &BASE10_LARGE_EXPONENT,
-    },
+    small: &BASE10_SMALL_MANTISSA,
+    large: &BASE10_LARGE_MANTISSA,
     small_int: &BASE10_SMALL_INT_POWERS,
     step: BASE10_STEP,
     bias: BASE10_BIAS,
@dtolnay
Copy link
Member

dtolnay commented May 23, 2021

Thanks for the update!

Right now I don't have a great sense of what the people relying on this would prefer to optimize for (whether it's performance, binary size, etc). Absent of feedback to the effect that the current implementation is not serving people's needs, I would prefer to mostly stick with the current code.

My own priorities for the float parser in order are:

  1. Correctness
  2. Reasonable performance envelope (say, within 2–5× of "optimal")
  3. Source code simplicity / auditability
  4. Smaller compile time / binary size (there is some complicated tradeoff frontier between these two things)

The exponent interpolation change sounds well aligned with this, whereas the Eisel–Lemire change is probably not impactful for us.

I think the right workflow might be to plan to pull in batches of changes from minimal-lexical on a cadence of something like 18 months or 24 months (other than correctness fixes, which we would take immediately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants