Fix getJsonObject number normalization 0.0 <=> inf bug #1944

thirtiseven · 2024-04-10T09:49:07Z

Fixes #1923

Floating number normalization in getJsonObject kernel using cudf::strings::detail::stod to convert string to double, then convert back to string with ftos_converter.

For numbers with very large exp, like 9.299999257686047e-0005603333574677677 or 9.299999257686047e0005603333574677677, following code will overflow and produce wrong sign:

9.299999257686047e-0005603333574677677 should be 0.0, but cuDF gets Infinity
9.299999257686047e0005603333574677677 should be infinity, but cuDF gets 0.0

cuDF is using int to save exponent value.
For the above 1st double string, it will overflow from negative value to positive value.
For the above 2nd double string, it will overflow from positive value to negative value.
So cuDF gets totally wrong result.

  // check for exponent char
  int exp_ten  = 0;
  int exp_sign = 1;
  if (in_ptr < end) {
    char ch = *in_ptr++;
    if (ch == 'e' || ch == 'E') {
      if (in_ptr < end) {
        ch = *in_ptr;
        if (ch == '-' || ch == '+') {
          exp_sign = (ch == '-' ? -1 : 1);
          ++in_ptr;
        }
        while (in_ptr < end) {
          ch = *in_ptr++;
          if (ch < '0' || ch > '9') break;
          exp_ten = (exp_ten * 10) + (int)(ch - '0');
+          if (exp_ten >= 1e8) break;
        }
      }
    }
  }

Then with a wrong sign, it will turn to 0.0 or inf in a wrong way.

Adding if (exp_ten >= 1e8) break; will prevent it from overflowing, and since the exp here is big enough to be an inf, it doesn't matter what the real value is. I choose 1e8 here because it's the largest power of 10 that can be multiplied by 10 one more time without overflowing, so it's the largest possible limit here. The reason not to choose std::numeric_limits<double>::max_exponent10 (which is 308) here is that there is an exp_off, which could be negative, to calculate the final exp later.

I think it's a cudf bug, but let's hot fix it in a copy of stod in jni, to make getJsonObject happy first.

Will file a cudf issue soon.

Signed-off-by: Haoyang Li <[email protected]>

Signed-off-by: Chong Gao <[email protected]>

…0, V100)" This reverts commit ff2db7e.

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2024-04-11T02:37:05Z

Add some cases in GetJsonObjectTest.java

res-life · 2024-04-11T03:43:46Z

There is only one line change:

+          if (exp_ten >= 1e8) break;

But the following code will adjust exp_ten again:

 exp_ten *= exp_sign;
  exp_ten += exp_off;
  exp_ten += num_digits - 1;

So if exp_off or num_digits have very large values, then the change is not equivalent to the original logic.

Fortunately, JSON parser defined the max digits for JSON numbers, refer to json_parser.cuh

constexpr int max_num_len = 1000;

So this change is OK.
Please add comment like:
Because of max digits for number in JSON is 1000 (consistent with Spark), so the adjustment will not impact result.
1000 = 10^3, max adjustment is 3, 10^8 - 3 > std::numeric_limits::max_exponent10 (which is 308)

res-life · 2024-04-11T05:21:40Z

src/main/cpp/src/string_to_float_cudf.cuh

+ *
+ * This function will also handle scientific notation format.
+ */
+__device__ inline double stod(cudf::string_view const& d_str)


Please describe this is copied from cuDF to do a temporary fix for branch-24.04.
Long term fix will be in cuDF.
The diff is:

+ if (exp_ten >= 1e8) break;

res-life · 2024-04-11T05:25:45Z

src/main/cpp/src/json_parser.cuh

@@ -1516,8 +1517,8 @@ class json_parser {
        // 200.000 => 200.0, 351.980 => 351.98, 12345678900000000000.0
        // => 1.23456789E19 0.0000000000003 => 3.0E-13; 0.003 => 0.003; 0.0003
        // => 3.0E-4 leverage function: `get_current_float_parts`
-        double d_value =
-          cudf::strings::detail::stod(cudf::string_view(current_token_start_pos, number_token_len));
+        double d_value = spark_rapids_jni::detail::stod(


BTW, please update the comments:

// number normalization: // 0.03E-2 => 0.3E-4; large value => infinity; small value => 0.0 // 200.000 => 200.0, 351.980 => 351.98, 12345678900000000000.0 // => 1.23456789E19 0.0000000000003 => 3.0E-13; 0.003 => 0.003; 0.0003 // => 3.0E-4

thirtiseven · 2024-04-11T05:46:27Z

following IT passed:

def json_string_to_float(x):
    if x == '"-Infinity"':
        return float('-inf')
    elif x == '"Infinity"':
        return float('inf')
    else:
        return float(x)

@pytest.mark.parametrize('data_gen', [StringGen('-?[1-9]\d{0,20}\.\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}\.\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,20}\.\d{1,5}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}\.\d{1,5}', nullable=False),
                                      StringGen('-?[1-9]\d{0,20}E-?\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}E-?\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,20}E-?\d{1,5}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}E-?\d{1,5}', nullable=False)], ids=idfn)
def test_get_json_object_floating_normalization(data_gen):
    schema = StructType([StructField("jsonStr", StringType())])
    gpu_res = [[row[1]] for row in with_gpu_session(
        lambda spark: unary_op_df(spark, data_gen, length=1000000).selectExpr(
            'a',
            'get_json_object(a,"$")'
            ).collect(),
        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})]
    cpu_res = [[row[1]] for row in with_cpu_session(
        lambda spark: unary_op_df(spark, data_gen, length=1000000).selectExpr(
            'a',
            'get_json_object(a,"$")'
            ).collect())]
    for i in range(len(gpu_res)):
        if not math.isclose(json_string_to_float(gpu_res[i][0]), json_string_to_float(cpu_res[i][0])):
            print("GPU: ", gpu_res[i][0], json_string_to_float(gpu_res[i][0]), " CPU: ", cpu_res[i][0], json_string_to_float(cpu_res[i][0]))
        assert math.isclose(json_string_to_float(gpu_res[i][0]), json_string_to_float(cpu_res[i][0]))

We will add above tests to 24.06 in a follow-up.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-04-11T07:40:08Z

build

revans2

Just one more comment.

revans2 · 2024-04-11T14:21:15Z

src/main/cpp/src/string_to_float_cudf.cuh

+          ch = *in_ptr++;
+          if (ch < '0' || ch > '9') break;
+          exp_ten = (exp_ten * 10) + (int)(ch - '0');
+          if (exp_ten >= 1e8) break;


Can we have an explanation about why this is here? This is the only change between CUDF and this version so I would like to understand it better. Mainly because it has a magic number of 1e8 in it and I would really like to understand where that came from.

thirtiseven and others added 14 commits April 8, 2024 08:53

Move device code in get_json_object to cu or cuh (NVIDIA#1915)

43b5dd9

Signed-off-by: Haoyang Li <[email protected]>

Refactor/simplify json generator

6b8fa98

Signed-off-by: Chong Gao <[email protected]>

Remove purge non-empty nulls step

76bcd10

Refactor josn parser: remove useless variables

264f47a

Use 64 bits to store JSON nest depth context to save memory

b985891

Refine push/pop/peek logic

1d1268d

Refactor: use less functions; change max path length from 32 to 8

5e315f9

Fix nvbug: get-json-obj get incorrect result on some GPUs(H100, V100)

ff2db7e

Revert "Fix nvbug: get-json-obj get incorrect result on some GPUs(H10…

b7ccd71

…0, V100)" This reverts commit ff2db7e.

Fix a redefined variable

f02ed8a

Update MAX_PATH_DEPTH = 8 in Java to keep consistent

bf1f269

Fix test cases; Change max path depth from 8 to 16

9ee7618

Minor change: add pragma once in header

8bab267

Fix a bug in string_to_float and move it to jni temporarily

bc49cb6

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven requested review from ttnghia and res-life April 10, 2024 09:49

thirtiseven self-assigned this Apr 10, 2024

Merge branch 'branch-24.04' into number_nuormalization_bug_fix

f18d6aa

res-life reviewed Apr 11, 2024

View reviewed changes

Address comments

907a31a

Signed-off-by: Haoyang Li <[email protected]>

res-life approved these changes Apr 11, 2024

View reviewed changes

thirtiseven merged commit abf48f3 into NVIDIA:branch-24.04 Apr 11, 2024
3 checks passed

thirtiseven deleted the number_nuormalization_bug_fix branch April 11, 2024 09:35

thirtiseven mentioned this pull request Apr 11, 2024

[BUG] String to Double return 0.0 for very large number and vice versa rapidsai/cudf#15508

Closed

revans2 mentioned this pull request Apr 11, 2024

[BUG] getJsonObject gets infinity instead of zero for a very small float number #1923

Closed

revans2 reviewed Apr 11, 2024

View reviewed changes

This was referenced Apr 12, 2024

[FEA] [follow-up] Revert the PR for number normalization 0.0 <=> inf bug and use cuDF fix #1963

Closed

[test] [follow-up] add IT to test number normalization for get-json-obj function NVIDIA/spark-rapids#10688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix getJsonObject number normalization 0.0 <=> inf bug #1944

Fix getJsonObject number normalization 0.0 <=> inf bug #1944

thirtiseven commented Apr 10, 2024 •

edited by res-life

Loading

res-life commented Apr 11, 2024

res-life commented Apr 11, 2024 •

edited

Loading

res-life Apr 11, 2024 •

edited

Loading

thirtiseven Apr 11, 2024

res-life Apr 11, 2024 •

edited

Loading

thirtiseven Apr 11, 2024

thirtiseven commented Apr 11, 2024 •

edited

Loading

thirtiseven commented Apr 11, 2024

revans2 left a comment

revans2 Apr 11, 2024

Fix getJsonObject number normalization 0.0 <=> inf bug #1944

Fix getJsonObject number normalization 0.0 <=> inf bug #1944

Conversation

thirtiseven commented Apr 10, 2024 • edited by res-life Loading

res-life commented Apr 11, 2024

res-life commented Apr 11, 2024 • edited Loading

res-life Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

thirtiseven Apr 11, 2024

Choose a reason for hiding this comment

res-life Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

thirtiseven Apr 11, 2024

Choose a reason for hiding this comment

thirtiseven commented Apr 11, 2024 • edited Loading

thirtiseven commented Apr 11, 2024

revans2 left a comment

Choose a reason for hiding this comment

revans2 Apr 11, 2024

Choose a reason for hiding this comment

thirtiseven commented Apr 10, 2024 •

edited by res-life

Loading

res-life commented Apr 11, 2024 •

edited

Loading

res-life Apr 11, 2024 •

edited

Loading

res-life Apr 11, 2024 •

edited

Loading

thirtiseven commented Apr 11, 2024 •

edited

Loading