Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix getJsonObject number normalization 0.0 <=> inf bug #1944

Merged

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Apr 10, 2024

Fixes #1923

Floating number normalization in getJsonObject kernel using cudf::strings::detail::stod to convert string to double, then convert back to string with ftos_converter.

For numbers with very large exp, like 9.299999257686047e-0005603333574677677 or 9.299999257686047e0005603333574677677, following code will overflow and produce wrong sign:

9.299999257686047e-0005603333574677677 should be 0.0, but cuDF gets Infinity
9.299999257686047e0005603333574677677 should be infinity, but cuDF gets 0.0

cuDF is using int to save exponent value.
For the above 1st double string, it will overflow from negative value to positive value.
For the above 2nd double string, it will overflow from positive value to negative value.
So cuDF gets totally wrong result.

  // check for exponent char
  int exp_ten  = 0;
  int exp_sign = 1;
  if (in_ptr < end) {
    char ch = *in_ptr++;
    if (ch == 'e' || ch == 'E') {
      if (in_ptr < end) {
        ch = *in_ptr;
        if (ch == '-' || ch == '+') {
          exp_sign = (ch == '-' ? -1 : 1);
          ++in_ptr;
        }
        while (in_ptr < end) {
          ch = *in_ptr++;
          if (ch < '0' || ch > '9') break;
          exp_ten = (exp_ten * 10) + (int)(ch - '0');
+          if (exp_ten >= 1e8) break;
        }
      }
    }
  }

Then with a wrong sign, it will turn to 0.0 or inf in a wrong way.

Adding if (exp_ten >= 1e8) break; will prevent it from overflowing, and since the exp here is big enough to be an inf, it doesn't matter what the real value is. I choose 1e8 here because it's the largest power of 10 that can be multiplied by 10 one more time without overflowing, so it's the largest possible limit here. The reason not to choose std::numeric_limits<double>::max_exponent10 (which is 308) here is that there is an exp_off, which could be negative, to calculate the final exp later.

I think it's a cudf bug, but let's hot fix it in a copy of stod in jni, to make getJsonObject happy first.

Will file a cudf issue soon.

@thirtiseven thirtiseven self-assigned this Apr 10, 2024
@res-life
Copy link
Collaborator

Add some cases in GetJsonObjectTest.java

@res-life
Copy link
Collaborator

res-life commented Apr 11, 2024

There is only one line change:

+          if (exp_ten >= 1e8) break;

But the following code will adjust exp_ten again:

 exp_ten *= exp_sign;
  exp_ten += exp_off;
  exp_ten += num_digits - 1;

So if exp_off or num_digits have very large values, then the change is not equivalent to the original logic.

Fortunately, JSON parser defined the max digits for JSON numbers, refer to json_parser.cuh

constexpr int max_num_len = 1000;

So this change is OK.
Please add comment like:
Because of max digits for number in JSON is 1000 (consistent with Spark), so the adjustment will not impact result.
1000 = 10^3, max adjustment is 3, 10^8 - 3 > std::numeric_limits::max_exponent10 (which is 308)

*
* This function will also handle scientific notation format.
*/
__device__ inline double stod(cudf::string_view const& d_str)
Copy link
Collaborator

@res-life res-life Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe this is copied from cuDF to do a temporary fix for branch-24.04.
Long term fix will be in cuDF.
The diff is:

+ if (exp_ten >= 1e8) break;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -1516,8 +1517,8 @@ class json_parser {
// 200.000 => 200.0, 351.980 => 351.98, 12345678900000000000.0
// => 1.23456789E19 0.0000000000003 => 3.0E-13; 0.003 => 0.003; 0.0003
// => 3.0E-4 leverage function: `get_current_float_parts`
double d_value =
cudf::strings::detail::stod(cudf::string_view(current_token_start_pos, number_token_len));
double d_value = spark_rapids_jni::detail::stod(
Copy link
Collaborator

@res-life res-life Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, please update the comments:

        // number normalization:
        // 0.03E-2 => 0.3E-4; large value => infinity; small value => 0.0
        // 200.000 => 200.0, 351.980 => 351.98, 12345678900000000000.0
        // => 1.23456789E19 0.0000000000003 => 3.0E-13; 0.003 => 0.003; 0.0003
        // => 3.0E-4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Apr 11, 2024

following IT passed:

def json_string_to_float(x):
    if x == '"-Infinity"':
        return float('-inf')
    elif x == '"Infinity"':
        return float('inf')
    else:
        return float(x)

@pytest.mark.parametrize('data_gen', [StringGen('-?[1-9]\d{0,20}\.\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}\.\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,20}\.\d{1,5}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}\.\d{1,5}', nullable=False),
                                      StringGen('-?[1-9]\d{0,20}E-?\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}E-?\d{1,20}', nullable=False),
                                      StringGen('-?[1-9]\d{0,20}E-?\d{1,5}', nullable=False),
                                      StringGen('-?[1-9]\d{0,5}E-?\d{1,5}', nullable=False)], ids=idfn)
def test_get_json_object_floating_normalization(data_gen):
    schema = StructType([StructField("jsonStr", StringType())])
    gpu_res = [[row[1]] for row in with_gpu_session(
        lambda spark: unary_op_df(spark, data_gen, length=1000000).selectExpr(
            'a',
            'get_json_object(a,"$")'
            ).collect(),
        conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})]
    cpu_res = [[row[1]] for row in with_cpu_session(
        lambda spark: unary_op_df(spark, data_gen, length=1000000).selectExpr(
            'a',
            'get_json_object(a,"$")'
            ).collect())]
    for i in range(len(gpu_res)):
        if not math.isclose(json_string_to_float(gpu_res[i][0]), json_string_to_float(cpu_res[i][0])):
            print("GPU: ", gpu_res[i][0], json_string_to_float(gpu_res[i][0]), " CPU: ", cpu_res[i][0], json_string_to_float(cpu_res[i][0]))
        assert math.isclose(json_string_to_float(gpu_res[i][0]), json_string_to_float(cpu_res[i][0]))

We will add above tests to 24.06 in a follow-up.

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven merged commit abf48f3 into NVIDIA:branch-24.04 Apr 11, 2024
3 checks passed
@thirtiseven thirtiseven deleted the number_nuormalization_bug_fix branch April 11, 2024 09:35
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more comment.

ch = *in_ptr++;
if (ch < '0' || ch > '9') break;
exp_ten = (exp_ten * 10) + (int)(ch - '0');
if (exp_ten >= 1e8) break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an explanation about why this is here? This is the only change between CUDF and this version so I would like to understand it better. Mainly because it has a magic number of 1e8 in it and I would really like to understand where that came from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] getJsonObject gets infinity instead of zero for a very small float number
3 participants