Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf-polars/pylibcudf string -> date parsing #16306

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Jul 18, 2024

This PR adds datetime/timestamp parsing from string columns in pylibcudf and cudf-polars.

Closes #16174

@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jul 18, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jul 18, 2024
@brandon-b-miller
Copy link
Contributor Author

Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly.

Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now?

@wence-
Copy link
Contributor

wence- commented Jul 23, 2024

Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly.

Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now?

Paging @mroeschke who has done a load of this in cudf-classic for the pandas compatibility.

If we think the answer is "it's too complicated", let's just not support unconditionally and write up properly as an issue.

@mroeschke
Copy link
Contributor

Both %z and %Z raise in cudf classic currently as NotImplementedError

As you alluded to, a %Z in the format matching UTC exactly is the only safe timezone localization without introspecting the data. Technically any string matching %z could be supported since it should be treated as an "arbitrary UTC offset", but there's no way to represent that in the dtype today.

@brandon-b-miller
Copy link
Contributor Author

Both %z and %Z raise in cudf classic currently as NotImplementedError

As you alluded to, a %Z in the format matching UTC exactly is the only safe timezone localization without introspecting the data. Technically any string matching %z could be supported since it should be treated as an "arbitrary UTC offset", but there's no way to represent that in the dtype today.

Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of %Z - my reasoning here being that if we assume the data is all normalized to UTC, no instant in time should be ambiguous, so any kind of "handling" should be a no-op.

@mroeschke
Copy link
Contributor

mroeschke commented Jul 23, 2024

Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of %Z - my reasoning here being that if we assume the data is all normalized to UTC, no instant in time should be ambiguous, so any kind of "handling" should be a no-op.

assume the data is all normalized to UTC

Internally within cudf we can assume this. From polars I'm not sure if this can be assumed, especially if it's directly from user input. e.g.:

df = pl.DataFrame({"a": ["2024-03-10 02:30:00 US/Pacific"]})
df.select(pl.col("a").str.to_datetime(format="%Y-%m-%d %H:%M:%S %Z"))

should raise as 2024-03-10 02:30:00 is not a valid date and time in US/Pacific

@brandon-b-miller
Copy link
Contributor Author

Just looking around for a possible combination of parameters we could support for this API. Starting with a column strings that represent dates:

  • Off the bat we need to deal with the format parameter, which in polars may be None. I argue we can not support the None case because the polars API promises that the format is inferred from the data in this case, which is data introspection we won't do.

  • Next we need to deal with strict, which has two modes, both of which require identifying malformed timestamps and I think are theoretically possible using cudf::is_timestamp to introspect:

    • if True, fail if any string fails to convert
    • if False, return null for strings that fail to convert (this is not specified in the docs and is my observation based on experimenting locally)
  • cache seems like a CPU only backend option that doesn't have a GPU equivalent

  • ambiguous, as we have discussed a bit above. However for this one I am thinking there is possibly a way forward that comes from the fact that times in libcudf are implicitly all UTC, which has no transitions. This means that I think the following scenario is allowable:

    • format is passed and we can state with certainty that the it specifies UTC (the user tells us the data is UTC timesatmps)
    • strict may be either True or False. Either way, we'll use is_timestamp with the format to identify where they are and either do a null replace if strict is False or an any of some kind on the inverse of the mask from is_timestamp to throw if strict is True
    • At this point we have either errored or we have a column of clean UTC timestamps/nulls. This should map cleanly onto the libcudf API which would normally return undefined if a malformed string is present, but won't have that issue at this point, because those places have been replaced with nulls, for which the libcudf api also returns nulls.

@wence-
Copy link
Contributor

wence- commented Jul 31, 2024

Thanks Brandon.

Just looking around for a possible combination of parameters we could support for this API. Starting with a column strings that represent dates:

* Off the bat we need to deal with the `format` parameter, which in polars may be `None`. I argue we can not support the `None` case because the polars API promises that `the format is inferred from the data` in this case, which is data introspection we won't do.

Agreed, here we can raise early with a message that one should provide a format.

* Next we need to deal with `strict`, which has two modes, both of which require identifying malformed timestamps and I think are theoretically possible using `cudf::is_timestamp` to introspect:
  
  * if `True`, fail if any string fails to convert
  * if `False`, return `null` for strings that fail to convert (this is not specified in the docs and is my observation based on experimenting locally)

Agreed that this is what happens. Can you open an issue/PR documenting the "null for unparseable strings" in the strict=False case?

I think this is fine because, like polars, we would raise at collection time if we fail with strict=True.

* `cache` seems like a CPU only backend option that doesn't have a GPU equivalent

Agreed.

* `ambiguous`, as we have discussed a bit above. However for this one I am thinking there is possibly a way forward that comes from the fact that times in libcudf are implicitly all UTC, which has no transitions. This means that I think the following scenario is allowable:
  
  * `format` is passed and we can state with certainty that the it specifies UTC (the user tells us the data is UTC timesatmps)

Is that true?

I think it is rather that the target dtype being parsed to has no timezone in it.

  * `strict` may be either `True` or `False`. Either way, we'll use `is_timestamp` with the `format` to identify where they are and either do a null replace if strict is `False` or an `any` of some kind on the inverse of the mask from `is_timestamp` to throw if strict is `True`
  * At this point we have either errored or we have a column of clean UTC timestamps/nulls. This should map cleanly onto the libcudf API which would normally return _undefined_ if a malformed string is present, but won't have that issue at this point, because those places have been replaced with nulls, for which the libcudf api also returns nulls.

I think this makes sense.

@brandon-b-miller
Copy link
Contributor Author

Is that true?
I think it is rather that the target dtype being parsed to has no timezone in it.

Hopefully this logic seems right, when a user passes a format with any timezone, I'd think any string that doesn't cleanly parse to that timezone should error or yield null. This is what polars seems to do on the cpu at least.

>>> data = pl.DataFrame({'a':['2011-01-01T00:00:00Z-01:00']}).lazy()
>>> data.select(pl.col('a').str.strptime(format='%Y-%m-%dT%H:%M:%SZ', dtype=pl.Datetime('ns'), strict=False)).collect()
shape: (1, 1)
┌──────────────┐
│ a            │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

So if a user passes, say, '%Y-%m-%dT%H:%M:%SZ', then any data that doesn't fit that format should end up being converted to null having not passed a theoretical is_timestamp(str, '%Y-%m-%dT%H:%M:%SZ') check. Any elements that haven't been nullified at this point then must be properly formed UTC timestamps, and therefore aren't subject to ambiguity, so we shouldn't need handling for the ambiguous kwarg until we somehow accept a wider set of formats later on.

Comment on lines 843 to 878
not_timestamps = plc.unary.unary_operation(
plc.strings.convert.convert_datetime.is_timestamp(
col.obj, format.encode()
),
plc.unary.UnaryOperator.NOT,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid the negation, because what you want is that all(is_timestamp(col)).

So:

is_timestamp = is_timestamp(...)

if strict:
    all_timestamp = plc.reduce.reduce(is_timestamp, plc.aggregation.all(), bool)
    if not plc.interop.to_arrow(all_timestamp).as_py():
        raise ...
else:
    mask = plc.unary.unary_operation(is_timestamp, NOT)
    bitmask, null_count = transform.bools_to_mask(mask) # Need to expose this
    result = col.obj.with_mask(bitmask, null_count)
    to_timestamps(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we want to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth updating to use a reduce so that we can avoid the negation in the strict case. I am a bit hesitant to implement bools_to_mask as part of this PR merely for scope reasons. I think it would benefit us to take our time with a comprehensive set of tests for bools_to_mask especially considering the way libcudf pads mask buffers in edge cases and so forth.

@beckernick
Copy link
Member

"No introspection" makes sense as a baseline policy, but curious how we handle this with the pandas interface? Do we violate that policy for the sake of UX?

E.g.,

import pandas as pd
import cudf

s = pd.Series(["2024-01-03"])
print(pd.to_datetime(s))

s = cudf.Series(["2024-01-03"])
print(cudf.to_datetime(s))
0   2024-01-03
dtype: datetime64[ns]
0   2024-01-03
dtype: datetime64[ns]

@mroeschke
Copy link
Contributor

curious how we handle this with the pandas interface?

Looks like in cudf Python we get the first string element, try to infer the date format (using an private-ish pandas function), and then use strptime with that inferred date format on the entire column.

@brandon-b-miller brandon-b-miller changed the base branch from branch-24.08 to branch-24.10 August 5, 2024 19:50
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas labels Aug 5, 2024
@brandon-b-miller brandon-b-miller removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas labels Aug 5, 2024
@github-actions github-actions bot added the CMake CMake build issue label Aug 10, 2024
@brandon-b-miller brandon-b-miller changed the title Pylibcudf polars date from str cudf-polars/pylibcudf string -> date parsing Aug 16, 2024
@brandon-b-miller brandon-b-miller marked this pull request as ready for review August 16, 2024 19:10
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner August 16, 2024 19:10
@brandon-b-miller brandon-b-miller requested review from vyasr and bdice and removed request for a team August 16, 2024 19:10
@bdice
Copy link
Contributor

bdice commented Aug 17, 2024

This probably needs updated since pylibcudf has been moved.

@wence-
Copy link
Contributor

wence- commented Aug 20, 2024

This probably needs updated since pylibcudf has been moved.

Since this is going to the feature branch (which doesn't, and won't, have the pylibcudf migration), I think not

@@ -856,6 +867,26 @@ def do_evaluate(
else prefix.obj,
)
)
elif self.name == pl_expr.StringFunction.Strptime:
# TODO: ignores ambiguous
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "ambiguous", and can we add a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polars strptime enforces some handling for ambiguous datetimes - those that could map to two different points in UTC, usually due an associated timezone IIUC. Earlier in the general comments if this PR I was attempting to reason that certain passed datetime formats (specifically those that denote utc) don't need to be checked for these - making handling them unnecessary as long as we only accept those formats. This would combine with only allowing strict=False (which nulls out any timestamps not in the right format) to produce the right result for a limited set of cases we allow, I think.

Comment on lines 843 to 878
not_timestamps = plc.unary.unary_operation(
plc.strings.convert.convert_datetime.is_timestamp(
col.obj, format.encode()
),
plc.unary.UnaryOperator.NOT,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we want to do?

python/cudf_polars/cudf_polars/dsl/expr.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for the testing utility

python/cudf_polars/cudf_polars/testing/asserts.py Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Aug 28, 2024

Can you apply:

diff --git a/python/cudf_polars/cudf_polars/testing/asserts.py b/python/cudf_polars/cudf_polars/testing/asserts.py
index bc8d1a86d6..a79d45899c 100644
--- a/python/cudf_polars/cudf_polars/testing/asserts.py
+++ b/python/cudf_polars/cudf_polars/testing/asserts.py
@@ -199,7 +199,11 @@ def assert_collect_raises(
     except polars_except:
         pass
     except Exception as e:
-        raise AssertionError(f"Polars CPU did not raise {polars_except}") from e
+        raise AssertionError(
+            f"CPU execution RAISED {type(e)}, EXPECTED {polars_except}"
+        ) from e
+    else:
+        raise AssertionError(f"CPU execution DID NOT RAISE {polars_except}")
 
     engine = GPUEngine(raise_on_fail=True)
     try:
@@ -207,4 +211,8 @@ def assert_collect_raises(
     except cudf_except:
         pass
     except Exception as e:
-        raise AssertionError(f"GPU did not raise {cudf_except}") from e
+        raise AssertionError(
+            f"GPU execution RAISED {type(e)}, EXPECTED {polars_except}"
+        ) from e
+    else:
+        raise AssertionError(f"GPU execution DID NOT RAISE {polars_except}")
diff --git a/python/cudf_polars/tests/testing/test_asserts.py b/python/cudf_polars/tests/testing/test_asserts.py
index 72954569ac..8e7f1a09d9 100644
--- a/python/cudf_polars/tests/testing/test_asserts.py
+++ b/python/cudf_polars/tests/testing/test_asserts.py
@@ -7,7 +7,10 @@ import pytest
 
 import polars as pl
 
+from cudf_polars.containers import DataFrame
+from cudf_polars.dsl.ir import Select
 from cudf_polars.testing.asserts import (
+    assert_collect_raises,
     assert_gpu_result_equal,
     assert_ir_translation_raises,
 )
@@ -33,3 +36,55 @@ def test_translation_assert_raises():
     with pytest.raises(AssertionError):
         # This should fail, because we can't translate this query, but it doesn't raise E.
         assert_ir_translation_raises(unsupported, E)
+
+
+def test_collect_assert_raises(monkeypatch):
+    df = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
+
+    with pytest.raises(AssertionError):
+        # This should raise, because polars CPU can run this query
+        assert_collect_raises(
+            df,
+            polars_except=pl.exceptions.InvalidOperationError,
+            cudf_except=pl.exceptions.InvalidOperationError,
+        )
+
+    # Here's an invalid query that gets caught at IR optimisation time.
+    q = df.select(pl.col("a") * pl.col("b"))
+
+    # This exception is raised in preprocessing, so is the same for
+    # both CPU and GPU engines.
+    assert_collect_raises(
+        q,
+        polars_except=pl.exceptions.InvalidOperationError,
+        cudf_except=pl.exceptions.InvalidOperationError,
+    )
+
+    with pytest.raises(AssertionError):
+        # This should raise because the expected GPU error is wrong
+        assert_collect_raises(
+            q,
+            polars_except=pl.exceptions.InvalidOperationError,
+            cudf_except=NotImplementedError,
+        )
+
+    with pytest.raises(AssertionError):
+        # This should raise because the expected CPU error is wrong
+        assert_collect_raises(
+            q,
+            polars_except=NotImplementedError,
+            cudf_except=pl.exceptions.InvalidOperationError,
+        )
+
+    with monkeypatch.context() as m:
+        m.setattr(Select, "evaluate", lambda self, cache: DataFrame([]))
+        # This query should fail, but we monkeypatch a bad
+        # implementation of Select which "succeeds" to check that our
+        # assertion notices this case.
+        q = df.select(pl.col("a") + pl.Series([1, 2]))
+        with pytest.raises(AssertionError):
+            assert_collect_raises(
+                q,
+                polars_except=pl.exceptions.ComputeError,
+                cudf_except=pl.exceptions.ComputeError,
+            )

To add coverage for the new utility (and fix a logic bug)

@wence- wence- merged commit 0bf68d4 into rapidsai:feature/cudf-polars Aug 28, 2024
80 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] Support string to datetime conversion in Polars engine without explicit format
5 participants