Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Series.strptime/2 #626

Merged
merged 14 commits into from
Jun 26, 2023
Merged

Conversation

anthony-khong
Copy link
Contributor

Closes #625

I would be happy to work on strftime after this!

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment and we can ship it. <3

@anthony-khong
Copy link
Contributor Author

Hi @josevalim, thank you so much for reviewing the PR quickly. I've added a few changes, so if it's okay, could you please re-review the PR?

Your comment made me wonder how exactly the formats differ from Calendar.strftime's format. I spent more time studying Polars' codebase, so I think I understand Polars' strptime a bit better. Here's a summary of the changes since you last saw the PR:

  • It turns out that we can avoid iterating in series.rs by using as_datetime (instead of as_datetime_not_exact. However, it looks like there's a slight change in the function signature between Polars' current main branch and v0.29 - the utc parameter is gone. This implementation probably needs to be changed when Explorer upgrades to v0.30.
  • Add tests to confirm that it's compatible to Chrono's format. In particular, verify formats not covered by Calendar.strftime here.
  • I also noticed that Polars can do format inference, so I added a nil default for the format_string argument and a test to use the inferred format.

I hope the additions are helpful, but if not, I'm happy to undo some of the changes!

@@ -23,6 +23,7 @@ defmodule Explorer.Backend.Series do
@callback to_iovec(s) :: [binary()]
@callback cast(s, dtype) :: s
@callback categorise(s, s) :: s
@callback strptime(s, String.t() | nil) :: s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to cast a string to datetime (i.e. nil), you call cast(:datetime). So I think we can remove nil and we document that cast(:date) and cast(:datetime) are also options. :)

@anthony-khong
Copy link
Contributor Author

Hi @josevalim, thank you again for the feedback. That makes total sense. I've removed the possibility of nil format.

Comment on lines 795 to 797
Converts a string series to a datetime series.

For the format string specification, refer to the [chrono crate documentation](https://docs.rs/chrono/latest/chrono/format/strftime/index.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Converts a string series to a datetime series.
For the format string specification, refer to the [chrono crate documentation](https://docs.rs/chrono/latest/chrono/format/strftime/index.html).
Converts a string series to a datetime series with a given `format_string`.
For the format string specification, refer to the
[chrono crate documentation](https://docs.rs/chrono/latest/chrono/format/strftime/).
Use `cast(series, :datetime)` if you prefer the format to be inferred (if possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've incorporated the suggestion. Thank you!

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I will let @philss merge with another pass. :)

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :shipit:

Maybe we could add support for the :category dtype in the future. But for now it LGTM.

@philss philss merged commit d0932d2 into elixir-explorer:main Jun 26, 2023
4 checks passed
@philss
Copy link
Member

philss commented Jun 26, 2023

@anthony-khong thank you! 💜

@anthony-khong
Copy link
Contributor Author

Thanks for the approval @philss!

Maybe we could add support for the :category dtype in the future. But for now it LGTM.

I'm trying to take a stab at this. Reading the Polars source code, I don't see a direct way to convert the categorical type to datetime. Should we do:

  1. Cast the categorical series to Utf8 first, then do strptime; or
  2. Downcast to CategoricalChunked, do iter_str( ... ) and try to emulate as_datetime?

I think option 1 will guarantee a more consistent behaviour as as_datetime is quite complex, but I'm not sure if there's a downside to casting.

@anthony-khong anthony-khong deleted the parse-datetime branch June 26, 2023 17:03
@philss
Copy link
Member

philss commented Jun 26, 2023

Hey @anthony-khong

I think the option 1. is easier to implement, and could be our first version. But I think we could do something in the lines of option 2., but instead of inter_str(...), we could convert the CategoricalChunked into a series - there is a into_series() method -, apply the transformation, and then work with the indexes of both the logical series of the categorical series, and the indexes of this new series to create the final series. It's a similar dance that we do in the encoding of categorical series to a list, in the encoding file, but simplified:

https://github.com/elixir-nx/explorer/blob/6671ae3cc323d4c1bac2a944cbb991284ad86446/native/explorer/src/encoding.rs#L349

I didn't test though. It may be slower then doing 1., but I think it is more memory efficient. WDYT?

@josevalim
Copy link
Member

If you have strings and they are datetimes, then you want to convert them to datetimes before you convert them to categories. So I am fine with not supporting it.

@anthony-khong
Copy link
Contributor Author

@philss, yes that makes a lot of sense - definitely more memory efficient. That's really cool, thanks for showing me that!

My concern is applying this transformation - it looks quite complex and there are changes between 0.29 and 0.30.

In any case, in light of José's comment, perhaps I should leave it be for now 😅

@philss
Copy link
Member

philss commented Jun 26, 2023

@anthony-khong yeah, I agree that we should leave for now :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parse datetime from string
3 participants