Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: pyarrow unique in group_by context #1076

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

May come in handy for plotly

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

I was not able to add tests... I tried to nest a bunch of checks but also the order inside the list type is not guaranteed..
any idea?

@github-actions github-actions bot added the fix label Sep 26, 2024
@MarcoGorelli
Copy link
Member

thanks @FBruzzesi !

I think plotly would only need to get some value from the aggregation, rather than a list dtypes?

perhaps we could allow

df.group_by('a').agg(nw.unique_value('b'))  # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

@FBruzzesi
Copy link
Member Author

Not sure if this is the right place for this discussion but here we go πŸ™ƒ

I think plotly would only need to get some value from the aggregation, rather than a list dtypes?

Yes correct!

perhaps we could allow

df.group_by('a').agg(nw.unique_value('b'))  # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))

Not the biggest fan of this if we are going to support nw.List type - a list is an aggregated value, and surprisingly pandas seems to behave quite well - at least for .unique πŸ™ˆ

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

Correct again.

@MarcoGorelli
Copy link
Member

surprisingly pandas seems to behave quite well

yeah but it returns object dtype and I fear that'd create more issues for us down the line

@FBruzzesi
Copy link
Member Author

yeah but it returns object dtype and I fear that'd create more issues for us down the line

Yes that's not ideal, and yesterday I had issues converting to list type (e.g. .astype('pyarrow[list]') is not enough).

Maybe let's sleep on this, but I would imagine that someone using narwhals should just be a bit more pedantic and do:

(df
.group_by("a")
.agg(nw.col("b").unique()))
.with_columns(nw.col("b").cast(nw.List(...)))  # force it to be list type
... # now can access .list namespace
)

@MarcoGorelli
Copy link
Member

i'm not sure that people would think to do that explicit cast, and implementing the list namespace would be quite difficult for pandas

we may be able to take inspiration from duckdb here, who have any_value as an aggregate function https://duckdb.org/docs/sql/functions/aggregates.html#any_valuearg

>>> rel = duckdb.read_parquet('../scratch/assets.parquet')
>>> duckdb.sql('select symbol, any_value(date) from rel group by symbol')
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ symbol  β”‚ any_value(date) β”‚
β”‚ varchar β”‚      date       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ EWJ     β”‚ 2022-01-31      β”‚
β”‚ OGN     β”‚ 2022-01-31      β”‚
β”‚ PRU     β”‚ 2022-01-31      β”‚
β”‚ AEP     β”‚ 2022-01-31      β”‚
β”‚ ALLE    β”‚ 2022-01-31      β”‚
β”‚ IEFM.L  β”‚ 2022-01-31      β”‚
β”‚ EWG     β”‚ 2022-01-31      β”‚
β”‚ SEGA.L  β”‚ 2022-01-31      β”‚
β”‚ IAU     β”‚ 2022-01-31      β”‚
β”‚ XLV     β”‚ 2022-01-31      β”‚
β”‚  Β·      β”‚     Β·           β”‚
β”‚  Β·      β”‚     Β·           β”‚
β”‚  Β·      β”‚     Β·           β”‚
β”‚ CNC     β”‚ 2022-01-31      β”‚
β”‚ CTAS    β”‚ 2022-01-31      β”‚
β”‚ DG      β”‚ 2022-01-31      β”‚
β”‚ IEF     β”‚ 2022-05-31      β”‚
β”‚ IEMG    β”‚ 2022-01-31      β”‚
β”‚ JPEA.L  β”‚ 2022-01-31      β”‚
β”‚ META    β”‚ 2022-01-31      β”‚
β”‚ HIGH.L  β”‚ 2022-03-17      β”‚
β”‚ HST     β”‚ 2022-01-31      β”‚
β”‚ VXX     β”‚ 2022-01-31      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    100 rows (20 shown)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

So, my thinking was that unique_value would be kind of like any_value, but only works if there's only a unique value per group

If it's a top-level function (nw.unique_value) then I think it'd be ok to depart from Polars a bit there, we have other non-Polars function in the top-level narwhals namespace

@FBruzzesi
Copy link
Member Author

Just for clarity, when you say:

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

does it mean that nw.unique_value('b') can receive an expression (e.g. in the skrub case nw.unique_value(nw.col('b').mode()))?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 27, 2024

I haven't tried implementing it yet, but yes, I think so

alternatively, we could add:

  • nw.unique_value
  • nw.unique_mode

Alternatively, we could have our own Agg class and do something like

df.group_by('a').agg(nw.Agg.unique_mode('b'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants