-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df.groupby().agg() feature #16084
df.groupby().agg() feature #16084
Conversation
Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns. Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code. |
Yeah sure @vyasr. I don't have any prior experience writing test cases before but I shall try my best. |
I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask. |
Let's first get the cudf implementation right first, it might then be we don't have to do anything dask-dataframe. To do this, I think we want to mimic the way the pandas API behaves (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html) I don't think we support arguments and engine to
The behaviour of pandas is:
This preprocessing should happen in I think the way to attack this is to let |
Understood @wence- I will start by reading the documentation to grasp the required changes and proceed to implement them. Following this, I will add few more test cases taking reference from the examples provided in the Given that my PC lacks a GPU, I will then run these tests using only Is there anything else I need to consider? |
That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here. |
Sure thing! Will do thanks! |
self, arg, split_every=None, split_out=1, shuffle_method=None | ||
self, arg, split_every=None, split_out=1, shuffle_method=None, **kwargs | ||
): | ||
if kwargs: | ||
arg = {col_name: agg_func for col_name, (col, agg_func) in kwargs.items()} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you are already revising this PR, but just a quick note: These changes are being made to the legacy dask-cudf API (which will soon be deprecated and then completely removed). The new (default) expression-based API is defined in expr/_groupby.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok Sure working on cudf now
@NeilGeorge1 have you had any luck here? Do you need any pointers? |
I'll open a new PR for this. |
Description
Added **kwargs Support in df.groupby().agg()
Implemented support for **kwargs in df.groupby().agg() to provide greater flexibility and customization options during aggregation operations.
This issue closes #15967
Checklist