Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to have dplyr's group_by + mutate behavior? #201

Open
tomicapretto opened this issue Jun 25, 2022 · 6 comments
Open

Is it possible to have dplyr's group_by + mutate behavior? #201

tomicapretto opened this issue Jun 25, 2022 · 6 comments

Comments

@tomicapretto
Copy link

First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

In R, we can do something like the following

library(dplyr)
data(iris)

iris %>%
  group_by(Species) %>%
  mutate(
    result = Petal.Width - mean(Petal.Width)
  )

Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

  • Is it possible to have this behavior in tidypolars now?
    • If yes, how?
    • If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

Again, thanks for the fantastic library!

@markfairbanks
Copy link
Owner

markfairbanks commented Jun 25, 2022

In tidypolars I decided to implement .group_by() slightly differently than in the tidyverse - if a function can operate "by group" you use the by arg. So this is how you would do it in your example.

import tidypolars as tp
from tidypolars import col

path = (
    "https://gist.githubusercontent.com/netj/8836201/" +
    "raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
)

iris = tp.read_csv(path).rename(species = 'variety')

(
    iris
    .mutate(
        result = col("petal.width") + tp.mean(col("petal.width")),
        by = "species"
    )
)
shape: (150, 6)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┬────────┐
│ sepal.length ┆ sepal.width ┆ petal.length ┆ petal.width ┆ species   ┆ result │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       ┆ ---    │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       ┆ f64    │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╪════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...          ┆ ...         ┆ ...          ┆ ...         ┆ ...       ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ Virginica ┆ 3.926  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ Virginica ┆ 4.026  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ Virginica ┆ 4.326  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ Virginica ┆ 3.826  │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┴────────┘

@markfairbanks
Copy link
Owner

Lots of functions have the by arg so they can operate by group. mutate, filter, slice, summarize, etc.

Basically - if a function can operate "by group" in the tidyverse you'll be able to use the by arg in tidypolars.

Hope this helps! If you have any other questions let me know.

@tomicapretto
Copy link
Author

Excellent! Thanks a lot for the prompt and awesome response!

@markfairbanks
Copy link
Owner

Saw your blog post and I'm glad tidypolars is working out for you!

Figured I would mention that tidypolars has a .drop_null() method. It works like the tidyverse's drop_na() or pandas .dropna() - though the .filter() approach you used works as well.

You can also use it to drop nulls from specific columns if you want.

# drop nulls from all columns
df.drop_null()

# drop nulls from "x" and "y"
df.drop_null('x', 'y')

@tomicapretto
Copy link
Author

Awesome! I'll update the post!

@dafxy
Copy link

dafxy commented Aug 28, 2024

I am reopening this issue. I added a pull request that implements group_by + mutate as a proof of concept. Other functions applied to grouped Tibble can be implemented following the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants