Is it possible to have dplyr's `group_by` + `mutate` behavior? #201

tomicapretto · 2022-06-25T00:49:46Z

First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

In R, we can do something like the following

library(dplyr)
data(iris)

iris %>%
  group_by(Species) %>%
  mutate(
    result = Petal.Width - mean(Petal.Width)
  )

Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

Is it possible to have this behavior in tidypolars now?
- If yes, how?
- If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

Again, thanks for the fantastic library!

The text was updated successfully, but these errors were encountered:

markfairbanks · 2022-06-25T01:12:31Z

In tidypolars I decided to implement .group_by() slightly differently than in the tidyverse - if a function can operate "by group" you use the by arg. So this is how you would do it in your example.

import tidypolars as tp
from tidypolars import col

path = (
    "https://gist.githubusercontent.com/netj/8836201/" +
    "raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
)

iris = tp.read_csv(path).rename(species = 'variety')

(
    iris
    .mutate(
        result = col("petal.width") + tp.mean(col("petal.width")),
        by = "species"
    )
)

shape: (150, 6)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┬────────┐
│ sepal.length ┆ sepal.width ┆ petal.length ┆ petal.width ┆ species   ┆ result │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       ┆ ---    │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       ┆ f64    │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╪════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...          ┆ ...         ┆ ...          ┆ ...         ┆ ...       ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ Virginica ┆ 3.926  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ Virginica ┆ 4.026  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ Virginica ┆ 4.326  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ Virginica ┆ 3.826  │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┴────────┘

markfairbanks · 2022-06-25T01:13:55Z

Lots of functions have the by arg so they can operate by group. mutate, filter, slice, summarize, etc.

Basically - if a function can operate "by group" in the tidyverse you'll be able to use the by arg in tidypolars.

Hope this helps! If you have any other questions let me know.

tomicapretto · 2022-06-25T02:22:07Z

Excellent! Thanks a lot for the prompt and awesome response!

markfairbanks · 2022-06-27T12:57:47Z

Saw your blog post and I'm glad tidypolars is working out for you!

Figured I would mention that tidypolars has a .drop_null() method. It works like the tidyverse's drop_na() or pandas .dropna() - though the .filter() approach you used works as well.

You can also use it to drop nulls from specific columns if you want.

# drop nulls from all columns
df.drop_null()

# drop nulls from "x" and "y"
df.drop_null('x', 'y')

tomicapretto · 2022-06-27T15:03:17Z

Awesome! I'll update the post!

dafxy · 2024-08-28T19:31:24Z

I am reopening this issue. I added a pull request that implements group_by + mutate as a proof of concept. Other functions applied to grouped Tibble can be implemented following the example.

markfairbanks closed this as completed Jun 25, 2022

etiennebacher mentioned this issue Jul 10, 2023

Groups are a problem etiennebacher/tidypolars#5

Closed

etiennebacher mentioned this issue Aug 3, 2023

Manually converting LazyGroupBy to LazyFrame breaks printing pola-rs/r-polars#338

Closed

dafxy mentioned this issue Aug 28, 2024

Implementation of group_by+mutate #242

Open

markfairbanks reopened this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to have dplyr's `group_by` + `mutate` behavior? #201

Is it possible to have dplyr's `group_by` + `mutate` behavior? #201

tomicapretto commented Jun 25, 2022

markfairbanks commented Jun 25, 2022 •

edited

Loading

markfairbanks commented Jun 25, 2022

tomicapretto commented Jun 25, 2022

markfairbanks commented Jun 27, 2022

tomicapretto commented Jun 27, 2022

dafxy commented Aug 28, 2024

Is it possible to have dplyr's group_by + mutate behavior? #201

Is it possible to have dplyr's group_by + mutate behavior? #201

Comments

tomicapretto commented Jun 25, 2022

markfairbanks commented Jun 25, 2022 • edited Loading

markfairbanks commented Jun 25, 2022

tomicapretto commented Jun 25, 2022

markfairbanks commented Jun 27, 2022

tomicapretto commented Jun 27, 2022

dafxy commented Aug 28, 2024

Is it possible to have dplyr's `group_by` + `mutate` behavior? #201

Is it possible to have dplyr's `group_by` + `mutate` behavior? #201

markfairbanks commented Jun 25, 2022 •

edited

Loading