Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ptype information to recipes #1329

Merged
merged 10 commits into from
Jun 4, 2024
Merged

Add ptype information to recipes #1329

merged 10 commits into from
Jun 4, 2024

Conversation

EmilHvitfeldt
Copy link
Member

@EmilHvitfeldt EmilHvitfeldt commented May 31, 2024

This looks at how we could add ptype information to the recipe.

I think the best plan of action is to take a ptype slice of the entire dataset passed to recipe() and then have helper functions to pull out the needed information.

Because the ptype information that the user/functions need will depend:

  • ptype checking for the data passed to prep() should match the data passed to recipe()
  • ptype checking for for data passed to bake() shouldn't match against outcome roles
  • In essence, everything will depend on the roles, and work with update_role_requirements()

Revdepcheck

I ran revdepcheck on the change to add a ptype field to the recipe object and it came back with no issues

@EmilHvitfeldt EmilHvitfeldt changed the title Input ptype Add ptype information to recipes May 31, 2024
@EmilHvitfeldt
Copy link
Member Author

@juliasilge is this enough for what you need in {vetiver}?

@simonpcouch
Copy link
Contributor

In the original, seeing from Davis:

This isn't a huge deal in practice, because workflows/hardhat typically handle the type consistency ahead of time

And then from Emil:

slightly bigger deal because {workflows}/{hardhat} doesn't deal with this

A bit confused here. From what I can tell in the reprex in #793 (comment), workflows does error in this situation but does so uninformatively. That same uninformative error still occurs with this recipes PR installed—here's the same linked reprex with this PR:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
traning <- tibble(outcome = rnorm(1000), x = rnorm(1000), y = sample(letters, 1000, T))

rec1 <- recipe(outcome ~ ., traning) %>%
  step_bin2factor(where(is.numeric), - all_outcomes())

# Original template `traning` has a character `y`
rec2 <- recipe(outcome ~ ., traning) %>%
  step_bin2factor(all_numeric_predictors())

# But when we prep we have a numeric `y`
testing <- tibble(outcome = rnorm(1000), x = rnorm(1000), y = rnorm(1000))

prep(rec1, training = testing, fresh = TRUE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x and y | Trained
prep(rec2, training = testing, fresh = TRUE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x | Trained
prep(rec1, training = testing, fresh = FALSE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x and y | Trained
prep(rec2, training = testing, fresh = FALSE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x | Trained
rec1$var_info$type
#> [[1]]
#> [1] "double"  "numeric"
#> 
#> [[2]]
#> [1] "string"    "unordered" "nominal"  
#> 
#> [[3]]
#> [1] "double"  "numeric"
rec2$var_info$type
#> [[1]]
#> [1] "double"  "numeric"
#> 
#> [[2]]
#> [1] "string"    "unordered" "nominal"  
#> 
#> [[3]]
#> [1] "double"  "numeric"
library(tidymodels)

wf1 <- workflow(rec1, linear_reg())
wf2 <- workflow(rec2, linear_reg())

fit(wf1, testing)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
fit(wf2, testing)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

Created on 2024-06-03 with reprex v2.1.0

@simonpcouch
Copy link
Contributor

Could you add some context here on how you imagine this helper being integrated into recipes/workflows?

@EmilHvitfeldt
Copy link
Member Author

EmilHvitfeldt commented Jun 3, 2024

Could you add some context here on how you imagine this helper being integrated into recipes/workflows?

right now we have check_training_set() that is called in recipes to check the incoming data, but it only checks column names. recipes_ptype() added in this PR is planned to be used in check_training_set() to enhance checking of input to include types and attributes.

in short, this PR doesn't close 793, but adds the tools needed for it to be closed.

Copy link
Member

@topepo topepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great and it will help with some S3 generics that use workflows (it's hard to know what the original data was in a workflow).

I wanted to verify that we computed the ptype after the formula was used to subset the columns in the recipe 👍

@EmilHvitfeldt
Copy link
Member Author

I wanted to verify that we computed the ptype after the formula was used to subset the columns in the recipe 👍

That is happening, but I see I forgot to add a test to verify.

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In vetiver, we currently require users to pass in their "input prototype data" explicitly, as shown in this test. With the change in this PR, we'd be able to get the ptype out automatically from the recipe itself, which is great!

We would change vetiver_ptype.recipe() so it reads the value off the recipe instead of looking for it from the user in the dots.

Copy link
Contributor

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is super well written.

I think the fact that this PR can be so simple is a testament to how well this package is engineered. Heck yeah!

R/developer.R Outdated Show resolved Hide resolved
R/ptype.R Outdated Show resolved Hide resolved
R/ptype.R Outdated Show resolved Hide resolved
R/ptype.R Outdated Show resolved Hide resolved
R/ptype.R Outdated Show resolved Hide resolved
R/ptype.R Outdated Show resolved Hide resolved
R/ptype.R Outdated Show resolved Hide resolved
tests/testthat/test-selections.R Show resolved Hide resolved
@EmilHvitfeldt EmilHvitfeldt merged commit 809a15f into main Jun 4, 2024
9 checks passed
@EmilHvitfeldt EmilHvitfeldt deleted the input-ptype branch June 4, 2024 19:16
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants