Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Not all variables in the recipe are present" even after using update_role_requirements #1196

Closed
walrossker opened this issue Sep 6, 2023 · 2 comments
Labels

Comments

@walrossker
Copy link

The problem

I'd like a variable included in the recipe step to be ignored when actually fitting a model (whether it is present in the data or not). In my understanding, that's one use of the update_role_requirements function, but it's not working as expected.

Reproducible example

library(tidymodels)
library(forcats)

set.seed(42)

# Create full dataset that does not yet have the stratifying variable
dat <- starwars %>%
  drop_na(gender) %>%
  mutate(human = if_else(species == "Human", "human", "non-human"),
         across(c(where(is.character)), factor),
         across(mass, ~ if_else(.x > 500, NA_real_, .x))) %>%
  select(name, gender, human, height, mass)

# Split into training and testing sets stratified on a new variable
train_test_split <- dat %>%
  mutate(gender_by_human = paste0(gender, "|", human)) %>%
  initial_split(prop = 3/4, strata = gender_by_human)

# Create workflow
rec <- recipe(gender ~ ., data = training(train_test_split)) %>%
  # Change role of stratifying variable (and ID) to "other"
  update_role(c(name, gender_by_human), new_role = "other") %>%
  # Ignore the stratifying variable when baking:
  update_role_requirements(role = "other", bake = FALSE) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal_predictors())

spec <- logistic_reg() %>%
  set_mode("classification")

wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(spec)

# Assess performance on test set
wf %>% last_fit(train_test_split) %>% collect_metrics()
#> # A tibble: 2 × 4
#>   .metric  .estimator .estimate .config
#>   <chr>    <chr>          <dbl> <chr>
#> 1 accuracy binary         0.818 Preprocessor1_Model1
#> 2 roc_auc  binary         0.865 Preprocessor1_Model1

# Attempt to fit model on the full dataset
wf %>% fit(data = dat)
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training set: 'gender_by_human'.
#> Backtrace:
#>      ▆
#>   1. ├─wf %>% fit(data = dat)
#>   2. ├─generics::fit(., data = dat)
#>   3. └─workflows:::fit.workflow(., data = dat)
#>   4.   └─workflows::.fit_pre(workflow, data)
#>   5.     ├─generics::fit(action, workflow = workflow, data = data)
#>   6.     └─workflows:::fit.action_recipe(action, workflow = workflow, data = data)
#>   7.       ├─hardhat::mold(recipe, data, blueprint = blueprint)
#>   8.       └─hardhat:::mold.recipe(recipe, data, blueprint = blueprint)
#>   9.         ├─hardhat::run_mold(blueprint, data = data)
#>  10.         └─hardhat:::run_mold.default_recipe_blueprint(blueprint, data = data)
#>  11.           └─hardhat:::mold_recipe_default_process(...)
#>  12.             ├─recipes::prep(...)
#>  13.             └─recipes:::prep.recipe(...)
#>  14.               └─recipes:::check_training_set(training, x, fresh)
#>  15.                 └─rlang::abort(...)

sessionInfo()
#> R version 4.2.3 (2023-03-15 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] forcats_0.5.2      yardstick_1.2.0    workflowsets_1.0.1 workflows_1.1.3
#>  [5] tune_1.1.2         tidyr_1.3.0        tibble_3.2.1       rsample_1.2.0
#>  [9] recipes_1.0.8      purrr_1.0.2        parsnip_1.1.1      modeldata_1.2.0
#> [13] infer_1.0.4        ggplot2_3.4.3      dplyr_1.1.2        dials_1.2.0
#> [17] scales_1.2.1       broom_1.0.5        tidymodels_1.1.1
#>
#> loaded via a namespace (and not attached):
#>  [1] foreach_1.5.2       splines_4.2.3       R.utils_2.12.2
#>  [4] prodlim_2019.11.13  GPfit_1.0-8         yaml_2.3.7
#>  [7] globals_0.16.2      ipred_0.9-13        pillar_1.9.0
#> [10] backports_1.4.1     lattice_0.20-45     glue_1.6.2
#> [13] digest_0.6.31       hardhat_1.3.0       colorspace_2.1-0
#> [16] htmltools_0.5.4     Matrix_1.5-3        R.oo_1.25.0
#> [19] timeDate_4022.108   pkgconfig_2.0.3     lhs_1.1.6
#> [22] DiceDesign_1.9      listenv_0.9.0       gower_1.0.1
#> [25] lava_1.7.1          timechange_0.2.0    styler_1.9.1
#> [28] generics_0.1.3      ellipsis_0.3.2      withr_2.5.0
#> [31] furrr_0.3.1         nnet_7.3-18         cli_3.6.1
#> [34] survival_3.5-0      magrittr_2.0.3      evaluate_0.20
#> [37] R.methodsS3_1.8.2   fs_1.6.1            future_1.30.0
#> [40] fansi_1.0.4         parallelly_1.34.0   R.cache_0.16.0
#> [43] MASS_7.3-58.2       class_7.3-21        tools_4.2.3
#> [46] lifecycle_1.0.3     munsell_0.5.0       reprex_2.0.2
#> [49] compiler_4.2.3      rlang_1.1.1         grid_4.2.3
#> [52] iterators_1.0.14    rstudioapi_0.15.0   rmarkdown_2.20
#> [55] gtable_0.3.3        codetools_0.2-19    R6_2.5.1
#> [58] lubridate_1.9.1     knitr_1.42          fastmap_1.1.0
#> [61] future.apply_1.10.0 utf8_1.2.3          parallel_4.2.3
#> [64] Rcpp_1.0.10         vctrs_0.6.3         rpart_4.1.19
#> [67] tidyselect_1.2.0    xfun_0.36
@EmilHvitfeldt
Copy link
Member

This is happening because update_role_requirements() only changes the role requirements for bake() time. When you call wf %>% fit(data = dat) you need to prep() the recipe, where gender_by_human is needed, but isn't part of dat.

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants