Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prediction fails when all variables are normalized #1256

Closed
matthewgson opened this issue Nov 5, 2023 · 9 comments
Closed

prediction fails when all variables are normalized #1256

matthewgson opened this issue Nov 5, 2023 · 9 comments
Labels

Comments

@matthewgson
Copy link

I'm cross-posting this report to both github (tidymodels / recipe). Please feel free to close either one if needed.

The problem

I've been exploring tidymodels and I encountered a problem where predict() fails especially when "ALL" of variables are normalized, which includes dependant variable. I do not observe same error when only predictor variables are normalized.

Reproducible example

library(tidymodels)
library(tidyverse)

data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)

all_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(everything())

x_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_predictors())

model = linear_reg() %>% 
  set_engine('lm')

workflow_all = workflow() %>% 
  add_model(model) %>% 
  add_recipe(all_recipe)

workflow_x = workflow() %>% 
  add_model(model) %>% 
  add_recipe(x_recipe)

fit_all = fit(workflow_all, train)
fit_x = fit(workflow_x, train)

predict(fit_all, test)
#> Error in `step_normalize()`:
#> ! The following required column is missing from `new_data` in step
#>   'normalize_EsO9U': Sepal.Width.
#> Backtrace:
#>      ▆
#>   1. ├─stats::predict(fit_all, test)
#>   2. └─workflows:::predict.workflow(fit_all, test)
#>   3.   └─workflows:::forge_predictors(new_data, workflow)
#>   4.     ├─hardhat::forge(new_data, blueprint = mold$blueprint)
#>   5.     └─hardhat:::forge.data.frame(new_data, blueprint = mold$blueprint)
#>   6.       ├─hardhat::run_forge(blueprint, new_data = new_data, outcomes = outcomes)
#>   7.       └─hardhat:::run_forge.default_recipe_blueprint(...)
#>   8.         └─hardhat:::forge_recipe_default_process(...)
#>   9.           ├─recipes::bake(object = rec, new_data = new_data)
#>  10.           └─recipes:::bake.recipe(object = rec, new_data = new_data)
#>  11.             ├─recipes::bake(step, new_data = new_data)
#>  12.             └─recipes:::bake.step_normalize(step, new_data = new_data)
#>  13.               └─recipes::check_new_data(col_names, object, new_data)
#>  14.                 └─cli::cli_abort(...)
#>  15.                   └─rlang::abort(...)
predict(fit_x, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1  3.26
#>  2  3.26
#>  3  3.25
#>  4  3.29
#>  5  3.25
#>  6  3.25
#>  7  3.28
#>  8  3.27
#>  9  3.25
#> 10  3.27
#> # ℹ 28 more rows
@EmilHvitfeldt
Copy link
Member

Hello @matthewgson This is happening because of the way {recipes} interact with {workflows}.

everything() selects both Sepal.Width and Petal.Length in your example. This is fine and dandy in {recipes}'s eyes.

When {workflows} is sending data to {recipes} it only sends data that is labeled as "predictor" according to {recipes}. This means that only Petal.Length is sent to recipes, but step_normalize() is expecting both Sepal.Width and Petal.Length, hence the thrown error.

This is all expected behavior from {recipes} and {workflows}. But it can be a bit surprising first time you see it.

So my general advice is to use recipes selection functions whenever possible, see selections for more information. The other tidyselect selections are amazing too, but avoid everything() since it selects EVERYTHING, which includes outcomes, predictors and other things.

@matthewgson
Copy link
Author

I don't think this is the intended behavior. The documentation on the recipe includes tidyselect functions including "everything()". At least these approach was intended to be included in the preprocessing, and actually it works and doesn't throw error at all when it was called with prep() and bake().

It would be disappointing if this was true, because normalizing the dependent variable is also a practice in NN for stable convergence.

@EmilHvitfeldt
Copy link
Member

There is a difference between {recipes} and {recipes} to be used within tidymodels and {workflows}.

{recipes} is a general preprocessing and feature engineering package. So it will allow you to do many things. Including modifying the outcome. However, we don't allow that behavior in combination with {workflows} as it is often a mistake.

When you predict on new data, the outcome variable isn't included in data set (hence why you are predicting) so having a transformation that tries to touch the outcome should fail. Allowing this type of behavior for the few cases where people need it does not in our opinion outweigh all the times where it would be a modeling mistake.

Hence if you want a transformation done on the outcome, for use in {tidymodels} it is the recommendation that you do the transformation yourself:

https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558
https://stackoverflow.com/questions/76347330/broomaugment-errors-with-recipesstep-log/76349194#76349194
https://stackoverflow.com/questions/76378383/problem-when-scoring-new-data-tidymodels
#1195

As for everything() specifically, it is listed in the documentation because it works within {recipes}, but it has to be used with care, as it selects everything. Using it in a supervised modeling setting is not a job for everything().

In the following example we pass in a data set with an outcome, predictor and a ID variable. everything() would modify everything which would be a mistake. Hence why we recommend the more specific selectors.

library(recipes)

my_cars <- mtcars[, c(1, 3)]
my_cars$ID <- 1:32

rec <- recipe(my_cars) |>
  update_role(mpg, new_role = "outcome") |>
  update_role(disp, new_role = "predictor")

rec |>
  step_normalize(everything()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 3
#>       mpg    disp     ID
#>     <dbl>   <dbl>  <dbl>
#>  1  0.151 -0.571  -1.65 
#>  2  0.151 -0.571  -1.55 
#>  3  0.450 -0.990  -1.44 
#>  4  0.217  0.220  -1.33 
#>  5 -0.231  1.04   -1.23 
#>  6 -0.330 -0.0462 -1.12 
#>  7 -0.961  1.04   -1.01 
#>  8  0.715 -0.678  -0.906
#>  9  0.450 -0.726  -0.800
#> 10 -0.148 -0.509  -0.693
#> # ℹ 22 more rows

rec |>
  step_normalize(all_outcomes(), all_predictors()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 3
#>       mpg    disp    ID
#>     <dbl>   <dbl> <int>
#>  1  0.151 -0.571      1
#>  2  0.151 -0.571      2
#>  3  0.450 -0.990      3
#>  4  0.217  0.220      4
#>  5 -0.231  1.04       5
#>  6 -0.330 -0.0462     6
#>  7 -0.961  1.04       7
#>  8  0.715 -0.678      8
#>  9  0.450 -0.726      9
#> 10 -0.148 -0.509     10
#> # ℹ 22 more rows```

@matthewgson
Copy link
Author

I see that there are quite a bit of different nuances to learn about, and many other users are struggling at this point.
I believe users would use everything() with caution as you pointed out in the example because - it is everything. Before realizing update_role() users might use step_rm() in conjunction everything(), or a formula specification given in the recipe() as in my example. IMHO, I believe if this approach should be discouraged especially with workflow(), a warning message could be used.

When you predict on new data, the outcome variable isn't included in data set (hence why you are predicting) so having a transformation that tries to touch the outcome should fail. Allowing this type of behavior for the few cases where people need it does not in our opinion outweigh all the times where it would be a modeling mistake.

Regarding this point, I don't see the reason prediction should fail. At least it can give a "normalized" prediction output that I should manually transform into the original scale. The outcome (or truth) may be included or not included (usually included if new data is supplied from test data based on with train test split), but because the parameters for normalization were learned from recipe() in the initial step (usually from train data), it should not re-estimate parameter from the new data but use a 'learned' parameter. I think it was the design philosophy and the reason why recipe() requires data specification in the first place, to prevent data leakage problem (which I think is fantastic).

I like that predict() can be used with an "workflow" and "before-processed" data so that it handles everything as specified, and I'm hoping the behavior could be consistent.

@EmilHvitfeldt
Copy link
Member

Imagine if we made the following code work as you suggested

library(tidymodels)
library(tidyverse)

data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)

all_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(everything())

x_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_predictors())

model = linear_reg() %>% 
  set_engine('lm')

workflow_all = workflow() %>% 
  add_model(model) %>% 
  add_recipe(all_recipe)

workflow_x = workflow() %>% 
  add_model(model) %>% 
  add_recipe(x_recipe)

fit_all = fit(workflow_all, train)
fit_x = fit(workflow_x, train)

predict(fit_all, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1  3.26
#>  2  3.26
#>  3  3.25
#>  4  3.29
#>  5  3.25
#>  6  3.25
#>  7  3.28
#>  8  3.27
#>  9  3.25
#> 10  3.27
#> # ℹ 28 more rows

predict(fit_x, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1  3.26
#>  2  3.26
#>  3  3.25
#>  4  3.29
#>  5  3.25
#>  6  3.25
#>  7  3.28
#>  8  3.27
#>  9  3.25
#> 10  3.27
#> # ℹ 28 more rows

What should the following code produce?

new_values <- data.frame(
  Petal.Length = c(1, 10, 100)
)

predict(fit_all, new_values)

it doesn't include the outcome, but it includes the predictors which should be enough. For this code to run, recipes would need to flag that that Sepal.Width is missing, and omit the transformation of that variable because it is a outcome.

This seems rather cumbersome considering the alternative is the following code:

library(tidymodels)
library(tidyverse)

data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)

# The difference ------------------------------------------
outcome_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_outcomes()) %>%
  prep()

train <- bake(outcome_recipe, train)
test <- bake(outcome_recipe, test)
# The difference ------------------------------------------

main_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_numeric_predictors())

model = linear_reg() %>% 
  set_engine('lm')

workflow_spec = workflow() %>% 
  add_model(model) %>% 
  add_recipe(main_recipe)

fit_all = fit(workflow_spec, train)

predict(fit_all, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1 0.526
#>  2 0.526
#>  3 0.504
#>  4 0.593
#>  5 0.504
#>  6 0.504
#>  7 0.571
#>  8 0.549
#>  9 0.504
#> 10 0.549
#> # ℹ 28 more rows

Above you use 2 recipes, one to transform the outcome, the second is used on your predictors that is used in your workflow.

@matthewgson
Copy link
Author

What should the following code produce?

new_values <- data.frame(
  Petal.Length = c(1, 10, 100)
)
predict(fit_all, new_values)

Let me elaborate on my comment above. To be consistent with current behavior that does not scale dependent variable, I expect it would do:

  1. Nomalize new variable based on train parameters ((x-mean)/sd = z)
  2. Make a prediction based on those (z) For example, if model intercept = 3 and slope = 1 then 3 + 1*z = m
  3. Convert back the normalized prediction (sd*m + mean) where sd, mean from the train, OR output the nomalized prediction m

Hence it should avoid checking on new data for normalization. However, even when the new data has the outcome variable, it throws the same error message that causes confusion.

In my opinion, this would be what users would expect to happen when they decide to include a dependent variable for scaling. Perhaps preparing two recipes for y and x separately could be a cumbersome workflow, but I appreciate your suggestion.

@EmilHvitfeldt
Copy link
Member

The way we do things now makes the most sense to me and is the most consistent with how we do things elsewhere. Thanks for offering a different perspective, but the existing approach works best for us.

As for the use of everything(), I would reiterate that it is often not the best way to select things as it will often select too much. I have created an issue to improve the documentation #1259

@matthewgson
Copy link
Author

matthewgson commented Nov 14, 2023

Well.. if I'm not mistaken there are other selectors that could choose the dependent variables other than everything().
all_nominal(), all_factors(), all_numeric(), all_integer(), and so forth. I don't think it is safe to assume the use of everything() or else as simply an end-user's mistake. Also, the error message is not clearly helpful.

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants