prediction fails when all variables are normalized #1256

matthewgson · 2023-11-05T00:05:36Z

I'm cross-posting this report to both github (tidymodels / recipe). Please feel free to close either one if needed.

The problem

I've been exploring tidymodels and I encountered a problem where predict() fails especially when "ALL" of variables are normalized, which includes dependant variable. I do not observe same error when only predictor variables are normalized.

Reproducible example

library(tidymodels)
library(tidyverse)

data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)

all_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(everything())

x_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_predictors())

model = linear_reg() %>% 
  set_engine('lm')

workflow_all = workflow() %>% 
  add_model(model) %>% 
  add_recipe(all_recipe)

workflow_x = workflow() %>% 
  add_model(model) %>% 
  add_recipe(x_recipe)

fit_all = fit(workflow_all, train)
fit_x = fit(workflow_x, train)

predict(fit_all, test)
#> Error in `step_normalize()`:
#> ! The following required column is missing from `new_data` in step
#>   'normalize_EsO9U': Sepal.Width.
#> Backtrace:
#>      ▆
#>   1. ├─stats::predict(fit_all, test)
#>   2. └─workflows:::predict.workflow(fit_all, test)
#>   3.   └─workflows:::forge_predictors(new_data, workflow)
#>   4.     ├─hardhat::forge(new_data, blueprint = mold$blueprint)
#>   5.     └─hardhat:::forge.data.frame(new_data, blueprint = mold$blueprint)
#>   6.       ├─hardhat::run_forge(blueprint, new_data = new_data, outcomes = outcomes)
#>   7.       └─hardhat:::run_forge.default_recipe_blueprint(...)
#>   8.         └─hardhat:::forge_recipe_default_process(...)
#>   9.           ├─recipes::bake(object = rec, new_data = new_data)
#>  10.           └─recipes:::bake.recipe(object = rec, new_data = new_data)
#>  11.             ├─recipes::bake(step, new_data = new_data)
#>  12.             └─recipes:::bake.step_normalize(step, new_data = new_data)
#>  13.               └─recipes::check_new_data(col_names, object, new_data)
#>  14.                 └─cli::cli_abort(...)
#>  15.                   └─rlang::abort(...)
predict(fit_x, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1  3.26
#>  2  3.26
#>  3  3.25
#>  4  3.29
#>  5  3.25
#>  6  3.25
#>  7  3.28
#>  8  3.27
#>  9  3.25
#> 10  3.27
#> # ℹ 28 more rows

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2023-11-05T01:55:52Z

Hello @matthewgson This is happening because of the way {recipes} interact with {workflows}.

everything() selects both Sepal.Width and Petal.Length in your example. This is fine and dandy in {recipes}'s eyes.

When {workflows} is sending data to {recipes} it only sends data that is labeled as "predictor" according to {recipes}. This means that only Petal.Length is sent to recipes, but step_normalize() is expecting both Sepal.Width and Petal.Length, hence the thrown error.

This is all expected behavior from {recipes} and {workflows}. But it can be a bit surprising first time you see it.

So my general advice is to use recipes selection functions whenever possible, see selections for more information. The other tidyselect selections are amazing too, but avoid everything() since it selects EVERYTHING, which includes outcomes, predictors and other things.

matthewgson · 2023-11-05T09:29:48Z

I don't think this is the intended behavior. The documentation on the recipe includes tidyselect functions including "everything()". At least these approach was intended to be included in the preprocessing, and actually it works and doesn't throw error at all when it was called with prep() and bake().

It would be disappointing if this was true, because normalizing the dependent variable is also a practice in NN for stable convergence.

EmilHvitfeldt · 2023-11-05T16:22:02Z

There is a difference between {recipes} and {recipes} to be used within tidymodels and {workflows}.

{recipes} is a general preprocessing and feature engineering package. So it will allow you to do many things. Including modifying the outcome. However, we don't allow that behavior in combination with {workflows} as it is often a mistake.

When you predict on new data, the outcome variable isn't included in data set (hence why you are predicting) so having a transformation that tries to touch the outcome should fail. Allowing this type of behavior for the few cases where people need it does not in our opinion outweigh all the times where it would be a modeling mistake.

Hence if you want a transformation done on the outcome, for use in {tidymodels} it is the recommendation that you do the transformation yourself:

https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558
https://stackoverflow.com/questions/76347330/broomaugment-errors-with-recipesstep-log/76349194#76349194
https://stackoverflow.com/questions/76378383/problem-when-scoring-new-data-tidymodels
#1195

As for everything() specifically, it is listed in the documentation because it works within {recipes}, but it has to be used with care, as it selects everything. Using it in a supervised modeling setting is not a job for everything().

In the following example we pass in a data set with an outcome, predictor and a ID variable. everything() would modify everything which would be a mistake. Hence why we recommend the more specific selectors.

library(recipes)

my_cars <- mtcars[, c(1, 3)]
my_cars$ID <- 1:32

rec <- recipe(my_cars) |>
  update_role(mpg, new_role = "outcome") |>
  update_role(disp, new_role = "predictor")

rec |>
  step_normalize(everything()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 3
#>       mpg    disp     ID
#>     <dbl>   <dbl>  <dbl>
#>  1  0.151 -0.571  -1.65 
#>  2  0.151 -0.571  -1.55 
#>  3  0.450 -0.990  -1.44 
#>  4  0.217  0.220  -1.33 
#>  5 -0.231  1.04   -1.23 
#>  6 -0.330 -0.0462 -1.12 
#>  7 -0.961  1.04   -1.01 
#>  8  0.715 -0.678  -0.906
#>  9  0.450 -0.726  -0.800
#> 10 -0.148 -0.509  -0.693
#> # ℹ 22 more rows

rec |>
  step_normalize(all_outcomes(), all_predictors()) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 3
#>       mpg    disp    ID
#>     <dbl>   <dbl> <int>
#>  1  0.151 -0.571      1
#>  2  0.151 -0.571      2
#>  3  0.450 -0.990      3
#>  4  0.217  0.220      4
#>  5 -0.231  1.04       5
#>  6 -0.330 -0.0462     6
#>  7 -0.961  1.04       7
#>  8  0.715 -0.678      8
#>  9  0.450 -0.726      9
#> 10 -0.148 -0.509     10
#> # ℹ 22 more rows```

matthewgson · 2023-11-05T18:06:00Z

I see that there are quite a bit of different nuances to learn about, and many other users are struggling at this point.
I believe users would use everything() with caution as you pointed out in the example because - it is everything. Before realizing update_role() users might use step_rm() in conjunction everything(), or a formula specification given in the recipe() as in my example. IMHO, I believe if this approach should be discouraged especially with workflow(), a warning message could be used.

When you predict on new data, the outcome variable isn't included in data set (hence why you are predicting) so having a transformation that tries to touch the outcome should fail. Allowing this type of behavior for the few cases where people need it does not in our opinion outweigh all the times where it would be a modeling mistake.

Regarding this point, I don't see the reason prediction should fail. At least it can give a "normalized" prediction output that I should manually transform into the original scale. The outcome (or truth) may be included or not included (usually included if new data is supplied from test data based on with train test split), but because the parameters for normalization were learned from recipe() in the initial step (usually from train data), it should not re-estimate parameter from the new data but use a 'learned' parameter. I think it was the design philosophy and the reason why recipe() requires data specification in the first place, to prevent data leakage problem (which I think is fantastic).

I like that predict() can be used with an "workflow" and "before-processed" data so that it handles everything as specified, and I'm hoping the behavior could be consistent.

EmilHvitfeldt · 2023-11-05T20:25:46Z

Imagine if we made the following code work as you suggested

library(tidymodels)
library(tidyverse)

data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)

all_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(everything())

x_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_predictors())

model = linear_reg() %>% 
  set_engine('lm')

workflow_all = workflow() %>% 
  add_model(model) %>% 
  add_recipe(all_recipe)

workflow_x = workflow() %>% 
  add_model(model) %>% 
  add_recipe(x_recipe)

fit_all = fit(workflow_all, train)
fit_x = fit(workflow_x, train)

predict(fit_all, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1  3.26
#>  2  3.26
#>  3  3.25
#>  4  3.29
#>  5  3.25
#>  6  3.25
#>  7  3.28
#>  8  3.27
#>  9  3.25
#> 10  3.27
#> # ℹ 28 more rows

predict(fit_x, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1  3.26
#>  2  3.26
#>  3  3.25
#>  4  3.29
#>  5  3.25
#>  6  3.25
#>  7  3.28
#>  8  3.27
#>  9  3.25
#> 10  3.27
#> # ℹ 28 more rows

What should the following code produce?

new_values <- data.frame(
  Petal.Length = c(1, 10, 100)
)

predict(fit_all, new_values)

it doesn't include the outcome, but it includes the predictors which should be enough. For this code to run, recipes would need to flag that that Sepal.Width is missing, and omit the transformation of that variable because it is a outcome.

This seems rather cumbersome considering the alternative is the following code:

library(tidymodels)
library(tidyverse)

data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)

# The difference ------------------------------------------
outcome_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_outcomes()) %>%
  prep()

train <- bake(outcome_recipe, train)
test <- bake(outcome_recipe, test)
# The difference ------------------------------------------

main_recipe = 
  recipe(Sepal.Width ~ Petal.Length, data = train
  ) %>% 
  step_normalize(all_numeric_predictors())

model = linear_reg() %>% 
  set_engine('lm')

workflow_spec = workflow() %>% 
  add_model(model) %>% 
  add_recipe(main_recipe)

fit_all = fit(workflow_spec, train)

predict(fit_all, test)
#> # A tibble: 38 × 1
#>    .pred
#>    <dbl>
#>  1 0.526
#>  2 0.526
#>  3 0.504
#>  4 0.593
#>  5 0.504
#>  6 0.504
#>  7 0.571
#>  8 0.549
#>  9 0.504
#> 10 0.549
#> # ℹ 28 more rows

Above you use 2 recipes, one to transform the outcome, the second is used on your predictors that is used in your workflow.

matthewgson · 2023-11-05T20:59:19Z

What should the following code produce?

new_values <- data.frame(
  Petal.Length = c(1, 10, 100)
)
predict(fit_all, new_values)

Let me elaborate on my comment above. To be consistent with current behavior that does not scale dependent variable, I expect it would do:

Nomalize new variable based on train parameters ((x-mean)/sd = z)
Make a prediction based on those (z) For example, if model intercept = 3 and slope = 1 then 3 + 1*z = m
Convert back the normalized prediction (sd*m + mean) where sd, mean from the train, OR output the nomalized prediction m

Hence it should avoid checking on new data for normalization. However, even when the new data has the outcome variable, it throws the same error message that causes confusion.

In my opinion, this would be what users would expect to happen when they decide to include a dependent variable for scaling. Perhaps preparing two recipes for y and x separately could be a cumbersome workflow, but I appreciate your suggestion.

EmilHvitfeldt · 2023-11-13T19:33:11Z

The way we do things now makes the most sense to me and is the most consistent with how we do things elsewhere. Thanks for offering a different perspective, but the existing approach works best for us.

As for the use of everything(), I would reiterate that it is often not the best way to select things as it will often select too much. I have created an issue to improve the documentation #1259

matthewgson · 2023-11-14T20:40:24Z

Well.. if I'm not mistaken there are other selectors that could choose the dependent variables other than everything().
all_nominal(), all_factors(), all_numeric(), all_integer(), and so forth. I don't think it is safe to assume the use of everything() or else as simply an end-user's mistake. Also, the error message is not clearly helpful.

github-actions · 2023-11-29T00:26:53Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

EmilHvitfeldt added the question label Nov 5, 2023

EmilHvitfeldt mentioned this issue Nov 5, 2023

Prediction fails when all variables are normalized tidymodels/tidymodels#117

Closed

EmilHvitfeldt closed this as completed Nov 13, 2023

github-actions bot locked and limited conversation to collaborators Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prediction fails when all variables are normalized #1256

prediction fails when all variables are normalized #1256

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 5, 2023

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 5, 2023

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 5, 2023

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 13, 2023

matthewgson commented Nov 14, 2023 •

edited

Loading

github-actions bot commented Nov 29, 2023

prediction fails when all variables are normalized #1256

prediction fails when all variables are normalized #1256

Comments

matthewgson commented Nov 5, 2023

The problem

Reproducible example

EmilHvitfeldt commented Nov 5, 2023

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 5, 2023

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 5, 2023

matthewgson commented Nov 5, 2023

EmilHvitfeldt commented Nov 13, 2023

matthewgson commented Nov 14, 2023 • edited Loading

github-actions bot commented Nov 29, 2023

matthewgson commented Nov 14, 2023 •

edited

Loading