-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prediction fails when all variables are normalized #1256
Comments
Hello @matthewgson This is happening because of the way {recipes} interact with {workflows}.
When {workflows} is sending data to {recipes} it only sends data that is labeled as "predictor" according to {recipes}. This means that only This is all expected behavior from {recipes} and {workflows}. But it can be a bit surprising first time you see it. So my general advice is to use recipes selection functions whenever possible, see selections for more information. The other tidyselect selections are amazing too, but avoid |
I don't think this is the intended behavior. The documentation on the recipe includes tidyselect functions including "everything()". At least these approach was intended to be included in the preprocessing, and actually it works and doesn't throw error at all when it was called with prep() and bake(). It would be disappointing if this was true, because normalizing the dependent variable is also a practice in NN for stable convergence. |
There is a difference between {recipes} and {recipes} to be used within tidymodels and {workflows}. {recipes} is a general preprocessing and feature engineering package. So it will allow you to do many things. Including modifying the outcome. However, we don't allow that behavior in combination with {workflows} as it is often a mistake. When you predict on new data, the outcome variable isn't included in data set (hence why you are predicting) so having a transformation that tries to touch the outcome should fail. Allowing this type of behavior for the few cases where people need it does not in our opinion outweigh all the times where it would be a modeling mistake. Hence if you want a transformation done on the outcome, for use in {tidymodels} it is the recommendation that you do the transformation yourself: https://stackoverflow.com/questions/76158409/log-transform-outcome-variable-in-tidymodels-workflow/76158558#76158558 As for In the following example we pass in a data set with an outcome, predictor and a ID variable. library(recipes)
my_cars <- mtcars[, c(1, 3)]
my_cars$ID <- 1:32
rec <- recipe(my_cars) |>
update_role(mpg, new_role = "outcome") |>
update_role(disp, new_role = "predictor")
rec |>
step_normalize(everything()) |>
prep() |>
bake(new_data = NULL)
#> # A tibble: 32 × 3
#> mpg disp ID
#> <dbl> <dbl> <dbl>
#> 1 0.151 -0.571 -1.65
#> 2 0.151 -0.571 -1.55
#> 3 0.450 -0.990 -1.44
#> 4 0.217 0.220 -1.33
#> 5 -0.231 1.04 -1.23
#> 6 -0.330 -0.0462 -1.12
#> 7 -0.961 1.04 -1.01
#> 8 0.715 -0.678 -0.906
#> 9 0.450 -0.726 -0.800
#> 10 -0.148 -0.509 -0.693
#> # ℹ 22 more rows
rec |>
step_normalize(all_outcomes(), all_predictors()) |>
prep() |>
bake(new_data = NULL)
#> # A tibble: 32 × 3
#> mpg disp ID
#> <dbl> <dbl> <int>
#> 1 0.151 -0.571 1
#> 2 0.151 -0.571 2
#> 3 0.450 -0.990 3
#> 4 0.217 0.220 4
#> 5 -0.231 1.04 5
#> 6 -0.330 -0.0462 6
#> 7 -0.961 1.04 7
#> 8 0.715 -0.678 8
#> 9 0.450 -0.726 9
#> 10 -0.148 -0.509 10
#> # ℹ 22 more rows``` |
I see that there are quite a bit of different nuances to learn about, and many other users are struggling at this point.
Regarding this point, I don't see the reason prediction should fail. At least it can give a "normalized" prediction output that I should manually transform into the original scale. The outcome (or truth) may be included or not included (usually included if new data is supplied from test data based on with train test split), but because the parameters for normalization were learned from I like that |
Imagine if we made the following code work as you suggested library(tidymodels)
library(tidyverse)
data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)
all_recipe =
recipe(Sepal.Width ~ Petal.Length, data = train
) %>%
step_normalize(everything())
x_recipe =
recipe(Sepal.Width ~ Petal.Length, data = train
) %>%
step_normalize(all_predictors())
model = linear_reg() %>%
set_engine('lm')
workflow_all = workflow() %>%
add_model(model) %>%
add_recipe(all_recipe)
workflow_x = workflow() %>%
add_model(model) %>%
add_recipe(x_recipe)
fit_all = fit(workflow_all, train)
fit_x = fit(workflow_x, train)
predict(fit_all, test)
#> # A tibble: 38 × 1
#> .pred
#> <dbl>
#> 1 3.26
#> 2 3.26
#> 3 3.25
#> 4 3.29
#> 5 3.25
#> 6 3.25
#> 7 3.28
#> 8 3.27
#> 9 3.25
#> 10 3.27
#> # ℹ 28 more rows
predict(fit_x, test)
#> # A tibble: 38 × 1
#> .pred
#> <dbl>
#> 1 3.26
#> 2 3.26
#> 3 3.25
#> 4 3.29
#> 5 3.25
#> 6 3.25
#> 7 3.28
#> 8 3.27
#> 9 3.25
#> 10 3.27
#> # ℹ 28 more rows What should the following code produce? new_values <- data.frame(
Petal.Length = c(1, 10, 100)
)
predict(fit_all, new_values) it doesn't include the outcome, but it includes the predictors which should be enough. For this code to run, recipes would need to flag that that This seems rather cumbersome considering the alternative is the following code: library(tidymodels)
library(tidyverse)
data = iris
set.seed(156)
split = initial_split(data, prop = 0.75)
train = training(split)
test = testing(split)
# The difference ------------------------------------------
outcome_recipe =
recipe(Sepal.Width ~ Petal.Length, data = train
) %>%
step_normalize(all_outcomes()) %>%
prep()
train <- bake(outcome_recipe, train)
test <- bake(outcome_recipe, test)
# The difference ------------------------------------------
main_recipe =
recipe(Sepal.Width ~ Petal.Length, data = train
) %>%
step_normalize(all_numeric_predictors())
model = linear_reg() %>%
set_engine('lm')
workflow_spec = workflow() %>%
add_model(model) %>%
add_recipe(main_recipe)
fit_all = fit(workflow_spec, train)
predict(fit_all, test)
#> # A tibble: 38 × 1
#> .pred
#> <dbl>
#> 1 0.526
#> 2 0.526
#> 3 0.504
#> 4 0.593
#> 5 0.504
#> 6 0.504
#> 7 0.571
#> 8 0.549
#> 9 0.504
#> 10 0.549
#> # ℹ 28 more rows Above you use 2 recipes, one to transform the outcome, the second is used on your predictors that is used in your workflow. |
new_values <- data.frame(
Petal.Length = c(1, 10, 100)
)
predict(fit_all, new_values) Let me elaborate on my comment above. To be consistent with current behavior that does not scale dependent variable, I expect it would do:
Hence it should avoid checking on new data for normalization. However, even when the new data has the outcome variable, it throws the same error message that causes confusion. In my opinion, this would be what users would expect to happen when they decide to include a dependent variable for scaling. Perhaps preparing two recipes for y and x separately could be a cumbersome workflow, but I appreciate your suggestion. |
The way we do things now makes the most sense to me and is the most consistent with how we do things elsewhere. Thanks for offering a different perspective, but the existing approach works best for us. As for the use of |
Well.. if I'm not mistaken there are other selectors that could choose the dependent variables other than |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
I'm cross-posting this report to both github (tidymodels / recipe). Please feel free to close either one if needed.
The problem
I've been exploring
tidymodels
and I encountered a problem wherepredict()
fails especially when "ALL" of variables are normalized, which includes dependant variable. I do not observe same error when only predictor variables are normalized.Reproducible example
The text was updated successfully, but these errors were encountered: