Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chapter on validation and internal tuning #829

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ book:
- chapters/chapter12/model_interpretation.qmd
- chapters/chapter13/beyond_regression_and_classification.qmd
- chapters/chapter14/algorithmic_fairness.qmd
- chapters/chapter15/predsets_valid_inttune.qmd
- chapters/references.qmd
appendices:
- chapters/appendices/solutions.qmd # online only
Expand Down
195 changes: 195 additions & 0 deletions book/chapters/appendices/solutions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2171,4 +2171,199 @@ prediction$score(msr_3, adult_subset)
We can see, that between women there is an even bigger discrepancy compared to men.

* The bias mitigation strategies we employed do not optimize for the *false omission rate* metric, but other metrics instead. It might therefore be better to try to achieve fairness via other strategies, using different or more powerful models or tuning hyperparameters.

## Solutions to @sec-predsets-valid-inttune

1. Manually `$train()` a LightGBM classifier from `r ref_pkg("mlr3extralearners")` on the pima task using $1/3$ of the training data for validation.
As the pima task has missing values, select a method from `r ref_pkg("mlr3pipelines")` to impute them.
Explicitly set the evaluation metric to logloss (`"binary_logloss"`), the maximum number of boosting iterations to 1000, the patience parameter to 10, and the step size to 0.01.
After training the learner, inspect the final validation scores as well as the early stopped number of iterations.

We start by loading the packages and creating the task.

```{r}
library(mlr3)
library(mlr3extralearners)
library(mlr3pipelines)

tsk_pima = tsk("pima")
tsk_pima
```

Below, we see that the task has five features with missing values.

```{r}
tsk_pima$missings()
```

Next, we create the LightGBM classifier, but don't specify the validation data yet.
We handle the missing values using a simple median imputation.

```{r}
lrn_lgbm = lrn("classif.lightgbm",
num_iterations = 1000,
early_stopping_rounds = 10,
learning_rate = 0.01,
eval = "binary_logloss"
)

glrn = as_learner(po("imputemedian") %>>% lrn_lgbm)
glrn$id = "lgbm"
```

After constructing the graphlearner, we now configure the validation data using `r ref("set_validate()")`.
The call below sets the `$validate` field of the LightGBM pipeop to `"predefined"` and of the graphlearner to `0.3`.
Recall that only the graphlearner itself can specify *how* the validation data is generated.
The individual pipeops can either use it (`"predefined"`) or not (`NULL`).

```{r}
set_validate(glrn, validate = 0.3, ids = "classif.lightgbm")
glrn$validate
glrn$graph$pipeops$classif.lightgbm$validate
```

Finally, we train the learner and inspect the validation scores and internally tuned parameters.

```{r}
glrn$train(tsk_pima)

glrn$internal_tuned_values
glrn$internal_valid_scores
```

2. Wrap the learner from exercise 1) in an `AutoTuner` using a three-fold CV for the tuning.
Also change the rule for aggregating the different boosting iterations from averaging to taking the maximum across the folds.
Don't tune any parameters other than `nrounds`, which can be done using `tnr("internal")`.
Use the internal validation metric as the tuning measure.
Compare this learner with a `lrn("classif.rpart")` using a 10-fold outer cross-validation with respect to classification accuracy.

We start by setting the number of boosting iterations to an internal tune token where the maximum number of boosting iterations is 1000 and the aggregation function the maximum.
Note that the input to the aggregation function is a list of integer values (the early stopped values for the different resampling iterations), so we need to `unlist()` it first before taking the maximum.

```{r}
library(mlr3tuning)

glrn$param_set$set_values(
classif.lightgbm.num_iterations = to_tune(
upper = 1000, internal = TRUE, aggr = function(x) max(unlist(x))
)
)
```

Now, we change the validation data from `0.3` to `"test"`, where we can omit the `ids` specification as LightGBM is the base learner.

```{r}
set_validate(glrn, validate = "test")
```

Next, we create the autotuner using the configuration given in the instructions.
As the internal validation measures are calculated by `lightgbm` and not `mlr3`, we need to specify whether the metric should be minimized.

```{r}
at_lgbm = auto_tuner(
learner = glrn,
tuner = tnr("internal"),
resampling = rsmp("cv", folds = 3),
measure = msr("internal_valid_score",
select = "classif.lightgbm.binary_logloss", minimize = TRUE)
)
at_lgbm$id = "at_lgbm"
```

Finally, we set up the benchmark design, run it, and evaluate the learners in terms of their classification accuracy.

```{r}
design = benchmark_grid(
task = tsk_pima,
learners = list(at_lgbm, lrn("classif.rpart")),
resamplings = rsmp("cv", folds = 10)
)

bmr = benchmark(design)

bmr$aggregate(msr("classif.acc"))
```

3. Consider the code below:

```{r}
branch_lrn = as_learner(
ppl("branch", list(
lrn("classif.ranger"),
lrn("classif.xgboost",
early_stopping_rounds = 10,
eval_metric = "error",
eta = to_tune(0.001, 0.1, logscale = TRUE),
nrounds = to_tune(upper = 1000, internal = TRUE)))))

set_validate(branch_lrn, validate = "test", ids = "classif.xgboost")
branch_lrn$param_set$set_values(branch.selection = to_tune())

at = auto_tuner(
tuner = tnr("grid_search"),
learner = branch_lrn,
resampling = rsmp("holdout", ratio = 0.8),
# cannot use internal validation score because ranger does not have one
measure = msr("classif.ce"),
term_evals = 10L,
store_models = TRUE
)

tsk_sonar = tsk("sonar")$filter(1:100)

rr = resample(
tsk_sonar, at, rsmp("holdout", ratio = 0.8), store_models = TRUE
)
```

Answer the following questions (ideally without running the code):

3.1 During the hyperparameter optimization, how many observations are used to train the XGBoost algorithm (excluding validation data) and how many for the random forest?
Hint: learners that cannot make use of validation data ignore it.

The outer resampling already removes 20 observations from the data (the outer test set), leaving only 80 data points (the outer train set) for the inner resampling.
Then 16 (0.2 * 80; the test set of the inner holdout resampling) observations are used to evaluate the hyperparameter configurations.
This leaves 64 (80 - 16) observations for training.
For XGBoost, the 16 observations that make up the inner test set are also used for validation, so no more observations from the 64 training points are removed.
Because the random forest does not support validation, the 16 observations from the inner test set will only be used for evaluation the hyperparameter configuration, but not simultanteously for internal validation.
Therefore, both the random forest and XGBoost models use 64 observations for training.

3.2 How many observations would be used to train the final model if XGBoost was selected? What if the random forest was chosen?

In both cases, all 80 observations (the train set from the outer resampling) would be used.
This is because during the final model fit no validation data is generated.

3.3 How would the answers to the last two questions change if we had set the `$validate` field of the graphlearner to `0.25` instead of `"test"`?

In this case, the validation data is no longer identical to the inner resampling test set.
Instead, it is split from the 64 observations that make up the inner training set.
Because this happens before the task enters the graphlearner, both the XGBoost model *and* the random forest only have access to 48 ((1 - 0.25) * 64) observations, and the remaining 16 are used to create the validation data.
Note that the random forest will again ignore the validation data as it does not have the 'validation' property and therefore cannot use it.
Also, the autotuner would now use a different set for tuning the step size and boosting iterations (which coincidentally both have size 16).
Therefore, the answer to question 3.1 would be 48 instead of 64.

However, this does not change the answer to 3.2, as, again, no validation is performed during the final model fit.

Note that we would normally recommend setting the validation data to `"test"` when tuning, so this should be thought of as a illustrative example.


4. Look at the (failing) code below:

```{r, error = TRUE}
tsk_sonar = tsk("sonar")
glrn = as_learner(
po("pca") %>>% lrn("classif.xgboost", validate = 0.3)
)
```

Can you explain *why* the code fails?
Hint: Should the data that xgboost uses for validation be preprocessed according to the *train* or *predict* logic?

If we set the `$validate` field of the XGBoost classifier to `0.3`, the validation data would be generated from the output task of `PipeOpOpPCA`.
However, this task has been exclusively preprocessed using the train logic, because the `PipeOpPCA` does not 'know' that the LightGBM classifier wants to do validation.
Because validation performance is intended to measure how well a model would perform during prediction, the validation should be preprocessed according to the predict logic.
For this reason, splitting of the 30% of the output from `PipeOpPCA` to use as validation data in the XGBoost classifier would be invalid.
Therefore, it is not possible to set the `$validate` field of `PipeOps` to values other than `predefined' or `NULL'.
Only the `GraphLearner` itself can dictate *how* the validation data is created *before* it enters the `Graph`, so the validation data is then preprocessed according to the predict logic.

:::
2 changes: 2 additions & 0 deletions book/chapters/chapter1/introduction_and_overview.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Before we can show you the full power of `mlr3`, we recommend installing the `r
install.packages("mlr3verse")
```

Chapters that were added after the release of the printed version of this book are marked with a '+'.

## Installation Guidelines {#installguide}

There are many packages in the `mlr3` ecosystem that you may want to use as you work through this book.
Expand Down
2 changes: 1 addition & 1 deletion book/chapters/chapter12/model_interpretation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,7 @@ To illustrate this, we will select a random data point to explain.
As we are dealing with people, we will name our observation "Charlie" and first look at the black box predictions:

```{r Charlie, asis='results'}
Charlie = credit_x[35, ]
Charlie = tsk_german$data(rows = 127L, cols = tsk_german$feature_names)
gbm_predict = predictor$predict(Charlie)
gbm_predict
```
Expand Down
Loading
Loading