1.5 Model Selection Process

Notes

Which model to choose?

Logistic regression
Decision tree
Neural Network
Or many others

The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.

Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.

The test set can help to avoid the MCP. Obtaining the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.

Split datasets in training, validation, and test. E.g. 60%, 20% and 20% respectively
Train the models
Evaluate the models
Select the best model
Apply the best model to the test dataset
Compare the performance metrics of validation and test

NB: Note that it is possible to reuse the validation data. After selecting the best model (step 4), the validation and training datasets can be combined to form a single training dataset for the chosen model before testing it on the test set.

⚠️	The notes are written by the community. If you see an error here, please create a PR with a fix.

Notes from Peter Ernicke

Navigation

Machine Learning Zoomcamp course
Lesson 1: Introduction to Machine Learning
Previous: CRISP-DM
Next: Setting up the Environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05-model-selection.md

05-model-selection.md

1.5 Model Selection Process

Notes

Which model to choose?

Navigation

Files

05-model-selection.md

Latest commit

History

05-model-selection.md

File metadata and controls

1.5 Model Selection Process

Notes

Which model to choose?

Navigation