Skip to content

Commit

Permalink
modif notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
Aurelien Massiot committed Sep 20, 2023
1 parent 13462cb commit c598a68
Showing 1 changed file with 138 additions and 24 deletions.
162 changes: 138 additions & 24 deletions notebook/titanic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -145,26 +145,48 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"../src/\")"
"sys.path.append(\"../\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from feature_engineering import *"
"from src.feature_engineering import *"
]
},
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "8d1efbbb-e0dd-f3d4-acd3-ef9eb5c396e2"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Having built our helper functions, we can now execute them in order to build our dataset that will be used in the model:a"
Expand All @@ -173,7 +195,15 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"drop_columns = ['Name', 'SibSp', 'Parch', 'Cabin', 'Ticket', 'Ticket_Letter', 'Pclass', 'Sex', 'Embarked',\n",
Expand All @@ -186,7 +216,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "35d843bc-3607-f11a-55f9-2828cf5eb91e"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We can see that our final dataset has 55 columns, composed of our target column and 54 predictor variables. Although highly dimensional datasets can result in high variance, I think we should be fine here. "
Expand All @@ -196,7 +232,13 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "09391302-b621-4730-7589-7eb017286e7f"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -206,7 +248,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "1066e65e-e578-e896-5c38-1457a947ec6f"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Hyperparameter Tuning"
Expand All @@ -215,7 +263,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "32b4e910-cbe5-04c6-4383-c6f02483e595"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We will use grid search to identify the optimal parameters of our random forest model. Because our training dataset is quite small, we can get away with testing a wider range of hyperparameter values. When I ran this on my 8 GB Windows machine, the process took less than ten minutes. I will not run it here for the sake of saving myself time, but I will discuss the results of this grid search."
Expand All @@ -224,7 +278,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "7f6c54fa-033e-075f-0e86-c9c0b469a03b"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"from sklearn.model_selection import GridSearchCV \n",
Expand Down Expand Up @@ -256,7 +316,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "11038f38-44d4-0cbd-328b-1ad7196819fe"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Looking at the results of the grid search: \n",
Expand All @@ -270,7 +336,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "c95a0ff7-0f68-7e28-0c9c-36a52808f578"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Model Estimation and Evaluation<a name=\"model\"></a>"
Expand All @@ -279,7 +351,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "e494ad2b-92e3-782f-13c1-f53a86602298"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"We are now ready to fit our model using the optimal hyperparameters. The out-of-bag score can give us an unbiased estimate of the model accuracy, and we can see that the score is 83.73%, which is only a little higher than our final leaderboard score."
Expand All @@ -289,7 +367,13 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "5593980a-4145-9594-299c-f4d1a9f01970"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -310,7 +394,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "4b44766d-6974-b7f3-b801-eb5d9423ae49"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's take a brief look at our variable importance according to our random forest model. We can see that some of the original columns we predicted would be important in fact were, including gender, fare, and age. But we also see title, name length, and ticket length feature prominently, so we can pat ourselves on the back for creating such useful variables."
Expand All @@ -320,7 +410,13 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "d77e221b-352d-8669-05d9-f7defce05709"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -332,7 +428,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "f4fbf72d-a7b6-1d14-73cb-7f763d291272"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Our last step is to predict the target variable for our test data and generate an output file that will be submitted to Kaggle. "
Expand All @@ -342,7 +444,13 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "14dc0e66-9fc4-86bf-8927-46d366d4bbcf"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
Expand All @@ -356,7 +464,13 @@
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "ae3dcb9e-8e70-956a-a0eb-a5aa1f188a99"
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Conclusion\n",
Expand All @@ -371,9 +485,9 @@
"_change_revision": 0,
"_is_fork": false,
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "PythonIndus",
"language": "python",
"name": "python3"
"name": "pythonindus"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -385,9 +499,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}

0 comments on commit c598a68

Please sign in to comment.