diff --git a/notebook/titanic.ipynb b/notebook/titanic.ipynb index 023e55c..00dac57 100644 --- a/notebook/titanic.ipynb +++ b/notebook/titanic.ipynb @@ -145,26 +145,48 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "import sys\n", - "sys.path.append(\"../src/\")" + "sys.path.append(\"../\")" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ - "from feature_engineering import *" + "from src.feature_engineering import *" ] }, { "cell_type": "markdown", "metadata": { - "_cell_guid": "8d1efbbb-e0dd-f3d4-acd3-ef9eb5c396e2" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "Having built our helper functions, we can now execute them in order to build our dataset that will be used in the model:a" @@ -173,7 +195,15 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "drop_columns = ['Name', 'SibSp', 'Parch', 'Cabin', 'Ticket', 'Ticket_Letter', 'Pclass', 'Sex', 'Embarked',\n", @@ -186,7 +216,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "35d843bc-3607-f11a-55f9-2828cf5eb91e" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "We can see that our final dataset has 55 columns, composed of our target column and 54 predictor variables. Although highly dimensional datasets can result in high variance, I think we should be fine here. " @@ -196,7 +232,13 @@ "cell_type": "code", "execution_count": null, "metadata": { - "_cell_guid": "09391302-b621-4730-7589-7eb017286e7f" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } }, "outputs": [], "source": [ @@ -206,7 +248,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "1066e65e-e578-e896-5c38-1457a947ec6f" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "### Hyperparameter Tuning" @@ -215,7 +263,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "32b4e910-cbe5-04c6-4383-c6f02483e595" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "We will use grid search to identify the optimal parameters of our random forest model. Because our training dataset is quite small, we can get away with testing a wider range of hyperparameter values. When I ran this on my 8 GB Windows machine, the process took less than ten minutes. I will not run it here for the sake of saving myself time, but I will discuss the results of this grid search." @@ -224,7 +278,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "7f6c54fa-033e-075f-0e86-c9c0b469a03b" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "from sklearn.model_selection import GridSearchCV \n", @@ -256,7 +316,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "11038f38-44d4-0cbd-328b-1ad7196819fe" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "Looking at the results of the grid search: \n", @@ -270,7 +336,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "c95a0ff7-0f68-7e28-0c9c-36a52808f578" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "### Model Estimation and Evaluation" @@ -279,7 +351,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "e494ad2b-92e3-782f-13c1-f53a86602298" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "We are now ready to fit our model using the optimal hyperparameters. The out-of-bag score can give us an unbiased estimate of the model accuracy, and we can see that the score is 83.73%, which is only a little higher than our final leaderboard score." @@ -289,7 +367,13 @@ "cell_type": "code", "execution_count": null, "metadata": { - "_cell_guid": "5593980a-4145-9594-299c-f4d1a9f01970" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } }, "outputs": [], "source": [ @@ -310,7 +394,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "4b44766d-6974-b7f3-b801-eb5d9423ae49" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "Let's take a brief look at our variable importance according to our random forest model. We can see that some of the original columns we predicted would be important in fact were, including gender, fare, and age. But we also see title, name length, and ticket length feature prominently, so we can pat ourselves on the back for creating such useful variables." @@ -320,7 +410,13 @@ "cell_type": "code", "execution_count": null, "metadata": { - "_cell_guid": "d77e221b-352d-8669-05d9-f7defce05709" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } }, "outputs": [], "source": [ @@ -332,7 +428,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "f4fbf72d-a7b6-1d14-73cb-7f763d291272" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "Our last step is to predict the target variable for our test data and generate an output file that will be submitted to Kaggle. " @@ -342,7 +444,13 @@ "cell_type": "code", "execution_count": null, "metadata": { - "_cell_guid": "14dc0e66-9fc4-86bf-8927-46d366d4bbcf" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } }, "outputs": [], "source": [ @@ -356,7 +464,13 @@ { "cell_type": "markdown", "metadata": { - "_cell_guid": "ae3dcb9e-8e70-956a-a0eb-a5aa1f188a99" + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } }, "source": [ "## Conclusion\n", @@ -371,9 +485,9 @@ "_change_revision": 0, "_is_fork": false, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "PythonIndus", "language": "python", - "name": "python3" + "name": "pythonindus" }, "language_info": { "codemirror_mode": { @@ -385,9 +499,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.3" + "version": "3.10.0" } }, "nbformat": 4, - "nbformat_minor": 1 + "nbformat_minor": 4 }