python_ISLR.html

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2020-08-19 Wed 08:09 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>A Python Companion to ISLR</title>
<meta name="generator" content="Org mode" />
<meta name="author" content="Naresh Gurbuxani" />
<style type="text/css">
 <!--/*--><![CDATA[/*><!--*/
  .title  { text-align: center;
             margin-bottom: .2em; }
  .subtitle { text-align: center;
              font-size: medium;
              font-weight: bold;
              margin-top:0; }
  .todo   { font-family: monospace; color: red; }
  .done   { font-family: monospace; color: green; }
  .priority { font-family: monospace; color: orange; }
  .tag    { background-color: #eee; font-family: monospace;
            padding: 2px; font-size: 80%; font-weight: normal; }
  .timestamp { color: #bebebe; }
  .timestamp-kwd { color: #5f9ea0; }
  .org-right  { margin-left: auto; margin-right: 0px;  text-align: right; }
  .org-left   { margin-left: 0px;  margin-right: auto; text-align: left; }
  .org-center { margin-left: auto; margin-right: auto; text-align: center; }
  .underline { text-decoration: underline; }
  #postamble p, #preamble p { font-size: 90%; margin: .2em; }
  p.verse { margin-left: 3%; }
  pre {
    border: 1px solid #ccc;
    box-shadow: 3px 3px 3px #eee;
    padding: 8pt;
    font-family: monospace;
    overflow: auto;
    margin: 1.2em;
  }
  pre.src {
    position: relative;
    overflow: visible;
    padding-top: 1.2em;
  }
  pre.src:before {
    display: none;
    position: absolute;
    background-color: white;
    top: -10px;
    right: 10px;
    padding: 3px;
    border: 1px solid black;
  }
  pre.src:hover:before { display: inline;}
  /* Languages per Org manual */
  pre.src-asymptote:before { content: 'Asymptote'; }
  pre.src-awk:before { content: 'Awk'; }
  pre.src-C:before { content: 'C'; }
  /* pre.src-C++ doesn't work in CSS */
  pre.src-clojure:before { content: 'Clojure'; }
  pre.src-css:before { content: 'CSS'; }
  pre.src-D:before { content: 'D'; }
  pre.src-ditaa:before { content: 'ditaa'; }
  pre.src-dot:before { content: 'Graphviz'; }
  pre.src-calc:before { content: 'Emacs Calc'; }
  pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
  pre.src-fortran:before { content: 'Fortran'; }
  pre.src-gnuplot:before { content: 'gnuplot'; }
  pre.src-haskell:before { content: 'Haskell'; }
  pre.src-hledger:before { content: 'hledger'; }
  pre.src-java:before { content: 'Java'; }
  pre.src-js:before { content: 'Javascript'; }
  pre.src-latex:before { content: 'LaTeX'; }
  pre.src-ledger:before { content: 'Ledger'; }
  pre.src-lisp:before { content: 'Lisp'; }
  pre.src-lilypond:before { content: 'Lilypond'; }
  pre.src-lua:before { content: 'Lua'; }
  pre.src-matlab:before { content: 'MATLAB'; }
  pre.src-mscgen:before { content: 'Mscgen'; }
  pre.src-ocaml:before { content: 'Objective Caml'; }
  pre.src-octave:before { content: 'Octave'; }
  pre.src-org:before { content: 'Org mode'; }
  pre.src-oz:before { content: 'OZ'; }
  pre.src-plantuml:before { content: 'Plantuml'; }
  pre.src-processing:before { content: 'Processing.js'; }
  pre.src-python:before { content: 'Python'; }
  pre.src-R:before { content: 'R'; }
  pre.src-ruby:before { content: 'Ruby'; }
  pre.src-sass:before { content: 'Sass'; }
  pre.src-scheme:before { content: 'Scheme'; }
  pre.src-screen:before { content: 'Gnu Screen'; }
  pre.src-sed:before { content: 'Sed'; }
  pre.src-sh:before { content: 'shell'; }
  pre.src-sql:before { content: 'SQL'; }
  pre.src-sqlite:before { content: 'SQLite'; }
  /* additional languages in org.el's org-babel-load-languages alist */
  pre.src-forth:before { content: 'Forth'; }
  pre.src-io:before { content: 'IO'; }
  pre.src-J:before { content: 'J'; }
  pre.src-makefile:before { content: 'Makefile'; }
  pre.src-maxima:before { content: 'Maxima'; }
  pre.src-perl:before { content: 'Perl'; }
  pre.src-picolisp:before { content: 'Pico Lisp'; }
  pre.src-scala:before { content: 'Scala'; }
  pre.src-shell:before { content: 'Shell Script'; }
  pre.src-ebnf2ps:before { content: 'ebfn2ps'; }
  /* additional language identifiers per "defun org-babel-execute"
       in ob-*.el */
  pre.src-cpp:before  { content: 'C++'; }
  pre.src-abc:before  { content: 'ABC'; }
  pre.src-coq:before  { content: 'Coq'; }
  pre.src-groovy:before  { content: 'Groovy'; }
  /* additional language identifiers from org-babel-shell-names in
     ob-shell.el: ob-shell is the only babel language using a lambda to put
     the execution function name together. */
  pre.src-bash:before  { content: 'bash'; }
  pre.src-csh:before  { content: 'csh'; }
  pre.src-ash:before  { content: 'ash'; }
  pre.src-dash:before  { content: 'dash'; }
  pre.src-ksh:before  { content: 'ksh'; }
  pre.src-mksh:before  { content: 'mksh'; }
  pre.src-posh:before  { content: 'posh'; }
  /* Additional Emacs modes also supported by the LaTeX listings package */
  pre.src-ada:before { content: 'Ada'; }
  pre.src-asm:before { content: 'Assembler'; }
  pre.src-caml:before { content: 'Caml'; }
  pre.src-delphi:before { content: 'Delphi'; }
  pre.src-html:before { content: 'HTML'; }
  pre.src-idl:before { content: 'IDL'; }
  pre.src-mercury:before { content: 'Mercury'; }
  pre.src-metapost:before { content: 'MetaPost'; }
  pre.src-modula-2:before { content: 'Modula-2'; }
  pre.src-pascal:before { content: 'Pascal'; }
  pre.src-ps:before { content: 'PostScript'; }
  pre.src-prolog:before { content: 'Prolog'; }
  pre.src-simula:before { content: 'Simula'; }
  pre.src-tcl:before { content: 'tcl'; }
  pre.src-tex:before { content: 'TeX'; }
  pre.src-plain-tex:before { content: 'Plain TeX'; }
  pre.src-verilog:before { content: 'Verilog'; }
  pre.src-vhdl:before { content: 'VHDL'; }
  pre.src-xml:before { content: 'XML'; }
  pre.src-nxml:before { content: 'XML'; }
  /* add a generic configuration mode; LaTeX export needs an additional
     (add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
  pre.src-conf:before { content: 'Configuration File'; }

  table { border-collapse:collapse; }
  caption.t-above { caption-side: top; }
  caption.t-bottom { caption-side: bottom; }
  td, th { vertical-align:top;  }
  th.org-right  { text-align: center;  }
  th.org-left   { text-align: center;   }
  th.org-center { text-align: center; }
  td.org-right  { text-align: right;  }
  td.org-left   { text-align: left;   }
  td.org-center { text-align: center; }
  dt { font-weight: bold; }
  .footpara { display: inline; }
  .footdef  { margin-bottom: 1em; }
  .figure { padding: 1em; }
  .figure p { text-align: center; }
  .inlinetask {
    padding: 10px;
    border: 2px solid gray;
    margin: 10px;
    background: #ffffcc;
  }
  #org-div-home-and-up
   { text-align: right; font-size: 70%; white-space: nowrap; }
  textarea { overflow-x: auto; }
  .linenr { font-size: smaller }
  .code-highlighted { background-color: #ffff00; }
  .org-info-js_info-navigation { border-style: none; }
  #org-info-js_console-label
    { font-size: 10px; font-weight: bold; white-space: nowrap; }
  .org-info-js_search-highlight
    { background-color: #ffff00; color: #000000; font-weight: bold; }
  .org-svg { width: 90%; }
  /*]]>*/-->
</style>
<script type="text/javascript">
/*
@licstart  The following is the entire license notice for the
JavaScript code in this tag.

Copyright (C) 2012-2019 Free Software Foundation, Inc.

The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version.  The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU GPL for more details.

As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.


@licend  The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
 function CodeHighlightOn(elem, id)
 {
   var target = document.getElementById(id);
   if(null != target) {
     elem.cacheClassElem = elem.className;
     elem.cacheClassTarget = target.className;
     target.className = "code-highlighted";
     elem.className   = "code-highlighted";
   }
 }
 function CodeHighlightOff(elem, id)
 {
   var target = document.getElementById(id);
   if(elem.cacheClassElem)
     elem.className = elem.cacheClassElem;
   if(elem.cacheClassTarget)
     target.className = elem.cacheClassTarget;
 }
/*]]>*///-->
</script>
<script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        displayAlign: "center",
        displayIndent: "0em",

        "HTML-CSS": { scale: 100,
                        linebreaks: { automatic: "false" },
                        webFont: "TeX"
                       },
        SVG: {scale: 100,
              linebreaks: { automatic: "false" },
              font: "TeX"},
        NativeMML: {scale: 100},
        TeX: { equationNumbers: {autoNumber: "AMS"},
               MultLineWidth: "85%",
               TagSide: "right",
               TagIndent: ".8em"
             }
});
</script>
<script type="text/javascript"
        src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_HTML"></script>
</head>
<body>
<div id="content">
<h1 class="title">A Python Companion to ISLR</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#org51a3d92">1. Introduction</a></li>
<li><a href="#orge18c8df">2. Statistical Learning</a>
<ul>
<li><a href="#orge487959">2.1. What is Statistical Learning?</a></li>
<li><a href="#orgeb9d610">2.2. Assessing Model Accuracy</a></li>
<li><a href="#org13974b2">2.3. Lab: Introduction to Python</a>
<ul>
<li><a href="#org2037bff">2.3.1. Basic Commands</a></li>
<li><a href="#orgb312b27">2.3.2. Graphics</a></li>
<li><a href="#org5e49b84">2.3.3. Indexing Data</a></li>
<li><a href="#orgd28dc33">2.3.4. Loading Data</a></li>
<li><a href="#org845481c">2.3.5. Additional Graphical and Numerical Summaries</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org0d58fd3">3. Linear Regression</a>
<ul>
<li><a href="#org0c24ba2">3.1. Simple Linear Regression</a></li>
<li><a href="#org8d955fe">3.2. Multiple Linear Regression</a></li>
<li><a href="#orge002d68">3.3. Other Considerations in the Regression Model</a></li>
<li><a href="#orge5b7046">3.4. The Marketing Plan</a></li>
<li><a href="#orge13beec">3.5. Comparison of Linear Regression with K-Nearest Neighbors</a></li>
<li><a href="#org3834f17">3.6. Lab: Linear Regression</a>
<ul>
<li><a href="#org42774ca">3.6.1. Libraries</a></li>
<li><a href="#orgef16a59">3.6.2. Simple Linear Regression</a></li>
<li><a href="#org3f68780">3.6.3. Multiple Linear Regression</a></li>
<li><a href="#org3477c07">3.6.4. Interaction Terms</a></li>
<li><a href="#org05f453d">3.6.5. Non-linear Transformations of the Predictors</a></li>
<li><a href="#orgd0f74b1">3.6.6. Qualitative Predictors</a></li>
<li><a href="#orgd293012">3.6.7. Calling <code>R</code> from <code>Python</code></a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org1fea058">4. Classification</a>
<ul>
<li><a href="#org8c46dbf">4.1. An Overview of Classification</a></li>
<li><a href="#org9bda174">4.2. Why Not Linear Regression?</a></li>
<li><a href="#orgda2ee54">4.3. Logistic Regression</a></li>
<li><a href="#orga747611">4.4. Linear Discriminant Analysis</a></li>
<li><a href="#orgf4eb929">4.5. A Comparison of Classification Methods</a></li>
<li><a href="#org1b0d56b">4.6. Lab: Logistic Regression, LDA, QDA, and KNN</a>
<ul>
<li><a href="#org9de03ac">4.6.1. The Stock Market Data</a></li>
<li><a href="#org71ef131">4.6.2. Logistc Regression</a></li>
<li><a href="#orgbcfd3bf">4.6.3. Linear Discriminant Analysis</a></li>
<li><a href="#org1d9b702">4.6.4. Quadratic Discriminant Analysis</a></li>
<li><a href="#orge5a8491">4.6.5. K-Nearest Neightbors</a></li>
<li><a href="#org790d805">4.6.6. An Application to Caravan Insurance Data</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org6c34bb7">5. Resampling Methods</a>
<ul>
<li><a href="#org3d5343e">5.1. Cross-Validation</a></li>
<li><a href="#org4fb86ce">5.2. The Bootstrap</a></li>
<li><a href="#org0379a7d">5.3. Lab: Cross-Validation and the Bootstrap</a>
<ul>
<li><a href="#org8dcc4da">5.3.1. The Validation Set Approach</a></li>
<li><a href="#org3c66e7c">5.3.2. Leave-One-Out Cross-Validation</a></li>
<li><a href="#orgafb980e">5.3.3. k-Fold Cross-Validation</a></li>
<li><a href="#orge2c382c">5.3.4. The Bootstrap</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org104cb2b">6. Linear Model Selection and Regularization</a>
<ul>
<li><a href="#org28de162">6.1. Subset Selection</a></li>
<li><a href="#orgc55df27">6.2. Shrinkage Methods</a></li>
<li><a href="#org2c9ebf8">6.3. Dimension Reduction Methods</a></li>
<li><a href="#org27036ec">6.4. Considerations in High Dimensions</a></li>
<li><a href="#org6b80217">6.5. Lab 1: Subset Selection Methods</a>
<ul>
<li><a href="#org20f73c1">6.5.1. Best Subset Selection</a></li>
<li><a href="#orgba9499e">6.5.2. Forward and Backward Stepwise Selection</a></li>
<li><a href="#orga5f0ec4">6.5.3. Choosing Among Models Using the Validation Set Approach and Cross-Validation</a></li>
</ul>
</li>
<li><a href="#orga1868ca">6.6. Lab 2: Ridge Regression and the Lasso</a>
<ul>
<li><a href="#orgef6a96f">6.6.1. Ridge Regression</a></li>
<li><a href="#org70f708b">6.6.2. The Lasso</a></li>
</ul>
</li>
<li><a href="#org3ee433d">6.7. Lab 3: PCR and PLS Regression</a>
<ul>
<li><a href="#org96f12f7">6.7.1. Principal Components Regression</a></li>
<li><a href="#orga100fd3">6.7.2. Partial Least Squares</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org2bdd141">7. Moving Beyond Linearity</a>
<ul>
<li><a href="#org8076280">7.1. Polynomial Regression</a></li>
<li><a href="#org324bcc2">7.2. Step Functions</a></li>
<li><a href="#org7acaba5">7.3. Basis Functions</a></li>
<li><a href="#org8c2eca0">7.4. Regression Splines</a></li>
<li><a href="#org3bb9aaf">7.5. Lab: Non-linear Modeling</a>
<ul>
<li><a href="#orgac959b8">7.5.1. Polynomial Regression and Step Functions</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#orgd2c3957">8. Tree-Based Models</a>
<ul>
<li><a href="#orgf4b805e">8.1. The Basics of Decision Trees</a>
<ul>
<li><a href="#org0112364">8.1.1. Regression Trees</a></li>
<li><a href="#orge46b2de">8.1.2. Classification Trees</a></li>
<li><a href="#org0bf0e9f">8.1.3. Trees versus Linear Models</a></li>
</ul>
</li>
<li><a href="#org60ad82b">8.2. Bagging, Random Forests, Boosting</a></li>
<li><a href="#org91a2f29">8.3. Lab: Decision Trees</a>
<ul>
<li><a href="#org8aae0c0">8.3.1. Fitting Classification Trees</a></li>
<li><a href="#org434f007">8.3.2. Fitting Regression Trees</a></li>
<li><a href="#org04ccd68">8.3.3. Bagging and Random Forests</a></li>
<li><a href="#org06402c2">8.3.4. Boosting</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#orgd98a3f9">9. Support Vector Machines</a>
<ul>
<li><a href="#org6220cf1">9.1. Maximal Margin Classifier</a>
<ul>
<li><a href="#org5e3604c">9.1.1. What is a Hyperplane?</a></li>
<li><a href="#org05ccb32">9.1.2. Classification Using a Separating Hyperplane</a></li>
<li><a href="#orgf3614f9">9.1.3. The Maximal Margin Classifier</a></li>
<li><a href="#org5d157d6">9.1.4. Construction of the Maximal Margin Classifier</a></li>
<li><a href="#org6025292">9.1.5. The Non-separable Case</a></li>
</ul>
</li>
<li><a href="#org1d8a47d">9.2. Support Vector Classifiers</a>
<ul>
<li><a href="#org9046c36">9.2.1. Overview of the Support Vector Classifier</a></li>
<li><a href="#orge4e3096">9.2.2. Details of the Support Vector Classifier</a></li>
</ul>
</li>
<li><a href="#org1933591">9.3. Support Vector Machines</a>
<ul>
<li><a href="#org513f10d">9.3.1. Classification with Non-linear Decision Boundaries</a></li>
<li><a href="#org4a732e4">9.3.2. The Support Vector Machine</a></li>
<li><a href="#org58a8f30">9.3.3. An Application to the Heart Disease Data</a></li>
</ul>
</li>
<li><a href="#orga42ffa9">9.4. SVMs with More than Two Classes</a>
<ul>
<li><a href="#org639a905">9.4.1. One-Versus-One Classification</a></li>
<li><a href="#org8e5bf9e">9.4.2. One-Versus-All Classification</a></li>
</ul>
</li>
<li><a href="#org7b0e323">9.5. Relationship with Logistic Regression</a></li>
<li><a href="#org65da4b4">9.6. Lab: Support Vector Machines</a>
<ul>
<li><a href="#org914fad5">9.6.1. Support Vector Classifier</a></li>
<li><a href="#org5c3cdef">9.6.2. Support Vector Machine</a></li>
<li><a href="#org86df7af">9.6.3. ROC Curves</a></li>
<li><a href="#org1bf1cd1">9.6.4. SVM with Multiple Classes</a></li>
<li><a href="#orgb372ca7">9.6.5. Application to Gene Expression Data</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#org79ac5fb">10. Unsupervised Learning</a>
<ul>
<li><a href="#orgbec34e7">10.1. The Challenge of Unsupervised Learning</a></li>
<li><a href="#org3b3f9b3">10.2. Principal Component Analysis</a>
<ul>
<li><a href="#orgc831dc8">10.2.1. What are Principal Components?</a></li>
<li><a href="#org1f1971c">10.2.2. Another Interpretation of Principal Components</a></li>
<li><a href="#org9d74a40">10.2.3. More on PCA</a></li>
</ul>
</li>
<li><a href="#org202cb85">10.3. Clustering Methods</a>
<ul>
<li><a href="#org4f92fc9">10.3.1. K-Means Clustering</a></li>
<li><a href="#orgfa686cb">10.3.2. Hierarchical Clustering</a></li>
</ul>
</li>
<li><a href="#org9cfd91f">10.4. Lab 1: Principal Components Analysis</a></li>
<li><a href="#org2e7b44a">10.5. Lab 2: Clustering</a>
<ul>
<li><a href="#org5f24036">10.5.1. K-Means Clustering</a></li>
<li><a href="#org9cb02e0">10.5.2. Hierarchical Clustering</a></li>
</ul>
</li>
<li><a href="#orgd639e92">10.6. Lab 3: NCI60 Data Example</a>
<ul>
<li><a href="#org2247c90">10.6.1. PCA on NCI60 Data</a></li>
<li><a href="#orgaadd3da">10.6.2. Clustering the Observations of the NCI60 Data</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
</div>

<div id="outline-container-org51a3d92" class="outline-2">
<h2 id="org51a3d92"><span class="section-number-2">1</span> Introduction</h2>
<div class="outline-text-2" id="text-1">
<p>
Figure <a href="#orgce502d8">1</a> shows graphs of Wage versus three variables. 
</p>


<div id="orgce502d8" class="figure">
<p><img src="figures/fig1_1.png" alt="fig1_1.png" />
</p>
<p><span class="figure-number">Figure 1: </span><code>Wage</code> data, which contains income survey information for males from the central Atlantic region of the United States.  Left: <code>wage</code> as a function of <code>age</code>.  On average, <code>wage</code> increases with <code>age</code> until about 60 years of age, at which point it begins to decline.  Center: <code>wage</code> as a function of <code>year</code>.  There is a slow but steady increase of approximately $10,000 in the average <code>wage</code> between 2003 and 2009.  Right: Boxplots displaying <code>wage</code> as a function of <code>education</code>, with 1 indicating the lowest level (no highschool diploma) and 5 the highest level (an advanced graduate degree).  On average, <code>wage</code> increases with the level of <code>education</code>.</p>
</div>


<p>
Figure <a href="#orgc0f5bc5">2</a> shows boxplots of previous days' percentage changes in S&amp;P
500 grouped according to today's change <code>Up</code> or <code>Down</code>. 
</p>


<div id="orgc0f5bc5" class="figure">
<p><img src="figures/fig1_2.png" alt="fig1_2.png" />
</p>
<p><span class="figure-number">Figure 2: </span>Left: Boxplots of the previous day's percentage change in the S&amp;P 500 index for the days for which the market increased or decreased, obtained from the <code>Smarket</code> data.  Center and Right: Same as left panel, but the percentage changes for two and three days previous are shown.</p>
</div>
</div>
</div>

<div id="outline-container-orge18c8df" class="outline-2">
<h2 id="orge18c8df"><span class="section-number-2">2</span> Statistical Learning</h2>
<div class="outline-text-2" id="text-2">
<p>
<a id="orgce1fc8b"></a>
</p>
</div>
<div id="outline-container-orge487959" class="outline-3">
<h3 id="orge487959"><span class="section-number-3">2.1</span> What is Statistical Learning?</h3>
<div class="outline-text-3" id="text-2-1">
<p>
Figure <a href="#org8710e1d">3</a> shows scatter plots of <code>sales</code> versus <code>TV</code>, <code>radio</code>,
and <code>newspaper</code> advertising.  In each panel, the figure also includes an OLS
regression line.  
</p>


<div id="org8710e1d" class="figure">
<p><img src="figures/fig2_1.png" alt="fig2_1.png" />
</p>
<p><span class="figure-number">Figure 3: </span>The <code>Advertising</code> data set. The plot displays <code>sales</code>, in thousands of units, as a function of <code>TV</code>, <code>radio</code>, and <code>newspaper</code> budgets, in thousands of dollars, for 200 different markets.  In each plot we show the simple least squares fit of <code>sales</code> to that variable.  In other words, each red line represents a simple model that can be used to predict <code>sales</code> using <code>TV</code>, <code>radio</code>, and <code>newspaper</code>, respectively.</p>
</div>


<p>
Figure <a href="#orga8a0b15">4</a> is a plot of <code>Income</code> versus <code>Years of Education</code> from the
Income data set.  In the left panel, the ``true'' function (given by blue line)
is actually my guess.  
</p>


<div id="orga8a0b15" class="figure">
<p><img src="figures/fig2_2.png" alt="fig2_2.png" />
</p>
<p><span class="figure-number">Figure 4: </span>The <code>Income</code> data set.  Left: The red dots are the observed values of <code>income</code> (in tens of thousands of dollars) and <code>years of education</code> for 30 individuals.  Right: The blue curve represents the true underlying relationship between <code>income</code> and <code>years of education</code>, which is generally unknown (but is known in this case because the data are simulated).  The vertical lines represent the error associated with each observation.  Note that some of the errors are positive (when an observation lies above the blue curve) and some are negative (when an observation lies below the curve).  Overall, these errors have approximately mean zero.</p>
</div>


<p>
Figure <a href="#orgdf20213">5</a> is a plot of <code>Income</code> versus <code>Years of Education</code> and
<code>Seniority</code> from the <code>Income</code> data set.  Since the book does not provide the
true values of <code>Income</code>, ``true'' values shown in the plot are actually third
order polynomial fit.  
</p>


<div id="orgdf20213" class="figure">
<p><img src="figures/fig2_3.png" alt="fig2_3.png" />
</p>
<p><span class="figure-number">Figure 5: </span>The plot displays <code>income</code> as a function of <code>years of education</code> and <code>seniority</code> in the <code>Income</code> data set.  The blue surface represents the true underlying relationship between <code>income</code> and <code>years of education</code> and <code>seniority</code>, which is known since the data are simulated.  The red dots indicate the observed values of these quantities for 30 individuals.</p>
</div>


<p>
Figure <a href="#org4581c1a">6</a> shows an example of the parametric approach applied to
the <code>Income</code> data from previous figure. 
</p>


<div id="org4581c1a" class="figure">
<p><img src="figures/fig2_4.png" alt="fig2_4.png" />
</p>
<p><span class="figure-number">Figure 6: </span>A linear model fit by least squares to the <code>Income</code> data from figure <a href="#orgdf20213">5</a>.  The observations are shown in red, and the blue plane indicates the least squares fit to the data.</p>
</div>


<p>
Figure <a href="#org0f419f0">7</a> provides an illustration of the trade-off between
flexibility and interpretability for some of the methods covered in this book.
</p>


<div id="org0f419f0" class="figure">
<p><img src="figures/figure2_7.png" alt="figure2_7.png" />
</p>
<p><span class="figure-number">Figure 7: </span>A representation of the tradeoff between flexibility and interpretability, using different statistical learning methods.  In general, as the flexibility of a method increases, its interpretability decreases.</p>
</div>


<p>
Figure <a href="#org3614b05">8</a> provides a simple illustration of the clustering problem.
</p>


<div id="org3614b05" class="figure">
<p><img src="figures/fig2_8.png" alt="fig2_8.png" />
</p>
<p><span class="figure-number">Figure 8: </span>A clustering data set involving three groups.  Each group is shown using a different colored symbol.  Left: The three groups are well-separated.  In this setting, a clustering approach should successfully identify the three groups.  Right: There is some overlap among the groups.  Now the clustering taks is more challenging.</p>
</div>
</div>
</div>

<div id="outline-container-orgeb9d610" class="outline-3">
<h3 id="orgeb9d610"><span class="section-number-3">2.2</span> Assessing Model Accuracy</h3>
<div class="outline-text-3" id="text-2-2">
<p>
Figure <a href="#org43f7e4e">9</a> illustrates the tradeoff between training MSE and test
MSE.  We select a ``true function'' whose shape is similar to that shown in the
book.  In the left panel, the orange, blue, and green curves illustrate three possible estimates
for \(f\) given by the black curve.  The orange line is the linear regression
fit, which is relatively inflexible.  The blue and green curves were produced
using <i>smoothing splines</i> from <code>UnivariateSpline</code> function in <code>scipy</code> package.
We obtain different levels of flexibility by varying the parameter <code>s</code>, which
affects the number of knots.  
</p>

<p>
For the right panel, we have chosen polynomial fits.  The degree of polynomial
represents the level of flexibility.  This is because the function
<code>UnivariateSpline</code> does not more than five degrees of freedom.  
</p>

<p>
When we repeat the simulations for figure <a href="#org43f7e4e">9</a>, we see considerable
variation in the right panel MSE plots.  But the overall conclusion remains the
same.   
</p>


<div id="org43f7e4e" class="figure">
<p><img src="figures/fig2_9.png" alt="fig2_9.png" />
</p>
<p><span class="figure-number">Figure 9: </span>Left: Data simulated from \(f\), shown in black.  Three estimates of \(f\) are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves).  Right: Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed grey line).</p>
</div>


<p>
Figure <a href="#org9a4ea7b">10</a> provides another example in which the true \(f\) is
approximately linear. 
</p>


<div id="org9a4ea7b" class="figure">
<p><img src="figures/fig2_10.png" alt="fig2_10.png" />
</p>
<p><span class="figure-number">Figure 10: </span>Details are as in figure <a href="#org43f7e4e">9</a> using a different true \(f\) that is much closer to linear.  In this setting, linear regression provides a very good fit to the data.</p>
</div>


<p>
Figure <a href="#orgb45f0cf">11</a> displays an example in which \(f\) is highly
non-linear. The training and test MSE curves still exhibit the same general
patterns.
</p>


<div id="orgb45f0cf" class="figure">
<p><img src="figures/fig2_11.png" alt="fig2_11.png" />
</p>
<p><span class="figure-number">Figure 11: </span>Details are as in figure <a href="#org43f7e4e">9</a>, using a different \(f\) that is far from linear.  In this setting, linear regression provides a very poor fit to the data.</p>
</div>


<p>
Figure <a href="#org25bb645">12</a> displays the relationship between bias, variance, and
test MSE.  This relationship is referred to as <i>bias-variance trade-off</i>.  When
simulations are repeated, we see considerable variation in different graphs,
especially for MSE lines.  But overall shape remains the same. 
</p>


<div id="org25bb645" class="figure">
<p><img src="figures/fig2_12.png" alt="fig2_12.png" />
</p>
<p><span class="figure-number">Figure 12: </span>Squared bias (blue curve), variance (orange curve), \(Var(\epsilon)\) (dashed line), and test MSE (red curve) for the three data sets in figures <a href="#org43f7e4e">9</a> - <a href="#orgb45f0cf">11</a>.  The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE.</p>
</div>


<p>
Figure <a href="#org0d2d113">13</a> provides an example using a simulated data set in
two-dimensional space consisting of predictors \(X_1\) and \(X_2\).  
</p>


<div id="org0d2d113" class="figure">
<p><img src="figures/fig2_13.png" alt="fig2_13.png" />
</p>
<p><span class="figure-number">Figure 13: </span>A simulated data set consisting of 200 observations in two groups, indicated in blue and orange.  The dashed line represents the Bayes decision boundary.  The orange background grid indicates the region in which a test observation will be assigned to the orange class, and blue background grid indicates the region in which a test observation will be assigned to the blue class.</p>
</div>


<p>
Figure <a href="#org4f7b532">14</a> displays the KNN decision boundary, using \(K=10\), when
applied to the simulated data set from figure <a href="#org0d2d113">13</a>.  Even though
the true distribution is not known by the KNN classifier, the KNN decision
making boundary is very close to that of the Bayes classifier.  
</p>


<div id="org4f7b532" class="figure">
<p><img src="figures/fig2_15.png" alt="fig2_15.png" />
</p>
<p><span class="figure-number">Figure 14: </span>The firm line indicates the KNN decision boundary on the data from figure <a href="#org0d2d113">13</a>, using \(K = 10\). The Bayes decision boundary is shown as a dashed line.  The KNN and Bayes decision boundaries are very similar.</p>
</div>


<div id="orgeee6d4c" class="figure">
<p><img src="figures/fig2_16.png" alt="fig2_16.png" />
</p>
<p><span class="figure-number">Figure 15: </span>A comparison of the KNN decision boundaries (solid curves) obtained using \(K=1\) and \(K=100\) on the data from figure <a href="#org0d2d113">13</a>.  With \(K=1\), the decision boundary is overly flexible, while with \(K=100\) it is not sufficiently flexible.  The Bayes decision boundary is shown as dashed line.</p>
</div>


<p>
In figure <a href="#orgff201e6">16</a> we have plotted the KNN test and training errors as
a function of \(\frac{1}{K}\).  As \(\frac{1}{K}\) increases, the method becomes
more flexible.  As in the regression setting, the training error rate
consistently declines as the flexibility increases.  However, the test error
exhibits the characteristic U-shape, declining at first (with a minimum at
approximately \(K=10\)) before increasing again when the method becomes
excessively flexible and overfits. 
</p>


<div id="orgff201e6" class="figure">
<p><img src="figures/fig2_17.png" alt="fig2_17.png" />
</p>
<p><span class="figure-number">Figure 16: </span>The KNN training error rate (blue, 200 observations) and test error rate (orange, 5,000 observations) on the data from figure <a href="#org0d2d113">13</a> as the level of flexibility (assessed using \(\frac{1}{K}\)) increases, or equivalently as the number of neighbors \(K\) decreases.  The black dashed line indicates the Bayes error rate.</p>
</div>
</div>
</div>

<div id="outline-container-org13974b2" class="outline-3">
<h3 id="org13974b2"><span class="section-number-3">2.3</span> Lab: Introduction to Python</h3>
<div class="outline-text-3" id="text-2-3">
</div>
<div id="outline-container-org2037bff" class="outline-4">
<h4 id="org2037bff"><span class="section-number-4">2.3.1</span> Basic Commands</h4>
<div class="outline-text-4" id="text-2-3-1">
<p>
In <code>Python</code> a list can be created by enclosing comma-separated elements by
square brackets.  Length of a list can be obtained using <code>len</code> function.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #BA36A5;">x</span> = [1, 3, 2, 5]
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">len</span>(x))
<span style="color: #BA36A5;">y</span> = 3
<span style="color: #BA36A5;">z</span> = 5
<span style="color: #0000FF;">print</span>(y + z)
</pre>
</div>

<pre class="example">
4
8

</pre>

<p>
To create an array of numbers, use <code>array</code> function in <code>numpy</code> library.  <code>numpy</code>
functions can be used to perform element-wise operations on arrays.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #BA36A5;">x</span> = np.array([[1, 2], [3, 4]])
<span style="color: #BA36A5;">y</span> = np.array([6, 7, 8, 9]).reshape((2, 2))
<span style="color: #0000FF;">print</span>(x)
<span style="color: #0000FF;">print</span>(y)
<span style="color: #0000FF;">print</span>(x ** 2)
<span style="color: #0000FF;">print</span>(np.sqrt(y))
</pre>
</div>

<pre class="example">
[[1 2]
 [3 4]]
[[6 7]
 [8 9]]
[[ 1  4]
 [ 9 16]]
[[2.44948974 2.64575131]
 [2.82842712 3.        ]]

</pre>


<p>
<code>numpy.random</code> has a number of functions to generate random variables that
follow a given distribution.  Here we create two correlated sets of numbers, <code>x</code>
and <code>y</code>, and use <code>numpy.corrcoef</code> to calculate correlation between them. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
np.random.seed(911)
<span style="color: #BA36A5;">x</span> = np.random.normal(size=50)
<span style="color: #BA36A5;">y</span> = x + np.random.normal(loc=50, scale=0.1, size=50)
<span style="color: #0000FF;">print</span>(np.corrcoef(x, y))
<span style="color: #0000FF;">print</span>(np.corrcoef(x, y)[0, 1])
<span style="color: #0000FF;">print</span>(np.mean(x))
<span style="color: #0000FF;">print</span>(np.var(y))
<span style="color: #0000FF;">print</span>(np.std(y) ** 2)
</pre>
</div>

<pre class="example">
[[1.         0.99374931]
 [0.99374931 1.        ]]
0.9937493134584551
-0.020219724397254404
0.9330621750073689
0.9330621750073688

</pre>
</div>
</div>

<div id="outline-container-orgb312b27" class="outline-4">
<h4 id="orgb312b27"><span class="section-number-4">2.3.2</span> Graphics</h4>
<div class="outline-text-4" id="text-2-3-2">
<p>
<code>matplotlib</code> library has a number of functions to plot data in <code>Python</code>.  It is
possible to view graphs on screen or save them in file for inclusion in a
document. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> matplotlib               <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">only if we need to save figure in file</span>
matplotlib.use(<span style="color: #008000;">'Agg'</span>)           <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">only to save figure in file</span>
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">x</span> = np.random.normal(size=100)
<span style="color: #BA36A5;">y</span> = np.random.normal(size=100)
plt.plot(x, y)
plt.xlabel(<span style="color: #008000;">'This is x-axis'</span>)
plt.ylabel(<span style="color: #008000;">'This is y-axis'</span>)
plt.title(<span style="color: #008000;">'Plot of X vs Y'</span>)

plt.savefig(<span style="color: #008000;">'xyPlot.png'</span>)       <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">only to save figure in a file</span>
</pre>
</div>

<p>
<code>numpy</code> function <code>linspace</code> can be used to create a sequence between a start and
an end of a given length.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">x</span> = np.linspace(-np.pi, np.pi, num=50)
<span style="color: #BA36A5;">y</span> = x
<span style="color: #BA36A5;">xx</span>, <span style="color: #BA36A5;">yy</span> = np.meshgrid(x, y)
<span style="color: #BA36A5;">zz</span> = np.cos(yy) / (1 + xx ** 2)

plt.contour(xx, yy, zz)

<span style="color: #BA36A5;">fig</span>, <span style="color: #BA36A5;">ax</span> = plt.subplots()
<span style="color: #BA36A5;">zza</span> = (zz - zz.T) / 2.0
<span style="color: #BA36A5;">CS</span> = ax.contour(xx, yy, zza)
ax.clabel(CS, inline=1)
</pre>
</div>
</div>
</div>

<div id="outline-container-org5e49b84" class="outline-4">
<h4 id="org5e49b84"><span class="section-number-4">2.3.3</span> Indexing Data</h4>
<div class="outline-text-4" id="text-2-3-3">
<p>
To access elements of an array, specify indexes inside square brackets.  It is
possible to access multiple rows and columns. <code>shape</code> method gives number of
rows followed by number of columns. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">A</span> = np.array(np.arange(1, 17))
<span style="color: #BA36A5;">A</span> = A.reshape(4, 4, order=<span style="color: #008000;">'F'</span>)  <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">column first, Fortran style</span>
<span style="color: #0000FF;">print</span>(A)
<span style="color: #0000FF;">print</span>(A[1, 2])
<span style="color: #0000FF;">print</span>(A[(0,2),:][:,(1,3)])
<span style="color: #0000FF;">print</span>(A[<span style="color: #006FE0;">range</span>(0,3),:][:,<span style="color: #006FE0;">range</span>(1,4)])
<span style="color: #0000FF;">print</span>(A[<span style="color: #006FE0;">range</span>(0, 2), :])
<span style="color: #0000FF;">print</span>(A[:, <span style="color: #006FE0;">range</span>(0, 2)])
<span style="color: #0000FF;">print</span>(A[0,:])
<span style="color: #0000FF;">print</span>(A.shape)
</pre>
</div>

<pre class="example">
[[ 1  5  9 13]
 [ 2  6 10 14]
 [ 3  7 11 15]
 [ 4  8 12 16]]
10
[ 5 15]
[ 5 10 15]
[[ 1  5  9 13]
 [ 2  6 10 14]]
[[1 5]
 [2 6]
 [3 7]
 [4 8]]
(4, 4)
</pre>
</div>
</div>

<div id="outline-container-orgd28dc33" class="outline-4">
<h4 id="orgd28dc33"><span class="section-number-4">2.3.4</span> Loading Data</h4>
<div class="outline-text-4" id="text-2-3-4">
<p>
<code>pandas</code> library provides <code>read_csv</code> function to read files with data in
rectangular shape.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #BA36A5;">Auto</span> = pd.read_csv(<span style="color: #008000;">'data/Auto.csv'</span>)
<span style="color: #0000FF;">print</span>(Auto.head())
<span style="color: #0000FF;">print</span>(Auto.shape)
<span style="color: #0000FF;">print</span>(Auto.columns)
</pre>
</div>

<pre class="example">
    mpg  cylinders  displacement  ... year  origin                       name
0  18.0          8         307.0  ...   70       1  chevrolet chevelle malibu
1  15.0          8         350.0  ...   70       1          buick skylark 320
2  18.0          8         318.0  ...   70       1         plymouth satellite
3  16.0          8         304.0  ...   70       1              amc rebel sst
4  17.0          8         302.0  ...   70       1                ford torino

[5 rows x 9 columns]
(397, 9)
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')
</pre>

<p>
To load data from an <code>R</code> library, use <code>get_rdataset</code> function from
<code>statsmodels</code>.  This function seems to work only if the computer is connected to
the internet. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #BA36A5;">carseats</span> = datasets.get_rdataset(<span style="color: #008000;">'Carseats'</span>, package=<span style="color: #008000;">'ISLR'</span>).data
<span style="color: #0000FF;">print</span>(carseats.shape)
<span style="color: #0000FF;">print</span>(carseats.columns)
</pre>
</div>

<pre class="example">
(400, 11)
Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

</pre>
</div>
</div>

<div id="outline-container-org845481c" class="outline-4">
<h4 id="org845481c"><span class="section-number-4">2.3.5</span> Additional Graphical and Numerical Summaries</h4>
<div class="outline-text-4" id="text-2-3-5">
<p>
<code>plot</code> method can be directly applied to a <code>pandas</code> dataframe.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #BA36A5;">Auto</span> = pd.read_csv(<span style="color: #008000;">'data/Auto.csv'</span>)
Auto.boxplot(column=<span style="color: #008000;">'mpg'</span>, by=<span style="color: #008000;">'cylinders'</span>, grid=<span style="color: #D0372D;">False</span>)
</pre>
</div>

<p>
<code>hist</code> method can be applied to plot a histogram. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #BA36A5;">Auto</span> = pd.read_csv(<span style="color: #008000;">'data/Auto.csv'</span>)
Auto.hist(column=<span style="color: #008000;">'mpg'</span>)
Auto.hist(column=<span style="color: #008000;">'mpg'</span>, color=<span style="color: #008000;">'red'</span>)
Auto.hist(column=<span style="color: #008000;">'mpg'</span>, color=<span style="color: #008000;">'red'</span>, bins=15)
</pre>
</div>

<p>
For pairs plot, use <code>scatter_matrix</code> method in <code>pandas.plotting</code>.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">from</span> pandas <span style="color: #0000FF;">import</span> plotting
<span style="color: #BA36A5;">Auto</span> = pd.read_csv(<span style="color: #008000;">'data/Auto.csv'</span>)
plotting.scatter_matrix(Auto[[<span style="color: #008000;">'mpg'</span>, <span style="color: #008000;">'displacement'</span>, <span style="color: #008000;">'horsepower'</span>, <span style="color: #008000;">'weight'</span>,
                              <span style="color: #008000;">'acceleration'</span>]])
</pre>
</div>

<p>
On <code>pandas</code> dataframes, <code>describe</code> method produces a summary of each variable. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #BA36A5;">Auto</span> = pd.read_csv(<span style="color: #008000;">'data/Auto.csv'</span>)
<span style="color: #0000FF;">print</span>(Auto.describe())
</pre>
</div>

<pre class="example">
              mpg   cylinders  ...        year      origin
count  397.000000  397.000000  ...  397.000000  397.000000
mean    23.515869    5.458438  ...   75.994962    1.574307
std      7.825804    1.701577  ...    3.690005    0.802549
min      9.000000    3.000000  ...   70.000000    1.000000
25%     17.500000    4.000000  ...   73.000000    1.000000
50%     23.000000    4.000000  ...   76.000000    1.000000
75%     29.000000    8.000000  ...   79.000000    2.000000
max     46.600000    8.000000  ...   82.000000    3.000000

[8 rows x 7 columns]
</pre>
</div>
</div>
</div>
</div>

<div id="outline-container-org0d58fd3" class="outline-2">
<h2 id="org0d58fd3"><span class="section-number-2">3</span> Linear Regression</h2>
<div class="outline-text-2" id="text-3">
</div>
<div id="outline-container-org0c24ba2" class="outline-3">
<h3 id="org0c24ba2"><span class="section-number-3">3.1</span> Simple Linear Regression</h3>
<div class="outline-text-3" id="text-3-1">
<p>
Figure <a href="#org7a8d086">17</a> displays the simple linear regression fit to the
<code>Advertising</code> data, where \(\hat{\beta_0} =\) 0.0475
 and \(\hat{\beta_1} =\) 7.0326.
</p>


<div id="org7a8d086" class="figure">
<p><img src="figures/fig3_1.png" alt="fig3_1.png" />
</p>
<p><span class="figure-number">Figure 17: </span>For the <code>Advertising</code> data, the least squares fit for the regression of <code>sales</code> onto <code>TV</code> is shown.  The fit is found by minimizing the sum of squared errors.  Each grey line represents an error, and the fit makes a compromise by averaging their squares.  In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot.</p>
</div>


<p>
In figure <a href="#orged2c00d">18</a>, we have computed RSS for a number of values of
\(\beta_0\) and \(\beta_1\), using the advertising data with <code>sales</code> as the response
and <code>TV</code> as the predictor. 
</p>


<div id="orged2c00d" class="figure">
<p><img src="figures/fig3_2.png" alt="fig3_2.png" />
</p>
<p><span class="figure-number">Figure 18: </span>Contour and three-dimensional plots of the RSS on the <code>Advertising</code> data, using <code>sales</code> as the response and <code>TV</code> as the predictor.  The red dots correspond to the least squares estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\).</p>
</div>


<p>
The left-hand panel of figure <a href="#org6d8b0c3">19</a> displays <i>population regression
line</i> and <i>least squares line</i> for a simple simulated example.  The red line in
the left-hand panel displays the <i>true</i> relationship, \(f(X) = 2 + 3X\), while the
blue line is the least squares estimate based on observed data.  In the
right-hand panel of figure <a href="#org6d8b0c3">19</a> we have generated five different
data sets from the model \(Y = 2 + 3X + \epsilon\) and plotted the corresponding
five least squares lines.  
</p>


<div id="org6d8b0c3" class="figure">
<p><img src="figures/fig3_3.png" alt="fig3_3.png" />
</p>
<p><span class="figure-number">Figure 19: </span>A simulated data set.  Left: The red line represents the true relationship, \(f(X) = 2 + 3X\), which is known as the population regression line.  The blue line is the least squares line; it is the least squares estimate for \(f(X)\) based on the observed data, shown in grey circles.  Right: The population regression line is again shown in red, and the least squares line in blue.  In cyan, five least squares lines are shown, each computed on the basis of a separate random set of observations.  Each least squares line is different, but on average, the least squares lines are quite close to the population regression line.</p>
</div>

<p>
For <code>Advertising</code> data, table <a href="#org2845396">1</a> provides details of the least squares model for the
regression of number of units sold on TV advertising budget. 
</p>

<table id="org2845396" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 1:</span> For <code>Advertising</code> data, the coefficients of the least squares model for the regression of number of units sold on TV advertising budget.  An increase of $1,000 on the TV advertising budget is associated with an increase in sales by around 50 units.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">7.0326</td>
<td class="org-right">0.4578</td>
<td class="org-right">15.3603</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">TV</td>
<td class="org-right">0.0475</td>
<td class="org-right">0.0027</td>
<td class="org-right">17.6676</td>
<td class="org-right">0.0</td>
</tr>
</tbody>
</table>


<p>
Next, in table <a href="#org624665c">2</a>, we report more information about the least squares model.  
</p>


<table id="org624665c" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 2:</span> For the <code>Advertising</code> data, more information about the least squares model for the regression of number of units sold on TV advertising budget.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Quantity</th>
<th scope="col" class="org-right">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Residual standard error</td>
<td class="org-right">3.259</td>
</tr>

<tr>
<td class="org-left">\(R^2\)</td>
<td class="org-right">0.612</td>
</tr>

<tr>
<td class="org-left">F-statistic</td>
<td class="org-right">312.145</td>
</tr>
</tbody>
</table>
</div>
</div>

<div id="outline-container-org8d955fe" class="outline-3">
<h3 id="org8d955fe"><span class="section-number-3">3.2</span> Multiple Linear Regression</h3>
<div class="outline-text-3" id="text-3-2">
<p>
Table <a href="#orgb4df7bc">3</a>  shows results of two simple linear
regressions, each of which uses a different advertising medium as a predictor.
We find that a $1,000 increase in spending on radio advertising is associated
with an increase in sales by around 202 units.  A $1,000 increase in advertising
spending on on newspapers increases sales by approximately 55 units. 
</p>

<table id="orgb4df7bc" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 3:</span> More simple linear regression models for <code>Advertising</code> data.  Coefficients of the simple linear regression model for number of units sold on Top: radio advertising budget and Bottom: newspaper advertising budget. A $1,000 increase in spending on radio advertising is associated with an average increase sales by around 202 units, while the same increase in spending on newspaper advertising is associated with an average increase of around 55 units.  <code>Sales</code> variable is in thousands of units, and the <code>radio</code> and <code>newspaper</code> variables are in thousands of dollars..</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">9.312</td>
<td class="org-right">0.563</td>
<td class="org-right">16.542</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">radio</td>
<td class="org-right">0.202</td>
<td class="org-right">0.02</td>
<td class="org-right">9.921</td>
<td class="org-right">0.0</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">12.351</td>
<td class="org-right">0.621</td>
<td class="org-right">19.876</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">newspaper</td>
<td class="org-right">0.055</td>
<td class="org-right">0.017</td>
<td class="org-right">3.3</td>
<td class="org-right">0.001</td>
</tr>
</tbody>
</table>


<p>
Figure <a href="#org919687c">20</a> illustrates an example of the least squares fit to a
toy data set with \(p = 2\) predictors. 
</p>


<div id="org919687c" class="figure">
<p><img src="figures/fig3_4.png" alt="fig3_4.png" />
</p>
<p><span class="figure-number">Figure 20: </span>In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane.  The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane.</p>
</div>


<p>
Table <a href="#org5b6c072">4</a> displays multiple regression coefficient estimates when
TV, radio, and newspaper advertising budgets are used to predict product sales
using <code>Advertising</code> data.
</p>

<table id="org5b6c072" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 4:</span> For the <code>Advertising</code> data, least squares coefficient estimates of the multiple linear regression of number of units sold on radio, TV, and newspaper advertising budgets.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">2.939</td>
<td class="org-right">0.312</td>
<td class="org-right">9.422</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">TV</td>
<td class="org-right">0.046</td>
<td class="org-right">0.001</td>
<td class="org-right">32.809</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">radio</td>
<td class="org-right">0.189</td>
<td class="org-right">0.009</td>
<td class="org-right">21.893</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">newspaper</td>
<td class="org-right">-0.001</td>
<td class="org-right">0.006</td>
<td class="org-right">-0.177</td>
<td class="org-right">0.86</td>
</tr>
</tbody>
</table>

<p>
Table <a href="#org3b4f6f6">5</a> shows the correlation matrix for the three predictor
variables and response variable in table <a href="#org5b6c072">4</a>. 
</p>

<table id="org3b4f6f6" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 5:</span> Correlation matrix for <code>TV</code>, <code>radio</code>, and <code>sales</code> for the <code>Advertising</code> data.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">TV</th>
<th scope="col" class="org-right">radio</th>
<th scope="col" class="org-right">newspaper</th>
<th scope="col" class="org-right">sales</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">TV</td>
<td class="org-right">1.0</td>
<td class="org-right">0.0548</td>
<td class="org-right">0.0566</td>
<td class="org-right">0.7822</td>
</tr>

<tr>
<td class="org-left">radio</td>
<td class="org-right">0.0548</td>
<td class="org-right">1.0</td>
<td class="org-right">0.3541</td>
<td class="org-right">0.5762</td>
</tr>

<tr>
<td class="org-left">newspaper</td>
<td class="org-right">0.0566</td>
<td class="org-right">0.3541</td>
<td class="org-right">1.0</td>
<td class="org-right">0.2283</td>
</tr>

<tr>
<td class="org-left">sales</td>
<td class="org-right">0.7822</td>
<td class="org-right">0.5762</td>
<td class="org-right">0.2283</td>
<td class="org-right">1.0</td>
</tr>
</tbody>
</table>

<table id="orgef55d94" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 6:</span> More information about the least squares model for the regression of number of units sold on TV, newspaper, and radio advertising budgets in the <code>Advertising</code> data.  Other information about this model was displayed in table <a href="#org5b6c072">4</a>.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Quantity</th>
<th scope="col" class="org-right">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Residual standard error</td>
<td class="org-right">1.69</td>
</tr>

<tr>
<td class="org-left">\(R^2\)</td>
<td class="org-right">0.897</td>
</tr>

<tr>
<td class="org-left">F-statistic</td>
<td class="org-right">570.0</td>
</tr>
</tbody>
</table>

<p>
Figure <a href="#orgcf78626">21</a> displays a three-dimensional plot of <code>TV</code> and <code>radio</code>
versus <code>sales</code>.  
</p>


<div id="orgcf78626" class="figure">
<p><img src="figures/fig3_5.png" alt="fig3_5.png" />
</p>
<p><span class="figure-number">Figure 21: </span>For the <code>Advertising</code> data, a linear regression fit to <code>sales</code> using <code>TV</code> and <code>radio</code> as predictors.  From the pattern of the residuals, we can see that there is a pronounced non-linear relationship in the data.  The positive residuals tend to lie along the 45-degree line, where TV and Radio budgets are split evenly.  The negative residuals tend to lie away from this line, where budgets are more lopsided.</p>
</div>
</div>
</div>

<div id="outline-container-orge002d68" class="outline-3">
<h3 id="orge002d68"><span class="section-number-3">3.3</span> Other Considerations in the Regression Model</h3>
<div class="outline-text-3" id="text-3-3">
<p>
<code>Credit</code> data set displayed in figure <a href="#orgf02e44c">22</a> records <code>balance</code>
(average credit card debt for a number of individuals) as well as several
quantitative predictors: <code>age</code>, <code>cards</code> (number of credit cards), <code>education</code>
and <code>rating</code> (credit rating).
</p>


<div id="orgf02e44c" class="figure">
<p><img src="figures/fig3_6.png" alt="fig3_6.png" />
</p>
<p><span class="figure-number">Figure 22: </span>The <code>Credit</code> dataset contains information about <code>balance</code>, <code>age</code>, <code>cards</code>, <code>education</code>, <code>income</code>, <code>limit</code>, and <code>rating</code> for a number of potential customers.</p>
</div>


<p>
Table <a href="#org8dbdf5e">7</a> displays the coefficient estimates and other information
associated with the model where <code>gender</code> is the only explanatory variable.
</p>

<table id="org8dbdf5e" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 7:</span> Least squares coefficient estimates associated with the regression of <code>balance</code> onto <code>gender</code> in the <code>Credit</code> data set.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">509.803</td>
<td class="org-right">33.128</td>
<td class="org-right">15.389</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">Gender[T.Female]</td>
<td class="org-right">19.733</td>
<td class="org-right">46.051</td>
<td class="org-right">0.429</td>
<td class="org-right">0.669</td>
</tr>
</tbody>
</table>


<p>
From table <a href="#org64eba39">8</a> we see that the estimated <code>balance</code> for the
baseline, African American, is $531.0. It is estimated that the
Asian category will have an additional $-18.7 debt, and that the
Caucasian category will have an additional $-12.5 debt compared to
African American category.
</p>

<table id="org64eba39" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 8:</span> Least squares coefficient estimates associated with the regression of <code>balance</code> onto <code>ethnicity</code> in the <code>Credit</code> data set.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">531.0</td>
<td class="org-right">46.319</td>
<td class="org-right">11.464</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">Ethnicity[T.Asian]</td>
<td class="org-right">-18.686</td>
<td class="org-right">65.021</td>
<td class="org-right">-0.287</td>
<td class="org-right">0.774</td>
</tr>

<tr>
<td class="org-left">Ethnicity[T.Caucasian]</td>
<td class="org-right">-12.503</td>
<td class="org-right">56.681</td>
<td class="org-right">-0.221</td>
<td class="org-right">0.826</td>
</tr>
</tbody>
</table>


<p>
Table <a href="#orgc8f9468">9</a> shows results of regressing <code>sales</code> and <code>TV</code> and <code>radio</code>
when an interaction term is included.  Coefficient of interaction term
<code>TV:radio</code> is highly significant.
</p>


<p>
In figure <a href="#orgd28263d">23</a>, the left panel shows least squares lines when
we predict <code>balance</code> using <code>income</code> (quantitative) and <code>student</code> (qualitative
variables). There is no interaction term between <code>income</code> and <code>student</code>.  The
right panel shows least squares lines when an interaction term is included. 
</p>

<table id="orgc8f9468" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 9:</span> For <code>Advertising</code> data, least squares coefficient estimates associated with the regression of <code>sales</code> onto <code>TV</code> and <code>radio</code>, with an interaction term.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">6.75</td>
<td class="org-right">0.248</td>
<td class="org-right">27.233</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">TV</td>
<td class="org-right">0.019</td>
<td class="org-right">0.002</td>
<td class="org-right">12.699</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">radio</td>
<td class="org-right">0.029</td>
<td class="org-right">0.009</td>
<td class="org-right">3.241</td>
<td class="org-right">0.001</td>
</tr>

<tr>
<td class="org-left">TV:radio</td>
<td class="org-right">0.001</td>
<td class="org-right">0.0</td>
<td class="org-right">20.727</td>
<td class="org-right">0.0</td>
</tr>
</tbody>
</table>


<div id="orgd28263d" class="figure">
<p><img src="figures/fig3_7.png" alt="fig3_7.png" />
</p>
<p><span class="figure-number">Figure 23: </span>For the <code>Credit</code> data, the least squares lines are shown for prediction of <code>balance</code> from <code>income</code> for students and non-students.  Left: There is no interaction between <code>income</code> and <code>student</code>.  Right: There is an interaction term between <code>income</code> and <code>students</code>.</p>
</div>

<p>
Figure <a href="#org8309859">24</a> shows a scatter plot of <code>mpg</code> (gas mileage in miles per
gallon) versus <code>horsepower</code> in the <code>Auto</code> data set.  The figure also includes
least squares fit line for linear, second degree, and fifth degree polynomials
in <code>horsepower</code>. 
</p>


<div id="org8309859" class="figure">
<p><img src="figures/fig3_8.png" alt="fig3_8.png" />
</p>
<p><span class="figure-number">Figure 24: </span>The <code>Auto</code> data set.  For a number of cars, <code>mpg</code> and <code>horsepower</code> are shown.  The linear regression fit is shown in orange.  The linear regression fit for a model that includes first- and second-order terms of <code>horsepower</code> is shown as blue curve.  The linear regression fit for a model that includes all polynomials of <code>horsepower</code> up to fifth-degree is shown in green.</p>
</div>


<p>
Table <a href="#org62ea4fc">10</a> shows regression results of a quadratic fit to explain
<code>mpg</code> as a function of <code>horsepower</code> and \(\mathttt{horsepower^2}\).  
</p>


<table id="org62ea4fc" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 10:</span> For the <code>Auto</code> data set, least squares coefficient estimates associated with the regression of <code>mpg</code> onto <code>horsepower</code> and \(\texttt{horsepower^2}\).</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">56.9001</td>
<td class="org-right">1.8004</td>
<td class="org-right">31.6037</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">horsepower</td>
<td class="org-right">-0.4662</td>
<td class="org-right">0.0311</td>
<td class="org-right">-14.9782</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">\(horsepower^2\)</td>
<td class="org-right">0.0012</td>
<td class="org-right">0.0001</td>
<td class="org-right">10.0801</td>
<td class="org-right">0.0</td>
</tr>
</tbody>
</table>


<p>
The left panel of figure <a href="#orgceb3241">25</a> displays a residual plot from the
linear regression of <code>mpg</code> onto <code>horsepower</code> on the <code>Auto</code> data set.  The red
line is a smooth fit to the residuals, which is displayed in order to make it
easier to identify any trends.  The residuals exhibit a clear U-shape, which
strongly suggests non-linearity in the data.  In contrast, the right hand panel
of figure<a href="#orgceb3241">25</a> displays the residual plot results from the model
which contains a quadratic term in <code>horsepower</code>.  Now there is little pattern in
residuals, suggesting that the quadratic term improves the fit to the data.
</p>


<div id="orgceb3241" class="figure">
<p><img src="figures/fig3_9.png" alt="fig3_9.png" />
</p>
<p><span class="figure-number">Figure 25: </span>Plots of residuals versus predicted (or fitted) values for the <code>Auto</code> data set.  In each plot, the red line is a smooth fit to the residuals, intended to make it easier to identify a trend.  Left: A linear regression of <code>mpg</code> on <code>horsepower</code>.  A strong pattern in the residuals indicates non-linearity in the data.  Right: A linear regression of <code>mpg</code> on <code>horsepower</code> and square of <code>horsepower</code>.  Now there is little pattern in the residuals.</p>
</div>


<p>
Figure <a href="#orgacb607b">26</a> provides an illustration of correlations among
residuals.  In the top panel, we see the residuals from a linear regression fit
to data generated with uncorrelated errors.  There is no evidence of
time-related trend in the residuals.  In contrast, the residuals in the bottom
panel are from a data set in which adjacent errors had a correlation of 0.9.
Now there is a clear pattern in the residuals - adjacent residuals tend to take
on similar values.  Finally, the center panel illustrates a more moderate case
in which the residuals had a correlation of 0.5.  There is still evidence of
tracking, but the pattern is less pronounced. 
</p>


<div id="orgacb607b" class="figure">
<p><img src="figures/fig3_10.png" alt="fig3_10.png" />
</p>
<p><span class="figure-number">Figure 26: </span>Plots of residuals from simulated time series data sets generated with differeing levels of correlation \(\rho\) between error terms for adjacent time points.</p>
</div>


<p>
In the left-hand panel of figure <a href="#org4541f84">27</a>, the magnitude of the
residuals tends to increase with the fitted values.  The right hand panel
displays residual plot after transforming the response using \(\log(Y)\).  The
residuals now appear to have constant variance, although there is some evidence
of a non-linear relationship in the data.
</p>


<div id="org4541f84" class="figure">
<p><img src="figures/fig3_11.png" alt="fig3_11.png" />
</p>
<p><span class="figure-number">Figure 27: </span>Residual plots.  The red line, a smooth fit to the residuals, is intended to make it easier to identify a trend.  The blue lines track \(5^{th}\) and \(95^{th}\) percentiles of the residuals, and emphasize patterns.  Left: The funnel shape indicates heteroscedasticity.  Right: the response has been log transformed, and now there is no evidence of heteroscedasticity.</p>
</div>

<p>
The red point (observation 20) in the left hand panel of figure
<a href="#org8666126">28</a> illustrates a typical outlier.  The red solid line is the
least squares regression fit, while the blue dashed line is the least squares
fit after removal of the outlier.  In this case, removal of outlier has little
effect on the least squares line.  In the center panel of figure
<a href="#org8666126">28</a>, the outlier is clearly visible.  In practice, to decide if
the outlier is sufficiently big to be considered an outlier, we can plot
<i>studentized residuals</i>, computed by dividing each residual \(\epsilon_i\) by its
estimated standard error.  These are shown in the right hand panel. 
</p>


<div id="org8666126" class="figure">
<p><img src="figures/fig3_12.png" alt="fig3_12.png" />
</p>
<p><span class="figure-number">Figure 28: </span>Left: The least squares regression line is shown in red.  The regression line after removing the outlier is is shown in blue.  Center: The residual plot clearly identifies the outlier.  Right: The outlier has a studentized residual of 6; typically we expect values between -3 and 3.</p>
</div>

<p>
Observation 41 in the left-hand panel in figure <a href="#orgad8c967">29</a> has
high leverage, in that the predictor value for this observation is large
relative to the other observations.  The data displayed in figure
<a href="#orgad8c967">29</a> are the same as the data displayed in figure
<a href="#org8666126">28</a>, except for the addition of a single high leverage
observation<sup><a id="fnr.1" class="footref" href="#fn.1">1</a></sup>.  The red solid line is the least squares fit to the data,
while the blue dashed line is the fit produced when observation 41 is
removed. Comparing the left-hand panels of figures <a href="#org8666126">28</a> and
<a href="#orgad8c967">29</a>, we observe that removing the high leverage observation has a
much more substantial impact on least squares line than removing the outlier.
The center panel of figure <a href="#orgad8c967">29</a>, for a data set with two
predictors \(X_1\) and \(X_2\). While most of the observations' predictor values
fall within the region of blue dashed lines, the red observation is well outside this
range. But neither the value for \(X_1\) nor the value for \(X_2\) is unusual.  So
if we examine just \(X_1\) or \(X_2\), we will not notice this high leverage
point. The right-panel of figure <a href="#orgad8c967">29</a> provides a plot of
studentized residuals versus \(h_i\) for the data in the left hand panel.
Observation 41 stands out as having a very high leverage statistic as well as a
high studentized residual.
</p>


<div id="orgad8c967" class="figure">
<p><img src="figures/fig3_13.png" alt="fig3_13.png" />
</p>
<p><span class="figure-number">Figure 29: </span>Left: Observation 41 is a high leverage point, while 20 is not.  The red line is the fit to all the data, and the blue line is the fit with observation 41 removed.  Center: The red observation is not unusual in terms of its \(X_1\) value or its \(X_2\) value, but still falls outside the bulk of the data, and hence has high leverage.  Right: Observation 41 has a high leverage and a high residual.</p>
</div>

<p>
Figure <a href="#org8feb139">30</a> illustrates the concept of collinearity.
</p>


<div id="org8feb139" class="figure">
<p><img src="figures/fig3_14.png" alt="fig3_14.png" />
</p>
<p><span class="figure-number">Figure 30: </span>Scatter plots of the observations from the <code>Credit</code> data set.  Left: A plot of <code>age</code> versus <code>limit</code>.  These two variables not collinear.  Right: A plot of <code>rating</code> versus <code>limit</code>.  There is high collinearity.</p>
</div>


<p>
Figure <a href="#org5d18c5d">31</a> illustrates some of the difficulties that can result
from collinearity.  The left panel is a contour plot of the RSS associated with
different possible coefficient estimates for the regression of <code>balance</code> on
<code>limit</code> and <code>age</code>.  Each ellipse represents a set of coefficients that
correspond to the same RSS, with ellipses nearest to the center taking on the
lowest values of RSS.  The black dot and the associated dashed lines represent
the coefficient estimates that result in the smallest possible RSS.  The axes
for <code>limit</code> and <code>age</code> have been scaled so that the plot includes possible
coefficients that are up to four standard errors on either side of the least
squares estimates.  We see that the true <code>limit</code> coefficient is almost certainly
between 0.15 and 0.20.
</p>

<p>
In contrast, the right hand panel of figure <a href="#org5d18c5d">31</a> displays contour
plots of the RSS associated with possible coefficient estimates for the
regression of <code>balance</code> onto <code>limit</code> and <code>rating</code>, which we know to be highly
collinear.  Now the contours run along a narrow valley; there is a broad range
of values for the coefficient estimates that result in equal values for RSS.  
</p>


<div id="org5d18c5d" class="figure">
<p><img src="figures/fig3_15.png" alt="fig3_15.png" />
</p>
<p><span class="figure-number">Figure 31: </span>Contour plots for the RSS values as a function of the parameters \(\beta\) for various regressions involving the <code>Credit</code> data set.  In each plot, the black dots represent the coefficient values corresponding to the minimum RSS.  Left: A contour plot of RSS for the regression of <code>balance</code> onto <code>age</code> and <code>limit</code>.  The minimum value is well defined.  Right: A contour plot of RSS for the regression of <code>balance</code> onto <code>rating</code> and <code>limit</code>.  Because of the collinearity, there are many pairs \((\beta_{Limit}, \beta_{Rating})\) with a similar value for RSS.</p>
</div>

<p>
Table <a href="#org55ad4fd">11</a> compares the coefficient estimates obtained from two
separate multiple regression models.  The first is a regression of <code>balance</code> on
<code>age</code> and <code>limit</code>.  The second is a regression of <code>balance</code> on <code>rating</code> and
<code>limit</code>.  In the first regression, both <code>age</code> and <code>limit</code> are highly significant
with very small p-values.  In the second, the collinearity between <code>limit</code> and
<code>rating</code> has caused the standard error for the <code>limit</code> coefficient to increase
by a factor of 12 and the p-value to increase to 0.701. In other words, the
importance of the <code>limit</code> variable has been masked due to the presence of
collinearity.  
</p>

<table id="org55ad4fd" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 11:</span> The results for two multiple regression models involving the <code>Credit</code> data set.  The top panel is a regression of <code>balance</code> on <code>age</code> and <code>limit</code>.  The bottom panel is a regression of <code>balance</code> on <code>rating</code> and <code>limit</code>.  The standard error of \(\hat{\beta}_{Limit}\) increases 12-fold in the second regression, due to collinearity.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(t\)</th>
<th scope="col" class="org-right">\(P > \mid t \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">-173.411</td>
<td class="org-right">43.828</td>
<td class="org-right">-3.957</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">Age</td>
<td class="org-right">-2.291</td>
<td class="org-right">0.672</td>
<td class="org-right">-3.407</td>
<td class="org-right">0.001</td>
</tr>

<tr>
<td class="org-left">Limit</td>
<td class="org-right">0.173</td>
<td class="org-right">0.005</td>
<td class="org-right">34.496</td>
<td class="org-right">0.0</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">-377.537</td>
<td class="org-right">45.254</td>
<td class="org-right">-8.343</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">Rating</td>
<td class="org-right">2.202</td>
<td class="org-right">0.952</td>
<td class="org-right">2.312</td>
<td class="org-right">0.021</td>
</tr>

<tr>
<td class="org-left">Limit</td>
<td class="org-right">0.025</td>
<td class="org-right">0.064</td>
<td class="org-right">0.384</td>
<td class="org-right">0.701</td>
</tr>
</tbody>
</table>
</div>
</div>

<div id="outline-container-orge5b7046" class="outline-3">
<h3 id="orge5b7046"><span class="section-number-3">3.4</span> The Marketing Plan</h3>
</div>

<div id="outline-container-orge13beec" class="outline-3">
<h3 id="orge13beec"><span class="section-number-3">3.5</span> Comparison of Linear Regression with K-Nearest Neighbors</h3>
<div class="outline-text-3" id="text-3-5">
<p>
Figure <a href="#orgd2e4567">32</a> illustrates two KNN fits on a data set with \(p = 2\)
predictors. The fit with \(K = 1\) is shown in the left-hand panel, while the
right-hand panel displays the fit with \(K = 9\).  When \(K = 1\), the KNN fit
perfectly interpolates the training observations, and consequently takes the
form of a step function. When \(K = 9\), the KNN fit is still a step function, but
averaging over nine observations results in much smaller regions of constant
prediction, and consequently a smoother fit.  
</p>


<div id="orgd2e4567" class="figure">
<p><img src="figures/fig3_16.png" alt="fig3_16.png" />
</p>
<p><span class="figure-number">Figure 32: </span>Plots of \(\hat{f}(X)\) using KNN regression on two-dimensional data set with 64 observations (brown dots).  Left: \(K = 1\) results in a rough step function fit.  Right: \(K = 9\) produces a much smoother fit.</p>
</div>

<p>
Figure <a href="#org61f207b">33</a> provides an example of KNN regression with data
generated from a one-dimensional regression model.  the black dashed lines
represent \(f(X)\), while the blue curves correspond to the KNN fits using \(K = 1\)
and \(K = 9\).  In this case, the \(K = 1\) predictions are far too variable, while
the smoother \(K = 9\) fit is much closer to \(f(X)\).  
</p>


<div id="org61f207b" class="figure">
<p><img src="figures/fig3_17.png" alt="fig3_17.png" />
</p>
<p><span class="figure-number">Figure 33: </span>Plots of \(\hat{f}(X)\) using KNN regression on a one-dimensional data set with 50 observations.  The true relationship is given by the black dashed line.  Left: The blue curve corresponds to \(K = 1\) and interpolates (i.e., passes directly through) training data.  Right: The blue curve corresponds to \(K = 9\), and represents a smoother fit.</p>
</div>

<p>
Figure <a href="#org5f3fc3e">34</a> represents the linear regression fit to the same
data.  It is almost perfect.  The right hand panel of figure <a href="#org5f3fc3e">34</a>
reveals that linear regression outperforms KNN for this data.  The green line,
plotted as a function of \(\frac{1}{K}\), represents the test set mean squared
error (MSE) for KNN.  The KNN errors are well above the horizontal dashed line,
which is the test MSE for linear regression.
</p>


<div id="org5f3fc3e" class="figure">
<p><img src="figures/fig3_18.png" alt="fig3_18.png" />
</p>
<p><span class="figure-number">Figure 34: </span>The same data set shown in figure <a href="#org61f207b">33</a> is investigated further.  Left: The blue dashed line is the least squares fit to the data.  Since \(f(X)\) is in fact linear (displayed in black line), the least squares regression line provides a very good estimate of \(f(X)\).  Right: The dashed horizontal line represents the least squares test set MSE, while the green line corresponds to the MSE for KNN as a function of \(\frac{1}{K}\).  Linear regression achieves a lower test MSE than does KNN regression, since \(f(X)\) is in fact linear.</p>
</div>


<p>
Figure <a href="#org8b8ad52">35</a> examines the relative performances of least squares
regression and KNN under increasing levels of non-linearity in the relationship
between \(X\) and \(Y\).  In the top row, the true relationship is nearly linear.
In this case, we see that the test MSE for linear regression is still superior
to that of KNN for low values of \(K\) (far right).  However, as \(K\) increases,
KNN outperforms linear regression.  The second row illustrates a more
substantial deviation from linearity.  In this situation, KNN substantially
outperforms linear regression for all values of \(K\).  
</p>


<div id="org8b8ad52" class="figure">
<p><img src="figures/fig3_19.png" alt="fig3_19.png" />
</p>
<p><span class="figure-number">Figure 35: </span>Top Left: In a setting with a slightly non-linear relationship between \(X\) and \(Y\) (solid black line), the KNN fits with \(K = 1\) (blue) and \(K = 9\) (red) are displayed.  Top Right: For the slightly non-linear data,the test set MSE for least squares regression (horizontal) and KNN with various values of \(\frac{1}{K}\) (green) are displayed.  Bottom Left and Bottom Right: As in the top panel, but with a strongly non-linear relationship between \(X\) and \(Y\).</p>
</div>


<p>
Figure <a href="#orgda9b45f">36</a> considers the same strongly non-linear situation as in the lower
panel of figure <a href="#org8b8ad52">35</a>, except that we have added additional <i>noise</i>
predictors that are not associated with the response.  When \(p = 1\) or \(p = 2\),
KNN outperforms linear regression.  But as we increase \(p\), linear regression
becomes superior to KNN.  In fact, increase in dimensionality has only caused a
small increase in linear regression test set MSE, but it has caused a much
bigger increase in the MSE for KNN.
</p>


<div id="orgda9b45f" class="figure">
<p><img src="figures/fig3_20.png" alt="fig3_20.png" />
</p>
<p><span class="figure-number">Figure 36: </span>Test MSE for linear regressions (black horizontal lines) and KNN (green curves) as the number of variables \(p\) increases.  The true function is non-linear in the first variable, as in the lower panel in figure <a href="#org8b8ad52">35</a>, and does not depend upon the additional variables. The performance of linear regression deteriorates slowly in the presense of these additional variables, whereas KNN's performance degrades more quickly as \(p\) increases.</p>
</div>
</div>
</div>

<div id="outline-container-org3834f17" class="outline-3">
<h3 id="org3834f17"><span class="section-number-3">3.6</span> Lab: Linear Regression</h3>
<div class="outline-text-3" id="text-3-6">
</div>
<div id="outline-container-org42774ca" class="outline-4">
<h4 id="org42774ca"><span class="section-number-4">3.6.1</span> Libraries</h4>
<div class="outline-text-4" id="text-3-6-1">
<p>
The <code>import</code> function, along with an optional <code>as</code>, is used to load <i>libraries</i>.
Before a library can be loaded, it must be installed on the system. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
</pre>
</div>
</div>
</div>

<div id="outline-container-orgef16a59" class="outline-4">
<h4 id="orgef16a59"><span class="section-number-4">3.6.2</span> Simple Linear Regression</h4>
<div class="outline-text-4" id="text-3-6-2">
<p>
We load <code>Boston</code> data set from <code>R</code> library <code>MASS</code>.  Then we use <code>ols</code> function
from <code>statsmodels.formula.api</code> to fit simple linear regression model, with
<code>medv</code> as response and <code>lstat</code> as the predictor.
</p>

<p>
Function <code>summary2()</code> gives some basic information about the model.  We can use
<code>dir()</code> to find out what other pieces of information are stored in <code>lm_fit</code>.
The <code>predict()</code> function can be used to produce prediction of <code>medv</code> for a given
value of <code>lstat</code>. 
</p>

<div class="org-src-container">
<pre class="src src-python" id="org7a0c41d"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, <span style="color: #008000;">'MASS'</span>).data
<span style="color: #0000FF;">print</span>(boston.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">lm_reg</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat'</span>, data=boston)
<span style="color: #BA36A5;">lm_fit</span> = lm_reg.fit()
<span style="color: #0000FF;">print</span>(lm_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">dir</span>(lm_fit))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(lm_fit.predict(exog=<span style="color: #006FE0;">dict</span>(lstat=[5, 10, 15])))
</pre>
</div>

<pre class="example">
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat', 'medv'],
      dtype='object')
--------
                 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.543    
Dependent Variable: medv             AIC:                3286.9750
Date:               2019-05-28 14:10 BIC:                3295.4280
No. Observations:   506              Log-Likelihood:     -1641.5  
Df Model:           1                F-statistic:        601.6    
Df Residuals:       504              Prob (F-statistic): 5.08e-88 
R-squared:          0.544            Scale:              38.636   
-------------------------------------------------------------------
               Coef.   Std.Err.     t      P&gt;|t|    [0.025   0.975]
-------------------------------------------------------------------
Intercept     34.5538    0.5626   61.4151  0.0000  33.4485  35.6592
lstat         -0.9500    0.0387  -24.5279  0.0000  -1.0261  -0.8740
------------------------------------------------------------------
Omnibus:             137.043       Durbin-Watson:          0.892  
Prob(Omnibus):       0.000         Jarque-Bera (JB):       291.373
Skew:                1.453         Prob(JB):               0.000  
Kurtosis:            5.319         Condition No.:          30     
==================================================================

------
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', 
'__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
'__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', 
'__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', 
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', 
'__subclasshook__', '__weakref__', '_cache', '_data_attr', 
'_get_robustcov_results', '_is_nested', '_wexog_singular_values', 'aic', 
'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 
'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 
'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 
'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 
'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 
'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 
'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 
'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 
'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 
'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 
'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']
------
0    29.803594
1    25.053347
2    20.303101
dtype: float64
</pre>

<p>
We will now plot <code>medv</code> and <code>lstat</code> along with least squares regression line.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, <span style="color: #008000;">'MASS'</span>).data
<span style="color: #0000FF;">print</span>(boston.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">lm_reg</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat'</span>, data=boston)
<span style="color: #BA36A5;">lm_fit</span> = lm_reg.fit()
<span style="color: #0000FF;">print</span>(lm_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">dir</span>(lm_fit))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(lm_fit.predict(exog=<span style="color: #006FE0;">dict</span>(lstat=[5, 10, 15])))
<span style="color: #0000FF;">import</span> statsmodels.api <span style="color: #0000FF;">as</span> sm
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">fig</span> = plt.figure()
<span style="color: #BA36A5;">ax</span> = fig.add_subplot(111)
boston.plot(x=<span style="color: #008000;">'lstat'</span>, y=<span style="color: #008000;">'medv'</span>, alpha=0.7, ax=ax)
sm.graphics.abline_plot(model_results=lm_fit, ax=ax, c=<span style="color: #008000;">'r'</span>)

</pre>
</div>

<p>
Next we examine some diagnostic plots.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, <span style="color: #008000;">'MASS'</span>).data
<span style="color: #0000FF;">print</span>(boston.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">lm_reg</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat'</span>, data=boston)
<span style="color: #BA36A5;">lm_fit</span> = lm_reg.fit()
<span style="color: #0000FF;">print</span>(lm_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">dir</span>(lm_fit))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(lm_fit.predict(exog=<span style="color: #006FE0;">dict</span>(lstat=[5, 10, 15])))
<span style="color: #0000FF;">import</span> statsmodels.api <span style="color: #0000FF;">as</span> sm
<span style="color: #0000FF;">from</span> statsmodels.nonparametric.smoothers_lowess <span style="color: #0000FF;">import</span> lowess
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">fig</span> = plt.figure()
<span style="color: #BA36A5;">ax1</span> = fig.add_subplot(221)
ax1.scatter(lm_fit.fittedvalues, lm_fit.resid, s=5, c=<span style="color: #008000;">'b'</span>, alpha=0.6)
ax1.axhline(y=0, linestyle=<span style="color: #008000;">'--'</span>, c=<span style="color: #008000;">'r'</span>)
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">resid_lowess_fit = lowess(endog=lm_fit.resid, exog=lm_fit.fittedvalues,</span>
<span style="color: #8D8D84;">#                           </span><span style="color: #8D8D84; font-style: italic;">is_sorted=True)</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">ax1.plot(resid_lowess_fit[:,0], resid_lowess_fit[:,1]) </span>
ax1.set_xlabel(<span style="color: #008000;">'Fitted values'</span>)
ax1.set_ylabel(<span style="color: #008000;">'Residuals'</span>)
ax1.set_title(<span style="color: #008000;">'Residuals vs Fitted'</span>)

<span style="color: #BA36A5;">ax2</span>=fig.add_subplot(222)
sm.graphics.qqplot(lm_fit.resid, ax=ax2, markersize=3, line=<span style="color: #008000;">'s'</span>,
                   linestyle=<span style="color: #008000;">'--'</span>, fit=<span style="color: #D0372D;">True</span>, alpha=0.4)
ax2.set_ylabel(<span style="color: #008000;">'Standardized residuals'</span>)
ax2.set_title(<span style="color: #008000;">'Normal Q-Q'</span>)

<span style="color: #BA36A5;">influence</span> = lm_fit.get_influence()
<span style="color: #BA36A5;">standardized_resid</span> = influence.resid_studentized_internal
<span style="color: #BA36A5;">ax3</span> = fig.add_subplot(223)
ax3.scatter(lm_fit.fittedvalues, np.sqrt(np.<span style="color: #006FE0;">abs</span>(standardized_resid)), s=5,
            alpha=0.4, c=<span style="color: #008000;">'b'</span>)
ax3.set_xlabel(<span style="color: #008000;">'Fitted values'</span>)
ax3.set_ylabel(r<span style="color: #008000;">'$\sqrt{\mid Standardized\; residuals \mid}$'</span>)
ax3.set_title(<span style="color: #008000;">'Scale-Location'</span>)

<span style="color: #BA36A5;">ax4</span> = fig.add_subplot(224)
sm.graphics.influence_plot(lm_fit, size=2, alpha=0.4, c=<span style="color: #008000;">'b'</span>,  ax=ax4)
ax4.xaxis.label.set_size(10)
ax4.yaxis.label.set_size(10)
ax4.title.set_size(12)
ax4.set_xlim(0, 0.03)
<span style="color: #0000FF;">for</span> txt <span style="color: #0000FF;">in</span> ax4.texts:
    txt.set_visible(<span style="color: #D0372D;">False</span>)
ax4.axhline(y=0, linestyle=<span style="color: #008000;">'--'</span>, color=<span style="color: #008000;">'grey'</span>)

fig.tight_layout()
</pre>
</div>
</div>
</div>

<div id="outline-container-org3f68780" class="outline-4">
<h4 id="org3f68780"><span class="section-number-4">3.6.3</span> Multiple Linear Regression</h4>
<div class="outline-text-4" id="text-3-6-3">
<p>
In order to fit a multiple regression model using least squares, we again use
the <code>ols</code> and <code>fit</code> functions.  The syntax <code>ols(formula='y ~ x1 + x2 + x3')</code> is
used to fit a model with three predictors, <code>x1</code>, <code>x2</code>, and <code>x3</code>.  The
<code>summary2()</code> now outputs the regression coefficients for all three predictors. 
</p>

<p>
<code>statsmodels</code> does not seem to have <code>R</code> like facility to include all variables
using the formula <code>y ~ .</code>.  To include all variables, we either write them
individually, or use code to create a formula.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, <span style="color: #008000;">'MASS'</span>).data

<span style="color: #BA36A5;">lm_reg</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat + age'</span>, data=boston)
<span style="color: #BA36A5;">lm_fit</span> = lm_reg.fit()

<span style="color: #0000FF;">print</span>(lm_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Create formula to include all variables</span>
<span style="color: #BA36A5;">all_columns</span> = <span style="color: #006FE0;">list</span>(boston.columns)
all_columns.remove(<span style="color: #008000;">'medv'</span>)
<span style="color: #BA36A5;">my_formula</span> = <span style="color: #008000;">'medv ~ '</span> + <span style="color: #008000;">' + '</span>.join(all_columns)
<span style="color: #0000FF;">print</span>(my_formula)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">all_reg</span> = smf.ols(formula=my_formula, data=boston)
<span style="color: #BA36A5;">all_fit</span> = all_reg.fit()
<span style="color: #0000FF;">print</span>(all_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
</pre>
</div>

<pre class="example">
                 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.549    
Dependent Variable: medv             AIC:                3281.0064
Date:               2019-05-29 10:07 BIC:                3293.6860
No. Observations:   506              Log-Likelihood:     -1637.5  
Df Model:           2                F-statistic:        309.0    
Df Residuals:       503              Prob (F-statistic): 2.98e-88 
R-squared:          0.551            Scale:              38.108   
-------------------------------------------------------------------
               Coef.   Std.Err.     t      P&gt;|t|    [0.025   0.975]
-------------------------------------------------------------------
Intercept     33.2228    0.7308   45.4579  0.0000  31.7869  34.6586
lstat         -1.0321    0.0482  -21.4163  0.0000  -1.1267  -0.9374
age            0.0345    0.0122    2.8256  0.0049   0.0105   0.0586
------------------------------------------------------------------
Omnibus:             124.288       Durbin-Watson:          0.945  
Prob(Omnibus):       0.000         Jarque-Bera (JB):       244.026
Skew:                1.362         Prob(JB):               0.000  
Kurtosis:            5.038         Condition No.:          201    
==================================================================

--------
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + 
ptratio + black + lstat
--------
                 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.734    
Dependent Variable: medv             AIC:                3025.6086
Date:               2019-05-29 10:07 BIC:                3084.7801
No. Observations:   506              Log-Likelihood:     -1498.8  
Df Model:           13               F-statistic:        108.1    
Df Residuals:       492              Prob (F-statistic): 6.72e-135
R-squared:          0.741            Scale:              22.518   
-------------------------------------------------------------------
            Coef.    Std.Err.     t      P&gt;|t|    [0.025    0.975] 
-------------------------------------------------------------------
Intercept   36.4595    5.1035    7.1441  0.0000   26.4322   46.4868
crim        -0.1080    0.0329   -3.2865  0.0011   -0.1726   -0.0434
zn           0.0464    0.0137    3.3816  0.0008    0.0194    0.0734
indus        0.0206    0.0615    0.3343  0.7383   -0.1003    0.1414
chas         2.6867    0.8616    3.1184  0.0019    0.9939    4.3796
nox        -17.7666    3.8197   -4.6513  0.0000  -25.2716  -10.2616
rm           3.8099    0.4179    9.1161  0.0000    2.9887    4.6310
age          0.0007    0.0132    0.0524  0.9582   -0.0253    0.0266
dis         -1.4756    0.1995   -7.3980  0.0000   -1.8675   -1.0837
rad          0.3060    0.0663    4.6129  0.0000    0.1757    0.4364
tax         -0.0123    0.0038   -3.2800  0.0011   -0.0197   -0.0049
ptratio     -0.9527    0.1308   -7.2825  0.0000   -1.2098   -0.6957
black        0.0093    0.0027    3.4668  0.0006    0.0040    0.0146
lstat       -0.5248    0.0507  -10.3471  0.0000   -0.6244   -0.4251
------------------------------------------------------------------
Omnibus:             178.041       Durbin-Watson:          1.078  
Prob(Omnibus):       0.000         Jarque-Bera (JB):       783.126
Skew:                1.521         Prob(JB):               0.000  
Kurtosis:            8.281         Condition No.:          15114  
==================================================================
* The condition number is large (2e+04). This might indicate
strong multicollinearity or other numerical problems.
--------
</pre>
</div>
</div>

<div id="outline-container-org3477c07" class="outline-4">
<h4 id="org3477c07"><span class="section-number-4">3.6.4</span> Interaction Terms</h4>
<div class="outline-text-4" id="text-3-6-4">
<p>
The syntax <code>lstat:black</code> tells <code>ols</code> to include an interaction term between
<code>lstat</code> and <code>black</code>.  The syntax <code>lstat*age</code> simultaneously includes <code>lstat,
age,</code> and the interaction term \(\text{lstat} \times \text{age}\) as predictors.
It is a shorthand for <code>lstat + age + lstat:age</code>. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, <span style="color: #008000;">'MASS'</span>).data

<span style="color: #BA36A5;">my_reg</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat * age'</span>, data=boston)
<span style="color: #BA36A5;">my_fit</span> = my_reg.fit()
<span style="color: #0000FF;">print</span>(my_fit.summary2())
</pre>
</div>

<pre class="example">
                 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.553    
Dependent Variable: medv             AIC:                3277.9547
Date:               2019-05-29 11:48 BIC:                3294.8609
No. Observations:   506              Log-Likelihood:     -1635.0  
Df Model:           3                F-statistic:        209.3    
Df Residuals:       502              Prob (F-statistic): 4.86e-88 
R-squared:          0.556            Scale:              37.804   
-------------------------------------------------------------------
                Coef.   Std.Err.     t     P&gt;|t|    [0.025   0.975]
-------------------------------------------------------------------
Intercept      36.0885    1.4698  24.5528  0.0000  33.2007  38.9763
lstat          -1.3921    0.1675  -8.3134  0.0000  -1.7211  -1.0631
age            -0.0007    0.0199  -0.0363  0.9711  -0.0398   0.0383
lstat:age       0.0042    0.0019   2.2443  0.0252   0.0005   0.0078
------------------------------------------------------------------
Omnibus:             135.601       Durbin-Watson:          0.965  
Prob(Omnibus):       0.000         Jarque-Bera (JB):       296.955
Skew:                1.417         Prob(JB):               0.000  
Kurtosis:            5.461         Condition No.:          6878   
==================================================================
* The condition number is large (7e+03). This might indicate
strong multicollinearity or other numerical problems.
</pre>
</div>
</div>

<div id="outline-container-org05f453d" class="outline-4">
<h4 id="org05f453d"><span class="section-number-4">3.6.5</span> Non-linear Transformations of the Predictors</h4>
<div class="outline-text-4" id="text-3-6-5">
<p>
The <code>ols</code> function can also accommodate non-linear transformations of the
predictors.  For example, given a predictor \(X\), we can create predictor \(X^2\)
using <code>I(X ** 2)</code>.  We now perform a regression of <code>medv</code> onto <code>lstat</code> and
\(\texttt{lstat}^2\). 
</p>

<p>
The near-zero p-value associated with the quadratic term suggests that it leads
to an improve model.  We use <code>anova_lm()</code> function to further quantify the
extent to which the quadratic fit is superior to the linear fit.  The null
hypothesis is that the two models fit the data equally well.  The alternative
hypothesis is that the full model is superior.  Given the large F-statistic and
zero p-value, this provides very clear evidence that the model with quadratic
term is superior.  A plot of residuals versus fitted values shows that, with
quadratic term included, there is no discernible pattern in residuals. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.api <span style="color: #0000FF;">as</span> sm
<span style="color: #BA36A5;">lowess</span> = sm.nonparametric.lowess
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, <span style="color: #008000;">'MASS'</span>).data

<span style="color: #BA36A5;">my_reg</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat'</span>, data=boston)
<span style="color: #BA36A5;">my_fit</span> = my_reg.fit()

<span style="color: #BA36A5;">my_reg2</span> = smf.ols(formula=<span style="color: #008000;">'medv ~ lstat + I(lstat ** 2)'</span>, data=boston)
<span style="color: #BA36A5;">my_fit2</span> = my_reg2.fit()
<span style="color: #0000FF;">print</span>(my_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #0000FF;">print</span>(sm.stats.anova_lm(my_fit2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #0000FF;">print</span>(sm.stats.anova_lm(my_fit, my_fit2))

<span style="color: #BA36A5;">my_regs</span> = (my_reg, my_reg2)

<span style="color: #BA36A5;">fig</span> = plt.figure(figsize=(8,4))
<span style="color: #BA36A5;">i_reg</span> = 1
<span style="color: #0000FF;">for</span> reg <span style="color: #0000FF;">in</span> my_regs:
    <span style="color: #BA36A5;">ax</span> = fig.add_subplot(1, 2, i_reg)
    <span style="color: #BA36A5;">fit</span> = reg.fit()
    ax.scatter(fit.fittedvalues, fit.resid, s=7, alpha=0.6)
    <span style="color: #BA36A5;">lowess_fit</span> = lowess(fit.resid, fit.fittedvalues)
    ax.plot(lowess_fit[:,0], lowess_fit[:,1], c=<span style="color: #008000;">'r'</span>)
    ax.axhline(y=0, linestyle=<span style="color: #008000;">'--'</span>, color=<span style="color: #008000;">'grey'</span>)
    ax.set_xlabel(<span style="color: #008000;">'Fitted values'</span>)
    ax.set_ylabel(<span style="color: #008000;">'Residuals'</span>)
    ax.set_title(reg.formula)
    <span style="color: #BA36A5;">i_reg</span> += 1

fig.tight_layout()
</pre>
</div>

<pre class="example">
                 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.543    
Dependent Variable: medv             AIC:                3286.9750
Date:               2019-05-29 12:41 BIC:                3295.4280
No. Observations:   506              Log-Likelihood:     -1641.5  
Df Model:           1                F-statistic:        601.6    
Df Residuals:       504              Prob (F-statistic): 5.08e-88 
R-squared:          0.544            Scale:              38.636   
-------------------------------------------------------------------
               Coef.   Std.Err.     t      P&gt;|t|    [0.025   0.975]
-------------------------------------------------------------------
Intercept     34.5538    0.5626   61.4151  0.0000  33.4485  35.6592
lstat         -0.9500    0.0387  -24.5279  0.0000  -1.0261  -0.8740
------------------------------------------------------------------
Omnibus:             137.043       Durbin-Watson:          0.892  
Prob(Omnibus):       0.000         Jarque-Bera (JB):       291.373
Skew:                1.453         Prob(JB):               0.000  
Kurtosis:            5.319         Condition No.:          30     
==================================================================

--------
                  df        sum_sq       mean_sq           F         PR(&gt;F)
lstat            1.0  23243.913997  23243.913997  761.810354  8.819026e-103
I(lstat ** 2)    1.0   4125.138260   4125.138260  135.199822   7.630116e-28
Residual       503.0  15347.243158     30.511418         NaN            NaN
--------
   df_resid           ssr  df_diff     ss_diff           F        Pr(&gt;F)
0     504.0  19472.381418      0.0         NaN         NaN           NaN
1     503.0  15347.243158      1.0  4125.13826  135.199822  7.630116e-28
</pre>
</div>
</div>

<div id="outline-container-orgd0f74b1" class="outline-4">
<h4 id="orgd0f74b1"><span class="section-number-4">3.6.6</span> Qualitative Predictors</h4>
<div class="outline-text-4" id="text-3-6-6">
<p>
We will now examine <code>Carseats</code> data, which is part of the <code>ISLR</code> library.  We
will attempt to predict <code>Sales</code> (child car seat sales) based on a number of
predictors. <code>statsmodels</code> automatically converts string variables into
categorical variables.  If we want <code>statsmodels</code> to treat a numerical variable <code>x</code> as
qualitative predictor, the formula should be <code>y ~ C(x)</code>. Here <code>C()</code> stands for
categorical.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">carseats</span> = datasets.get_rdataset(<span style="color: #008000;">'Carseats'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #0000FF;">print</span>(carseats.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">all_columns</span> = <span style="color: #006FE0;">list</span>(carseats.columns)
all_columns.remove(<span style="color: #008000;">'Sales'</span>)
<span style="color: #BA36A5;">my_formula</span> = <span style="color: #008000;">'Sales ~ '</span> + <span style="color: #008000;">' + '</span>.join(all_columns)
<span style="color: #BA36A5;">my_formula</span> +=  <span style="color: #008000;">' + Income:Advertising + Price:Age'</span>

<span style="color: #0000FF;">print</span>(my_formula)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">my_reg</span> = smf.ols(formula=my_formula, data=carseats)
<span style="color: #BA36A5;">my_fit</span> = my_reg.fit()
<span style="color: #0000FF;">print</span>(my_fit.summary2())
</pre>
</div>

<pre class="example">
Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')
--------
Sales ~ CompPrice + Income + Advertising + Population + Price + ShelveLoc + 
Age + Education + Urban + US + Income:Advertising + Price:Age
--------
                  Results: Ordinary least squares
====================================================================
Model:                OLS              Adj. R-squared:     0.872    
Dependent Variable:   Sales            AIC:                1157.3378
Date:                 2019-05-29 12:53 BIC:                1213.2183
No. Observations:     400              Log-Likelihood:     -564.67  
Df Model:             13               F-statistic:        210.0    
Df Residuals:         386              Prob (F-statistic): 6.14e-166
R-squared:            0.876            Scale:              1.0213   
--------------------------------------------------------------------
                     Coef.  Std.Err.    t     P&gt;|t|   [0.025  0.975]
--------------------------------------------------------------------
Intercept            6.5756   1.0087   6.5185 0.0000  4.5922  8.5589
ShelveLoc[T.Good]    4.8487   0.1528  31.7243 0.0000  4.5482  5.1492
ShelveLoc[T.Medium]  1.9533   0.1258  15.5307 0.0000  1.7060  2.2005
Urban[T.Yes]         0.1402   0.1124   1.2470 0.2132 -0.0808  0.3612
US[T.Yes]           -0.1576   0.1489  -1.0580 0.2907 -0.4504  0.1352
CompPrice            0.0929   0.0041  22.5668 0.0000  0.0848  0.1010
Income               0.0109   0.0026   4.1828 0.0000  0.0058  0.0160
Advertising          0.0702   0.0226   3.1070 0.0020  0.0258  0.1147
Population           0.0002   0.0004   0.4329 0.6653 -0.0006  0.0009
Price               -0.1008   0.0074 -13.5494 0.0000 -0.1154 -0.0862
Age                 -0.0579   0.0160  -3.6329 0.0003 -0.0893 -0.0266
Education           -0.0209   0.0196  -1.0632 0.2884 -0.0594  0.0177
Income:Advertising   0.0008   0.0003   2.6976 0.0073  0.0002  0.0013
Price:Age            0.0001   0.0001   0.8007 0.4238 -0.0002  0.0004
--------------------------------------------------------------------
Omnibus:                1.281        Durbin-Watson:           2.047 
Prob(Omnibus):          0.527        Jarque-Bera (JB):        1.147 
Skew:                   0.129        Prob(JB):                0.564 
Kurtosis:               3.050        Condition No.:           130576
====================================================================
* The condition number is large (1e+05). This might indicate
strong multicollinearity or other numerical problems.
</pre>
</div>
</div>

<div id="outline-container-orgd293012" class="outline-4">
<h4 id="orgd293012"><span class="section-number-4">3.6.7</span> Calling <code>R</code> from <code>Python</code></h4>
<div class="outline-text-4" id="text-3-6-7">
</div>
</div>
</div>
</div>

<div id="outline-container-org1fea058" class="outline-2">
<h2 id="org1fea058"><span class="section-number-2">4</span> Classification</h2>
<div class="outline-text-2" id="text-4">
</div>
<div id="outline-container-org8c46dbf" class="outline-3">
<h3 id="org8c46dbf"><span class="section-number-3">4.1</span> An Overview of Classification</h3>
<div class="outline-text-3" id="text-4-1">
<p>
In figure <a href="#orgaab6be1">37</a>, we have plotted annual <code>income</code> and monthly
credit card <code>balance</code> for a subset of individuals in <code>Credit</code> data set.  The
left hand panel displays individuals who defaulted in brown, and those who did
not in blue.  We have plotted only a fraction of individuals who did not
default.  It appears that individuals who defaulted tended to have higher credit
card balances than those who did not.  In the right hand panel, we show two
pairs of boxplots.  The first shows the distribution of <code>balance</code> split by the
binary <code>default</code> variable; the second is a similar plot for <code>income</code>.  
</p>


<div id="orgaab6be1" class="figure">
<p><img src="figures/fig4_1.png" alt="fig4_1.png" />
</p>
<p><span class="figure-number">Figure 37: </span>The <code>Default</code> data set.  Left: The annual income and monthly credit card balances of a number of individuals.  The individuals who defaulted on their credit card debt are shown in brown, and those who did not default are shown in blue.  Center: Boxplots of <code>balance</code> as a function of <code>default</code> status.  Right: Boxplots of <code>income</code> as a function of <code>default</code> status.</p>
</div>
</div>
</div>

<div id="outline-container-org9bda174" class="outline-3">
<h3 id="org9bda174"><span class="section-number-3">4.2</span> Why Not Linear Regression?</h3>
</div>

<div id="outline-container-orgda2ee54" class="outline-3">
<h3 id="orgda2ee54"><span class="section-number-3">4.3</span> Logistic Regression</h3>
<div class="outline-text-3" id="text-4-3">
<p>
Using <code>Default</code> data set, in figure <a href="#orgedd7ded">38</a> we show probability of default as a function of
<code>balance</code>.  The left panel shows a model fitted using linear regression.  Some
of the probabilities estimates (for low balance) are outside the \([0, 1]\)
interval.  The right panel shows a model fitted using logistic regression, which
models the probability of default as a function of <code>balance</code>.  Now all
probability estimates are in the \([0, 1]\) interval.
</p>


<div id="orgedd7ded" class="figure">
<p><img src="figures/fig4_2.png" alt="fig4_2.png" />
</p>
<p><span class="figure-number">Figure 38: </span>Classification using <code>Default</code> data.  Left: Estimated probability of <code>default</code> using linear regression.  Some estimated probabilities are negative!  The brown ticks indicate the 0/1 values coded for <code>default</code> (<code>No</code> or <code>Yes</code>).  Right: Predicted probabilities of <code>default</code> using logistic regression.  All probabilities lie between 0 and 1.</p>
</div>

<p>
Table <a href="#orgc4117f6">12</a> shows the coefficient estimates and related
information that result from fitting a logistic regression model on the
<code>Default</code> data in order to predict the probability of <code>default = Yes</code> using <code>balance</code>.
</p>

<table id="orgc4117f6" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 12:</span> For the <code>Default</code> data, estimated coefficients of the logistic regression model that predicts the probability of <code>default</code> using <code>balance</code>.  A one-unit increase in <code>balance</code> is associated with an increase in the log odds of <code>default</code> by 0.0055 units.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(z\)</th>
<th scope="col" class="org-right">\(P > \mid z \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">-10.6513</td>
<td class="org-right">0.3612</td>
<td class="org-right">-29.4913</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">balance</td>
<td class="org-right">0.0055</td>
<td class="org-right">0.0002</td>
<td class="org-right">24.9524</td>
<td class="org-right">0.0</td>
</tr>
</tbody>
</table>

<p>
Table <a href="#org185fd4c">13</a> shows the results of logistic model where <code>default</code>
is a function of the qualitative variable <code>student</code>.  
</p>

<p>
Table <a href="#org4ad6032">14</a> shows the coefficient estimates for a logistic
regression model that uses <code>balance</code>, <code>income</code> (in thousands of dollars), and
<code>student</code> status to predict probability of <code>default</code>.
</p>

<table id="org185fd4c" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 13:</span> For the <code>Default</code> data, estimated coefficients of the logistic regression model that predicts the probability of <code>default</code> using student status.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(z\)</th>
<th scope="col" class="org-right">\(P > \mid z \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">-3.5041</td>
<td class="org-right">0.0707</td>
<td class="org-right">-49.5541</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">student[T.Yes]</td>
<td class="org-right">0.4049</td>
<td class="org-right">0.115</td>
<td class="org-right">3.5202</td>
<td class="org-right">0.0004</td>
</tr>
</tbody>
</table>

<table id="org4ad6032" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 14:</span> For the <code>Default</code> data, estimated coefficients of the logistic regression model that predicts the probability of <code>default</code> using <code>balance</code>, <code>income</code>, and <code>student</code> status.  In fitting this model, <code>income</code> was measured in thousands of dollars.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">Coef.</th>
<th scope="col" class="org-right">Std.Err.</th>
<th scope="col" class="org-right">\(z\)</th>
<th scope="col" class="org-right">\(P > \mid z \mid\)</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Intercept</td>
<td class="org-right">-10.869</td>
<td class="org-right">0.4923</td>
<td class="org-right">-22.0793</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">student[T.Yes]</td>
<td class="org-right">-0.6468</td>
<td class="org-right">0.2363</td>
<td class="org-right">-2.7376</td>
<td class="org-right">0.0062</td>
</tr>

<tr>
<td class="org-left">balance</td>
<td class="org-right">0.0057</td>
<td class="org-right">0.0002</td>
<td class="org-right">24.7365</td>
<td class="org-right">0.0</td>
</tr>

<tr>
<td class="org-left">income</td>
<td class="org-right">0.003</td>
<td class="org-right">0.0082</td>
<td class="org-right">0.3698</td>
<td class="org-right">0.7115</td>
</tr>
</tbody>
</table>

<p>
The left hand panel of figure <a href="#orgbbbf687">39</a> shows average default rates
for students and non-students, respectively, as a function of credit card
balance.  <i>For a fixed value</i> of <code>balance</code> and <code>income</code>, a student is less
likely to default than a non-student.  This is true for all values of balance.
This is consistent with negative coefficient of student in table
<a href="#org4ad6032">14</a>.  But the horizontal lines near the base of the plot, which show the default rates
for students and non-students averaged over all values of <code>balance</code> and
<code>income</code>, suggest the opposite effect: the overall student default rate is
higher than non-student default rate.  Consequently, there is a positive
coefficient for <code>student</code> in the single variable logistic regression output
shown in table <a href="#org185fd4c">13</a>. 
</p>


<div id="orgbbbf687" class="figure">
<p><img src="figures/fig4_3.png" alt="fig4_3.png" />
</p>
<p><span class="figure-number">Figure 39: </span>Confounding in the <code>Default</code> data.  Left: Default rates are shown for students (brown) and non-students (blue).  The solid lines display default rate as a function of <code>balance</code>, while the horizontal lines display the overall default rates.  Right: Boxplots of <code>balance</code> for students and non-students are shown.</p>
</div>
</div>
</div>

<div id="outline-container-orga747611" class="outline-3">
<h3 id="orga747611"><span class="section-number-3">4.4</span> Linear Discriminant Analysis</h3>
<div class="outline-text-3" id="text-4-4">
<p>
In the left panel of figure <a href="#orge8d6d4f">40</a>, two normal density functions
that are displayed, \(f_1(x)\) and \(f_2(x)\), represent two distinct classes.  The
Bayes classifier boundary, shown as vertical dashed line, is estimated using the
function <code>GaussianNB()</code>.  The right hand panel displays a histogram of a random
sample of 20 observations from each class.  The LDA decision boundary is shown
as firm vertical line.
</p>


<div id="orge8d6d4f" class="figure">
<p><img src="figures/fig4_4.png" alt="fig4_4.png" />
</p>
<p><span class="figure-number">Figure 40: </span>Left: Two one-dimensional normal density functions are shown.  The dashed vertical line represents the Bayes decision boundary.  Right: 20 observations were drawn from each of the two classes, and are shown as histograms.  The Bayes decision boundary is again shown as a dashed vertical line.  The solid vertical line represents the LDA decision boundary estimated from the training data.</p>
</div>


<p>
Two examples of multivariate Gaussian distributions with \(p = 2\) are shown in
figure <a href="#org0a2358c">41</a>.  In the upper panel, the height of the surface at
any particular point represents the probability that both \(X_1\) and \(X_2\) fall
in the small region around that point.  If the surface is cut along the \(X_1\)
axis or along the \(X_2\) axis, the resulting cross-section will have the shape of
a one-dimensional normal distribution.  The left-hand panel illustrates an example in
which \(\text{var}(X_1) = \text{var}(X_2)\) and \(\text{cor}(X_1, X_2) = 0\); this surface has a
characteristic <i>bell shape</i>.  However, the bell shape will be distorted if the
predictors are correlated or have unequal variances, as is illustrated in the
right-hand panel of figure <a href="#org0a2358c">41</a>.  In this situation, the base
of the bell will have an elliptical, rather than circular, shape.  The contour
plots in the lower panel are not in the book. 
</p>


<div id="org0a2358c" class="figure">
<p><img src="figures/fig4_5.png" alt="fig4_5.png" />
</p>
<p><span class="figure-number">Figure 41: </span>Two multivariate Gaussian density functions are shown, with \(p = 2\).  Left: The two predictors are uncorrelated.  Right: The two predictors have a correlation of 0.7.  The lower panel shows contour plots of the surfaces drawn in the upper panel.  Here the correlations can be easily seen.</p>
</div>

<p>
Figure <a href="#orgef05e3b">42</a> shows an example of three equally sized Gaussian
classes with class-specific mean vectors and a common covariance matrix. The
dashed lines are the Bayes decision boundaries.  
</p>


<div id="orgef05e3b" class="figure">
<p><img src="figures/fig4_6.png" alt="fig4_6.png" />
</p>
<p><span class="figure-number">Figure 42: </span>An example with three classes. The observation from each class are drawn from a multivariate Gaussian distribution with \(p = 2\), with a class-specific mean vector and a common covariance matrix.  Left: The dashed lines are the Bayes decision boundaries.  Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines.  The Bayes decision boundaries are once again shown as dashed lines.</p>
</div>

<p>
A <i>confusion matrix</i>, shown for the <code>Default</code> data in table
<a href="#org60939f4">15</a>, is a convenient way to display prediction of default in
comparison to true default.  Table <a href="#org4bdfecb">16</a> shows the error rates
that result when we label any customer with a posterior probability of default
above 20% to the <i>default</i> class.
</p>

<table id="org60939f4" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 15:</span> A confusion matrix compares the LDA predictions to the true default statuses for the training observations in the <code>Default</code> data set.  Elements of the diagonal matrix represent individuals whose default statuses were correctly predicted, while off-diagonal elements represent individuals that were missclassified.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">true No</th>
<th scope="col" class="org-right">true Yes</th>
<th scope="col" class="org-right">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">predict No</td>
<td class="org-right">9645</td>
<td class="org-right">254</td>
<td class="org-right">9899</td>
</tr>

<tr>
<td class="org-left">predict Yes</td>
<td class="org-right">22</td>
<td class="org-right">79</td>
<td class="org-right">101</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="org-left">Total</td>
<td class="org-right">9667</td>
<td class="org-right">333</td>
<td class="org-right">10000</td>
</tr>
</tbody>
</table>


<table id="org4bdfecb" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 16:</span> A confusion matrix compares LDA predictions to the true default statuses for the training observations in the <code>Default</code> data set, using a modified threshold value that predicts default for any individuals whose posterior default probability exceeds 20%.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">true No</th>
<th scope="col" class="org-right">true Yes</th>
<th scope="col" class="org-right">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">predict No</td>
<td class="org-right">9435</td>
<td class="org-right">140</td>
<td class="org-right">9575</td>
</tr>

<tr>
<td class="org-left">predict Yes</td>
<td class="org-right">232</td>
<td class="org-right">193</td>
<td class="org-right">425</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="org-left">Total</td>
<td class="org-right">9667</td>
<td class="org-right">333</td>
<td class="org-right">10000</td>
</tr>
</tbody>
</table>


<p>
Figure <a href="#orgda94e10">43</a> illustrates the trade-off that results from
modifying the threshold value for the posterior probability of default.  Various
error rates are shown as a function of the threshold value.  Using a threshold
of 0.5 minimizes the overall error rate, shown as a black line.  But when a
threshold of 0.5 is used, the error rate among the individuals who default is
quite high (blue dashed line).  As the threshold is reduced, the error rate
among individuals who default decreases steadily, but the error rate amond
individuals who do not default increases. 
</p>


<div id="orgda94e10" class="figure">
<p><img src="figures/fig4_7.png" alt="fig4_7.png" />
</p>
<p><span class="figure-number">Figure 43: </span>For the <code>Default</code> data set, error rates are shown as a function of the threshold value for the posterior probability that is used to perform the assignment of default.  The black sold line displays the overall error rate.  The blue dashed line represents the fraction of defaulting customers that are incorrectly classified, and the orange dotted line indicates the fraction of errors among the non-defaulting customers.</p>
</div>


<p>
Figure <a href="#orgc1de2a8">44</a> displays the ROC curve for the LDA classifier on
the <code>Default</code> data set.
</p>


<div id="orgc1de2a8" class="figure">
<p><img src="figures/fig4_8.png" alt="fig4_8.png" />
</p>
<p><span class="figure-number">Figure 44: </span>A ROC curve for the LDA classifier on the <code>Default</code> data.  It traces two types of error as we vary the threshold value for the posterior probability of default.  The actual thresholds are not shown.  The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified using a given threshold value.  The false positive rate is the fraction of non-defaulters we incorrectly specify as defaulters, using the same threshold value.  The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate.  The dotted line represents the ``no information'' classifier; this is what we would expect if student status and credit card balance are not associated with the probability of default.</p>
</div>

<p>
Table <a href="#org648409a">17</a> shows the possible results when applying a
classifier (or diagnostic test) to a population.  
</p>

<table id="org648409a" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 17:</span> Possible results when applying a classifier or diagnostic test to a population.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-left" />

<col  class="org-left" />

<col  class="org-left" />

<col  class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-left"><i>True class</i></th>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-left">&#xa0;</th>
</tr>

<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-left">- or Null</th>
<th scope="col" class="org-left">+ or Non-null</th>
<th scope="col" class="org-left">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left"><i>Predicted</i></td>
<td class="org-left">- or Null</td>
<td class="org-left">True Negative (TN)</td>
<td class="org-left">False Negative (FN)</td>
<td class="org-left">N*</td>
</tr>

<tr>
<td class="org-left"><i>class</i></td>
<td class="org-left">+ or Non-null</td>
<td class="org-left">False Positive (FP)</td>
<td class="org-left">True Positive (TP)</td>
<td class="org-left">P*</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="org-left">&#xa0;</td>
<td class="org-left">Total</td>
<td class="org-left">N</td>
<td class="org-left">P</td>
<td class="org-left">&#xa0;</td>
</tr>
</tbody>
</table>
<p>
Table <a href="#org6597316">18</a> lists many of the popular performance measures that
are used in this context.
</p>

<table id="org6597316" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 18:</span> Important measures for classification and diagnostic testing, derived from quantities in table <a href="#org648409a">17</a>.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-left" />

<col  class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Name</th>
<th scope="col" class="org-left">Definition</th>
<th scope="col" class="org-left">Synonyms</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">False Positive rate</td>
<td class="org-left">FP / N</td>
<td class="org-left">Type I error, 1 - specificity</td>
</tr>

<tr>
<td class="org-left">True Positive rate</td>
<td class="org-left">TP / P</td>
<td class="org-left">1 - Type II error, power, sensitivity, recall</td>
</tr>

<tr>
<td class="org-left">Positive Predicted value</td>
<td class="org-left">TP / P*</td>
<td class="org-left">Precision, 1 - false discovery proportion</td>
</tr>

<tr>
<td class="org-left">Negative Predicted value</td>
<td class="org-left">TN / N*</td>
<td class="org-left">&#xa0;</td>
</tr>
</tbody>
</table>


<p>
Figure <a href="#orgee7808c">45</a> illustrates the performances of LDA and QDA in two
scenarios.  In the left-hand panel, the two Gaussian classes have a common
correlation of 0.7 between \(X_1\) and \(X_2\).  As a result, the Bayes decision
boundary is nearly linear and is accurately approximated by the LDA decision
boundary.  In contrast, the right-hand panel displays a situation in which the
orange class has a correlation of 0.7 between the variables and blue class has a
correlation of -0.7.
</p>


<div id="orgee7808c" class="figure">
<p><img src="figures/fig4_9.png" alt="fig4_9.png" />
</p>
<p><span class="figure-number">Figure 45: </span>Left: The Bayes (purple dashed), LDA (black dotted), and QDA (green sold) decision boundaries for a two-class problem with \(\Sigma_1 = \Sigma_2\).  Right: Details are as given in the left-hand panel, except that \(\Sigma_1 \ne \Sigma_2\).</p>
</div>
</div>
</div>

<div id="outline-container-orgf4eb929" class="outline-3">
<h3 id="orgf4eb929"><span class="section-number-3">4.5</span> A Comparison of Classification Methods</h3>
<div class="outline-text-3" id="text-4-5">
<p>
Figure <a href="#org8cd38c4">46</a> illustrates the performances of the four
classification approaches (KNN, LDA, Logistic, and QDA) when Bayes decision
boundary is linear.
</p>


<div id="org8cd38c4" class="figure">
<p><img src="figures/fig4_10.png" alt="fig4_10.png" />
</p>
<p><span class="figure-number">Figure 46: </span>Boxplots of the test error rates for each of the linear scenarios described in the main text.</p>
</div>
</div>
</div>

<div id="outline-container-org1b0d56b" class="outline-3">
<h3 id="org1b0d56b"><span class="section-number-3">4.6</span> Lab: Logistic Regression, LDA, QDA, and KNN</h3>
<div class="outline-text-3" id="text-4-6">
</div>
<div id="outline-container-org9de03ac" class="outline-4">
<h4 id="org9de03ac"><span class="section-number-4">4.6.1</span> The Stock Market Data</h4>
<div class="outline-text-4" id="text-4-6-1">
<p>
We will begin by examining some numerical and graphical summaries of the
<code>Smarket</code> data, which is part of the <code>ISLR</code> library.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">smarket</span> = datasets.get_rdataset(<span style="color: #008000;">'Smarket'</span>, <span style="color: #008000;">'ISLR'</span>).data

<span style="color: #0000FF;">print</span>(smarket.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(smarket.shape)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(smarket.describe())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(smarket.iloc[:,1:8].corr())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
smarket.boxplot(column=<span style="color: #008000;">'Volume'</span>, by=<span style="color: #008000;">'Year'</span>, grid=<span style="color: #D0372D;">False</span>)
</pre>
</div>

<pre class="example">
Index(['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today',
       'Direction'],
      dtype='object')
--------
(1250, 9)
--------
              Year         Lag1  ...       Volume        Today
count  1250.000000  1250.000000  ...  1250.000000  1250.000000
mean   2003.016000     0.003834  ...     1.478305     0.003138
std       1.409018     1.136299  ...     0.360357     1.136334
min    2001.000000    -4.922000  ...     0.356070    -4.922000
25%    2002.000000    -0.639500  ...     1.257400    -0.639500
50%    2003.000000     0.039000  ...     1.422950     0.038500
75%    2004.000000     0.596750  ...     1.641675     0.596750
max    2005.000000     5.733000  ...     3.152470     5.733000

[8 rows x 8 columns]
--------
            Lag1      Lag2      Lag3      Lag4      Lag5    Volume     Today
Lag1    1.000000 -0.026294 -0.010803 -0.002986 -0.005675  0.040910 -0.026155
Lag2   -0.026294  1.000000 -0.025897 -0.010854 -0.003558 -0.043383 -0.010250
Lag3   -0.010803 -0.025897  1.000000 -0.024051 -0.018808 -0.041824 -0.002448
Lag4   -0.002986 -0.010854 -0.024051  1.000000 -0.027084 -0.048414 -0.006900
Lag5   -0.005675 -0.003558 -0.018808 -0.027084  1.000000 -0.022002 -0.034860
Volume  0.040910 -0.043383 -0.041824 -0.048414 -0.022002  1.000000  0.014592
Today  -0.026155 -0.010250 -0.002448 -0.006900 -0.034860  0.014592  1.000000
--------
</pre>
</div>
</div>

<div id="outline-container-org71ef131" class="outline-4">
<h4 id="org71ef131"><span class="section-number-4">4.6.2</span> Logistc Regression</h4>
<div class="outline-text-4" id="text-4-6-2">
<p>
Next, we will fit a logistic regression model to predict <code>Direction</code> using
<code>Lag1</code> through <code>Lag5</code> and <code>Volume</code>.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">smarket</span> = datasets.get_rdataset(<span style="color: #008000;">'Smarket'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">smarket</span>[<span style="color: #008000;">'direction_cat'</span>] = smarket[<span style="color: #008000;">'Direction'</span>].<span style="color: #006FE0;">apply</span>(<span style="color: #0000FF;">lambda</span> x: <span style="color: #006FE0;">int</span>(x==<span style="color: #008000;">'Up'</span>))

<span style="color: #BA36A5;">logit_model</span> = smf.logit(
    formula=<span style="color: #008000;">'direction_cat ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume'</span>,
    data=smarket)
<span style="color: #BA36A5;">logit_fit</span> = logit_model.fit()

<span style="color: #0000FF;">print</span>(logit_fit.summary2())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">dir</span>(logit_fit))           <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">see what information is available from fit</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(logit_fit.params)         <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">coefficients estimates</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(logit_fit.summary2().tables[1]) <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">coefficients estimates, std error, and z</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(logit_fit.summary2().tables[1].iloc[:,3]) <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">P &gt; |z| column only</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(logit_fit.predict()[:10]) <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">probabilities for training data</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #BA36A5;">smarket</span>[<span style="color: #008000;">'predict_direction'</span>] = np.vectorize(
    <span style="color: #0000FF;">lambda</span> x: <span style="color: #008000;">'Up'</span> <span style="color: #0000FF;">if</span> x &gt; 0.5 <span style="color: #0000FF;">else</span> <span style="color: #008000;">'Down'</span>)(logit_fit.predict())
<span style="color: #0000FF;">print</span>(pd.crosstab(smarket[<span style="color: #008000;">'predict_direction'</span>], smarket[<span style="color: #008000;">'Direction'</span>]))
</pre>
</div>

<pre class="example">
Optimization terminated successfully.
         Current function value: 0.691034
         Iterations 4
                         Results: Logit
================================================================
Model:              Logit            Pseudo R-squared: 0.002    
Dependent Variable: direction_cat    AIC:              1741.5841
Date:               2019-06-06 18:56 BIC:              1777.5004
No. Observations:   1250             Log-Likelihood:   -863.79  
Df Model:           6                LL-Null:          -865.59  
Df Residuals:       1243             LLR p-value:      0.73187  
Converged:          1.0000           Scale:            1.0000   
No. Iterations:     4.0000                                      
-----------------------------------------------------------------
               Coef.   Std.Err.     z     P&gt;|z|    [0.025  0.975]
-----------------------------------------------------------------
Intercept     -0.1260    0.2407  -0.5234  0.6007  -0.5978  0.3458
Lag1          -0.0731    0.0502  -1.4566  0.1452  -0.1714  0.0253
Lag2          -0.0423    0.0501  -0.8446  0.3984  -0.1405  0.0559
Lag3           0.0111    0.0499   0.2220  0.8243  -0.0868  0.1090
Lag4           0.0094    0.0500   0.1873  0.8514  -0.0886  0.1073
Lag5           0.0103    0.0495   0.2083  0.8350  -0.0867  0.1074
Volume         0.1354    0.1584   0.8553  0.3924  -0.1749  0.4458
================================================================

--------
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
'__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', 
'__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', 
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
'__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 
'_cache', '_data_attr', '_get_endog_name', '_get_robustcov_results', 'aic', 
'bic', 'bse', 'conf_int', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 
'df_resid', 'f_test', 'fittedvalues', 'get_margeff', 'initialize', 
'k_constant', 'llf', 'llnull', 'llr', 'llr_pvalue', 'load', 'mle_retvals', 
'mle_settings', 'model', 'nobs', 'normalized_cov_params', 'params', 
'pred_table', 'predict', 'prsquared', 'pvalues', 'remove_data', 'resid_dev', 
'resid_generalized', 'resid_pearson', 'resid_response', 'save', 'scale', 
'set_null_options', 'summary', 'summary2', 't_test', 't_test_pairwise', 
'tvalues', 'use_t', 'wald_test', 'wald_test_terms']
--------
Intercept   -0.126000
Lag1        -0.073074
Lag2        -0.042301
Lag3         0.011085
Lag4         0.009359
Lag5         0.010313
Volume       0.135441
dtype: float64
--------
              Coef.  Std.Err.         z     P&gt;|z|    [0.025    0.975]
Intercept -0.126000  0.240737 -0.523394  0.600700 -0.597836  0.345836
Lag1      -0.073074  0.050168 -1.456583  0.145232 -0.171401  0.025254
Lag2      -0.042301  0.050086 -0.844568  0.398352 -0.140469  0.055866
Lag3       0.011085  0.049939  0.221974  0.824334 -0.086793  0.108963
Lag4       0.009359  0.049974  0.187275  0.851445 -0.088589  0.107307
Lag5       0.010313  0.049512  0.208296  0.834998 -0.086728  0.107354
Volume     0.135441  0.158361  0.855266  0.392404 -0.174941  0.445822
--------
Intercept    0.600700
Lag1         0.145232
Lag2         0.398352
Lag3         0.824334
Lag4         0.851445
Lag5         0.834998
Volume       0.392404
Name: P&gt;|z|, dtype: float64
--------
[0.50708413 0.48146788 0.48113883 0.51522236 0.51078116 0.50695646
 0.49265087 0.50922916 0.51761353 0.48883778]
--------
Direction          Down   Up
predict_direction           
Down                145  141
Up                  457  507
</pre>

<p>
We now use data for years 2001 through 2004 to train the model, then use data
for year 2005 to test the model. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">smarket</span> = datasets.get_rdataset(<span style="color: #008000;">'Smarket'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">smarket</span>[<span style="color: #008000;">'direction_cat'</span>] = smarket[<span style="color: #008000;">'Direction'</span>].<span style="color: #006FE0;">apply</span>(<span style="color: #0000FF;">lambda</span> x:
                                                      <span style="color: #006FE0;">int</span>(x == <span style="color: #008000;">'Up'</span>))
<span style="color: #BA36A5;">smarket_train</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] &lt; 2005]
<span style="color: #BA36A5;">smarket_test</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] == 2005].copy()

<span style="color: #BA36A5;">logit_model</span> = smf.logit(
    formula=<span style="color: #008000;">'direction_cat ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume'</span>,
    data=smarket_train)
<span style="color: #BA36A5;">logit_fit</span> = logit_model.fit()

<span style="color: #BA36A5;">prob_up_test</span> = logit_fit.predict(smarket_test)
<span style="color: #BA36A5;">smarket_test.loc</span>[:,<span style="color: #008000;">'direction_predict'</span>] = np.vectorize(
    <span style="color: #0000FF;">lambda</span> x: <span style="color: #008000;">'Up'</span> <span style="color: #0000FF;">if</span> x &gt; 0.5 <span style="color: #0000FF;">else</span> <span style="color: #008000;">'Down'</span>)(prob_up_test)

<span style="color: #BA36A5;">confusion_test</span> = \
    pd.crosstab(smarket_test[<span style="color: #008000;">'direction_predict'</span>], smarket_test[<span style="color: #008000;">'Direction'</span>])
<span style="color: #0000FF;">print</span>(confusion_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(np.mean(smarket_test[<span style="color: #008000;">'direction_predict'</span>] ==
                      smarket_test[<span style="color: #008000;">'Direction'</span>])))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Refit logistic regression with only Lag1 and Lag2</span>
<span style="color: #BA36A5;">logit_model</span> = smf.logit(<span style="color: #008000;">'direction_cat ~ Lag1 + Lag2'</span>, data=smarket_train)
<span style="color: #BA36A5;">logit_fit</span> = logit_model.fit()
<span style="color: #BA36A5;">prob_up_test</span> = logit_fit.predict(smarket_test)
<span style="color: #BA36A5;">smarket_test</span>[<span style="color: #008000;">'direction_pred_2var'</span>] = np.vectorize(
    <span style="color: #0000FF;">lambda</span> x: <span style="color: #008000;">'Up'</span> <span style="color: #0000FF;">if</span> x &gt; 0.5 <span style="color: #0000FF;">else</span> <span style="color: #008000;">'Down'</span>)(prob_up_test)

<span style="color: #0000FF;">print</span>(pd.crosstab(smarket_test[<span style="color: #008000;">'direction_pred_2var'</span>],
                  smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #0000FF;">print</span>(np.mean(smarket_test[<span style="color: #008000;">'direction_pred_2var'</span>] == smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #0000FF;">print</span>(logit_fit.predict(exog=<span style="color: #006FE0;">dict</span>(Lag1=[1.2,1.5], Lag2=[1.1,-0.8])))
</pre>
</div>

<pre class="example">
Optimization terminated successfully.
         Current function value: 0.691936
         Iterations 4
Direction          Down  Up
direction_predict          
Down                 77  97
Up                   34  44
--------
0.4801587301587302
--------
Optimization terminated successfully.
         Current function value: 0.692085
         Iterations 3
Direction            Down   Up
direction_pred_2var           
Down                   35   35
Up                     76  106
--------
0.5595238095238095
--------
0    0.479146
1    0.496094
dtype: float64
</pre>
</div>
</div>

<div id="outline-container-orgbcfd3bf" class="outline-4">
<h4 id="orgbcfd3bf"><span class="section-number-4">4.6.3</span> Linear Discriminant Analysis</h4>
<div class="outline-text-4" id="text-4-6-3">
<p>
Now we will perform LDA on <code>Smarket</code> data.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> sklearn.discriminant_analysis <span style="color: #0000FF;">import</span> LinearDiscriminantAnalysis <span style="color: #0000FF;">as</span> LDA
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">smarket</span> = datasets.get_rdataset(<span style="color: #008000;">'Smarket'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">smarket_train</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] &lt; 2005]
<span style="color: #BA36A5;">smarket_test</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] == 2005].copy()

<span style="color: #BA36A5;">lda_model</span> = LDA()
<span style="color: #BA36A5;">lda_fit</span> = lda_model.fit(smarket_train[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]],
                        smarket_train[<span style="color: #008000;">'Direction'</span>])

<span style="color: #0000FF;">print</span>(lda_fit.priors_)          <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prior probabilities of groups</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(lda_fit.means_)           <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Group means</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(lda_fit.scalings_)        <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Coefficients of linear discriminants</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #BA36A5;">lda_predict_2005</span> = lda_fit.predict(smarket_test[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]])
<span style="color: #0000FF;">print</span>(pd.crosstab(lda_predict_2005, smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(lda_predict_2005 == smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #BA36A5;">lda_predict_prob2005</span> = lda_fit.predict_proba(smarket_test[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]])
<span style="color: #0000FF;">print</span>(np.<span style="color: #006FE0;">sum</span>(lda_predict_prob2005[:,0] &gt;= 0.5))
<span style="color: #0000FF;">print</span>(np.<span style="color: #006FE0;">sum</span>(lda_predict_prob2005[:,0] &lt; 0.5))
</pre>
</div>

<pre class="example">
[0.49198397 0.50801603]
--------
[[ 0.04279022  0.03389409]
 [-0.03954635 -0.03132544]]
--------
[[-0.64201904]
 [-0.51352928]]
--------
Direction  Down   Up
row_0               
Down         35   35
Up           76  106
--------
0.5595238095238095
--------
70
182
</pre>
</div>
</div>

<div id="outline-container-org1d9b702" class="outline-4">
<h4 id="org1d9b702"><span class="section-number-4">4.6.4</span> Quadratic Discriminant Analysis</h4>
<div class="outline-text-4" id="text-4-6-4">
<p>
We will now fit a QDA model to the <code>Smarket</code> data.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> sklearn.discriminant_analysis <span style="color: #0000FF;">import</span> QuadraticDiscriminantAnalysis <span style="color: #0000FF;">as</span> QDA
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">smarket</span> = datasets.get_rdataset(<span style="color: #008000;">'Smarket'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">smarket_train</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] &lt; 2005]
<span style="color: #BA36A5;">smarket_test</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] == 2005].copy()

<span style="color: #BA36A5;">qdf</span> = QDA()
qdf.fit(smarket_train[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]], smarket_train[<span style="color: #008000;">'Direction'</span>])

<span style="color: #0000FF;">print</span>(qdf.priors_)              <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prior probabilities of groups</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(qdf.means_)               <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Group means</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #BA36A5;">predict_direction2005</span> = qdf.predict(smarket_test[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]])
<span style="color: #0000FF;">print</span>(pd.crosstab(predict_direction2005, smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(predict_direction2005 == smarket_test[<span style="color: #008000;">'Direction'</span>]))
</pre>
</div>

<pre class="example">
[0.49198397 0.50801603]
--------
[[ 0.04279022  0.03389409]
 [-0.03954635 -0.03132544]]
--------
Direction  Down   Up
row_0               
Down         30   20
Up           81  121
--------
0.5992063492063492
</pre>
</div>
</div>

<div id="outline-container-orge5a8491" class="outline-4">
<h4 id="orge5a8491"><span class="section-number-4">4.6.5</span> K-Nearest Neightbors</h4>
<div class="outline-text-4" id="text-4-6-5">
<p>
We will now perform KNN, also on the <code>Smarket</code> data.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> sklearn.neighbors <span style="color: #0000FF;">import</span> KNeighborsClassifier
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">smarket</span> = datasets.get_rdataset(<span style="color: #008000;">'Smarket'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">smarket_train</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] &lt; 2005]
<span style="color: #BA36A5;">smarket_test</span> = smarket.loc[smarket[<span style="color: #008000;">'Year'</span>] == 2005].copy()

<span style="color: #BA36A5;">knn1</span> = KNeighborsClassifier(n_neighbors=1)
knn1.fit(smarket_train[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]], smarket_train[<span style="color: #008000;">'Direction'</span>])
<span style="color: #BA36A5;">smarket_test</span>[<span style="color: #008000;">'predict_dir_knn1'</span>] = knn1.predict(smarket_test[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]])
<span style="color: #0000FF;">print</span>(pd.crosstab(smarket_test[<span style="color: #008000;">'predict_dir_knn1'</span>], smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(smarket_test[<span style="color: #008000;">'predict_dir_knn1'</span>] == smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">knn3</span> = KNeighborsClassifier(n_neighbors=3)
knn3.fit(smarket_train[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]], smarket_train[<span style="color: #008000;">'Direction'</span>])
<span style="color: #BA36A5;">smarket_test</span>[<span style="color: #008000;">'predict_dir_knn3'</span>] = knn3.predict(smarket_test[[<span style="color: #008000;">'Lag1'</span>, <span style="color: #008000;">'Lag2'</span>]])
<span style="color: #0000FF;">print</span>(pd.crosstab(smarket_test[<span style="color: #008000;">'predict_dir_knn3'</span>], smarket_test[<span style="color: #008000;">'Direction'</span>]))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(smarket_test[<span style="color: #008000;">'predict_dir_knn3'</span>] == smarket_test[<span style="color: #008000;">'Direction'</span>]))
</pre>
</div>

<pre class="example">
Direction         Down  Up
predict_dir_knn1          
Down                43  58
Up                  68  83
--------
0.5
--------
Direction         Down  Up
predict_dir_knn3          
Down                48  55
Up                  63  86
--------
0.5317460317460317
</pre>
</div>
</div>

<div id="outline-container-org790d805" class="outline-4">
<h4 id="org790d805"><span class="section-number-4">4.6.6</span> An Application to Caravan Insurance Data</h4>
<div class="outline-text-4" id="text-4-6-6">
<p>
Finally, we will apply the KNN approach to the <code>Caravan</code> data set in the <code>ISLR</code>
library. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> sklearn.neighbors <span style="color: #0000FF;">import</span> KNeighborsClassifier
<span style="color: #0000FF;">from</span> sklearn.linear_model <span style="color: #0000FF;">import</span> LogisticRegression
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">caravan</span> = datasets.get_rdataset(<span style="color: #008000;">'Caravan'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #0000FF;">print</span>(caravan[<span style="color: #008000;">'Purchase'</span>].value_counts())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">caravan_scale</span> = caravan.iloc[:,:-1]
<span style="color: #BA36A5;">caravan_scale</span> = (caravan_scale - caravan_scale.mean()) / caravan_scale.std()

<span style="color: #BA36A5;">caravan_test</span> = caravan_scale.iloc[:1000]
<span style="color: #BA36A5;">purchase_test</span> = caravan.iloc[:1000][<span style="color: #008000;">'Purchase'</span>]

<span style="color: #BA36A5;">caravan_train</span> = caravan_scale.iloc[1000:]
<span style="color: #BA36A5;">purchase_train</span> = caravan.iloc[1000:][<span style="color: #008000;">'Purchase'</span>]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit KNN with 1, 3, and 5 neighbors</span>
<span style="color: #BA36A5;">knn1</span> = KNeighborsClassifier(n_neighbors=1)
knn1.fit(caravan_train, purchase_train)
<span style="color: #BA36A5;">purchase_predict_knn1</span> = knn1.predict(caravan_test)

<span style="color: #0000FF;">print</span>(np.mean(purchase_test != purchase_predict_knn1))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(purchase_test == <span style="color: #008000;">'Yes'</span>))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(pd.crosstab(purchase_predict_knn1, purchase_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">knn3</span> = KNeighborsClassifier(n_neighbors=3)
knn3.fit(caravan_train, purchase_train)
<span style="color: #BA36A5;">purchase_predict_knn3</span> = knn3.predict(caravan_test)

<span style="color: #0000FF;">print</span>(np.mean(purchase_test != purchase_predict_knn3))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(purchase_test == <span style="color: #008000;">'Yes'</span>))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(pd.crosstab(purchase_predict_knn3, purchase_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">knn5</span> = KNeighborsClassifier(n_neighbors=5)
knn5.fit(caravan_train, purchase_train)
<span style="color: #BA36A5;">purchase_predict_knn5</span> = knn5.predict(caravan_test)

<span style="color: #0000FF;">print</span>(np.mean(purchase_test != purchase_predict_knn5))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(np.mean(purchase_test == <span style="color: #008000;">'Yes'</span>))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)
<span style="color: #0000FF;">print</span>(pd.crosstab(purchase_predict_knn5, purchase_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Now fit logistic regression</span>
<span style="color: #BA36A5;">logit_model</span> = LogisticRegression(solver=<span style="color: #008000;">'lbfgs'</span>, max_iter=1000)
logit_model.fit(caravan_train, purchase_train)
<span style="color: #BA36A5;">purchase_predict_logit</span> = logit_model.predict(caravan_test)
<span style="color: #0000FF;">print</span>(pd.crosstab(purchase_predict_logit, purchase_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">purchase_predict_prob_logit</span> = logit_model.predict_proba(caravan_test)
<span style="color: #BA36A5;">purchase_predict_logit_prob25</span> = np.vectorize(
    <span style="color: #0000FF;">lambda</span> x: <span style="color: #008000;">'Yes'</span> <span style="color: #0000FF;">if</span> x &gt; 0.25 <span style="color: #0000FF;">else</span> <span style="color: #008000;">'No'</span>)(purchase_predict_prob_logit[:,1])
<span style="color: #0000FF;">print</span>(pd.crosstab(purchase_predict_logit_prob25, purchase_test))
</pre>
</div>

<pre class="example">
No     5474
Yes     348
Name: Purchase, dtype: int64
--------
0.118
--------
0.059
--------
Purchase   No  Yes
row_0             
No        873   50
Yes        68    9
--------
0.074
--------
0.059
--------
Purchase   No  Yes
row_0             
No        921   54
Yes        20    5
--------
0.066
--------
0.059
--------
Purchase   No  Yes
row_0             
No        930   55
Yes        11    4
--------
Purchase   No  Yes
row_0             
No        934   59
Yes         7    0
--------
Purchase   No  Yes
row_0             
No        917   48
Yes        24   11
</pre>
</div>
</div>
</div>
</div>

<div id="outline-container-org6c34bb7" class="outline-2">
<h2 id="org6c34bb7"><span class="section-number-2">5</span> Resampling Methods</h2>
<div class="outline-text-2" id="text-5">
</div>
<div id="outline-container-org3d5343e" class="outline-3">
<h3 id="org3d5343e"><span class="section-number-3">5.1</span> Cross-Validation</h3>
<div class="outline-text-3" id="text-5-1">
<p>
Figure <a href="#org5f4f05c">47</a> displays the <i>validation set approach</i>, a simple
stategy to estimate the test error associated with fitting a particular
statistical learning method on a set of observations.
</p>


<div id="org5f4f05c" class="figure">
<p><img src="figures/fig5_1.png" alt="fig5_1.png" />
</p>
<p><span class="figure-number">Figure 47: </span>A schematic display of the validation set approach.  A set of \(n\) observations are randomly split into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a validation set (shown in red, and containing observation 91, among others).  The statistical learning method is fit on the training set, and its performance is evaluated on the validation set.</p>
</div>

<p>
In figure <a href="#org0279fbf">48</a>, the left-hand panel shows validation sample MSE as a
function of polynomial order for which a regression model was fit on training
sample.  The two samples are obtained by randomly splitting <code>Auto</code> data set into
two data sets of 196 observations each.  The right-hand panel shows the results
of repeating this exercise 10 times, each time with a different random split of
the observations into training and validation sets.  The model with a quadratic
term has a lower MSE compared to the model with only a linear term.  There is
not much benefit from adding cubic or higher order polynomial terms in the
regression model.
</p>


<div id="org0279fbf" class="figure">
<p><img src="figures/fig5_2.png" alt="fig5_2.png" />
</p>
<p><span class="figure-number">Figure 48: </span>The validation set approach was used in the <code>Auto</code> data set in order to estimate the test error that results from predicting <code>mpg</code> using polynomial functions of <code>horsepower</code>.  Left: Validation error estimates for a single split into training and validation data sets.  Right: The validatioin method was repeated ten times, each time using a different random split of the observations into a training set and a validation set.  This illustrates the variability of of the estimated test MSE that results from this approach.</p>
</div>

<p>
Figure <a href="#org10b187e">49</a> displas the Leave One Out Cross Validation (LOOCV) approach.
</p>


<div id="org10b187e" class="figure">
<p><img src="figures/fig5_3.png" alt="fig5_3.png" />
</p>
<p><span class="figure-number">Figure 49: </span>A schematic display of LOOCV.  A set of \(n\) data points is repeatedly split into a training set (shown in blue) containing all but one observation, and a validation set that contains only that observation (shown in red).  The test error is then estimated by averaging the n resulting MSE's.  The first training set contains all but observation 1, the second training set contains all but observation 2, and so on.</p>
</div>

<p>
The left-hand panel of figure <a href="#orgbaeb79c">50</a> shows test set MSE as a function of
polynomial degree when LOOCV is used on the <code>Auto</code> data set.  We fit linear
regression models to predict <code>mpg</code> using polynomial functions of <code>horsepower</code>.
The right-hand panel of figure <a href="#orgbaeb79c">50</a> shows nine different 10-fold
CV estimates for the <code>Auto</code> data set, each resulting from a different random
split of the observations into ten folds.
</p>


<div id="orgbaeb79c" class="figure">
<p><img src="figures/fig5_4.png" alt="fig5_4.png" />
</p>
<p><span class="figure-number">Figure 50: </span>Cross-validation was used in the <code>Auto</code> data set in order to estimate the test error that results from predicting <code>mpg</code> using polynomial functions of <code>horsepower</code>.  Left: The LOOCV error curve.  Right: 10-fold CV was run nin separate times, each with a different random split of the data into ten parts.  The figure shows the nine slightly different CV error curves.</p>
</div>

<p>
Figure <a href="#org732d47f">51</a> illustrates the <i>k</i>-fold CV approach.
</p>


<div id="org732d47f" class="figure">
<p><img src="figures/fig5_5.png" alt="fig5_5.png" />
</p>
<p><span class="figure-number">Figure 51: </span>A schematic display of 5-fold CV.  A set of \(n\) observations is randomly split into five non-overlapping groups.  Each of these fifths acts as a validation set (shown in red), and the remainder as a training set (shown in blue).  The test error is estimated by averaging the five resulting MSE estimates.</p>
</div>

<p>
In figure <a href="#org8b30885">52</a>, we plot the cross-validation estimates and true
test error rates that result from fitting least squares polynomials to the
simulated data sets illustrated in figures <a href="#org43f7e4e">9</a>,
<a href="#org9a4ea7b">10</a>, and <a href="#orgb45f0cf">11</a> of chapter <a href="#orgce1fc8b">2</a>.  In all
three plots, the two cross validation errors are very similar.  
</p>


<div id="org8b30885" class="figure">
<p><img src="figures/fig5_6.png" alt="fig5_6.png" />
</p>
<p><span class="figure-number">Figure 52: </span>True and estimated test MSE for the simulated data sets in Figures <a href="#org43f7e4e">9</a> (left), <a href="#org9a4ea7b">10</a> (center), and <a href="#orgb45f0cf">11</a> (right).  The true test MSE is shown in blue, the LOOCV estimate is shown in black dashed line, and the 10-fold CV estimate is shown in red dotted line.</p>
</div>

<p>
Figure <a href="#orgd59cb26">53</a> shows Bayesian decision boundary (blue dashed line)
and logistic regression decision boundary (black line) for 1- to 4-degree
polynomials on \(X_1\) and \(X_2\). 
</p>


<div id="orgd59cb26" class="figure">
<p><img src="figures/fig5_7.png" alt="fig5_7.png" />
</p>
<p><span class="figure-number">Figure 53: </span>Logistic regression fits on the two-dimensional classification data displayed in figure <a href="#org0d2d113">13</a>.  The Bayes decision boundary is represented using a blue dashed line.  Estimated decision boundaries from linear, quadratic, cubic, and quartic (degrees 1-4) logistic regressions are displayed in black.</p>
</div>

<p>
The left-hand panel of figure <a href="#orgd60a6bf">54</a> displays in black 10-fold CV
error rates that result from fitting ten logistic regression models to the data,
using polynomial functions of the predictors up to tenth order.  The true test
errors are shown in red, and the training errors are shown in blue.  The
training error tends to decrease as the flexibility of the fit increases.  The
test error is higher than training error.  The 10-fold CV
error rate is a close approximation to the test error rate.  
</p>

<p>
The right-hand panel of figure <a href="#orgd60a6bf">54</a> displays the same three curves
using the KNN approach for classification, as a function of the value of <i>K</i>
(the number of neighbors used in the KNN classifier).  Again, the training error
rate declines as the method becomes more flexible, and so we see that the
training error rate cannot be used to select the optimal value of <i>K</i>.  
</p>


<div id="orgd60a6bf" class="figure">
<p><img src="figures/fig5_8.png" alt="fig5_8.png" />
</p>
<p><span class="figure-number">Figure 54: </span>Test error (red), training error(blue), and 10-fold CV error (black) on the two-dimensional classification data displayed in <a href="#orgd59cb26">53</a>.  Left: Logistic regression using polynomial functions of the predictors.  The order of the polynomials used is displayed on the x-axis.  Right: The KNN classifier with different values of K, the number of neighbors used in the KNN classifier.</p>
</div>
</div>
</div>

<div id="outline-container-org4fb86ce" class="outline-3">
<h3 id="org4fb86ce"><span class="section-number-3">5.2</span> The Bootstrap</h3>
<div class="outline-text-3" id="text-5-2">
<p>
Figure <a href="#orgfe665d3">55</a> illustrates the approach for estimating &alpha; by
repeated simulation of data.  In each panel, we simulated 100 pairs of returns
for the investments <i>X</i> and <i>Y</i>.  We used these returns to estimate
\(\sigma_X^2\), \(\sigma_Y^2\) and \(\sigma_{XY}\), which are then used to estimate &alpha;.
</p>


<div id="orgfe665d3" class="figure">
<p><img src="figures/fig5_9.png" alt="fig5_9.png" />
</p>
<p><span class="figure-number">Figure 55: </span>Each panel displays 100 simulated returns for investments <i>X</i> and <i>Y</i>.  The resulting estimates of &alpha; are displayed in bottom right corner.</p>
</div>

<p>
It is natural to wish to quantify the accuracy of our estimate of &alpha;.  To
estimate the standard deviation of \(\hat{\alpha}\), we repeated the process of
simulating 100 paired observations of <i>X</i> and <i>Y</i>, and estimating &alpha; 1000
times.  We thereby obtain 1000 estimates of &alpha;, which we can call
\({\hat{\alpha}}_1, {\hat{\alpha}}_2, ...,{\hat{\alpha}}_{1000}\). The left-hand panel of
figure <a href="#org80a983b">56</a> displays a histogram of the resulting estimates.  The
mean over all 1000 estimates for &alpha; is 0.599, which is very close
to \(\alpha = 0.6\).  The standard deviation of the estimates is 0.08.
</p>

<p>
The bootstrap approach is illustrated in the center panel of figure
<a href="#org80a983b">56</a>, which displays a histogram of 1000 bootstrap estimates of
&alpha;, each computed using a distinct bootstrap data set.  The panel was
constructed on the basis of a single data set, and hence could be created using
real data. The right-hand panel displays the information in the center and left
panels in a different way, via boxplots of the estimates of &alpha; obtained by
generating 1000 simulated data sets from the true population and using the
boostrap approach.  
</p>


<div id="org80a983b" class="figure">
<p><img src="figures/fig5_10.png" alt="fig5_10.png" />
</p>
<p><span class="figure-number">Figure 56: </span>Left: A histogram of the estimates of &alpha; obtained by generating 1000 simulated data sets from the true population.  Center: A histogram of the estimates of &alpha; obtained from 1000 bootstrap samples from a single data set.  Right: The estimates of &alpha; displayed in the left and center panels are shown as boxplots.  In each panel, the red line indicates the true value of &alpha;.</p>
</div>
</div>
</div>

<div id="outline-container-org0379a7d" class="outline-3">
<h3 id="org0379a7d"><span class="section-number-3">5.3</span> Lab: Cross-Validation and the Bootstrap</h3>
<div class="outline-text-3" id="text-5-3">
</div>
<div id="outline-container-org8dcc4da" class="outline-4">
<h4 id="org8dcc4da"><span class="section-number-4">5.3.1</span> The Validation Set Approach</h4>
<div class="outline-text-4" id="text-5-3-1">
<p>
We use the function <code>choice</code> in numpy.random library to split the set of
observations in <code>Auto</code> data set into two subsets of 196 observations.  Then we
fit regression models on the training data set and calculate validation error on
the validation set.
</p>

<p>
These results show that a model that predicts <code>mpg</code> using a quadratic function
of <code>horsepower</code> performs better than a model that predicts <code>mpg</code> using a linear
function of horsepower.  There is little evidence that a cubic function of
<code>horsepower</code> is better than the quadratic function. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf

<span style="color: #BA36A5;">auto</span> = datasets.get_rdataset(<span style="color: #008000;">'Auto'</span>, <span style="color: #008000;">'ISLR'</span>).data

np.random.seed(911)
<span style="color: #BA36A5;">train_ind</span> = np.random.choice(auto.shape[0], size=<span style="color: #006FE0;">int</span>(auto.shape[0]/2),
                             replace=<span style="color: #D0372D;">False</span>)
<span style="color: #BA36A5;">all_ind</span> = np.arange(auto.shape[0])
<span style="color: #BA36A5;">test_ind</span> = <span style="color: #006FE0;">set</span>(all_ind).difference(<span style="color: #006FE0;">set</span>(train_ind))
<span style="color: #BA36A5;">test_ind</span> = <span style="color: #006FE0;">list</span>(test_ind)
<span style="color: #BA36A5;">auto_train</span> = auto.iloc[train_ind]
<span style="color: #BA36A5;">auto_test</span> = auto.iloc[test_ind]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit first linear model</span>
<span style="color: #BA36A5;">lm_model</span> = smf.ols(formula=<span style="color: #008000;">'mpg ~ horsepower'</span>, data=auto_train)
<span style="color: #BA36A5;">lm_fit</span> = lm_model.fit()
<span style="color: #BA36A5;">mse_train</span> = np.<span style="color: #006FE0;">sum</span>((lm_fit.predict(auto_train) - auto_train[<span style="color: #008000;">'mpg'</span>]) ** 2) / \
    (auto_train.shape[0] - 2)
<span style="color: #0000FF;">print</span>(mse_train)
<span style="color: #0000FF;">print</span>(lm_fit.mse_resid)         <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">same value</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">mse_test</span> = np.<span style="color: #006FE0;">sum</span>((lm_fit.predict(auto_test) - auto_test[<span style="color: #008000;">'mpg'</span>]) ** 2) / \
    (auto_test.shape[0] - 2)
<span style="color: #0000FF;">print</span>(mse_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit quadratic model</span>
<span style="color: #BA36A5;">lm_model2</span> = smf.ols(<span style="color: #008000;">'mpg ~ horsepower + I(horsepower ** 2)'</span>, data=auto_train)
<span style="color: #BA36A5;">lm_fit2</span> = lm_model2.fit()
<span style="color: #BA36A5;">mse_test2</span> = np.<span style="color: #006FE0;">sum</span>((lm_fit2.predict(auto_test) - auto_test[<span style="color: #008000;">'mpg'</span>]) ** 2) / \
    (auto_test.shape[0] - 3)
<span style="color: #0000FF;">print</span>(mse_test2)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit third order polynomial model</span>
<span style="color: #BA36A5;">lm_model3</span> = smf.ols(<span style="color: #008000;">'mpg ~ horsepower + I(horsepower ** 2) + I(horsepower ** 3)'</span>,
                    data=auto_train)
<span style="color: #BA36A5;">lm_fit3</span> = lm_model3.fit()
<span style="color: #BA36A5;">mse_test3</span> = np.<span style="color: #006FE0;">sum</span>((lm_fit3.predict(auto_test) - auto_test[<span style="color: #008000;">'mpg'</span>]) ** 2) / \
    (auto_test.shape[0] - 4)
<span style="color: #0000FF;">print</span>(mse_test3)
</pre>
</div>

<pre class="example">
23.61593457249045
23.615934572490445
--------
24.868027221207488
--------
20.701029881139203
--------
20.893010200297326

</pre>
</div>
</div>

<div id="outline-container-org3c66e7c" class="outline-4">
<h4 id="org3c66e7c"><span class="section-number-4">5.3.2</span> Leave-One-Out Cross-Validation</h4>
<div class="outline-text-4" id="text-5-3-2">
<p>
Using first principles, it is straightforward to implement leave-one-out
cross-validation. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">mseLOOCV.py</span>

<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf

<span style="color: #BA36A5;">auto</span> = datasets.get_rdataset(<span style="color: #008000;">'Auto'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">all_ind</span> = np.arange(auto.shape[0])

<span style="color: #BA36A5;">my_formula</span> = <span style="color: #008000;">'mpg ~ horsepower'</span>

<span style="color: #BA36A5;">mse_loocv</span> = []
<span style="color: #BA36A5;">degree</span> = []
<span style="color: #0000FF;">for</span> i_degree <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(1, 6):
    <span style="color: #BA36A5;">mse</span> = []
    <span style="color: #0000FF;">for</span> i_obs <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(auto.shape[0]):
        <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">auto_train = auto.loc[all_ind != i_obs]</span>
        <span style="color: #BA36A5;">auto_train</span> = auto.drop(auto.index[i_obs])
        <span style="color: #BA36A5;">auto_test</span> = auto.iloc[i_obs]
        <span style="color: #BA36A5;">lm_model</span> = smf.ols(my_formula, data=auto_train)
        <span style="color: #BA36A5;">lm_fit</span> = lm_model.fit()
        <span style="color: #BA36A5;">hp_predict</span> = lm_fit.predict(
            exog=<span style="color: #006FE0;">dict</span>(horsepower=auto_test[<span style="color: #008000;">'horsepower'</span>]))
        mse.append((hp_predict - auto_test[<span style="color: #008000;">'mpg'</span>]) ** 2)

    mse_loocv.append(np.mean(mse))
    degree.append(i_degree)
    <span style="color: #BA36A5;">my_formula</span> += <span style="color: #008000;">' + I(horsepower **'</span> + <span style="color: #006FE0;">str</span>(i_degree + 1) + <span style="color: #008000;">')'</span>

<span style="color: #0000FF;">for</span> i_degree, mse <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">zip</span>(degree, mse_loocv):
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'degree: '</span>, i_degree, <span style="color: #008000;">', mse_loocv:'</span>, <span style="color: #006FE0;">round</span>(mse, 3))
</pre>
</div>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap5/'</span>)

<span style="color: #0000FF;">import</span> mseLOOCV
</pre>
</div>

<pre class="example">
degree:  1 , mse_loocv: 24.232
degree:  2 , mse_loocv: 19.248
degree:  3 , mse_loocv: 19.335
degree:  4 , mse_loocv: 19.424
degree:  5 , mse_loocv: 19.033

</pre>
</div>
</div>

<div id="outline-container-orgafb980e" class="outline-4">
<h4 id="orgafb980e"><span class="section-number-4">5.3.3</span> k-Fold Cross-Validation</h4>
<div class="outline-text-4" id="text-5-3-3">
<p>
Using first principles, it is straightforward to implement <i>k</i>-fold CV.  Once
again, we see little evidence that using cubic or higher order polynomial terms
leads to lower test error than simply using a quadratic fit.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">mse_kFoldCV.py</span>

<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets

<span style="color: #BA36A5;">auto</span> = datasets.get_rdataset(<span style="color: #008000;">'Auto'</span>, <span style="color: #008000;">'ISLR'</span>).data

<span style="color: #BA36A5;">n_folds</span> = 10
<span style="color: #BA36A5;">max_degree</span> = 10

np.random.seed(911)
<span style="color: #BA36A5;">fold_ind</span> = np.random.choice(n_folds, auto.shape[0])
<span style="color: #BA36A5;">all_ind</span> = np.arange(auto.shape[0])
<span style="color: #BA36A5;">degree</span> = []
<span style="color: #BA36A5;">mse_folds</span> = {}

<span style="color: #BA36A5;">my_formula</span> = <span style="color: #008000;">'mpg ~ horsepower'</span>
<span style="color: #0000FF;">for</span> i_degree <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(1, max_degree + 1):
    <span style="color: #BA36A5;">mse_folds</span>[i_degree] = []
    <span style="color: #0000FF;">for</span> i_fold <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(n_folds):
        <span style="color: #BA36A5;">train_df</span> = auto.loc[i_fold != fold_ind]
        <span style="color: #BA36A5;">test_df</span> = auto.loc[i_fold == fold_ind]
        <span style="color: #BA36A5;">lm_model</span> = smf.ols(my_formula, data=train_df)
        <span style="color: #BA36A5;">lm_fit</span> = lm_model.fit()
        <span style="color: #BA36A5;">mse</span> = np.mean((lm_fit.predict(test_df) - test_df[<span style="color: #008000;">'mpg'</span>]) ** 2)
        mse_folds[i_degree].append(mse)

    degree.append(i_degree)
    <span style="color: #BA36A5;">my_formula</span> += <span style="color: #008000;">' + I(horsepower ** '</span> + <span style="color: #006FE0;">str</span>(i_degree + 1) + <span style="color: #008000;">')'</span>

<span style="color: #BA36A5;">mse_degree</span> = []
<span style="color: #0000FF;">for</span> i_degree <span style="color: #0000FF;">in</span> mse_folds.keys():
    mse_degree.append(np.mean(mse_folds[i_degree]))

<span style="color: #0000FF;">for</span> i_degree, mse_kfold <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">zip</span>(degree, mse_degree):
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'degree: '</span>, i_degree, <span style="color: #008000;">', mse_kfold: '</span>, <span style="color: #006FE0;">round</span>(mse_kfold, 3))
</pre>
</div>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'cnoode/chap5/'</span>)

<span style="color: #0000FF;">import</span> mse_kFoldCV
</pre>
</div>

<pre class="example">
degree:  1 , mse_kfold:  24.213
degree:  2 , mse_kfold:  19.378
degree:  3 , mse_kfold:  19.477
degree:  4 , mse_kfold:  19.538
degree:  5 , mse_kfold:  19.166
degree:  6 , mse_kfold:  19.183
degree:  7 , mse_kfold:  19.157
degree:  8 , mse_kfold:  23.247
degree:  9 , mse_kfold:  23.258
degree:  10 , mse_kfold:  65.251
</pre>
</div>
</div>

<div id="outline-container-orge2c382c" class="outline-4">
<h4 id="orge2c382c"><span class="section-number-4">5.3.4</span> The Bootstrap</h4>
<div class="outline-text-4" id="text-5-3-4">
</div>
<ol class="org-ol">
<li><a id="org6871c13"></a>Estimating the Accuracy of a Statistic of Interest<br />
<div class="outline-text-5" id="text-5-3-4-1">
<p>
We will first write a function that takes two inputs, data and index, and
calculates the desired statistic &alpha;.  Then we will repeatedly call this
function and store the estimates of &alpha;.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">alphaBootstrap.py</span>

<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">from statsmodels import datasets</span>
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf


<span style="color: #0000FF;">def</span> <span style="color: #006699;">alphaEst</span>(returns_df, row_index):
    <span style="color: #036A07;">'''Assumes returns_df is a return dataframe with two columns of stock returns,</span>
<span style="color: #036A07;">    row_index is a list of row indexes to be used in calculation.</span>
<span style="color: #036A07;">    Returns alpha estimate using subset of data defined by row_index.'''</span>

    <span style="color: #BA36A5;">cov_xy</span> = np.cov(returns_df.iloc[row_index], rowvar=<span style="color: #D0372D;">False</span>)
    <span style="color: #0000FF;">return</span> (cov_xy[1, 1] - cov_xy[0, 1]) / \
        (cov_xy[0, 0] + cov_xy[1, 1] - 2 * cov_xy[0, 1])


<span style="color: #0000FF;">def</span> <span style="color: #006699;">bootStrap</span>(my_df, myFunc, sample_size, n_bootstrap, all_res=<span style="color: #D0372D;">False</span>):
    <span style="color: #036A07;">''' Assumes my_df is a dataframe and myFunc is a function that can</span>
<span style="color: #036A07;">    estimate a stastic on my_df.  Estimate statistic n_bootstrap times,</span>
<span style="color: #036A07;">    each with a sample of size sample_size.</span>
<span style="color: #036A07;">    Return mean and standard error of statistic.'''</span>
    <span style="color: #BA36A5;">my_stat</span> = []
    <span style="color: #0000FF;">for</span> i <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(n_bootstrap):
        <span style="color: #BA36A5;">index</span> = np.random.choice(my_df.shape[0], sample_size)
        my_stat.append(myFunc(my_df, index))

    <span style="color: #0000FF;">if</span> <span style="color: #006FE0;">isinstance</span>(my_stat[0], <span style="color: #006FE0;">float</span>):
        <span style="color: #BA36A5;">my_res</span> = {<span style="color: #008000;">'mean'</span>: np.mean(my_stat), <span style="color: #008000;">'std. error'</span>: np.std(my_stat)}
        <span style="color: #0000FF;">if</span> all_res:
            <span style="color: #BA36A5;">my_res</span>[<span style="color: #008000;">'stats'</span>] = my_stat

    <span style="color: #0000FF;">elif</span> <span style="color: #006FE0;">isinstance</span>(my_stat[0], pd.core.series.Series):
        <span style="color: #BA36A5;">my_stat_dict</span> = {}
        <span style="color: #0000FF;">for</span> ind <span style="color: #0000FF;">in</span> my_stat[0].index:
            <span style="color: #BA36A5;">my_stat_dict</span>[ind] = []
        <span style="color: #0000FF;">for</span> i <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(<span style="color: #006FE0;">len</span>(my_stat)):
            <span style="color: #0000FF;">for</span> key <span style="color: #0000FF;">in</span> my_stat_dict.keys():
                my_stat_dict[key].append(my_stat[i][key])
        <span style="color: #BA36A5;">my_res</span> = {}
        <span style="color: #0000FF;">for</span> key <span style="color: #0000FF;">in</span> my_stat_dict.keys():
            <span style="color: #BA36A5;">my_res</span>[key] = {}
            my_res[key][<span style="color: #008000;">'mean'</span>] = np.mean(my_stat_dict[key])
            my_res[key][<span style="color: #008000;">'std. error'</span>] = np.std(my_stat_dict[key])
        <span style="color: #0000FF;">if</span> all_res:
            <span style="color: #BA36A5;">my_res</span>[<span style="color: #008000;">'stats'</span>] = my_stat

    <span style="color: #0000FF;">return</span> my_res


<span style="color: #0000FF;">def</span> <span style="color: #006699;">autoDataCoef</span>(auto_df, row_index):
    <span style="color: #036A07;">'''Assumes auto_df is a dataframe which includes 'mpg' and</span>
<span style="color: #036A07;">    'horsepower' columns.  Fit a linear regression model on auto_df.</span>
<span style="color: #036A07;">    Use row_index to create a subset of auto_df.  Return regression</span>
<span style="color: #036A07;">    coefficients estimated from subset of auto_df.'''</span>
    <span style="color: #BA36A5;">lm_model</span> = smf.ols(<span style="color: #008000;">'mpg ~ horsepower'</span>, data=auto_df.iloc[row_index])
    <span style="color: #BA36A5;">lm_fit</span> = lm_model.fit()
    <span style="color: #0000FF;">return</span> lm_fit.params


<span style="color: #0000FF;">def</span> <span style="color: #006699;">autoDataCoef2</span>(auto_df, row_index):
    <span style="color: #036A07;">'''Assumes auto_df is a dataframe which has columns 'mpg' and</span>
<span style="color: #036A07;">    'horsepower'.  Fit an OLS regression model with mpg as a </span>
<span style="color: #036A07;">    quadratic function of horsepower.  Use subset of auto_df defined</span>
<span style="color: #036A07;">    by row_index.  Return regression coefficient estimates.'''</span>
    <span style="color: #BA36A5;">lm_model</span> = smf.ols(<span style="color: #008000;">'mpg ~ horsepower + I(horsepower ** 2)'</span>,
                       data=auto_df.iloc[row_index])
    <span style="color: #BA36A5;">lm_fit</span> = lm_model.fit()
    <span style="color: #0000FF;">return</span> lm_fit.params
</pre>
</div>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap5/'</span>)
<span style="color: #0000FF;">from</span> alphaBootstrap <span style="color: #0000FF;">import</span> alphaEst, bootStrap

<span style="color: #BA36A5;">portfolio</span> = datasets.get_rdataset(<span style="color: #008000;">'Portfolio'</span>, <span style="color: #008000;">'ISLR'</span>).data

np.random.seed(911)
<span style="color: #BA36A5;">alpha_boot</span> = bootStrap(portfolio, alphaEst, sample_size=100, n_bootstrap=1000)
<span style="color: #0000FF;">print</span>(alpha_boot)
</pre>
</div>

<pre class="example">
{'mean': 0.5753949845303641, 'std. error': 0.08938513622277834}

</pre>
</div>
</li>

<li><a id="orgb91c806"></a>Estimating the Accuracy of a Linear Regression Model<br />
<div class="outline-text-5" id="text-5-3-4-2">
<p>
We now use bootstrap method to assess the variability of the estimates for
&beta;<sub>0</sub> and &beta;<sub>1</sub>, the intercept and slope terms for the linear regression
model that uses <code>horsepower</code> to predict <code>mpg</code> in the <code>Auto</code> data set.  We will
compare the estimates obtained using the bootstrap to those obtained using the
standar formulas for \(SE(\hat{\beta}_0)\) and \(SE(\hat{\beta}_1)\).
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap5/'</span>)
<span style="color: #0000FF;">from</span> alphaBootstrap <span style="color: #0000FF;">import</span> autoDataCoef, bootStrap

<span style="color: #BA36A5;">auto</span> = datasets.get_rdataset(<span style="color: #008000;">'Auto'</span>, <span style="color: #008000;">'ISLR'</span>).data

np.random.seed(911)
<span style="color: #BA36A5;">mpg_hp_boot</span> = bootStrap(auto, autoDataCoef, sample_size=392, n_bootstrap=1000)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Bootstrap results:'</span>)
<span style="color: #0000FF;">for</span> key <span style="color: #0000FF;">in</span> mpg_hp_boot.keys():
    <span style="color: #0000FF;">print</span>(key, <span style="color: #008000;">':'</span>, mpg_hp_boot[key])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'--------'</span>)

<span style="color: #BA36A5;">lm_model</span> = smf.ols(<span style="color: #008000;">'mpg ~ horsepower'</span>, data=auto)
<span style="color: #BA36A5;">lm_fit</span> = lm_model.fit()
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Regression results:'</span>)
<span style="color: #0000FF;">print</span>(lm_fit.summary2().tables[1].iloc[:,:4])

</pre>
</div>

<pre class="example">
Bootstrap results:
Intercept : {'mean': 39.94234375950751, 'std. error': 0.8748453071088308}
horsepower : {'mean': -0.15796112230552348, 'std. error': 0.007526082860287968}
--------
Regression results:
                Coef.  Std.Err.          t          P&gt;|t|
Intercept   39.935861  0.717499  55.659841  1.220362e-187
horsepower  -0.157845  0.006446 -24.489135   7.031989e-81

</pre>

<p>
Finally, we compute the bootstrap standard error estimates and the standard
linear regression estimates that result from fitting the quadratic model to the
<code>Auto</code> data.
</p>

<pre class="example">
Bootstrap results:
Intercept : {'mean': 57.02549325815686, 'std. error': 2.012215071375403}
horsepower : {'mean': -0.46840037414346225, 'std. error': 0.03187991112044731}
I(horsepower ** 2) : {'mean': 0.0012391590913923556, 'std. error': 0.00011523595073730878}
--------
Regression results:
                        Coef.  Std.Err.          t          P&gt;|t|
Intercept           56.900100  1.800427  31.603673  1.740911e-109
horsepower          -0.466190  0.031125 -14.978164   2.289429e-40
I(horsepower ** 2)   0.001231  0.000122  10.080093   2.196340e-21
</pre>
</div>
</li>
</ol>
</div>
</div>
</div>
<div id="outline-container-org104cb2b" class="outline-2">
<h2 id="org104cb2b"><span class="section-number-2">6</span> Linear Model Selection and Regularization</h2>
<div class="outline-text-2" id="text-6">
</div>
<div id="outline-container-org28de162" class="outline-3">
<h3 id="org28de162"><span class="section-number-3">6.1</span> Subset Selection</h3>
<div class="outline-text-3" id="text-6-1">
<p>
An application of best subset selection is shown in figure <a href="#org2a509ec">57</a>.
Each plotted point corresponds to a least squares regression model fit using a
different subset of the 10 predictors in the <code>Credit</code> data set.  We have plotted
the RSS and R<sup>2</sup> statistics for each model, as a function of the number of
variables.  The red curve connects the best models for each model size,
according to RSS or R<sup>2</sup>.  Initially, these quantities improve as the number of
varaibles increases. However, from the three-variable model on, there is little
improvement in RSS and R<sup>2</sup> when more predictors are included.
</p>


<div id="org2a509ec" class="figure">
<p><img src="figures/fig6_1.png" alt="fig6_1.png" />
</p>
<p><span class="figure-number">Figure 57: </span>For each possible model containing a subset of the ten predictors in the <code>Credit</code> data set, the RSS and R<sup>2</sup> are displayed.  The red frontier tracks the <i>best</i> model for a given number of predictors, according to RSS and R<sup>2</sup>.</p>
</div>

<p>
Table <a href="#orgfcb7f37">19</a> shows first four selected models for the best subset and
forward subset selection on the <code>Credit</code> data set.  Both best subset selection
and forward stepwise selection choose <code>Rating</code> for the best one-variable model
and then include <code>Income</code> and <code>Student</code> for the two- and three-variable models.
However, best subset selection replaces <code>Rating</code> by <code>Cards</code> in the four-variable
model.  On the other hand forward stepwise selection must maintain <code>Rating</code> in
its four-variable model. 
</p>


<table id="orgfcb7f37" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 19:</span> The first four selected models for best subset selection and forward stepwise selection on the <code>Credit</code> data set.  The first three models are identical, but the fourth models differ.</caption>

<colgroup>
<col  class="org-right" />

<col  class="org-left" />

<col  class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-right">Count</th>
<th scope="col" class="org-left">Best subset</th>
<th scope="col" class="org-left">Forward stepwise</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-right">1</td>
<td class="org-left">Rating</td>
<td class="org-left">Rating</td>
</tr>

<tr>
<td class="org-right">2</td>
<td class="org-left">Income, Rating</td>
<td class="org-left">Rating, Income</td>
</tr>

<tr>
<td class="org-right">3</td>
<td class="org-left">Income, Rating, Student</td>
<td class="org-left">Rating, Income, Student</td>
</tr>

<tr>
<td class="org-right">4</td>
<td class="org-left">Income, Limit, Cards, Student</td>
<td class="org-left">Rating, Income, Student, Limit</td>
</tr>
</tbody>
</table>

<p>
Figure <a href="#org5e553dc">58</a> displays C<sub>p</sub>, BIC, and adjusted R<sup>2</sup> for the best model
of each size produced by best subset selection on the <code>Credit</code> data set.
</p>


<div id="org5e553dc" class="figure">
<p><img src="figures/fig6_2.png" alt="fig6_2.png" />
</p>
<p><span class="figure-number">Figure 58: </span>C<sub>p</sub>, BIC, and adjusted R<sup>2</sup> are shown for the best models of each size for the <code>Credit</code> data set (the lower frontier in figure <a href="#org2a509ec">57</a>).  C<sub>p</sub> and BIC are estimates of test MSE.  In the middle panel we see that the BIC estimate of test error shows an increase after four variables are selected.  The other two plots are rather flat after four variables are selected.</p>
</div>


<p>
Figure <a href="#org21c9df3">59</a> displays, as a function of <i>d</i>, the BIC, validation set
errors, and cross-validation errors on the <code>Credit</code> data set, for the best
<i>d</i>-variable model.  The validation errors were calculated by randomly selecting
two-thirds of the observations as the training set, and the remainder as the
validation set.  The cross-validation errors were computed using \(k = 10\)
folds.  Depending upon the choice of the random seed, validation errors may be
minimized by six or seven predictors.
</p>


<div id="org21c9df3" class="figure">
<p><img src="figures/fig6_3.png" alt="fig6_3.png" />
</p>
<p><span class="figure-number">Figure 59: </span>For the <code>Credit</code> data set, three quantities are displayed for the best model containing predictors, for <i>d</i> ranging from 1 to 10. The overall <i>best</i> model, based on each of these quantities, is shown as a green cross.  Left: Square root of BIC.  Center: Validation set errors.  Right: Cross-validation errors.</p>
</div>
</div>
</div>

<div id="outline-container-orgc55df27" class="outline-3">
<h3 id="orgc55df27"><span class="section-number-3">6.2</span> Shrinkage Methods</h3>
<div class="outline-text-3" id="text-6-2">
<p>
In figure <a href="#orga41649f">60</a> the ridge regression coefficient estimates for the
<code>Credit</code> data set are displayed.  In the left-hand panel, each curve corresponds
to the ridge regression coefficient estimate for one of the four important
variables, plotted as a function of &lambda;.  At the extreme left side of the
plot, &lambda; is essentially zero, and so the corresponding ridge coefficients
estimates are the same as the usual least square estimates.  But as &lambda;
increases, the ridge coefficients shrink towards zero.  
</p>

<p>
The right-hand panel of figure <a href="#orga41649f">60</a> displays the same ridge
coefficient estimates as the left-hand panel.  But instead of displaying &lambda;
on the <i>x</i>-axis, we now display \(\| \hat{\beta}_{\lambda}^R \|_2 / \|
\hat{\beta} \|_2\), where \(\hat{\beta}\) denotes the vector of least squares
coefficient estimates.  The notation \(\| \beta \|_2\) denotes the \(\ell_2\)
norm of a vector.  This norm is defined as \(\| \beta \|_2 = \sqrt{\sum_{j=1}^p
\beta_j^2}\).  It measures the distance of &beta; from zero.  
</p>


<div id="orga41649f" class="figure">
<p><img src="figures/fig6_4.png" alt="fig6_4.png" />
</p>
<p><span class="figure-number">Figure 60: </span>The standardized ridge regression coefficients are displayed for the <code>Credit</code> data set, as a function of &lambda; and \(\| \hat{\beta}_{\lambda}^R \|_2 / \| \hat{\beta} \|_2\).</p>
</div>

<p>
Ridge regression's advantage over least squares is rooted in <i>bias-variance
tradeoff</i>.  As &lambda; increases, the flexibility of ridge regression fit
decreases, leading to decreased variance but increased bias.  We use a simulated
data set with \(p = 30\) features and \(n = 50\) observations.  Figure
<a href="#orgbead74e">61</a> shows the bias-tradeoff on this simulated data set.
</p>


<div id="orgbead74e" class="figure">
<p><img src="figures/fig6_5.png" alt="fig6_5.png" />
</p>
<p><span class="figure-number">Figure 61: </span>Squared bias, variance, and test mean squared error for the ridge regression predictions on a simulated data set, as a function of &lambda; (left-hand panel) and \(\| \hat{\beta}_\lambda^R \|_2 / \| \hat{\beta} \|_2\).  The horizontal lines show the minimum possible MSE.  The crosses show the ridge regression models for which the MSE is the smallest.</p>
</div>

<p>
In figure <a href="#org8b9f541">62</a>, coefficient plots are generated from applying the
lasso to the <code>Credit</code> data set.  When &lambda; = 0, then the lasso simply gives
the least squares fit.  When &lambda; becomes sufficiently large, the lasso gives
the null model in which all coefficient estimates equal zero.  However, in
between these two extremes, the ridge regression and the lasso regression models
are quite different.  In the right-hand panel of figure <a href="#org8b9f541">62</a>, as we
move from left to right, at first the lasso model only contains the <code>Rating</code>
predictor.  Then <code>Student</code> and <code>Limit</code> enter the model, shortly followed by
<code>Income</code>. Depending upon the value of &lambda;, the lasso can produce a model
involving any number of variables.  In contrast, although the magnitude of the
estimates will depend upon &lambda;, ridge regression will always
include all of the variables in the model.  
</p>


<div id="org8b9f541" class="figure">
<p><img src="figures/fig6_6.png" alt="fig6_6.png" />
</p>
<p><span class="figure-number">Figure 62: </span>The standardized lasso coefficients on the <code>Credit</code> data set are shown as a function of &lambda; and \(\| \hat{\beta}_{\lambda}^L \|_1 / \| \hat{\beta} \|_1\).</p>
</div>

<p>
Figure <a href="#orgac4332a">63</a> illustrates why lasso, unlike ridge regression, results
in coefficient estimates that are exactly zero.  In the left-hand panel, lasso
coefficient constraint region is represented by solid blue diamond.  In the
right-hand panel, ridge regression coefficient constraint region is represented
by solid blue circle. The ellipses centered around \hat{\beta} represent regions
of constant RSS.  As ellipses expand outward from the least squares coefficient
estimates, RSS increases.  Lasso and ridge regression coefficient estimates are
given by the first point at which an ellipse touches the constraint region.
Since lasso constraint has <i>corners</i> at each of the axes, the ellipse will often
intersect the constraint region on an axis.  When this occurs, one of the
coefficients will equal zero.  On the other hand, since ridge regression
constraint has no sharp edges, the intersection will generally not occur on an
axis.  Therefore ridge regression coefficients will usually be non-zero.
</p>


<div id="orgac4332a" class="figure">
<p><img src="figures/fig6_7.png" alt="fig6_7.png" />
</p>
<p><span class="figure-number">Figure 63: </span>Contours of the error and constraint functions for the lasso (left) and ridge regression (right).  The solid blue areas are contraint regions, \(\| \beta_1 \| + \| \beta_2 \| \le s\) and \(\beta_1^2 + \beta_2 ^2 \le s\), while the red ellipses are the contours of the RSS.</p>
</div>


<p>
Figure <a href="#org880bb63">64</a> displays the choice of &lambda; that results from
performing leave-one-out cross-validation on the ridge regression fits from the
<code>Credit</code> data set. The dashed vertical lines indicate the selected value of
&lambda;. 
</p>


<div id="org880bb63" class="figure">
<p><img src="figures/fig6_12.png" alt="fig6_12.png" />
</p>
<p><span class="figure-number">Figure 64: </span>Left: For various values of &lambda;, cross-validation errors that result from applying ridge regression to the <code>Credit</code> data set.  Right: The coefficient estimates as a function of &lambda;.  The vertical dashed lines indicate the value of &lambda; selected by cross-validation.</p>
</div>
</div>
</div>

<div id="outline-container-org2c9ebf8" class="outline-3">
<h3 id="org2c9ebf8"><span class="section-number-3">6.3</span> Dimension Reduction Methods</h3>
<div class="outline-text-3" id="text-6-3">
<p>
Figure <a href="#org885972e">65</a> shows daily changes in 10-year Treasury note yield (<code>10
YR</code>) and 2-year Treasury note yield (<code>2 YR</code>) in year 2018.  The green sold line
represents the first principal component direction of the data.  We can see by
eye that this is the direction along which there is the greatest variability in
the data.  
</p>


<div id="org885972e" class="figure">
<p><img src="figures/fig6_14.png" alt="fig6_14.png" />
</p>
<p><span class="figure-number">Figure 65: </span>Daily changes in 10-year Treasury note yield (<code>10 YR</code>) and 2-year Treasury note yield (<code>2 YR</code>) in year 2018 are shown as red circles.  The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component.</p>
</div>


<p>
In another interpretation of PCA, the first principal component vector defines
the line that is <i>as close as possible</i> to the data.  In figure
<a href="#org5aade76">66</a>, the left-hand panel shows the distances between data points
and the first principal component.  The first principal component has been
chosen so that the projected observations are <i>as close as possible</i> to the
original observations.  
</p>

<p>
In the right-hand panel of figure <a href="#org5aade76">66</a>, the left-hand panel has
been rotated so that the first principal component direction coincides with the
<i>x</i>-axis. 
</p>


<div id="org5aade76" class="figure">
<p><img src="figures/fig6_15.png" alt="fig6_15.png" />
</p>
<p><span class="figure-number">Figure 66: </span>A subset of the Treasury yield data.  Left: The first principal component direction is shown in blue.  It is the dimension along which the data vary the most, and it also defines the line that is closest to all <i>n</i> of the observations.  The distances from each observation to the principal component are represented in using black dashed line segments.  Right: The left-hand panel has been rotated so that the first principal component direction coincides with the x-axis.</p>
</div>

<p>
Figure <a href="#orge325136">67</a> displays <code>10 YR</code> and <code>2 YR</code> versus first principal
component scores.  The plots show a strong relationship between the first
principal component and the two features.  In other words, the first principal
component appears to capture most of the information contained in <code>10 YR</code> and <code>2
YR</code>. 
</p>


<div id="orge325136" class="figure">
<p><img src="figures/fig6_16.png" alt="fig6_16.png" />
</p>
<p><span class="figure-number">Figure 67: </span>Plots of <code>10 YR</code> and <code>2 YR</code> versus first principal component scores. The relationships are strong.</p>
</div>

<p>
Figure <a href="#orgdd891b3">68</a> displays <code>10 YR</code> and <code>2 YR</code> versus second principal
component scores.  The plots show a weak relationship between the second
principal component and the two features. In other words, one only needs the
first principal component to accurately represent <code>10 YR</code> and <code>2 YR</code>. 
</p>


<div id="orgdd891b3" class="figure">
<p><img src="figures/fig6_17.png" alt="fig6_17.png" />
</p>
<p><span class="figure-number">Figure 68: </span>Plots of <code>10 YR</code> and <code>2 YR</code> versus second principal component scores.  The relationships are weak.</p>
</div>
</div>
</div>

<div id="outline-container-org27036ec" class="outline-3">
<h3 id="org27036ec"><span class="section-number-3">6.4</span> Considerations in High Dimensions</h3>
<div class="outline-text-3" id="text-6-4">
<p>
Figure <a href="#orgb04c8c0">69</a> shows <i>p</i> = 1 feature (plus an intercept) in two
cases: when there are 20 observations (left-hand panel), and when there are only
two observations (right-hand panel).  When there are 20 observations, <i>n &gt; p</i>
and the least squares regression line does not perfectly fit the data; instead,
the regression line seeks to approximate the 20 observations as well as
possible.  On the other hand, when there are only two observations, then
regardless of the values of those two observations, the regression line will fit
the data exactly.
</p>


<div id="orgb04c8c0" class="figure">
<p><img src="figures/fig6_22.png" alt="fig6_22.png" />
</p>
<p><span class="figure-number">Figure 69: </span>Left: Least squares regression in the low-dimensional setting.  Right: Least squares regression with <i>n</i> = 2 observations and two parameters to be estimated (an intercept and a coefficient).</p>
</div>

<p>
Figure <a href="#org5ba2c4d">70</a> further illustrates the risk of carelessly applying
least squares when the number of features <i>p</i> is large.  Data were simulated
with <i>n</i> = 20 observations, and regression was performed with between 1 and 20
features, each of which was completely unrelated to the response.  As the number
of features included increases, R<sup>2</sup> increases to 1, and correspondingly training
set MSE decreases to zero.  On the other hand, as the number features included
increases, MSE on an <i>independent test set</i> becomes extremely large. 
</p>


<div id="org5ba2c4d" class="figure">
<p><img src="figures/fig6_23.png" alt="fig6_23.png" />
</p>
<p><span class="figure-number">Figure 70: </span>On a simulated example with <i>n</i> = 20 training observations, features that are completely unrelated to the outcome are added to the model.  Left:  As more features are included, the training R<sup>2</sup> increases to 1.  Center:  As more features are added, the training set MSE decreases to zero.  Right: As more features are included, the test set MSE increases.</p>
</div>

<p>
Figure <a href="#orgeb996a3">71</a> illustrates the performance of the lasso in a simple
simulated example.  There are <i>p</i> = 20, 50, or 2000 features, of which 20 are
truly associated with the outcome.  
</p>


<div id="orgeb996a3" class="figure">
<p><img src="figures/fig6_24.png" alt="fig6_24.png" />
</p>
<p><span class="figure-number">Figure 71: </span>The lasso was performed with <i>n</i> = 100 observations and three values of <i>p</i>, the number of features.  Of the <i>p</i> features, 20 were associated with the response.  The boxplots show the test MSEs that result using four different values of the tuning parameter &lambda;.  For ease of interpretation, rather than reporting &lambda;, the degrees of freedom are reported; for the lasso, this turns out to be simply the number of estimated non-zero coefficients.  When <i>p</i> = 20, the lowest test MSE was obtained with the smallest amount of regularization.  When <i>p</i> = 50, the lowest test MSE was achieved when there was a substantial amount of regularization.  When <i>p</i> = 2000, we see results similar to <i>p</i> = 50, with very slight increase in test MSE with degrees of freedom.</p>
</div>
</div>
</div>

<div id="outline-container-org6b80217" class="outline-3">
<h3 id="org6b80217"><span class="section-number-3">6.5</span> Lab 1: Subset Selection Methods</h3>
<div class="outline-text-3" id="text-6-5">
</div>
<div id="outline-container-org20f73c1" class="outline-4">
<h4 id="org20f73c1"><span class="section-number-4">6.5.1</span> Best Subset Selection</h4>
<div class="outline-text-4" id="text-6-5-1">
<p>
Here we apply the best subset selection approach to the <code>Hitters</code> data.  We wish
to predict a baseball player's <code>Salary</code> on the basis of various statistics
associated with performance in the previous year.
</p>

<p>
We note that the <code>Salary</code> variable is missing for some of the players.  The
<code>dropna()</code> function in <code>pandas</code> module can be used to remove all the rows that
have missing values in any variable. 
</p>

<p>
It is straightforward to consider subsets of different sizes and, given a
criterion, identify best model for each size. Program <code>subsetSelection.py</code>
includes a function for this purpose. Using a given criterion, <code>bestModel</code> uses
exhaustive enumeration to find the best model for a given size. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">import pandas as pd</span>
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap6/'</span>)
<span style="color: #0000FF;">from</span> subsetSelection <span style="color: #0000FF;">import</span> bestModel, C_p
<span style="color: #0000FF;">from</span> itertools <span style="color: #0000FF;">import</span> combinations
<span style="color: #0000FF;">from</span> operator <span style="color: #0000FF;">import</span> attrgetter
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">hitters</span> = datasets.get_rdataset(<span style="color: #008000;">'Hitters'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #0000FF;">print</span>(hitters.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #0000FF;">print</span>(hitters.shape)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #0000FF;">print</span>(np.<span style="color: #006FE0;">sum</span>(hitters[<span style="color: #008000;">'Salary'</span>].isna()))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
hitters.dropna(inplace=<span style="color: #D0372D;">True</span>)
<span style="color: #0000FF;">print</span>(hitters.shape)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare inputs for allSubsets function</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">x_vars</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
x_vars.remove(y_var)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Select best model for a given subset size</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use smallest RSS as the criterion to find best model</span>
<span style="color: #BA36A5;">best_models</span> = {}
<span style="color: #0000FF;">for</span> p <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(1, 7):
    <span style="color: #BA36A5;">best_models</span>[p] = bestModel(y_var, x_vars, hitters, subset_size=p,
                               metric=<span style="color: #008000;">'ssr'</span>, metric_max=<span style="color: #D0372D;">False</span>)

<span style="color: #0000FF;">for</span> p <span style="color: #0000FF;">in</span> best_models.keys():
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of variables: '</span> + <span style="color: #006FE0;">str</span>(p))
    <span style="color: #BA36A5;">best_vars</span> = <span style="color: #006FE0;">list</span>(best_models[p][<span style="color: #008000;">'model_vars'</span>])
    best_vars.sort()
    <span style="color: #0000FF;">print</span>(best_vars)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Print r-squared and adjusted r-squared</span>
<span style="color: #0000FF;">for</span> p <span style="color: #0000FF;">in</span> best_models.keys():
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Num vars: '</span> + <span style="color: #006FE0;">str</span>(p) + <span style="color: #008000;">', R-squared: '</span> +
          <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_models[p][<span style="color: #008000;">'model'</span>].rsquared, 3)) +
          <span style="color: #008000;">', adjusted R-squared: '</span> +
          <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_models[p][<span style="color: #008000;">'model'</span>].rsquared_adj, 3)))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Plot R-squared, adjusted R-squared, Cp, and BIC versus number of variables</span>
<span style="color: #BA36A5;">fig</span> = plt.figure(figsize=(8, 8))
<span style="color: #BA36A5;">ax1</span> = fig.add_subplot(221)
<span style="color: #BA36A5;">rsq</span> = [best_models[k][<span style="color: #008000;">'model'</span>].rsquared <span style="color: #0000FF;">for</span> k <span style="color: #0000FF;">in</span> best_models.keys()]
ax1.plot(best_models.keys(), rsq)
ax1.set_ylabel(r<span style="color: #008000;">'$R^2$'</span>)

<span style="color: #BA36A5;">ax2</span> = fig.add_subplot(222)
<span style="color: #BA36A5;">adj_rsq</span> = [best_models[k][<span style="color: #008000;">'model'</span>].rsquared_adj <span style="color: #0000FF;">for</span> k <span style="color: #0000FF;">in</span> best_models.keys()]
ax2.plot(best_models.keys(), adj_rsq)
ax2.set_ylabel(r<span style="color: #008000;">'Adjusted $R^2$'</span>)

<span style="color: #BA36A5;">ax3</span> = fig.add_subplot(223)
<span style="color: #BA36A5;">Cp</span> = [C_p(best_models[k][<span style="color: #008000;">'model'</span>]) <span style="color: #0000FF;">for</span> k <span style="color: #0000FF;">in</span> best_models.keys()]
ax3.plot(best_models.keys(), Cp)
ax3.set_ylabel(r<span style="color: #008000;">'$C_p$'</span>)

<span style="color: #BA36A5;">ax4</span> = fig.add_subplot(224)
<span style="color: #BA36A5;">bic</span> = [best_models[k][<span style="color: #008000;">'model'</span>].bic <span style="color: #0000FF;">for</span> k <span style="color: #0000FF;">in</span> best_models.keys()]
ax4.plot(best_models.keys(), bic)
ax4.set_ylabel(<span style="color: #008000;">'BIC'</span>)

<span style="color: #0000FF;">for</span> ax <span style="color: #0000FF;">in</span> fig.axes:
    ax.set_xlabel(<span style="color: #008000;">'Number of variables'</span>)
fig.tight_layout()
</pre>
</div>

<pre class="example">
Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat',
       'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'League', 'Division',
       'PutOuts', 'Assists', 'Errors', 'Salary', 'NewLeague'],
      dtype='object')
------
(322, 20)
------
59
------
(263, 20)
------
Number of variables: 1
['CRBI']
Number of variables: 2
['CRBI', 'Hits']
Number of variables: 3
['CRBI', 'Hits', 'PutOuts']
Number of variables: 4
['CRBI', 'Division', 'Hits', 'PutOuts']
Number of variables: 5
['AtBat', 'CRBI', 'Division', 'Hits', 'PutOuts']
Number of variables: 6
['AtBat', 'CRBI', 'Division', 'Hits', 'PutOuts', 'Walks']
------
Num vars: 1, R-squared: 0.321, adjusted R-squared: 0.319
Num vars: 2, R-squared: 0.425, adjusted R-squared: 0.421
Num vars: 3, R-squared: 0.451, adjusted R-squared: 0.445
Num vars: 4, R-squared: 0.475, adjusted R-squared: 0.467
Num vars: 5, R-squared: 0.491, adjusted R-squared: 0.481
Num vars: 6, R-squared: 0.509, adjusted R-squared: 0.497
</pre>

<p>
The above results are output of program <code>subsetSelection.py</code>, which uses
<code>statsmodels</code>.  An advantage of using <code>statsmodels</code> is that a number of metrics
(e.g., RSS, R-squared, adjusted R-squared, BIC, AIC, etc.) are built-in.
Therefore, any of these can be used as the criterion to select the ``best''
model.  But, when number of variables is large, <code>statsmodels</code> can be slow.  The
program <code>subsetSelectionSklearn.py</code> implements best subset selection using
<code>sklearn</code>, which is faster than <code>statsmodels</code>.  A disadvantage of using
<code>sklearn</code> is that most of the metrics are not implemented.  These can be
implemented in a straightforward fashion. 
</p>

<p>
In the next code block, we use <code>subsetSelectionSklearn.py</code> to find best models
using exhaustive enumeration.  As before, for any given subset size, best model
is defined as the model which minimizes RSS. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap6/'</span>)
<span style="color: #0000FF;">from</span> subsetSelectionSklearn <span style="color: #0000FF;">import</span> getVarLookup, bestSubset, testStats

<span style="color: #BA36A5;">hitters</span> = datasets.get_rdataset(<span style="color: #008000;">'Hitters'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">hitters</span> = hitters.dropna()

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare inputs for sklearn LinearRegression()</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]

<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)
<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]

<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_categoric_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">var_categoric_dummies</span> = <span style="color: #006FE0;">list</span>(X_categoric_dummies.columns)

<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric, X_categoric_dummies), axis=1)
<span style="color: #BA36A5;">y</span> = hitters[y_var]

<span style="color: #BA36A5;">x_var_names</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
x_var_names.remove(y_var)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Select best model for a given subset size</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Best model is defined as model with the lowest RSS</span>
<span style="color: #BA36A5;">best_models</span> = {}
<span style="color: #BA36A5;">best_model_stats</span> = {}
<span style="color: #0000FF;">for</span> p <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(1, 19):
    <span style="color: #BA36A5;">best_models</span>[p] = bestSubset(y, var_numeric, var_categoric,
                                var_categoric_dummies, X, subset_size=p)
    <span style="color: #BA36A5;">best_model_stats</span>[p] = testStats(X, y, best_models[p])
<span style="color: #0000FF;">for</span> p <span style="color: #0000FF;">in</span> best_models.keys():
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Subset size: '</span> + <span style="color: #006FE0;">str</span>(p) + <span style="color: #008000;">', '</span> +
          best_models[p][<span style="color: #008000;">'metric_name'</span>] + <span style="color: #008000;">': '</span> + 
          <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_models[p][<span style="color: #008000;">'metric'</span>])))
    <span style="color: #0000FF;">print</span>(best_models[p][<span style="color: #008000;">'x_var_names'</span>])
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Adjusted R-squared: '</span> +
          <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_model_stats[p][<span style="color: #008000;">'adj_rsq'</span>], 3)) +
          <span style="color: #008000;">', Cp: '</span> + <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_model_stats[p][<span style="color: #008000;">'C_p'</span>])) +
          <span style="color: #008000;">', AIC: '</span> + <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_model_stats[p][<span style="color: #008000;">'AIC'</span>])) +
          <span style="color: #008000;">', BIC: '</span> + <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_model_stats[p][<span style="color: #008000;">'BIC'</span>])))

</pre>
</div>

<pre class="example">
Subset size: 1, RSS: 36179679.0
('CRBI',)
Adjusted R-squared: 0.319, Cp: 138323.0, AIC: 3862.0, BIC: 3869.0
Subset size: 2, RSS: 30646560.0
('Hits', 'CRBI')
Adjusted R-squared: 0.421, Cp: 118042.0, AIC: 3820.0, BIC: 3831.0
Subset size: 3, RSS: 29249297.0
('Hits', 'CRBI', 'PutOuts')
Adjusted R-squared: 0.445, Cp: 113486.0, AIC: 3810.0, BIC: 3825.0
Subset size: 4, RSS: 27970852.0
('Hits', 'CRBI', 'PutOuts', 'Division')
Adjusted R-squared: 0.467, Cp: 109382.0, AIC: 3800.0, BIC: 3818.0
Subset size: 5, RSS: 27149899.0
('AtBat', 'Hits', 'CRBI', 'PutOuts', 'Division')
Adjusted R-squared: 0.481, Cp: 107018.0, AIC: 3795.0, BIC: 3816.0
Subset size: 6, RSS: 26194904.0
('AtBat', 'Hits', 'Walks', 'CRBI', 'PutOuts', 'Division')
Adjusted R-squared: 0.497, Cp: 104144.0, AIC: 3787.0, BIC: 3812.0
Subset size: 7, RSS: 25906548.0
('Hits', 'Walks', 'CAtBat', 'CHits', 'CHmRun', 'PutOuts', 'Division')
Adjusted R-squared: 0.501, Cp: 103805.0, AIC: 3786.0, BIC: 3815.0
Subset size: 8, RSS: 25136930.0
('AtBat', 'Hits', 'Walks', 'CHmRun', 'CRuns', 'CWalks', 'PutOuts', 'Division')
Adjusted R-squared: 0.514, Cp: 101636.0, AIC: 3780.0, BIC: 3813.0
Subset size: 9, RSS: 24814051.0
('AtBat', 'Hits', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 
'Division')
Adjusted R-squared: 0.518, Cp: 101166.0, AIC: 3779.0, BIC: 3815.0
Subset size: 10, RSS: 24500402.0
('AtBat', 'Hits', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 
'Assists', 'Division')
Adjusted R-squared: 0.522, Cp: 100731.0, AIC: 3778.0, BIC: 3817.0
Subset size: 11, RSS: 24387345.0
('AtBat', 'Hits', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 
'Assists', 'League', 'Division')
Adjusted R-squared: 0.523, Cp: 101058.0, AIC: 3778.0, BIC: 3821.0
Subset size: 12, RSS: 24333232.0
('AtBat', 'Hits', 'Runs', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 
'PutOuts', 'Assists', 'League', 'Division')
Adjusted R-squared: 0.522, Cp: 101610.0, AIC: 3780.0, BIC: 3826.0
Subset size: 13, RSS: 24289148.0
('AtBat', 'Hits', 'Runs', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 'CWalks', 
'PutOuts', 'Assists', 'Errors', 'League', 'Division')
Adjusted R-squared: 0.521, Cp: 102200.0, AIC: 3781.0, BIC: 3831.0
Subset size: 14, RSS: 24248660.0
('AtBat', 'Hits', 'HmRun', 'Runs', 'Walks', 'CAtBat', 'CRuns', 'CRBI', 
'CWalks', 'PutOuts', 'Assists', 'Errors', 'League', 'Division')
Adjusted R-squared: 0.52, Cp: 102803.0, AIC: 3783.0, BIC: 3836.0
Subset size: 15, RSS: 24235177.0
('AtBat', 'Hits', 'HmRun', 'Runs', 'Walks', 'CAtBat', 'CHits', 'CRuns', 'CRBI', 
'CWalks', 'PutOuts', 'Assists', 'Errors', 'League', 'Division')
Adjusted R-squared: 0.518, Cp: 103509.0, AIC: 3785.0, BIC: 3842.0
Subset size: 16, RSS: 24219377.0
('AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'CAtBat', 'CHits', 'CRuns', 
'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League', 'Division')
Adjusted R-squared: 0.516, Cp: 104206.0, AIC: 3787.0, BIC: 3847.0
Subset size: 17, RSS: 24209447.0
('AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'CAtBat', 'CHits', 'CRuns', 
'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League', 'Division', 'NewLeague')
Adjusted R-squared: 0.514, Cp: 104926.0, AIC: 3788.0, BIC: 3853.0
Subset size: 18, RSS: 24201837.0
('AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat', 'CHits', 
'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League', 'Division', 'NewLeague')
Adjusted R-squared: 0.513, Cp: 105654.0, AIC: 3790.0, BIC: 3858.0
</pre>

<p>
Using BIC, the best subset has six variables.  Using AIC or C<sub>p</sub>, the best subset has 10
variables.  Finally, using adjusted R-squared, the best subset has 11 variables.
</p>
</div>
</div>

<div id="outline-container-orgba9499e" class="outline-4">
<h4 id="orgba9499e"><span class="section-number-4">6.5.2</span> Forward and Backward Stepwise Selection</h4>
<div class="outline-text-4" id="text-6-5-2">
<p>
Forward and backward stepwise selection are implemented in functions
<code>forwardStepSelect</code> and <code>backwardStepSelect</code>. Given a data set and a metric,
these functions add or eliminate a variable at every step. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap6/'</span>)
<span style="color: #0000FF;">from</span> subsetSelection <span style="color: #0000FF;">import</span> forwardStepSelect, backwardStepSelect
<span style="color: #0000FF;">from</span> subsetSelectionSklearn <span style="color: #0000FF;">import</span> getVarLookup, bestSubset

<span style="color: #BA36A5;">hitters</span> = datasets.get_rdataset(<span style="color: #008000;">'Hitters'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">hitters</span> = hitters.dropna()

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare inputs for subsetSelectionSklearn</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]

<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)
<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]

<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_categoric_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">var_categoric_dummies</span> = <span style="color: #006FE0;">list</span>(X_categoric_dummies.columns)

<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric, X_categoric_dummies), axis=1)
<span style="color: #BA36A5;">y</span> = hitters[y_var]

<span style="color: #BA36A5;">x_var_names</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
x_var_names.remove(y_var)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Select best model for a given subset size</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Best model is defined as model with the lowest RSS</span>
<span style="color: #BA36A5;">best_model7</span> = bestSubset(y, var_numeric, var_categoric,
                         var_categoric_dummies, X, subset_size=7)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best models for subset size 7'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best model from exhaustive enumeration'</span>)
<span style="color: #BA36A5;">best_model7_res</span> = pd.DataFrame({<span style="color: #008000;">'variable'</span>: [<span style="color: #008000;">'intercept'</span>],
                                <span style="color: #008000;">'coef'</span>: [best_model7[<span style="color: #008000;">'model'</span>].intercept_]})
<span style="color: #BA36A5;">best_model7_res</span> = pd.concat((best_model7_res, pd.DataFrame(
    {<span style="color: #008000;">'variable'</span>: best_model7[<span style="color: #008000;">'var_numeric_dummies'</span>],
     <span style="color: #008000;">'coef'</span>: best_model7[<span style="color: #008000;">'model'</span>].coef_})), axis=0, ignore_index=<span style="color: #D0372D;">True</span>)
<span style="color: #0000FF;">print</span>(best_model7_res)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare inputs for forward step or backward step</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">x_vars</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
x_vars.remove(y_var)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Forward step select</span>
<span style="color: #BA36A5;">fwd_best_models</span> = forwardStepSelect(y_var, x_vars, hitters, <span style="color: #008000;">'ssr'</span>,
                                    metric_max=<span style="color: #D0372D;">False</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best model using forward step select'</span>)
<span style="color: #0000FF;">print</span>(fwd_best_models[7][<span style="color: #008000;">'model'</span>].params)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Backward step select</span>
<span style="color: #BA36A5;">bkwd_best_models</span> = backwardStepSelect(y_var, x_vars, hitters, <span style="color: #008000;">'ssr'</span>,
                                      metric_max=<span style="color: #D0372D;">False</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best model using backward step select'</span>)
<span style="color: #0000FF;">print</span>(bkwd_best_models[7][<span style="color: #008000;">'model'</span>].params)
</pre>
</div>

<pre class="example">
Best models for subset size 7
------
Best model from exhaustive enumeration
     variable       coef
0   intercept  14.457626
1        Hits   1.283351
2       Walks   3.227426
3      CAtBat  -0.375235
4       CHits   1.495707
5      CHmRun   1.442054
6     PutOuts   0.236681
7  Division_E  64.993322
8  Division_W -64.993322
------
Best model using forward step select
Intercept        109.787306
Division[T.W]   -127.122393
CRBI               0.853762
Hits               7.449877
PutOuts            0.253340
AtBat             -1.958885
Walks              4.913140
CWalks            -0.305307
dtype: float64
------
Best model using backward step select
Intercept        105.648749
Division[T.W]   -116.169217
AtBat             -1.976284
Hits               6.757491
Walks              6.055869
CRuns              1.129309
CWalks            -0.716335
PutOuts            0.302885
dtype: float64
</pre>
</div>
</div>

<div id="outline-container-orga5f0ec4" class="outline-4">
<h4 id="orga5f0ec4"><span class="section-number-4">6.5.3</span> Choosing Among Models Using the Validation Set Approach and Cross-Validation</h4>
<div class="outline-text-4" id="text-6-5-3">
<p>
We now split the <code>Hitters</code> data set into two groups: training and test.  We use
training group to estimate regression coefficients. Then we use these coefficients
to estimate RSS on test group.  By repeating this procedure on subsets of all
sizes, we find the optimal subset size (which results in the smallest RSS).  For
this subset size, we find the best model using the <i>entire</i> data set. Note that
the last step has already been done in the previous section.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">from sklearn import LinearRegression</span>
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap6/'</span>)
<span style="color: #0000FF;">from</span> subsetSelectionSklearn <span style="color: #0000FF;">import</span> bestSubsetTest, getVarLookup, bestSubset

<span style="color: #BA36A5;">hitters</span> = datasets.get_rdataset(<span style="color: #008000;">'Hitters'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">hitters</span> = hitters.dropna()

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Create indexes to split data between training and test groups</span>
np.random.seed(911)
<span style="color: #BA36A5;">train_ind</span> = np.random.choice([<span style="color: #D0372D;">True</span>, <span style="color: #D0372D;">False</span>], hitters.shape[0])
<span style="color: #BA36A5;">test_ind</span> = (train_ind == <span style="color: #D0372D;">False</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare inputs for LinearRegression</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]

<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)
<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]

<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_categoric_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">var_categoric_dummies</span> = <span style="color: #006FE0;">list</span>(X_categoric_dummies.columns)

<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric, X_categoric_dummies), axis=1)
<span style="color: #BA36A5;">y</span> = hitters[y_var]

<span style="color: #BA36A5;">x_var_names</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
x_var_names.remove(y_var)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Select best model for given subset size</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Estimate coefficients using training data</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Best model minimizes test RSS</span>
<span style="color: #BA36A5;">best_models</span> = {}
<span style="color: #0000FF;">for</span> p <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(1, 19):
    <span style="color: #BA36A5;">best_models</span>[p] = bestSubsetTest(y, var_numeric, var_categoric,
                                    var_categoric_dummies, X, train_ind,
                                    test_ind, subset_size=p)

<span style="color: #BA36A5;">RSS</span> = [best_models[k][<span style="color: #008000;">'metric'</span>] <span style="color: #0000FF;">for</span> k <span style="color: #0000FF;">in</span> best_models.keys()]
<span style="color: #BA36A5;">best_ind</span> = np.argmin(RSS)
<span style="color: #BA36A5;">best_subset_size</span> = <span style="color: #006FE0;">list</span>(best_models.keys())[best_ind]

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best model from cross-validation'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'subset size: '</span> + <span style="color: #006FE0;">str</span>(best_subset_size) + <span style="color: #008000;">', RSS: '</span> +
      <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(best_models[best_ind][<span style="color: #008000;">'metric'</span>], 0)))
<span style="color: #BA36A5;">coef_df</span> = pd.DataFrame({<span style="color: #008000;">'variable'</span>: [<span style="color: #008000;">'intercept'</span>], <span style="color: #008000;">'coefficient'</span>:
                        best_models[best_ind][<span style="color: #008000;">'model'</span>].intercept_})
<span style="color: #BA36A5;">coef_df</span> = pd.concat((coef_df, pd.DataFrame(
    {<span style="color: #008000;">'variable'</span>: best_models[best_ind][<span style="color: #008000;">'var_numeric_dummies'</span>],
     <span style="color: #008000;">'coefficient'</span>: best_models[best_ind][<span style="color: #008000;">'model'</span>].coef_})),
                    axis=0, ignore_index=<span style="color: #D0372D;">True</span>)

<span style="color: #0000FF;">print</span>(coef_df)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use full dataset to reestimate model for best subset size</span>
<span style="color: #BA36A5;">best_model_alldata</span> = bestSubset(y, var_numeric, var_categoric,
                                var_categoric_dummies, X,
                                best_subset_size)
<span style="color: #BA36A5;">coef_alldata</span> = pd.DataFrame({<span style="color: #008000;">'variable'</span>: [<span style="color: #008000;">'intercept'</span>],
                             <span style="color: #008000;">'coefficient'</span>:
                             best_model_alldata[<span style="color: #008000;">'model'</span>].intercept_})
<span style="color: #BA36A5;">coef_alldata</span> = pd.concat(
    (coef_alldata, pd.DataFrame(
        {<span style="color: #008000;">'variable'</span>: best_model_alldata[<span style="color: #008000;">'var_numeric_dummies'</span>],
         <span style="color: #008000;">'coefficient'</span>: best_model_alldata[<span style="color: #008000;">'model'</span>].coef_})), axis=0,
    ignore_index=<span style="color: #D0372D;">True</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best subset size from cross validation'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best model coefficients reestimated using all data'</span>)
<span style="color: #0000FF;">print</span>(coef_alldata)
</pre>
</div>

<pre class="example">
Best model from cross-validation
subset size: 10, RSS: 15473294.0
       variable  coefficient
0     intercept    80.226817
1         AtBat    -1.961614
2          Hits     7.665752
3         Walks     4.392214
4        CHmRun     0.780084
5         CRuns     0.655072
6        CWalks    -0.334012
7        Errors     0.739688
8    Division_E    60.649902
9    Division_W   -60.649902
10  NewLeague_A   -18.555935
11  NewLeague_N    18.555935
------
Best subset size from cross validation
Best model coefficients reestimated using all data
      variable  coefficient
0    intercept   106.345413
1        AtBat    -2.168650
2         Hits     6.918017
3        Walks     5.773225
4       CAtBat    -0.130080
5        CRuns     1.408249
6         CRBI     0.774312
7       CWalks    -0.830826
8      PutOuts     0.297373
9      Assists     0.283168
10  Division_E    56.190029
11  Division_W   -56.190029
</pre>

<p>
We see that the best ten-variable model on the full data set has a different set
of variables than the best ten-variable model on the training set.  Moreover,
the best ten-variable model on the training set is different from the result in
the book.  In fact, it can change with the choice of seed used to partition the
data set between training and test groups. 
</p>

<p>
We now use 10-fold cross-validation to find best model (which minimizes test
error) for each subset size.  To speed up calculations, we use <code>multiprocessing</code>
package. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> sys
sys.path.append(<span style="color: #008000;">'code/chap6/'</span>)
<span style="color: #0000FF;">from</span> subsetSelectionSklearn <span style="color: #0000FF;">import</span> bestSubsetCrossVal, bestSubset, getVarLookup
<span style="color: #0000FF;">from</span> multiprocessing <span style="color: #0000FF;">import</span> Pool

<span style="color: #BA36A5;">hitters</span> = datasets.get_rdataset(<span style="color: #008000;">'Hitters'</span>, <span style="color: #008000;">'ISLR'</span>).data
hitters.dropna(inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare inputs for LinearRegression</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]

<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)
<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]

<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_categoric_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">var_categoric_dummies</span> = <span style="color: #006FE0;">list</span>(X_categoric_dummies.columns)

<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric, X_categoric_dummies), axis=1)
<span style="color: #BA36A5;">y</span> = hitters[y_var]

<span style="color: #BA36A5;">x_var_names</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
x_var_names.remove(y_var)


<span style="color: #0000FF;">def</span> <span style="color: #006699;">bestSubsetMP</span>(s):
    <span style="color: #0000FF;">return</span> bestSubsetCrossVal(y, var_numeric, var_categoric,
                              var_categoric_dummies, X, subset_size=s)

<span style="color: #BA36A5;">size_list</span> = np.arange(1, 19)
<span style="color: #0000FF;">with</span> Pool() <span style="color: #0000FF;">as</span> p:
    <span style="color: #BA36A5;">best_model_list</span> = p.<span style="color: #006FE0;">map</span>(bestSubsetMP, size_list)

<span style="color: #BA36A5;">mse</span> = [<span style="color: #006FE0;">round</span>(model[<span style="color: #008000;">'metric'</span>]) <span style="color: #0000FF;">for</span> model <span style="color: #0000FF;">in</span> best_model_list]
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cross validation MSE'</span>)
<span style="color: #0000FF;">print</span>(mse)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #BA36A5;">best_ind</span> = np.argmin(mse)
<span style="color: #BA36A5;">best_size</span> = size_list[best_ind]
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best subset size: '</span> + <span style="color: #006FE0;">str</span>(best_size))
<span style="color: #BA36A5;">best_model_cv</span> = bestSubset(y, var_numeric, var_categoric,
                           var_categoric_dummies, X,
                           subset_size=best_size)

<span style="color: #BA36A5;">coef_df</span> = pd.DataFrame({<span style="color: #008000;">'variable'</span>: [<span style="color: #008000;">'intercept'</span>],
                        <span style="color: #008000;">'coefficient'</span>: best_model_cv[<span style="color: #008000;">'model'</span>].intercept_})
<span style="color: #BA36A5;">coef_df</span> = pd.concat(
    (coef_df, pd.DataFrame({<span style="color: #008000;">'variable'</span>: best_model_cv[<span style="color: #008000;">'var_numeric_dummies'</span>],
                            <span style="color: #008000;">'coefficient'</span>: best_model_cv[<span style="color: #008000;">'model'</span>].coef_})),
    axis=0, ignore_index=<span style="color: #D0372D;">True</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best model coefficients'</span>)
<span style="color: #0000FF;">print</span>(coef_df)
</pre>
</div>

<pre class="example">
Cross validation MSE
[139557.0, 119654.0, 115384.0, 111173.0, 108374.0, 104886.0, 104685.0, 102869.0,
 101997.0, 101760.0, 101805.0, 102509.0, 103381.0, 104389.0, 105478.0, 107019.0,
 108827.0, 110783.0]
------
Best subset size: 10
Best model coefficients
      variable  coefficient
0    intercept   106.345413
1        AtBat    -2.168650
2         Hits     6.918017
3        Walks     5.773225
4       CAtBat    -0.130080
5        CRuns     1.408249
6         CRBI     0.774312
7       CWalks    -0.830826
8      PutOuts     0.297373
9      Assists     0.283168
10  Division_E    56.190029
11  Division_W   -56.190029
</pre>

<p>
In the reported results, cross-validation selects a 10-variable model.
Depending upon the choice of seed, a 9-, 10- or 11-variable model may be
selected.  
</p>
</div>
</div>
</div>

<div id="outline-container-orga1868ca" class="outline-3">
<h3 id="orga1868ca"><span class="section-number-3">6.6</span> Lab 2: Ridge Regression and the Lasso</h3>
<div class="outline-text-3" id="text-6-6">
<p>
From <code>sklearn</code> library, we will use <code>Ridge</code> and <code>Lasso</code> functions to perform
ridge regression and the lasso.  
</p>
</div>

<div id="outline-container-orgef6a96f" class="outline-4">
<h4 id="orgef6a96f"><span class="section-number-4">6.6.1</span> Ridge Regression</h4>
<div class="outline-text-4" id="text-6-6-1">
<p>
In the <code>Ridge()</code> function of <code>sklearn</code> library, <code>alpha</code> input is similar to
&lambda; in the book.  The <code>Ridge()</code> function minimizes \(\|y - Xw\|_2^2 + \alpha
\|w\|^2_2\).  On the other hand, for ridge regression, <code>glmnet</code> used in
the book minimizes \(\frac{1}{N} \|y - X\beta\|_2^2 +
\frac{\lambda}{2}\|\beta\|_2^2\). Therefore, to obtain comparable results similar
to the book, we need to use in \(\alpha_{Ridge} = \lambda_{book} N / 2\).  
</p>

<p>
We see that the \(\ell_2\) norms of ridge coefficients are different from
those reported in the book. But, consistent with the book, as penalty decreases,
\(\ell_2\) norm of coefficients increases. 
</p>

<p>
Fitting a ridge regression model with &lambda; = 4 leads to a much lower test MSE
than than fitting a model with just an intercept.  However, unlike the book,
test MSE from ordinary least squares regression is lower than test MSE from
ridge regression.  These results change with the choice of seed. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> sklearn.linear_model <span style="color: #0000FF;">import</span> Ridge, RidgeCV
<span style="color: #0000FF;">from</span> sklearn.metrics <span style="color: #0000FF;">import</span> make_scorer, mean_squared_error
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">hitters</span> = pd.read_csv(<span style="color: #008000;">'data/Hitters.csv'</span>, index_col=0)
hitters.dropna(inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare data for input to sklearn</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]
<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)

<span style="color: #BA36A5;">y</span> = hitters[y_var]
<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]
<span style="color: #BA36A5;">X_numeric_std</span> = X_numeric / X_numeric.std()
<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_cat_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric_std, X_cat_dummies), axis=1)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">lambda = 11498</span>
<span style="color: #BA36A5;">ridge11k</span> = Ridge(alpha=11498 * X.shape[0] / 2)
ridge11k.fit(X, y)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'el2 norm when lambda is 11498'</span>)
<span style="color: #0000FF;">print</span>(np.sqrt(np.<span style="color: #006FE0;">sum</span>(ridge11k.coef_ ** 2)))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">lambda = 705</span>
<span style="color: #BA36A5;">ridge700</span> = Ridge(alpha=705 * X.shape[0] / 2)
ridge700.fit(X, y)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'el2 norm when lambda is 705'</span>)
<span style="color: #0000FF;">print</span>(np.sqrt(np.<span style="color: #006FE0;">sum</span>(ridge700.coef_ ** 2)))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Split data into training and test groups</span>
np.random.seed(911)
<span style="color: #BA36A5;">shuffle_ind</span> = np.arange(X.shape[0])
np.random.shuffle(shuffle_ind)
<span style="color: #BA36A5;">train_ind</span> = shuffle_ind[:<span style="color: #006FE0;">int</span>(X.shape[0] / 2.0)]
<span style="color: #BA36A5;">test_ind</span> = shuffle_ind[<span style="color: #006FE0;">int</span>(X.shape[0] / 2.0):]
<span style="color: #BA36A5;">X_train</span> = X.iloc[train_ind]
<span style="color: #BA36A5;">y_train</span> = y[train_ind]
<span style="color: #BA36A5;">X_test</span> = X.iloc[test_ind]
<span style="color: #BA36A5;">y_test</span> = y[test_ind]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">lambda = 4</span>
<span style="color: #BA36A5;">ridge4</span> = Ridge(alpha=4 * X.shape[0] / 2)
ridge4.fit(X_train, y_train)
<span style="color: #BA36A5;">y_predict</span> = ridge4.predict(X_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE with lambda = 4'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_predict - y_test) ** 2)))

<span style="color: #BA36A5;">y_predict_train</span> = ridge4.predict(X_train)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE when only an intercept is fit'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_test - np.mean(y_predict_train)) ** 2)))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Very large lambda</span>
<span style="color: #BA36A5;">ridge_inf</span> = Ridge(alpha=1e10)
ridge_inf.fit(X_train, y_train)
<span style="color: #BA36A5;">y_predict</span> = ridge_inf.predict(X_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE with very large lambda'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_predict - y_test) ** 2)))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">lambda = 0, equivalanet to OLS</span>
<span style="color: #BA36A5;">ridge_zero</span> = Ridge(alpha=0)
ridge_zero.fit(X_train, y_train)
<span style="color: #BA36A5;">y_predict</span> = ridge_zero.predict(X_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE with zero lambda'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_predict - y_test) ** 2)))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Select best alpha using cross-validation</span>
<span style="color: #BA36A5;">alpha_vals</span> = np.logspace(-2, 5)
<span style="color: #BA36A5;">negative_mse</span> = make_scorer(mean_squared_error, greater_is_better=<span style="color: #D0372D;">False</span>)
<span style="color: #BA36A5;">ridge_cv</span> = RidgeCV(alphas=alpha_vals, scoring=negative_mse)
ridge_cv.fit(X_train, y_train)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best lambda: '</span> + <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(ridge_cv.alpha_ * 2 / X_train.shape[0], 3)))

<span style="color: #BA36A5;">ridge_best</span> = Ridge(alpha=ridge_cv.alpha_)
ridge_best.fit(X_train, y_train)
<span style="color: #BA36A5;">y_predict</span> = ridge_best.predict(X_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE with best lambda'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_predict - y_test) ** 2)))
</pre>
</div>

<pre class="example">
el2 norm when lambda is 11498
0.1361967493538057
el2 norm when lambda is 705
2.1801333373772005
------
Test MSE with lambda = 4
123619.0
Test MSE when only an intercept is fit
202378.0
Test MSE with very large lambda
202378.0
Test MSE with zero lambda
103859.0
------
Best lambda: 2.121
Test MSE with best lambda
108007.0
</pre>
</div>
</div>

<div id="outline-container-org70f708b" class="outline-4">
<h4 id="org70f708b"><span class="section-number-4">6.6.2</span> The Lasso</h4>
<div class="outline-text-4" id="text-6-6-2">
<p>
We now ask whether the lasso can yield either a more accurate or a more
interpretable model than ridge regression.  We find that test MSE is much lower
than the test MSE from a model with no coefficients.  Test MSE from best fit
lasso model is comparable to test MSE from best fit ridge model.  Note that, for
the chosen seeds, test MSE is slightly <i>lower</i> than training MSE.  For other
seeds, we find that test MSE is higher than training MSE. 
</p>

<p>
Using the <code>alpha</code> of the best lasso model, we fit a lasso on the entire data
set.  Some of the variables have zero coefficients.  Although the test MSE is
very similar to test MSE reported in the book, the coefficients are quite different.
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> sklearn.linear_model <span style="color: #0000FF;">import</span> Lasso, LassoCV
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">hitters</span> = pd.read_csv(<span style="color: #008000;">'data/Hitters.csv'</span>, index_col=0)
hitters.dropna(inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare data for input to sklearn</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]
<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)

<span style="color: #BA36A5;">y</span> = hitters[y_var]
<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]
<span style="color: #BA36A5;">X_numeric_std</span> = (X_numeric - X_numeric.mean()) / X_numeric.std()
<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_cat_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric_std, X_cat_dummies), axis=1)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Split data into training and test groups</span>
np.random.seed(911)
<span style="color: #BA36A5;">shuffle_ind</span> = np.arange(X.shape[0])
np.random.shuffle(shuffle_ind)
<span style="color: #BA36A5;">train_ind</span> = shuffle_ind[:<span style="color: #006FE0;">int</span>(X.shape[0] / 2.0)]
<span style="color: #BA36A5;">test_ind</span> = shuffle_ind[<span style="color: #006FE0;">int</span>(X.shape[0] / 2.0):]
<span style="color: #BA36A5;">X_train</span> = X.iloc[train_ind]
<span style="color: #BA36A5;">y_train</span> = y[train_ind]
<span style="color: #BA36A5;">X_test</span> = X.iloc[test_ind]
<span style="color: #BA36A5;">y_test</span> = y[test_ind]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Find best alpha, calculate MSE</span>
<span style="color: #BA36A5;">alpha_vals</span> = np.logspace(-3, 3)
<span style="color: #BA36A5;">lasso_cv</span> = LassoCV(alphas=alpha_vals, max_iter=10000, cv=10, random_state=211)
lasso_cv.fit(X_train, y_train)

<span style="color: #BA36A5;">y_predict</span> = lasso_cv.predict(X_train)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Train MSE with best lambda'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_predict - y_train) ** 2)))

<span style="color: #BA36A5;">y_predict</span> = lasso_cv.predict(X_test)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE with best lambda'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">round</span>(np.mean((y_predict - y_test) ** 2)))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use best alpha to fit lasso on full data set</span>
<span style="color: #BA36A5;">lasso</span> = Lasso(alpha=lasso_cv.alpha_)
lasso.fit(X, y)
<span style="color: #BA36A5;">best_coef</span> = pd.Series(lasso.coef_, index=X.columns)
<span style="color: #BA36A5;">best_coef</span> = best_coef[np.<span style="color: #006FE0;">abs</span>(best_coef) &gt; 1e-5]
<span style="color: #BA36A5;">best_coef</span> = pd.concat((pd.Series(lasso.intercept_, index=[<span style="color: #008000;">'Intercept'</span>]),
                       best_coef))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Coefficients of best lasso fit'</span>)
<span style="color: #0000FF;">print</span>(best_coef)
</pre>
</div>

<pre class="example">
Train MSE with best lambda
116129.0
Test MSE with best lambda
102605.0
------
Coefficients of best lasso fit
Intercept     506.569249
Hits           84.392705
Walks          48.147509
CRuns          71.418900
CRBI          128.901955
PutOuts        60.083757
Division_E     59.851120
dtype: float64
</pre>
</div>
</div>
</div>

<div id="outline-container-org3ee433d" class="outline-3">
<h3 id="org3ee433d"><span class="section-number-3">6.7</span> Lab 3: PCR and PLS Regression</h3>
<div class="outline-text-3" id="text-6-7">
</div>
<div id="outline-container-org96f12f7" class="outline-4">
<h4 id="org96f12f7"><span class="section-number-4">6.7.1</span> Principal Components Regression</h4>
<div class="outline-text-4" id="text-6-7-1">
<p>
We first run PCA on <code>Hitters</code> data.  Then we use principal components as
explanatory variables in linear regression.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> sklearn.decomposition <span style="color: #0000FF;">import</span> PCA
<span style="color: #0000FF;">from</span> sklearn.linear_model <span style="color: #0000FF;">import</span> LinearRegression
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> cross_val_score, KFold
<span style="color: #0000FF;">from</span> sklearn.metrics <span style="color: #0000FF;">import</span> mean_squared_error

<span style="color: #BA36A5;">hitters</span> = pd.read_csv(<span style="color: #008000;">'data/Hitters.csv'</span>, index_col=0)
hitters.dropna(inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare data for use in sklearn</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]
<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)

<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]
<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_cat_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">An alternative method of including dummy variables</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use this method to tie out with results in the book</span>
X_cat_dummies.drop(columns=[<span style="color: #008000;">'League_A'</span>, <span style="color: #008000;">'Division_E'</span>, <span style="color: #008000;">'NewLeague_A'</span>],
                   inplace=<span style="color: #D0372D;">True</span>)
<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric, X_cat_dummies), axis=1)
<span style="color: #BA36A5;">X</span> = (X - X.mean()) / X.std()
<span style="color: #BA36A5;">y</span> = hitters[y_var]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Run PCA on explanatory variables</span>
<span style="color: #BA36A5;">pca</span> = PCA()
pca.fit(X)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Run OLS regression using 1, 2, ..., all principal components</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Calculate percent variance of y explained (r-squared) using cross validation</span>
<span style="color: #BA36A5;">X_transform</span> = pca.transform(X)
<span style="color: #BA36A5;">k_fold10</span> = KFold(n_splits=10, shuffle=<span style="color: #D0372D;">True</span>, random_state=911)
<span style="color: #BA36A5;">lm_model</span> = LinearRegression()
<span style="color: #BA36A5;">r_sq</span> = []                       <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">on training data</span>
<span style="color: #BA36A5;">r_sq_cv</span> = []                    <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">in cross validation</span>
<span style="color: #BA36A5;">mse</span> = []                        <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">on training data</span>
<span style="color: #BA36A5;">mse_cv</span> = []                     <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">in cross validation</span>
<span style="color: #0000FF;">for</span> i_components <span style="color: #0000FF;">in</span> np.arange(1, X_transform.shape[1] + 1):
    <span style="color: #BA36A5;">scores_mse</span>  = cross_val_score(lm_model, X_transform[:, :i_components],
                                  y, scoring=<span style="color: #008000;">'neg_mean_squared_error'</span>,
                                  cv=k_fold10)
    mse_cv.append(np.mean(scores_mse))

    <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">For LinearRegression, score is r-squared</span>
    <span style="color: #BA36A5;">scores_rsq</span> = cross_val_score(lm_model, X_transform[:, :i_components],
                                 y, cv=k_fold10)
    r_sq_cv.append(np.mean(scores_rsq))
    <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit on all data</span>
    lm_model.fit(X_transform[:, :i_components], y)
    r_sq.append(lm_model.score(X_transform[:, :i_components], y))
    mse.append(mean_squared_error(y, lm_model.predict(
        X_transform[:, :i_components])))

<span style="color: #BA36A5;">mse_cv</span> = [-1 * mse <span style="color: #0000FF;">for</span> mse <span style="color: #0000FF;">in</span> mse_cv]

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Training data variance explained by principal components'</span>)
<span style="color: #BA36A5;">explain_df</span> = pd.DataFrame(
    {<span style="color: #008000;">'num_components'</span>: np.arange(1, X_transform.shape[1] + 1),
     <span style="color: #008000;">'X'</span>: np.<span style="color: #006FE0;">round</span>(np.cumsum(pca.explained_variance_ratio_), 4),
     <span style="color: #008000;">'y'</span>: r_sq})
explain_df.set_index(<span style="color: #008000;">'num_components'</span>, inplace=<span style="color: #D0372D;">True</span>)
<span style="color: #0000FF;">print</span>(explain_df.head(10))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #BA36A5;">n_best_train</span> = np.argmin(mse) + 1
<span style="color: #BA36A5;">n_best_cv</span> = np.argmin(mse_cv) + 1
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'On training data, lowest mse occurs when %0.0f components are incuded'</span> %
      n_best_train)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'In cross validation, lowest mse occurs when %0.0f components are included'</span>
      % n_best_cv)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Split data into training and test</span>
np.random.seed(211)
<span style="color: #BA36A5;">train_ind</span> = np.random.choice([<span style="color: #D0372D;">True</span>, <span style="color: #D0372D;">False</span>], X_transform.shape[0])
<span style="color: #BA36A5;">test_ind</span> = (train_ind == <span style="color: #D0372D;">False</span>)
<span style="color: #BA36A5;">X_train</span> = X.loc[train_ind]
<span style="color: #BA36A5;">X_train</span> = (X_train - X_train.mean()) / X_train.std()
<span style="color: #BA36A5;">y_train</span> = y[train_ind]

<span style="color: #BA36A5;">X_test</span> = X.loc[test_ind]
<span style="color: #BA36A5;">X_test</span> = (X_test - X_test.mean()) / X_test.std()
<span style="color: #BA36A5;">y_test</span> = y[test_ind]

<span style="color: #BA36A5;">pca</span> = PCA()
pca.fit(X_train)
<span style="color: #BA36A5;">X_train_transform</span> = pca.transform(X_train)[:, :n_best_cv]
<span style="color: #BA36A5;">lm_model</span> = LinearRegression()
lm_model.fit(X_train_transform, y_train)

<span style="color: #BA36A5;">X_test_transform</span> = pca.transform(X_test)[:, :n_best_cv]
<span style="color: #BA36A5;">rss</span> = np.mean((y_test - lm_model.predict(X_test_transform)) ** 2)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Using number of components that results in lowest cv MSE'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test RSS: %0.0f'</span> % rss)
</pre>
</div>

<pre class="example">
Training data variance explained by principal components
                     X         y
num_components                  
1               0.3831  0.406269
2               0.6016  0.415822
3               0.7084  0.421733
4               0.7903  0.432236
5               0.8429  0.449044
6               0.8863  0.464800
7               0.9226  0.466864
8               0.9496  0.467498
9               0.9628  0.468580
10              0.9726  0.477632
------
On training data, lowest mse occurs when 19 components are incuded
In cross validation, lowest mse occurs when 6 components are included
------
Using number of components that results in lowest cv MSE
Test RSS: 110989
</pre>
</div>
</div>

<div id="outline-container-orga100fd3" class="outline-4">
<h4 id="orga100fd3"><span class="section-number-4">6.7.2</span> Partial Least Squares</h4>
<div class="outline-text-4" id="text-6-7-2">
<p>
To perform partial least squares regression, we use <code>PLSRegression</code> function in
<code>sklearn</code> library.  While the overall results are the same (for best model using
training data, test MSE is comparable to best model using PCA, ridge, or lasso),
actual results will vary based on the choice of seed.   
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> sklearn.cross_decomposition <span style="color: #0000FF;">import</span> PLSRegression
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> cross_val_score

<span style="color: #BA36A5;">hitters</span> = pd.read_csv(<span style="color: #008000;">'data/Hitters.csv'</span>, index_col=0)
hitters.dropna(inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Prepare data for sklearn</span>
<span style="color: #BA36A5;">y_var</span> = <span style="color: #008000;">'Salary'</span>
<span style="color: #BA36A5;">var_categoric</span> = [<span style="color: #008000;">'League'</span>, <span style="color: #008000;">'Division'</span>, <span style="color: #008000;">'NewLeague'</span>]
<span style="color: #BA36A5;">var_numeric</span> = <span style="color: #006FE0;">list</span>(hitters.columns)
var_numeric.remove(y_var)
<span style="color: #0000FF;">for</span> name <span style="color: #0000FF;">in</span> var_categoric:
    var_numeric.remove(name)

<span style="color: #BA36A5;">X_numeric</span> = hitters[var_numeric]
<span style="color: #BA36A5;">X_categoric</span> = hitters[var_categoric]
<span style="color: #BA36A5;">X_cat_dummies</span> = pd.get_dummies(X_categoric)
<span style="color: #BA36A5;">X</span> = pd.concat((X_numeric, X_cat_dummies), axis=1)
<span style="color: #BA36A5;">y</span> = hitters[y_var]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Split data between training and test groups</span>
np.random.seed(911)
<span style="color: #BA36A5;">train_ind</span> = np.random.choice([<span style="color: #D0372D;">True</span>, <span style="color: #D0372D;">False</span>], X.shape[0])
<span style="color: #BA36A5;">test_ind</span> = np.vectorize(<span style="color: #0000FF;">lambda</span> x: <span style="color: #0000FF;">not</span> x)(train_ind)
<span style="color: #BA36A5;">X_train</span> = X.loc[train_ind]
<span style="color: #BA36A5;">X_test</span> = X.loc[test_ind]
<span style="color: #BA36A5;">X_train</span> = (X_train - X_train.mean()) / X_train.std()
<span style="color: #BA36A5;">X_test</span> = (X_test - X_test.mean()) / X_test.std()
<span style="color: #BA36A5;">y_train</span> = y[train_ind]
<span style="color: #BA36A5;">y_test</span> = y[test_ind]

<span style="color: #BA36A5;">mse_cv</span> = []
<span style="color: #0000FF;">for</span> i_components <span style="color: #0000FF;">in</span> np.arange(1, 20):
    <span style="color: #BA36A5;">pls</span> = PLSRegression(n_components=i_components)
    <span style="color: #BA36A5;">scores</span> = cross_val_score(pls, X_train, y_train, cv=10,
                             scoring=<span style="color: #008000;">'neg_mean_squared_error'</span>)
    mse_cv.append(np.mean(scores))

<span style="color: #BA36A5;">mse_cv</span> = [-1 * mse <span style="color: #0000FF;">for</span> mse <span style="color: #0000FF;">in</span> mse_cv]
<span style="color: #BA36A5;">best_comp_count</span> = np.argmin(mse_cv) + 1
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Lowest MSE obtained when number of components is %0.0f'</span>
      % best_comp_count)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit PLS on test data using best number of components</span>
<span style="color: #BA36A5;">pls</span> = PLSRegression(n_components=best_comp_count)
pls.fit(X_test, y_test)
<span style="color: #BA36A5;">mse_test</span> = np.mean((pls.predict(X_test).ravel() - y_test) ** 2)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Using best number of components from training PLS'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Test MSE: %0.0f'</span> % mse_test)
</pre>
</div>

<pre class="example">
Lowest MSE obtained when number of components is 3
Using best number of components from training PLS
Test MSE: 109315

</pre>
</div>
</div>
</div>
</div>

<div id="outline-container-org2bdd141" class="outline-2">
<h2 id="org2bdd141"><span class="section-number-2">7</span> Moving Beyond Linearity</h2>
<div class="outline-text-2" id="text-7">
</div>
<div id="outline-container-org8076280" class="outline-3">
<h3 id="org8076280"><span class="section-number-3">7.1</span> Polynomial Regression</h3>
<div class="outline-text-3" id="text-7-1">
<p>
The left-hand panel of figure <a href="#orga4cf1aa">72</a> is a plot of <code>wage</code> against
<code>age</code> for the <code>Wage</code> data set, which contains demographic information for males
who reside in the central Atlantic region of the United States.  The scatter
plot shows individual data points.  The firm line is regression fit of fourth
order polynomial.  The dotted lines show 95% confidence interval.  
</p>

<p>
The right-hand panel shows fitted probabilities of <code>wage &gt; 250</code> from a logistic
regression, also on fourth order polynomial of <code>age</code>.  The grey marks on the top
and bottom of the panel indicate the ages of the high earners and low earners.    
</p>


<div id="orga4cf1aa" class="figure">
<p><img src="figures/fig7_1.png" alt="fig7_1.png" />
</p>
<p><span class="figure-number">Figure 72: </span>The <code>Wage</code> data.  Left: The sold curve is a degree-4 polynomial of <code>wage</code> (in thousands of dollars) as a function of <code>age</code>, fit by least squares.  The dotted curves indicate an estimated 95% confidence interval.  Right: We model the binary event <code>wage &gt; 250</code> using logistic regression, again with a degree-4 polynomial.  The fitted posterior probability of <code>wage</code> exceeding $250,000 is shown in blue.</p>
</div>
</div>
</div>

<div id="outline-container-org324bcc2" class="outline-3">
<h3 id="org324bcc2"><span class="section-number-3">7.2</span> Step Functions</h3>
<div class="outline-text-3" id="text-7-2">
<p>
The left-hand panel of figure <a href="#org9a3e6db">73</a> shows a fit of step functions
to the <code>Wage</code> data from figure <a href="#orga4cf1aa">72</a>.  We also fit the logistic
regression model to predict the probability that an individual is a high earner
based on <code>age</code>.  The right-hand panel displays the fitted posterior probabilities.
</p>


<div id="org9a3e6db" class="figure">
<p><img src="figures/fig7_2.png" alt="fig7_2.png" />
</p>
<p><span class="figure-number">Figure 73: </span>The <code>Wage</code> data.  Left: The solid curve displays the fitted value from a least squares regression of <code>wage</code> (in thousands of dollars) using step functions of <code>age</code>.  The dotted curves show an estimated 95% confidence interval.  Right: We model the binary event <code>wage &gt; 250</code> using logistic regression again using step functions of <code>age</code>.  The fitted posterior probability of <code>wage</code> exceeding $250,000 is shown.</p>
</div>
</div>
</div>

<div id="outline-container-org7acaba5" class="outline-3">
<h3 id="org7acaba5"><span class="section-number-3">7.3</span> Basis Functions</h3>
</div>

<div id="outline-container-org8c2eca0" class="outline-3">
<h3 id="org8c2eca0"><span class="section-number-3">7.4</span> Regression Splines</h3>
</div>

<div id="outline-container-org3bb9aaf" class="outline-3">
<h3 id="org3bb9aaf"><span class="section-number-3">7.5</span> Lab: Non-linear Modeling</h3>
<div class="outline-text-3" id="text-7-5">
</div>
<div id="outline-container-orgac959b8" class="outline-4">
<h4 id="orgac959b8"><span class="section-number-4">7.5.1</span> Polynomial Regression and Step Functions</h4>
<div class="outline-text-4" id="text-7-5-1">
<p>
We define a simple function to replicate the output of <code>poly</code> function in <code>R</code>.
We then fit <code>wage</code> as a fourth order orthogonal polynomial of <code>age</code>. Then we fit
<code>wage</code> as a fourth order raw polynomnial of <code>age</code>.  Predicted values from both
fits are very close to one another.
</p>

<p>
ANOVA analysis on nested models of upto five degree polynomials shows that first
three degrees are highly significant.  Fourth degree has a p-value just above
0.05.  Fifth degree is not significant.  Either a cubic or a quartic polynomial
provide a reasonable fit.
</p>

<p>
With the fourth order polynomial as the chosen model, it is straightforward to
plot fitted values and confidence intervals. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> statsmodels.stats.api <span style="color: #0000FF;">import</span> anova_lm
<span style="color: #0000FF;">import</span> statsmodels.api <span style="color: #0000FF;">as</span> sm
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">wage</span> = datasets.get_rdataset(<span style="color: #008000;">'Wage'</span>, <span style="color: #008000;">'ISLR'</span>).data

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Replicate poly() function in R</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Based on an answer by K. A. Buhr on stackoverflow</span>
<span style="color: #0000FF;">def</span> <span style="color: #006699;">poly</span>(x, p):
    <span style="color: #BA36A5;">x</span> = np.array(x)
    <span style="color: #BA36A5;">X_mat</span> = np.transpose(np.vstack([x ** k <span style="color: #0000FF;">for</span> k <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(p + 1)]))
    <span style="color: #0000FF;">return</span> np.linalg.qr(X_mat)[0][:, 1:]


<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit wage as a function of orthogonal polynomial of age</span>
<span style="color: #BA36A5;">X_wage</span> = poly(wage[<span style="color: #008000;">'age'</span>], 4)
<span style="color: #BA36A5;">X_wage</span> = sm.add_constant(X_wage)

<span style="color: #BA36A5;">poly_model</span> = sm.OLS(wage[<span style="color: #008000;">'wage'</span>], X_wage)
<span style="color: #BA36A5;">poly_fit</span> = poly_model.fit()

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Coefficients of orthogonal polynomials upto 4 degrees'</span>)
<span style="color: #0000FF;">print</span>(poly_fit.summary2().tables[1].iloc[:, :4])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit wage as a function of raw polynomials of age</span>
<span style="color: #BA36A5;">model</span> = smf.ols(<span style="color: #008000;">'wage ~ age + I(age ** 2) + I(age ** 3) + I(age ** 4)'</span>,
                data=wage)
<span style="color: #BA36A5;">fit</span> = model.fit()
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Coefficients of raw polynomials upto 4 degrees'</span>)
<span style="color: #0000FF;">print</span>(fit.summary2().tables[1].iloc[:, :4])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Verify that both models produce identical fitted values</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Orthogonal polynomials and raw polynomials fitted values nearly equal:'</span>)
<span style="color: #0000FF;">print</span>(np.<span style="color: #006FE0;">all</span>(np.<span style="color: #006FE0;">abs</span>(fit.fittedvalues - poly_fit.fittedvalues) &lt; 1e-7))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit models of degrees 1 to 5, then compare using ANOVA</span>
<span style="color: #BA36A5;">fit1</span> = smf.ols(<span style="color: #008000;">'wage ~ age'</span>, data=wage).fit()
<span style="color: #BA36A5;">fit2</span> = smf.ols(<span style="color: #008000;">'wage ~ age + I(age ** 2)'</span>, data=wage).fit()
<span style="color: #BA36A5;">fit3</span> = smf.ols(<span style="color: #008000;">'wage ~ age + I(age ** 2) + I(age ** 3)'</span>, data=wage).fit()
<span style="color: #BA36A5;">fit4</span> = smf.ols(<span style="color: #008000;">'wage ~ age + I(age ** 2) + I(age ** 3) + I(age ** 4)'</span>,
               data=wage).fit()
<span style="color: #BA36A5;">fit5</span> = smf.ols(<span style="color: #008000;">'wage ~ age + I(age ** 2) + I(age ** 3) + I(age ** 4) + I(age ** 5)'</span>, data=wage).fit()

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'ANOVA on nested models upto 5 degrees'</span>)
<span style="color: #0000FF;">print</span>(anova_lm(fit1, fit2, fit3, fit4, fit5))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">For plotting, create age array, get prediction and confidence intervals</span>
<span style="color: #BA36A5;">res_df</span> = pd.DataFrame({<span style="color: #008000;">'age'</span>: np.linspace(wage[<span style="color: #008000;">'age'</span>].<span style="color: #006FE0;">min</span>(),
                                          wage[<span style="color: #008000;">'age'</span>].<span style="color: #006FE0;">max</span>())})
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'wage_predict'</span>] = fit.get_prediction(exog=res_df).predicted_mean
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'wage_low'</span>] = fit.get_prediction(exog=res_df).conf_int()[:, 0]
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'wage_high'</span>] = fit.get_prediction(exog=res_df).conf_int()[:, 1]

<span style="color: #BA36A5;">fig</span> = plt.figure()
<span style="color: #BA36A5;">ax</span> = fig.add_subplot()
wage.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'wage'</span>, kind=<span style="color: #008000;">'scatter'</span>, alpha=0.5, ax=ax)
res_df.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'wage_predict'</span>, c=<span style="color: #008000;">'r'</span>, ax=ax)
res_df.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'wage_low'</span>, c=<span style="color: #008000;">'r'</span>, linestyle=<span style="color: #008000;">'--'</span>, ax=ax)
res_df.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'wage_high'</span>, c=<span style="color: #008000;">'r'</span>, linestyle=<span style="color: #008000;">'-.'</span>, ax=ax)
ax.set_xlabel(<span style="color: #008000;">'Age'</span>)
ax.set_ylabel(<span style="color: #008000;">'Wage'</span>)
ax.set_title(<span style="color: #008000;">'Degree-4 Polynomial'</span>)
</pre>
</div>

<pre class="example">
Coefficients of orthogonal polynomials upto 4 degrees
            Coef.   Std.Err.           t         P&gt;|t|
const  111.703608   0.728741  153.283015  0.000000e+00
x1     447.067853  39.914785   11.200558  1.484604e-28
x2    -478.315806  39.914785  -11.983424  2.355831e-32
x3    -125.521686  39.914785   -3.144742  1.678622e-03
x4      77.911181  39.914785    1.951938  5.103865e-02
------
Coefficients of raw polynomials upto 4 degrees
                  Coef.   Std.Err.         t     P&gt;|t|
Intercept   -184.154180  60.040377 -3.067172  0.002180
age           21.245521   5.886748  3.609042  0.000312
I(age ** 2)   -0.563859   0.206108 -2.735743  0.006261
I(age ** 3)    0.006811   0.003066  2.221409  0.026398
I(age ** 4)   -0.000032   0.000016 -1.951938  0.051039
------
Orthogonal polynomials and raw polynomials fitted values nearly equal:
True
------
ANOVA on nested models upto 5 degrees
   df_resid           ssr  df_diff        ss_diff           F        Pr(&gt;F)
0    2998.0  5.022216e+06      0.0            NaN         NaN           NaN
1    2997.0  4.793430e+06      1.0  228786.010128  143.593107  2.363850e-32
2    2996.0  4.777674e+06      1.0   15755.693664    9.888756  1.679202e-03
3    2995.0  4.771604e+06      1.0    6070.152124    3.809813  5.104620e-02
4    2994.0  4.770322e+06      1.0    1282.563017    0.804976  3.696820e-01
------
</pre>

<p>
Next we consider the task of predicting whether an individual earns more than
$250,000 per year.  The probability of wage greater than 250K is directly
obtained from the <code>predict</code> method on <code>statsmodels</code> fit object.  However, for
logit models, <code>statsmodels</code> does not provide confidence intervals.  We use the
formula for confidence intervals.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">from</span> sklearn.preprocessing <span style="color: #0000FF;">import</span> PolynomialFeatures
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">wage</span> = datasets.get_rdataset(<span style="color: #008000;">'Wage'</span>, <span style="color: #008000;">'ISLR'</span>).data

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Create dummy variable for wage &gt; 250K</span>
<span style="color: #BA36A5;">wage</span>[<span style="color: #008000;">'wage_gt250K'</span>] = wage[<span style="color: #008000;">'wage'</span>].<span style="color: #006FE0;">apply</span>(<span style="color: #0000FF;">lambda</span> x: 1 <span style="color: #0000FF;">if</span> x &gt; 250 <span style="color: #0000FF;">else</span> 0)

<span style="color: #BA36A5;">model</span> = smf.logit(<span style="color: #008000;">'wage_gt250K ~ age + I(age ** 2) + I(age ** 3) + I(age ** 4)'</span>,
                  data=wage)
<span style="color: #BA36A5;">fit</span> = model.fit()

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Predicted probability for plotting</span>
<span style="color: #BA36A5;">res_df</span> = pd.DataFrame({<span style="color: #008000;">'age'</span>: np.linspace(wage[<span style="color: #008000;">'age'</span>].<span style="color: #006FE0;">min</span>(),
                                           wage[<span style="color: #008000;">'age'</span>].<span style="color: #006FE0;">max</span>())})
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'prob_wage_gt250'</span>] = fit.predict(exog=res_df)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Create X matrix for estimating confidence intervals</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Based on a suggestion from David Dale on stackoverflow</span>
<span style="color: #BA36A5;">poly</span> = PolynomialFeatures(degree=4)
<span style="color: #BA36A5;">X_mat</span> = poly.fit_transform(res_df[<span style="color: #008000;">'age'</span>][:, np.newaxis])

<span style="color: #BA36A5;">cov_beta</span> = fit.cov_params()
<span style="color: #BA36A5;">predict_var</span> = np.diag(np.dot(X_mat, np.dot(cov_beta, X_mat.T)))
<span style="color: #BA36A5;">predict_error</span> = np.sqrt(predict_var)
<span style="color: #BA36A5;">Xb</span> = np.dot(X_mat, fit.params)

<span style="color: #BA36A5;">predict_upper</span> = Xb + 1.96 * predict_error
<span style="color: #BA36A5;">predict_lower</span> = Xb - 1.96 * predict_error
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'prob_upper'</span>] = np.exp(predict_upper) / (1 + np.exp(predict_upper))
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'prob_lower'</span>] = np.exp(predict_lower) / (1 + np.exp(predict_lower))

<span style="color: #BA36A5;">fig</span> = plt.figure()
<span style="color: #BA36A5;">ax</span> = fig.add_subplot(111)
ax.scatter(wage[<span style="color: #008000;">'age'</span>], wage[<span style="color: #008000;">'wage_gt250K'</span>]/2, marker=<span style="color: #008000;">'|'</span>, color=<span style="color: #008000;">'grey'</span>,
           alpha=0.5)
res_df.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'prob_wage_gt250'</span>, c=<span style="color: #008000;">'b'</span>, ax=ax)
res_df.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'prob_upper'</span>, c=<span style="color: #008000;">'r'</span>, linestyle=<span style="color: #008000;">'--'</span>, ax=ax)
res_df.plot(x=<span style="color: #008000;">'age'</span>, y=<span style="color: #008000;">'prob_lower'</span>, c=<span style="color: #008000;">'r'</span>, linestyle=<span style="color: #008000;">'-.'</span>, ax=ax)
ax.set_xlabel(<span style="color: #008000;">'Age'</span>)
ax.set_ylabel(<span style="color: #008000;">'Prob(Wage &gt; 250 | Age)'</span>)
</pre>
</div>

<pre class="example">
Optimization terminated successfully.
         Current function value: 0.116870
         Iterations 12

</pre>

<p>
We now fit a step function of <code>age</code>. 
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> statsmodels.formula.api <span style="color: #0000FF;">as</span> smf
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">wage</span> = datasets.get_rdataset(<span style="color: #008000;">'Wage'</span>, <span style="color: #008000;">'ISLR'</span>).data
<span style="color: #BA36A5;">wage</span>[<span style="color: #008000;">'age_grp'</span>] = pd.cut(wage[<span style="color: #008000;">'age'</span>], bins=[17, 33.5, 49, 64.5, 81],
                         labels=[<span style="color: #008000;">'17to33'</span>, <span style="color: #008000;">'33to49'</span>, <span style="color: #008000;">'49to64'</span>, <span style="color: #008000;">'64to80'</span>])

<span style="color: #BA36A5;">step_model</span> = smf.ols(<span style="color: #008000;">'wage ~ age_grp'</span>, data=wage)
<span style="color: #BA36A5;">step_fit</span> = step_model.fit()

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Coefficients of age groups'</span>)
<span style="color: #0000FF;">print</span>(step_fit.summary2().tables[1].iloc[:, :4])
</pre>
</div>

<pre class="example">
Coefficients of age groups
                       Coef.  Std.Err.          t         P&gt;|t|
Intercept          94.158392  1.476069  63.789970  0.000000e+00
age_grp[T.33to49]  24.053491  1.829431  13.148074  1.982315e-38
age_grp[T.49to64]  23.664559  2.067958  11.443444  1.040750e-29
age_grp[T.64to80]   7.640592  4.987424   1.531972  1.256350e-01

</pre>
</div>
</div>
</div>
</div>

<div id="outline-container-orgd2c3957" class="outline-2">
<h2 id="orgd2c3957"><span class="section-number-2">8</span> Tree-Based Models</h2>
<div class="outline-text-2" id="text-8">
</div>
<div id="outline-container-orgf4b805e" class="outline-3">
<h3 id="orgf4b805e"><span class="section-number-3">8.1</span> The Basics of Decision Trees</h3>
<div class="outline-text-3" id="text-8-1">
</div>
<div id="outline-container-org0112364" class="outline-4">
<h4 id="org0112364"><span class="section-number-4">8.1.1</span> Regression Trees</h4>
<div class="outline-text-4" id="text-8-1-1">
<p>
Figure <a href="#org9d9d605">74</a> shows a regression tree fit to <code>Hitters</code> data.  We
predict log of <code>Salary</code> based on <code>Years</code> (column 0 of <code>X</code> matrix) and <code>Hits</code>
(column 1).  The figure consists of a series of splitting rules.  The top split
assigns observations having <code>Years &lt; 4.5</code> to the left branch.  Players with
<code>Years &gt; 4.5</code> are assigned to the right branch, and then the group is further
subdivided by <code>Hits</code>.  
</p>


<div id="org9d9d605" class="figure">
<p><img src="figures/fig8_1.png" alt="fig8_1.png" />
</p>
<p><span class="figure-number">Figure 74: </span>For the <code>Hitters</code> data, a regression tree for producing the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year.</p>
</div>


<p>
Overall, the tree segments the players into four regions of predictor space:
\(R_{1}=\{X | Years < 4.5, Hits < 15.5\}\), \(R_{2}=\{X | Years < 4.5, Hits > 15.5\}\),
\(R_{3}=\{X | Years > 4.5, Hits < 117.5\}\), and \(R_{4}=\{X | Years > 4.5, Hits >
117.5\}\).  Figure <a href="#org3e606f1">75</a> illustrates the regions as a function of
<code>Years</code> and <code>Hits</code>. 
</p>


<div id="org3e606f1" class="figure">
<p><img src="figures/fig8_2.png" alt="fig8_2.png" />
</p>
<p><span class="figure-number">Figure 75: </span>The four-region partition for the <code>Hitters</code> data set from the regression tree in Figure <a href="#org9d9d605">74</a>.</p>
</div>

<p>
Figure <a href="#orgd7c411e">76</a> shows a five-region example of prediction via
stratification of the feature space. 
</p>


<div id="orgd7c411e" class="figure">
<p><img src="figures/fig8_3.png" alt="fig8_3.png" />
</p>
<p><span class="figure-number">Figure 76: </span>Top Left: A partition of two-dimensional space that could not result from recursive binary splitting. Top Right: The output of recursive binary splitting on a two-dimensional example.  Bottom Left: A tree corresponding to the partition on the top right panel.  Bottom Right: A perspective plot of the prediction surface corresponding to that tree.</p>
</div>
</div>
</div>
<div id="outline-container-orge46b2de" class="outline-4">
<h4 id="orge46b2de"><span class="section-number-4">8.1.2</span> Classification Trees</h4>
</div>

<div id="outline-container-org0bf0e9f" class="outline-4">
<h4 id="org0bf0e9f"><span class="section-number-4">8.1.3</span> Trees versus Linear Models</h4>
<div class="outline-text-4" id="text-8-1-3">
<p>
Figure <a href="#org27facc5">77</a> is an example of when a linear model or a tree
performs better.  In the top row, the relationship between the features and
responses is linear.  A linear regression works well.  In the bottom row, the
relationship between the features and the responses is non-linear.  A decision
tree outperforms linear regression.  
</p>


<div id="org27facc5" class="figure">
<p><img src="figures/fig8_7.png" alt="fig8_7.png" />
</p>
<p><span class="figure-number">Figure 77: </span>Top Row: A two-dimensional classification example in which the true decision boundary is linear, and is indicated by the shaded regions.  A classical approach that assumes a linear boundary (left) will outperform a decision tree that performs splits parallel to the axes (right).  Bottom Row: Here the true decision boundary is non-linear.  A linear model is unable to capture the true decision boundary (left), whereas a decision tree is successful (right).</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org60ad82b" class="outline-3">
<h3 id="org60ad82b"><span class="section-number-3">8.2</span> Bagging, Random Forests, Boosting</h3>
<div class="outline-text-3" id="text-8-2">
<p>
Figure <a href="#org85be9c7">78</a> is a graphical representation of <i>variable
importances</i> in the <code>Heart</code> data.  We see the mean decrease in Gini index for each variable
relative to the largest.  The variables with the largest mean decrease in Gini
index are <code>Ca</code>, <code>Thal</code>, <code>HR</code>, and <code>Oldpeak</code>.  When a categorical variable has
more than two values, it appears in the figure more than once.  For example,
<code>Thal</code> has values <code>normal</code>, <code>reversable</code>, and <code>fixed</code>.  In the figure, we see
<code>Thal_normal</code> and <code>Thal_reversable</code>. 
</p>


<div id="org85be9c7" class="figure">
<p><img src="figures/fig8_9.png" alt="fig8_9.png" />
</p>
<p><span class="figure-number">Figure 78: </span>A variable importance plot for the <code>Heart</code> data. Variable importance is computed using the mean decrease in Gini index and expressed relative to the maximum.</p>
</div>
</div>
</div>

<div id="outline-container-org91a2f29" class="outline-3">
<h3 id="org91a2f29"><span class="section-number-3">8.3</span> Lab: Decision Trees</h3>
<div class="outline-text-3" id="text-8-3">
</div>
<div id="outline-container-org8aae0c0" class="outline-4">
<h4 id="org8aae0c0"><span class="section-number-4">8.3.1</span> Fitting Classification Trees</h4>
<div class="outline-text-4" id="text-8-3-1">
<p>
In <code>scikit-learn</code> library, <code>tree</code> module implements classification and
regression trees.  
</p>

<p>
We first use classification trees to analyze the <code>Carseats</code> data set.  In these
data, <code>Sales</code> is a continuous variable, and so we begin by recoding it as a binary
variable.  
</p>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> sklearn.tree <span style="color: #0000FF;">import</span> DecisionTreeClassifier, export_text, plot_tree
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split, GridSearchCV

<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">Carseats</span> = datasets.get_rdataset(<span style="color: #008000;">'Carseats'</span>, <span style="color: #008000;">'ISLR'</span>).data

<span style="color: #BA36A5;">Carseats</span>[<span style="color: #008000;">'High Sales'</span>] = Carseats[<span style="color: #008000;">'Sales'</span>].<span style="color: #006FE0;">apply</span>(
    <span style="color: #0000FF;">lambda</span> x: <span style="color: #008000;">'Yes'</span> <span style="color: #0000FF;">if</span> x &gt; 8 <span style="color: #0000FF;">else</span> <span style="color: #008000;">'No'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Create dummy variables for qualitative data</span>
<span style="color: #BA36A5;">X_numeric</span> = Carseats[[<span style="color: #008000;">'CompPrice'</span>, <span style="color: #008000;">'Income'</span>, <span style="color: #008000;">'Advertising'</span>, <span style="color: #008000;">'Population'</span>,
                      <span style="color: #008000;">'Price'</span>, <span style="color: #008000;">'Age'</span>, <span style="color: #008000;">'Education'</span>]]
<span style="color: #BA36A5;">X_cat</span> = Carseats[[<span style="color: #008000;">'ShelveLoc'</span>, <span style="color: #008000;">'Urban'</span>, <span style="color: #008000;">'US'</span>]]
<span style="color: #BA36A5;">X_dummies</span> = pd.get_dummies(X_cat, drop_first=<span style="color: #D0372D;">True</span>)
<span style="color: #BA36A5;">X</span> = pd.concat([X_numeric, X_dummies], axis=1)
<span style="color: #BA36A5;">y</span> = Carseats[<span style="color: #008000;">'High Sales'</span>]

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit model to all data</span>
<span style="color: #BA36A5;">tree</span> = DecisionTreeClassifier(ccp_alpha=0.02)
tree.fit(X, y)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Decision tree in text'</span>)
<span style="color: #0000FF;">print</span>(export_text(tree, feature_names=X.columns.to_list()))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Split data between train and test sets</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit model to training data, then print prediction accuracy on test data</span>
<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y, test_size=0.5,
                                                    random_state=0)

<span style="color: #BA36A5;">model</span> = DecisionTreeClassifier()
model.fit(X_train, y_train)

<span style="color: #BA36A5;">res_crosstab</span> = pd.crosstab(y_test, model.predict(X_test))
<span style="color: #BA36A5;">res_crosstab.columns.name</span> = <span style="color: #008000;">'High Sales Predict'</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Prediction accuracy on test data'</span>)
<span style="color: #0000FF;">print</span>(res_crosstab)
<span style="color: #0000FF;">print</span>(<span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>((96 + 48) / X_test.shape[0], 2)), <span style="color: #008000;">'of predictions are correct'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use grid search to find optimal level of tree complexity</span>
<span style="color: #BA36A5;">param_grid</span> = {<span style="color: #008000;">'ccp_alpha'</span>: np.linspace(0, 0.02, 11)}
<span style="color: #BA36A5;">grid</span> = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5,
                    return_train_score=<span style="color: #D0372D;">True</span>)
grid.fit(X, y)
</pre>
</div>

<pre class="example">
Decision tree in text
|--- ShelveLoc_Good &lt;= 0.50
|   |--- Price &lt;= 92.50
|   |   |--- class: Yes
|   |--- Price &gt;  92.50
|   |   |--- Advertising &lt;= 13.50
|   |   |   |--- class: No
|   |   |--- Advertising &gt;  13.50
|   |   |   |--- class: Yes
|--- ShelveLoc_Good &gt;  0.50
|   |--- class: Yes

Prediction accuracy on test data
High Sales Predict  No  Yes
High Sales                 
No                  97   21
Yes                 30   52
0.72  of predictions are correct
</pre>
</div>
</div>
<div id="outline-container-org434f007" class="outline-4">
<h4 id="org434f007"><span class="section-number-4">8.3.2</span> Fitting Regression Trees</h4>
<div class="outline-text-4" id="text-8-3-2">
<p>
Here we fit a regression tree to the <code>Boston</code> data set.  First we create a
training set, and fit the tree to the training data. 
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split, GridSearchCV
<span style="color: #0000FF;">from</span> sklearn.tree <span style="color: #0000FF;">import</span> DecisionTreeRegressor, export_text
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np

<span style="color: #BA36A5;">Boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, package=<span style="color: #008000;">'MASS'</span>).data

<span style="color: #BA36A5;">X</span> = Boston.drop(columns=<span style="color: #008000;">'medv'</span>)
<span style="color: #BA36A5;">y</span> = Boston[<span style="color: #008000;">'medv'</span>]

<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y, test_size=0.5,
                                                    random_state=0)

<span style="color: #BA36A5;">tree</span> = DecisionTreeRegressor(max_leaf_nodes=40, random_state=0)
tree.fit(X_train, y_train)

<span style="color: #BA36A5;">resid_mean_dev</span> = np.mean(np.<span style="color: #006FE0;">abs</span>(y_test - tree.predict(X_test)))
<span style="color: #BA36A5;">resid_dist</span> = pd.Series(np.quantile(y_test - tree.predict(X_test),
                                   [0, 0.25, 0.5, 0.75, 1]),
                       index=[<span style="color: #008000;">'Min'</span>, <span style="color: #008000;">'1st Qu'</span>, <span style="color: #008000;">'Median'</span>, <span style="color: #008000;">'3rd Qu'</span>, <span style="color: #008000;">'Max'</span>])

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Residual mean deviance:'</span>, <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(resid_mean_dev, 2)))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Distribution of residuals:'</span>)
<span style="color: #0000FF;">print</span>(resid_dist)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Mean squared error:'</span>, <span style="color: #006FE0;">str</span>(<span style="color: #006FE0;">round</span>(np.std(y_test - tree.predict(X_test)),
                                       2)))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use grid search to find best tree size</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Depending upon starting point, tree size can be very different</span>
<span style="color: #BA36A5;">params_grid</span> = {<span style="color: #008000;">'max_leaf_nodes'</span>: np.arange(2, 21)}
<span style="color: #BA36A5;">grid</span> = GridSearchCV(DecisionTreeRegressor(random_state=0),
                    param_grid=params_grid, cv=5)

grid.fit(X_train, y_train)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best fitted tree'</span>)
<span style="color: #0000FF;">print</span>(export_text(grid.best_estimator_, feature_names=X.columns.to_list()))

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Mean squared error:'</span>,
      <span style="color: #006FE0;">str</span>(np.<span style="color: #006FE0;">round</span>(np.std(y_test - grid.best_estimator_.predict(X_test)), 2)))
</pre>
</div>

<pre class="example">
Residual mean deviance: 3.07
Distribution of residuals:
Min      -16.675000
1st Qu    -2.118519
Median    -0.400000
3rd Qu     1.466667
Max       35.500000
dtype: float64
Mean squared error: 5.06
------
Best fitted tree
|--- lstat &lt;= 7.81
|   |--- rm &lt;= 7.43
|   |   |--- dis &lt;= 1.48
|   |   |   |--- value: [50.00]
|   |   |--- dis &gt;  1.48
|   |   |   |--- rm &lt;= 6.56
|   |   |   |   |--- value: [23.75]
|   |   |   |--- rm &gt;  6.56
|   |   |   |   |--- value: [30.67]
|   |--- rm &gt;  7.43
|   |   |--- ptratio &lt;= 15.40
|   |   |   |--- value: [48.64]
|   |   |--- ptratio &gt;  15.40
|   |   |   |--- value: [41.26]
|--- lstat &gt;  7.81
|   |--- lstat &lt;= 15.00
|   |   |--- rm &lt;= 6.53
|   |   |   |--- value: [20.78]
|   |   |--- rm &gt;  6.53
|   |   |   |--- value: [26.02]
|   |--- lstat &gt;  15.00
|   |   |--- dis &lt;= 1.92
|   |   |   |--- value: [11.67]
|   |   |--- dis &gt;  1.92
|   |   |   |--- value: [16.83]

Mean squared error: 4.89
</pre>
</div>
</div>
<div id="outline-container-org04ccd68" class="outline-4">
<h4 id="org04ccd68"><span class="section-number-4">8.3.3</span> Bagging and Random Forests</h4>
<div class="outline-text-4" id="text-8-3-3">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">from</span> sklearn.ensemble <span style="color: #0000FF;">import</span> RandomForestRegressor, BaggingRegressor
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split
<span style="color: #0000FF;">from</span> sklearn.tree <span style="color: #0000FF;">import</span> DecisionTreeRegressor
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
plt.style.use(<span style="color: #008000;">'seaborn-whitegrid'</span>)

<span style="color: #BA36A5;">Boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, package=<span style="color: #008000;">'MASS'</span>).data

<span style="color: #BA36A5;">X</span> = Boston.drop(columns=<span style="color: #008000;">'medv'</span>).copy()
<span style="color: #BA36A5;">y</span> = Boston[<span style="color: #008000;">'medv'</span>].copy()

<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y, random_state=0)

<span style="color: #BA36A5;">bag</span> = BaggingRegressor(random_state=0)
bag.fit(X_train, y_train)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'BaggingRegressor model'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Model performance on training data'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Mean of squared residuals: %0.2f'</span> %
      np.mean((y_train - bag.predict(X_train)) ** 2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Percent variance explained: %0.2f'</span> % bag.score(X_train, y_train))

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Model performance on test data'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Percent variance explained: %0.2f'</span> %
      np.mean((y_test - bag.predict(X_test)) ** 2))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">plt.scatter(y_test, bag.predict(X_test), alpha=0.7)</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], linestyle='--',</span>
<span style="color: #8D8D84;">#        </span><span style="color: #8D8D84; font-style: italic;">alpha=0.7)</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">plt.gca().set(xlabel='True price', ylabel='Predicted price')</span>

<span style="color: #BA36A5;">forest</span> = RandomForestRegressor(random_state=0)
forest.fit(X_train, y_train)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'RandomForestRegressor model'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Mean of squared residuals: %0.2f'</span> %
      np.mean((y_test - forest.predict(X_test)) ** 2))

<span style="color: #BA36A5;">feature_imp</span> = pd.Series(forest.feature_importances_, index=X.columns.to_list())
feature_imp.sort_values(ascending=<span style="color: #D0372D;">False</span>, inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Feature importance of variables'</span>)
<span style="color: #0000FF;">print</span>((feature_imp * 100).<span style="color: #006FE0;">round</span>(2))
</pre>
</div>

<pre class="example">
BaggingRegressor model
Model performance on training data
Mean of squared residuals: 2.49
Percent variance explained: 0.97
------
Model performance on test data
Percent variance explained: 22.79
------
RandomForestRegressor model
Mean of squared residuals: 16.73
------
Feature importance of variables
rm         41.65
lstat      40.75
dis         4.33
crim        4.15
ptratio     2.12
tax         1.80
nox         1.58
age         1.24
black       1.10
indus       0.69
rad         0.41
zn          0.11
chas        0.07
dtype: float64
</pre>
</div>
</div>
<div id="outline-container-org06402c2" class="outline-4">
<h4 id="org06402c2"><span class="section-number-4">8.3.4</span> Boosting</h4>
<div class="outline-text-4" id="text-8-3-4">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels <span style="color: #0000FF;">import</span> datasets
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> sklearn.ensemble <span style="color: #0000FF;">import</span> AdaBoostRegressor
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split
<span style="color: #0000FF;">from</span> sklearn.tree <span style="color: #0000FF;">import</span> DecisionTreeRegressor
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">Boston</span> = datasets.get_rdataset(<span style="color: #008000;">'Boston'</span>, package=<span style="color: #008000;">'MASS'</span>).data

<span style="color: #BA36A5;">X</span> = Boston.drop(columns=<span style="color: #008000;">'medv'</span>).copy()
<span style="color: #BA36A5;">y</span> = Boston[<span style="color: #008000;">'medv'</span>].copy()

<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y, random_state=0)

<span style="color: #BA36A5;">tree</span> = DecisionTreeRegressor(max_depth=4)
<span style="color: #BA36A5;">boost</span> = AdaBoostRegressor(tree, n_estimators=5000, random_state=0)

boost.fit(X_train, y_train)

<span style="color: #BA36A5;">feature_imp</span> = pd.Series(boost.feature_importances_, index=X_train.columns)
feature_imp.sort_values(ascending=<span style="color: #D0372D;">False</span>, inplace=<span style="color: #D0372D;">True</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'AdaBoostRegressor model'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Feature importances'</span>)
<span style="color: #0000FF;">print</span>((feature_imp * 100).<span style="color: #006FE0;">round</span>(2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Model performance on training data'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Mean of squared residuals: %0.2f'</span> %
      np.mean((y_train - boost.predict(X_train)) ** 2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Percent of variance explained: %0.2f'</span> % boost.score(X_train, y_train))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Model performance on test data'</span>)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Mean of squared residuals: %0.2f'</span> %
      np.mean((y_test - boost.predict(X_test)) ** 2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Percent of variance explained: %0.2f'</span> % boost.score(X_test, y_test))
</pre>
</div>

<pre class="example">
AdaBoostRegressor model
Feature importances
lstat      41.10
rm         31.69
dis         7.42
tax         4.69
ptratio     4.44
nox         2.32
crim        2.19
black       1.81
age         1.63
indus       1.60
rad         0.57
zn          0.27
chas        0.26
dtype: float64
------
Model performance on training data
Mean of squared residuals: 3.37
Percent of variance explained: 0.96
------
Model performance on test data
Mean of squared residuals: 21.91
Percent of variance explained: 0.73
</pre>
</div>
</div>
</div>
</div>
<div id="outline-container-orgd98a3f9" class="outline-2">
<h2 id="orgd98a3f9"><span class="section-number-2">9</span> Support Vector Machines</h2>
<div class="outline-text-2" id="text-9">
</div>
<div id="outline-container-org6220cf1" class="outline-3">
<h3 id="org6220cf1"><span class="section-number-3">9.1</span> Maximal Margin Classifier</h3>
<div class="outline-text-3" id="text-9-1">
</div>
<div id="outline-container-org5e3604c" class="outline-4">
<h4 id="org5e3604c"><span class="section-number-4">9.1.1</span> What is a Hyperplane?</h4>
<div class="outline-text-4" id="text-9-1-1">
<p>
Figure <a href="#orgc07b0cb">79</a> shows a hyperplane in two-dimensional space.
</p>


<div id="orgc07b0cb" class="figure">
<p><img src="figures/fig9_1.png" alt="fig9_1.png" />
</p>
<p><span class="figure-number">Figure 79: </span>The hyperplane \(1 + 2X_1 + 3X_2 = 0\) is shown.  The blue region is the set of points for which \(1 + 2X_1 + 3X_2 > 0\), and the red region is the set of points for which \(1 + 2X_1 + 3X_2 < 0\).</p>
</div>
</div>
</div>

<div id="outline-container-org05ccb32" class="outline-4">
<h4 id="org05ccb32"><span class="section-number-4">9.1.2</span> Classification Using a Separating Hyperplane</h4>
<div class="outline-text-4" id="text-9-1-2">
<p>
Figure <a href="#orgb8f1a6c">80</a> shows an example of a separating hyperplane classifier.
</p>


<div id="orgb8f1a6c" class="figure">
<p><img src="figures/fig9_2.png" alt="fig9_2.png" />
</p>
<p><span class="figure-number">Figure 80: </span>Left: There are two classes of observations, shown in red and blue, each of which has measurements on two variables.  Three separating hyperplanes, out of many possible, are shown in gray.  Right: A separating hyperplane is shown in black.  The red and blue grid indicates the decision rule made by a classifier based on this separating hyperplane: a test observation that falls in the red portion of the grid will be assigned to the red class.  A test observation that falls into the blue portion of the grid will be assigned to the blue class.</p>
</div>
</div>
</div>

<div id="outline-container-orgf3614f9" class="outline-4">
<h4 id="orgf3614f9"><span class="section-number-4">9.1.3</span> The Maximal Margin Classifier</h4>
<div class="outline-text-4" id="text-9-1-3">
<p>
Examining figure <a href="#orgedf65a7">81</a>, we see that three training observations are
equidistant from the maximal margin hyperplane and lie along the dashed lines
indicating the width of the margin.  The three observations are known as
<i>support vectors</i>, since they are vectors in <i>p</i>-dimensional space (in figure
<a href="#orgedf65a7">81</a>, \(p=2\)) and they ``support'' the maximal margin hyperplane.  If
these points were moved slightly, then the maximal margin hyperplane would move as
well. 
</p>


<div id="orgedf65a7" class="figure">
<p><img src="figures/fig9_3.png" alt="fig9_3.png" />
</p>
<p><span class="figure-number">Figure 81: </span>There are two classes of observations, shown in red and in blue.  The maximal margin hyperplane is shown as a solid line.  The margin is the distance from the solid line to either of the dashed lines.  The two red points the blue point that lie on the dashed lines are support vectors.  The red and blue grid indicates the decision rule made by a classifier based on this separating hyperplane.</p>
</div>
</div>
</div>
<div id="outline-container-org5d157d6" class="outline-4">
<h4 id="org5d157d6"><span class="section-number-4">9.1.4</span> Construction of the Maximal Margin Classifier</h4>
</div>

<div id="outline-container-org6025292" class="outline-4">
<h4 id="org6025292"><span class="section-number-4">9.1.5</span> The Non-separable Case</h4>
<div class="outline-text-4" id="text-9-1-5">
<p>
Figure <a href="#orgaf8b9f6">82</a> shows an example where we cannot <i>exactly</i> separate the two
classes.  There is no maximal margin classifier.  
</p>


<div id="orgaf8b9f6" class="figure">
<p><img src="figures/fig9_4.png" alt="fig9_4.png" />
</p>
<p><span class="figure-number">Figure 82: </span>There are two classes of observations, shown in red and blue.  In this case, the two classes are not separable by a hyperplane.  Therefore, the maximal margin classifier cannot be used.</p>
</div>
</div>
</div>
</div>

<div id="outline-container-org1d8a47d" class="outline-3">
<h3 id="org1d8a47d"><span class="section-number-3">9.2</span> Support Vector Classifiers</h3>
<div class="outline-text-3" id="text-9-2">
</div>
<div id="outline-container-org9046c36" class="outline-4">
<h4 id="org9046c36"><span class="section-number-4">9.2.1</span> Overview of the Support Vector Classifier</h4>
<div class="outline-text-4" id="text-9-2-1">
<p>
In figures <a href="#orgb4ba7e3">83</a>, addition of a single observation in the right hand panel
leads to a dramatic change in the maximal margin hyperplane.  
</p>


<div id="orgb4ba7e3" class="figure">
<p><img src="figures/fig9_5.png" alt="fig9_5.png" />
</p>
<p><span class="figure-number">Figure 83: </span>Left: Two classes of observations are shown in blue and in red, along with the maximal margin hyperplane.  Right: An additional blue observation (marked with `x') has been added, leading to a dramatic shift in the maximal margin hyperplane shown as a solid line.  The dashed line indicates the maximal margin hyperplane that was obtained in the absence of this additional point.</p>
</div>

<p>
Figure <a href="#org99aa3d0">84</a> is an example of a <i>support vector classifier</i>.  Most of the
observations are on the correct side of the margin.  However, a small subset of
observations are on the wrong side of the margin.  
</p>


<div id="org99aa3d0" class="figure">
<p><img src="figures/fig9_6.png" alt="fig9_6.png" />
</p>
<p><span class="figure-number">Figure 84: </span>Left: A support vector classifier was fit to a small data set.  The hyperplane is shown as a solid line and the margins are shown as dashed lines.  Blue observations: Observations 1, 4, 7 are on the correct side of the margin.  Observation 6 is on the margin.  Observation 8 is on the wrong side of the margin.  Observation 11 is on the wrong side of the hyperplane.  Red observations: Observations 2, 10 are on the right side of the margin.  Observationsn 0, 9 are on the margin.  Observations 3, 5 are on the wrong side of the margin.  No observation is on the wrong side of the hyperplane.  Righ: Same as right panel, with three additional points 12, 13, and 14.  Observation 13 is on the wrong side of hyperplane.  Now observation 8 is also on the wrong side of the hyperplane.</p>
</div>
</div>
</div>
<div id="outline-container-orge4e3096" class="outline-4">
<h4 id="orge4e3096"><span class="section-number-4">9.2.2</span> Details of the Support Vector Classifier</h4>
<div class="outline-text-4" id="text-9-2-2">
<p>
Figure <a href="#org435824d">85</a> illustrates the width of margin for different values of
regularization parameter \(C\).  
</p>


<div id="org435824d" class="figure">
<p><img src="figures/fig9_7.png" alt="fig9_7.png" />
</p>
<p><span class="figure-number">Figure 85: </span>A support vector classifier was fit using four different values of the regularization parameter \(C\) (which is different from \(C\), the non-negative tuning parameter, described in the book).  The strenght of the regularization is inversely proportional to \(C\).  When \(C\) is small, there is a high tolerance for observations being on the wrong side of the margin.  Therefore, the margin is wide.  As \(C\) increases, the tolerance for observations being on the wrong side of the margin decreases.  Therefore, the margin narrows.</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org1933591" class="outline-3">
<h3 id="org1933591"><span class="section-number-3">9.3</span> Support Vector Machines</h3>
<div class="outline-text-3" id="text-9-3">
</div>
<div id="outline-container-org513f10d" class="outline-4">
<h4 id="org513f10d"><span class="section-number-4">9.3.1</span> Classification with Non-linear Decision Boundaries</h4>
<div class="outline-text-4" id="text-9-3-1">
<p>
In figure <a href="#orgd5fb73a">86</a> consider the data shown in the left panel.  It is clear
that a support vector classifier or any linear classifier will perform poorly
here.  Indeed, the support vector classifier shown in the right panel is
useless here. 
</p>

<div id="orgd5fb73a" class="figure">
<p><img src="figures/fig9_8.png" alt="fig9_8.png" />
</p>
<p><span class="figure-number">Figure 86: </span>Left: The observations fall into two classes, with a non-linear boundary between them.  Right: The support vector classifier seeks a linear boundary, and consequently performs poorly.</p>
</div>
</div>
</div>
<div id="outline-container-org4a732e4" class="outline-4">
<h4 id="org4a732e4"><span class="section-number-4">9.3.2</span> The Support Vector Machine</h4>
<div class="outline-text-4" id="text-9-3-2">
<p>
In figure <a href="#org88d0932">87</a>, the left-hand panel shows an example of SVM with a
polynomial kernel applied the non-linear data from figure <a href="#orgd5fb73a">86</a>.  The fit
is an improvement over the linear support vector classifier.  The right-hannd
panel show an example of an SVM with a radial kernel on this non-linear data.
SVM does the best job of separating the two classes.
</p>


<div id="org88d0932" class="figure">
<p><img src="figures/fig9_9.png" alt="fig9_9.png" />
</p>
<p><span class="figure-number">Figure 87: </span>Left: An SVM with a polynomial kernel of degree 3 is applied to the non-linear data from figure <a href="#orgd5fb73a">86</a>, resulting in a more appropriate decision rule.  Right: An SVM with a radial kernel is applied.  In this example, radial kernel is the most capable of capturing the decision boundary.</p>
</div>
</div>
</div>
<div id="outline-container-org58a8f30" class="outline-4">
<h4 id="org58a8f30"><span class="section-number-4">9.3.3</span> An Application to the Heart Disease Data</h4>
<div class="outline-text-4" id="text-9-3-3">
<p>
In figure <a href="#org31d189d">88</a>, the left-hand panel displays ROC curves for training set
predictions for both LDA and the support vector classifier.  The right-hand
panel displays ROC curves for SVMs using a radial kernel, with two values of
\(\gamma\). As \(\gamma\) increases and the fit becomes more non-linear, the ROC improves.
</p>

<div id="org31d189d" class="figure">
<p><img src="figures/fig9_10.png" alt="fig9_10.png" />
</p>
<p><span class="figure-number">Figure 88: </span>ROC curves for <code>Heart</code> data training set.  Left: The support vector classifier and LDA are compared.  Right: The support vector classifier is compared to an SVM using radial basis kernel with \(\gamma = 0.005\) and \(\gamma = 0.001\).</p>
</div>

<p>
Figure <a href="#org592574b">89</a> compares ROC curves on test observations.  Using default
settings for all parameters other than \(\gamma\), SVM model performance is worse
than support vector classifier model performance.  In the code, there is a
commented section which shows how model parameters \(C\) and \(\gamma\) can be tuned
to improve model performance. 
</p>


<div id="org592574b" class="figure">
<p><img src="figures/fig9_11.png" alt="fig9_11.png" />
</p>
<p><span class="figure-number">Figure 89: </span>ROC curves for test set of the <code>Heart</code> data.  Left: The support vector classifier and LDA are compared.  Right: The support vector classifier is compared to an SVM using radial basis kernel with \(\gamma = 0.005\) and \(\gamma = 0.001\).</p>
</div>
</div>
</div>
</div>
<div id="outline-container-orga42ffa9" class="outline-3">
<h3 id="orga42ffa9"><span class="section-number-3">9.4</span> SVMs with More than Two Classes</h3>
<div class="outline-text-3" id="text-9-4">
</div>
<div id="outline-container-org639a905" class="outline-4">
<h4 id="org639a905"><span class="section-number-4">9.4.1</span> One-Versus-One Classification</h4>
</div>
<div id="outline-container-org8e5bf9e" class="outline-4">
<h4 id="org8e5bf9e"><span class="section-number-4">9.4.2</span> One-Versus-All Classification</h4>
</div>
</div>
<div id="outline-container-org7b0e323" class="outline-3">
<h3 id="org7b0e323"><span class="section-number-3">9.5</span> Relationship with Logistic Regression</h3>
</div>
<div id="outline-container-org65da4b4" class="outline-3">
<h3 id="org65da4b4"><span class="section-number-3">9.6</span> Lab: Support Vector Machines</h3>
<div class="outline-text-3" id="text-9-6">
<p>
We use <code>svm</code> module of <code>sklearn</code> library to demonstrate the support vector
classifier and the SVM.
</p>
</div>
<div id="outline-container-org914fad5" class="outline-4">
<h4 id="org914fad5"><span class="section-number-4">9.6.1</span> Support Vector Classifier</h4>
<div class="outline-text-4" id="text-9-6-1">
<p>
To fit a support vector classifier, we use <code>SVC</code> function with the argument
<code>kernel</code> set to <code>'linear'</code>.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">import</span> rpy2.robjects <span style="color: #0000FF;">as</span> robjects
<span style="color: #0000FF;">import</span> sys
<span style="color: #0000FF;">from</span> sklearn.svm <span style="color: #0000FF;">import</span> SVC
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> GridSearchCV
<span style="color: #0000FF;">from</span> sklearn.metrics <span style="color: #0000FF;">import</span> confusion_matrix
plt.style.use(<span style="color: #008000;">'seaborn-whitegrid'</span>)
sys.path.append(<span style="color: #008000;">'code/chap9'</span>)

<span style="color: #0000FF;">from</span> svm_funcs <span style="color: #0000FF;">import</span> svm_model_plot
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">To replicate results, get data from R</span>
<span style="color: #BA36A5;">X</span> = robjects.r(<span style="color: #008000;">'''set.seed(1); x &lt;- matrix(rnorm(20 * 2))'''</span>)
<span style="color: #BA36A5;">X</span> = np.array(X).reshape((20, 2), order=<span style="color: #008000;">'F'</span>)
<span style="color: #BA36A5;">y</span> = np.concatenate([np.ones(10) * -1, np.ones(10)])
<span style="color: #BA36A5;">y</span> = y.astype(<span style="color: #006FE0;">int</span>)
<span style="color: #BA36A5;">X</span>[y == 1, :] += 1

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Plot shows classes are not linearly separable</span>
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.get_cmap(<span style="color: #008000;">'RdBu'</span>, 2), alpha=0.7)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit support vector classifier</span>
<span style="color: #BA36A5;">svc</span> = SVC(C=10, kernel=<span style="color: #008000;">'linear'</span>)
svc.fit(X, y)

svm_model_plot(svc, X, y)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Support vector indices</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Remember python count begins at 0; R count begins at 1</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Support vector indices:'</span>, svc.support_)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Parameters of support vector classifier:'</span>)
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">print(svc.get_params())</span>
<span style="color: #0000FF;">for</span> param_name <span style="color: #0000FF;">in</span> [<span style="color: #008000;">'C'</span>, <span style="color: #008000;">'kernel'</span>]:
    <span style="color: #0000FF;">print</span>(<span style="color: #008000;">'  '</span>, param_name, <span style="color: #008000;">':'</span>, svc.get_params()[param_name])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of support vectors:'</span>, svc.n_support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of classes:'</span>, svc.n_features_in_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Classes:'</span>, svc.classes_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use smaller value of cost to obtain larger number of support vectors</span>
<span style="color: #BA36A5;">svc</span> = SVC(C=0.1, kernel=<span style="color: #008000;">'linear'</span>)
svc.fit(X, y)
svm_model_plot(svc, X, y)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Suport vector indices:'</span>, svc.support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use grid search to find the value of cost parameter C for best fit</span>
<span style="color: #BA36A5;">param_grid</span> = {<span style="color: #008000;">'C'</span>: [0.001, 0.01, 0.1, 1, 5, 10, 100]}
<span style="color: #BA36A5;">svc</span> = SVC(kernel=<span style="color: #008000;">'linear'</span>)
<span style="color: #BA36A5;">grid</span> = GridSearchCV(svc, param_grid=param_grid, cv=10)
grid.fit(X, y)

<span style="color: #BA36A5;">grid_res</span> = pd.DataFrame(grid.cv_results_[<span style="color: #008000;">'params'</span>])
<span style="color: #BA36A5;">grid_res</span>[<span style="color: #008000;">'score'</span>] = grid.cv_results_[<span style="color: #008000;">'mean_test_score'</span>]

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Best parameter:'</span>, grid.best_params_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Performance results:'</span>)
<span style="color: #0000FF;">print</span>(grid_res)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Generate test data set</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">First run code in book so that test set is identical to that used in the book</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">This does not produce results identical to those shown in the book</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">test_data = robjects.r('''</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">get_x_test &lt;- function(){</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">x &lt;- matrix(rnorm(20 * 2), ncol = 2)</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">y &lt;- c(rep(-1, 10), rep(1, 10))</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">x[y == 1, ] &lt;- x[y == 1, ] + 1</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">dat &lt;- data.frame(x = x, y = as.factor(y))</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">library(e1071)</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">svmfit &lt;- svm(y ~ ., data = dat, kernel = 'linear', cost = 10, scale = FALSE)</span>
<span style="color: #8D8D84;">#     </span><span style="color: #8D8D84; font-style: italic;">set.seed(1)</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">tune.out &lt;- tune(svm, y ~ ., data = dat, kernel = 'linear',</span>
<span style="color: #8D8D84;">#     </span><span style="color: #8D8D84; font-style: italic;">ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">x_test &lt;- matrix(rnorm(20 * 2))</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">y_test &lt;- sample(c(-1, 1), 20, rep = TRUE)</span>
<span style="color: #8D8D84;">#   </span><span style="color: #8D8D84; font-style: italic;">c(x_test, y_test)</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">}</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">get_x_test()</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">''')</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">test_data = np.array(test_data)</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">X_test = test_data[:20 * 2]</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">X_test = np.array(X_test).reshape((20, 2), order='F')</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">y_test = test_data[20 * 2:]</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">y_test = np.array(y_test).astype(int)</span>
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">X_test[y_test == 1, :] += 1</span>

np.random.seed(11)
<span style="color: #BA36A5;">X_test</span> = np.random.normal(size=[20, 2])
<span style="color: #BA36A5;">y_test</span> = np.random.choice([-1, 1], size=20, replace=<span style="color: #D0372D;">True</span>)
<span style="color: #BA36A5;">X_test</span>[y_test == 1, :] += 1

<span style="color: #BA36A5;">model</span> = grid.best_estimator_
<span style="color: #BA36A5;">y_predict</span> = model.predict(X_test)

<span style="color: #BA36A5;">res_df</span> = pd.DataFrame(confusion_matrix(y_test, y_predict), columns=[-1, 1],
                      index=[-1, 1])
<span style="color: #BA36A5;">res_df.columns.name</span> = <span style="color: #008000;">'Predict'</span>
<span style="color: #BA36A5;">res_df.index.name</span> = <span style="color: #008000;">'True'</span>

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Prediction on test data with best model (C=0.1):'</span>)
<span style="color: #0000FF;">print</span>(res_df)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit support vector classifier with C = 0.01, then predict on test data</span>
<span style="color: #BA36A5;">svc</span> = SVC(C=0.01, kernel=<span style="color: #008000;">'linear'</span>)
svc.fit(X, y)
<span style="color: #BA36A5;">y_predict</span> = svc.predict(X_test)
<span style="color: #BA36A5;">res_df_new</span> = pd.DataFrame(confusion_matrix(y_test, y_predict), columns=[-1, 1],
                          index=[-1, 1])
<span style="color: #BA36A5;">res_df_new.columns.name</span> = <span style="color: #008000;">'Predict'</span>
<span style="color: #BA36A5;">res_df_new.index.name</span> = <span style="color: #008000;">'True'</span>

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Prediction on test data with model using C = 0.01:'</span>)
<span style="color: #0000FF;">print</span>(res_df_new)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Separate the two classes so that they are linearly separable</span>
<span style="color: #BA36A5;">X</span>[y == 1, :] += 0.5
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.get_cmap(<span style="color: #008000;">'RdBu'</span>, 2), alpha=0.7)

<span style="color: #BA36A5;">X_test</span>[y_test == 1, :] += 0.5

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit a support vector classifier with a high value of cost</span>
<span style="color: #BA36A5;">svc</span> = SVC(C=1e5, kernel=<span style="color: #008000;">'linear'</span>)
svc.fit(X, y)
svm_model_plot(svc, X, y)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cost parameter:'</span>, svc.get_params()[<span style="color: #008000;">'C'</span>])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of support vectors:'</span>, svc.n_support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of training observations correctly predicted:'</span>,
      svc.score(X, y))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of testing observations correctly predicted:'</span>,
      svc.score(X_test, y_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Refit model with smaller value of cost parameter</span>
<span style="color: #BA36A5;">svc</span> = SVC(C=1, kernel=<span style="color: #008000;">'linear'</span>)
svc.fit(X, y)
svm_model_plot(svc, X, y)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cost parameter:'</span>, svc.get_params()[<span style="color: #008000;">'C'</span>])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of support vectors:'</span>, svc.n_support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of training observations correctly predicted:'</span>,
      svc.score(X, y))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of testing observations correctly predicted:'</span>,
      svc.score(X_test, y_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
</pre>
</div>

<pre class="example">
Support vector indices: [ 0  1  4  6 13 15 16]
Parameters of support vector classifier:
   C : 10
   kernel : linear
Number of support vectors: [4 3]
Number of classes: 2
Classes: [-1  1]
------
Suport vector indices: [ 0  1  2  3  4  6  8  9 11 12 13 14 15 16 17 19]
------
Best parameter: {'C': 0.1}
Performance results:
         C  score
0    0.001   0.75
1    0.010   0.75
2    0.100   0.95
3    1.000   0.90
4    5.000   0.85
5   10.000   0.85
6  100.000   0.85
------
Prediction on test data with best model (C=0.1):
Predict  -1   1
True           
-1       11   2
 1        2   5
------
Prediction on test data with model using C = 0.01:
Predict  -1   1
True           
-1       12   1
 1        4   3
------
Cost parameter: 100000.0
Number of support vectors: [1 2]
Fraction of training observations correctly predicted: 1.0
Fraction of testing observations correctly predicted: 0.9
------
Cost parameter: 1
Number of support vectors: [3 4]
Fraction of training observations correctly predicted: 0.95
Fraction of testing observations correctly predicted: 0.9
------
</pre>
</div>
</div>
<div id="outline-container-org5c3cdef" class="outline-4">
<h4 id="org5c3cdef"><span class="section-number-4">9.6.2</span> Support Vector Machine</h4>
<div class="outline-text-4" id="text-9-6-2">
<p>
In order to fit an SVM using a non-linear kernel, we once again use the <code>SVC</code>
function.  We now set the parameter <code>kernel</code> to <code>'rbf'</code>.  
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> rpy2.robjects <span style="color: #0000FF;">as</span> robjects
<span style="color: #0000FF;">import</span> sys
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split, GridSearchCV
<span style="color: #0000FF;">from</span> sklearn.svm <span style="color: #0000FF;">import</span> SVC

sys.path.append(<span style="color: #008000;">'code/chap9'</span>)

<span style="color: #0000FF;">from</span> svm_funcs <span style="color: #0000FF;">import</span> svm_model_plot

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">To replicate results, get data from R</span>
<span style="color: #BA36A5;">X</span> = robjects.r(<span style="color: #008000;">'''set.seed(1); x &lt;- matrix(rnorm(200 * 2), ncol = 2)'''</span>)
<span style="color: #BA36A5;">X</span> = np.array(X).reshape(200, 2)
<span style="color: #BA36A5;">X</span>[:100, :] += 2
<span style="color: #BA36A5;">X</span>[100:150, :] -= 2
<span style="color: #BA36A5;">y</span> = np.concatenate([np.ones(150), np.ones(50) * 2])

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.get_cmap(<span style="color: #008000;">'RdBu'</span>, 2), alpha=0.7)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Split data between train and test sets</span>
<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y, test_size=100,
                                                    random_state=0)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit training data set with a radial kernel</span>
<span style="color: #BA36A5;">svm</span> = SVC(kernel=<span style="color: #008000;">'rbf'</span>)
svm.fit(X_train, y_train)

svm_model_plot(svm, X_train, y_train)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'SVM model fit with cost parameter:'</span>, svm.get_params()[<span style="color: #008000;">'C'</span>])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of support vectors:'</span>, svm.n_support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of accurate predictions on training data:'</span>,
      svm.score(X_train, y_train))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of accurate predictions on test data:'</span>,
      svm.score(X_test, y_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit training data set with a high value for cost parameter</span>
<span style="color: #BA36A5;">svm</span> = SVC(C=1e5, kernel=<span style="color: #008000;">'rbf'</span>)
svm.fit(X_train, y_train)

svm_model_plot(svm, X_train, y_train)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'SVM model fit with cost parameter:'</span>, svm.get_params()[<span style="color: #008000;">'C'</span>])
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of support vectors:'</span>, svm.n_support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of accurate predictions on training data:'</span>,
      svm.score(X_train, y_train))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of accurate predictions on test data:'</span>,
      svm.score(X_test, y_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Find best values of C and gamma using grid search</span>
<span style="color: #BA36A5;">param_grid</span> = {<span style="color: #008000;">'C'</span>: [0.1, 1, 5], <span style="color: #008000;">'gamma'</span>: [0.1, 0.5, 1]}
<span style="color: #BA36A5;">svm</span> = SVC(kernel=<span style="color: #008000;">'rbf'</span>)
<span style="color: #BA36A5;">grid</span> = GridSearchCV(svm, param_grid=param_grid)
grid.fit(X_train, y_train)

<span style="color: #BA36A5;">res_df</span> = pd.DataFrame(grid.cv_results_[<span style="color: #008000;">'params'</span>])
<span style="color: #BA36A5;">res_df</span>[<span style="color: #008000;">'test_score'</span>] = grid.cv_results_[<span style="color: #008000;">'mean_test_score'</span>]

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Grid search results:'</span>)
<span style="color: #0000FF;">print</span>(res_df)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of accurate predictions on test data:'</span>,
      grid.best_estimator_.score(X_test, y_test))
</pre>
</div>

<pre class="example">
SVM model fit with cost parameter: 1.0
Number of support vectors: [15 14]
Fraction of accurate predictions on training data: 0.92
Fraction of accurate predictions on test data: 0.85
------
SVM model fit with cost parameter: 100000.0
Number of support vectors: [11  8]
Fraction of accurate predictions on training data: 0.99
Fraction of accurate predictions on test data: 0.8
------
Grid search results:
     C  gamma  test_score
0  0.1    0.1        0.75
1  0.1    0.5        0.74
2  0.1    1.0        0.75
3  1.0    0.1        0.92
4  1.0    0.5        0.93
5  1.0    1.0        0.92
6  5.0    0.1        0.92
7  5.0    0.5        0.92
8  5.0    1.0        0.88
Fraction of accurate predictions on test data: 0.86
</pre>
</div>
</div>
<div id="outline-container-org86df7af" class="outline-4">
<h4 id="org86df7af"><span class="section-number-4">9.6.3</span> ROC Curves</h4>
<div class="outline-text-4" id="text-9-6-3">
<p>
In <code>sklearn</code> library, <code>metrics</code> module provides functions <code>roc_curve</code> and
<code>plot_roc_curve</code>.
</p>
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">import</span> rpy2.robjects <span style="color: #0000FF;">as</span> robjects
<span style="color: #0000FF;">from</span> sklearn.svm <span style="color: #0000FF;">import</span> SVC
<span style="color: #0000FF;">from</span> sklearn.metrics <span style="color: #0000FF;">import</span> plot_roc_curve
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split

<span style="color: #BA36A5;">X</span> = robjects.r(<span style="color: #008000;">'''set.seed(1); x &lt;- matrix(rnorm(200 * 2), ncol = 2)'''</span>)
<span style="color: #BA36A5;">X</span> = np.array(X, order=<span style="color: #008000;">'F'</span>)
<span style="color: #BA36A5;">X</span>[:100, :] += 2
<span style="color: #BA36A5;">X</span>[100:150, :] -= 2
<span style="color: #BA36A5;">y</span> = np.concatenate([np.ones(150), np.ones(50) * 2])

<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y)

<span style="color: #BA36A5;">svm</span> = SVC(probability=<span style="color: #D0372D;">True</span>, random_state=0)
svm.fit(X_train, y_train)

<span style="color: #BA36A5;">fig</span>, <span style="color: #BA36A5;">ax</span> = plt.subplots()
plot_roc_curve(svm, X_train, y_train, label=<span style="color: #008000;">'Training data'</span>, ax=ax)
plot_roc_curve(svm, X_test, y_test, label=<span style="color: #008000;">'Test data'</span>, linestyle=<span style="color: #008000;">'--'</span>, ax=ax)
</pre>
</div>
</div>
</div>
<div id="outline-container-org1bf1cd1" class="outline-4">
<h4 id="org1bf1cd1"><span class="section-number-4">9.6.4</span> SVM with Multiple Classes</h4>
<div class="outline-text-4" id="text-9-6-4">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> sklearn.svm <span style="color: #0000FF;">import</span> SVC
<span style="color: #0000FF;">from</span> sklearn.model_selection <span style="color: #0000FF;">import</span> train_test_split

np.random.seed(0)
<span style="color: #BA36A5;">X</span> = np.random.normal(size=[200, 2])
<span style="color: #BA36A5;">X</span>[:100, :] += 2
<span style="color: #BA36A5;">X</span>[100:150, :] -= 2
<span style="color: #BA36A5;">y</span> = np.concatenate([np.ones(150), np.ones(50) * 2])

<span style="color: #BA36A5;">X</span> = np.vstack([X, np.random.normal(size=[50, 2])])
<span style="color: #BA36A5;">y</span> = np.concatenate([y, np.zeros(50)])
<span style="color: #BA36A5;">y</span> = y.astype(<span style="color: #006FE0;">int</span>)
<span style="color: #BA36A5;">X</span>[y == 0, 1] += 2

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.get_cmap(<span style="color: #008000;">'viridis'</span>, 3),
            alpha=0.7)

<span style="color: #BA36A5;">X_train</span>, <span style="color: #BA36A5;">X_test</span>, <span style="color: #BA36A5;">y_train</span>, <span style="color: #BA36A5;">y_test</span> = train_test_split(X, y, random_state=0)

<span style="color: #BA36A5;">svm</span> = SVC()
svm.fit(X_train, y_train)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'In training data, fraction of observations correctly classified:'</span>,
      <span style="color: #006FE0;">round</span>(svm.score(X_train, y_train), 2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'In test data, fraction of observations correctly classified:'</span>,
      <span style="color: #006FE0;">round</span>(svm.score(X_test, y_test), 2))
</pre>
</div>

<pre class="example">
In training data, fraction of observations correctly classified: 0.81
In test data, fraction of observations correctly classified: 0.84

</pre>
</div>
</div>
<div id="outline-container-orgb372ca7" class="outline-4">
<h4 id="orgb372ca7"><span class="section-number-4">9.6.5</span> Application to Gene Expression Data</h4>
<div class="outline-text-4" id="text-9-6-5">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">from</span> rpy2 <span style="color: #0000FF;">import</span> robjects
<span style="color: #0000FF;">from</span> sklearn.svm <span style="color: #0000FF;">import</span> SVC
<span style="color: #0000FF;">from</span> sklearn.metrics <span style="color: #0000FF;">import</span> confusion_matrix

<span style="color: #BA36A5;">khan</span> = {}
<span style="color: #BA36A5;">khan</span>[<span style="color: #008000;">'xtrain'</span>] = robjects.r(<span style="color: #008000;">'library(ISLR); Khan$xtrain'</span>)
<span style="color: #BA36A5;">khan</span>[<span style="color: #008000;">'xtest'</span>] = robjects.r(<span style="color: #008000;">'Khan$xtest'</span>)
<span style="color: #BA36A5;">khan</span>[<span style="color: #008000;">'ytrain'</span>] = robjects.r(<span style="color: #008000;">'Khan$ytrain'</span>)
<span style="color: #BA36A5;">khan</span>[<span style="color: #008000;">'ytest'</span>] = robjects.r(<span style="color: #008000;">'Khan$ytest'</span>)

<span style="color: #BA36A5;">X_train</span> = np.array(khan[<span style="color: #008000;">'xtrain'</span>])
<span style="color: #BA36A5;">X_test</span> = np.array(khan[<span style="color: #008000;">'xtest'</span>])
<span style="color: #BA36A5;">y_train</span> = np.array(khan[<span style="color: #008000;">'ytrain'</span>]).astype(<span style="color: #006FE0;">int</span>)
<span style="color: #BA36A5;">y_test</span> = np.array(khan[<span style="color: #008000;">'ytest'</span>]).astype(<span style="color: #006FE0;">int</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Shape of training data:'</span>, X_train.shape)
<span style="color: #0000FF;">print</span>(y_train.shape)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Shape of test data:'</span>, X_test.shape)
<span style="color: #0000FF;">print</span>(y_test.shape)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Count of tissue sample types in training data:'</span>)
<span style="color: #0000FF;">print</span>(pd.value_counts(y_train).sort_index())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'In test data:'</span>)
<span style="color: #0000FF;">print</span>(pd.value_counts(y_test).sort_index())
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit linear kernel model</span>
<span style="color: #BA36A5;">svc</span> = SVC(kernel=<span style="color: #008000;">'linear'</span>)
svc.fit(X_train, y_train)

<span style="color: #BA36A5;">y_predict_train</span> = svc.predict(X_train)
<span style="color: #BA36A5;">train_confusion</span> = pd.DataFrame(confusion_matrix(y_train, y_predict_train))
<span style="color: #BA36A5;">y_predict_test</span> = svc.predict(X_test)
<span style="color: #BA36A5;">test_confusion</span> = pd.DataFrame(confusion_matrix(y_test, y_predict_test))
<span style="color: #0000FF;">for</span> df <span style="color: #0000FF;">in</span> [train_confusion, test_confusion]:
    <span style="color: #BA36A5;">df.columns</span> = [1, 2, 3, 4]
    <span style="color: #BA36A5;">df.index</span> = [1, 2, 3, 4]
    <span style="color: #BA36A5;">df.columns.name</span> = <span style="color: #008000;">'Predicted'</span>
    <span style="color: #BA36A5;">df.index.name</span> = <span style="color: #008000;">'True'</span>

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Number of support vectors:'</span>, svc.n_support_)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of correct predictions on training data:'</span>,
      svc.score(X_train, y_train))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Confusion matrix on training data:'</span>)
<span style="color: #0000FF;">print</span>(train_confusion)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Fraction of correct predictions on test data:'</span>,
      svc.score(X_test, y_test))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Confusion matrix on test data:'</span>)
<span style="color: #0000FF;">print</span>(test_confusion)
</pre>
</div>

<pre class="example">
Shape of training data: (63, 2308)
(63,)
Shape of test data: (20, 2308)
(20,)
Count of tissue sample types in training data:
1     8
2    23
3    12
4    20
dtype: int64
In test data:
1    3
2    6
3    6
4    5
dtype: int64
------
Number of support vectors: [ 7 18  9 20]
Fraction of correct predictions on training data: 1.0
Confusion matrix on training data:
Predicted  1   2   3   4
True                    
1          8   0   0   0
2          0  23   0   0
3          0   0  12   0
4          0   0   0  20
------
Fraction of correct predictions on test data: 0.9
Confusion matrix on test data:
Predicted  1  2  3  4
True                 
1          3  0  0  0
2          0  6  0  0
3          0  2  4  0
4          0  0  0  5
</pre>
</div>
</div>
</div>
</div>
<div id="outline-container-org79ac5fb" class="outline-2">
<h2 id="org79ac5fb"><span class="section-number-2">10</span> Unsupervised Learning</h2>
<div class="outline-text-2" id="text-10">
</div>
<div id="outline-container-orgbec34e7" class="outline-3">
<h3 id="orgbec34e7"><span class="section-number-3">10.1</span> The Challenge of Unsupervised Learning</h3>
</div>
<div id="outline-container-org3b3f9b3" class="outline-3">
<h3 id="org3b3f9b3"><span class="section-number-3">10.2</span> Principal Component Analysis</h3>
<div class="outline-text-3" id="text-10-2">
</div>
<div id="outline-container-orgc831dc8" class="outline-4">
<h4 id="orgc831dc8"><span class="section-number-4">10.2.1</span> What are Principal Components?</h4>
<div class="outline-text-4" id="text-10-2-1">
<p>
We first normalize <code>USArrests</code> data so that every variable has mean 0 and
standard deviation 1. Then we perform PCA on the normalized data set.  Table
<a href="#orgc85d704">20</a> shows loadings of the first two principal components.
Figure <a href="#orgee520e3">90</a> plots the first two principal components of these
data.  The figure represents both the principal component scores and the loading
vectors in a single <i>biplot</i> display.
</p>

<table id="orgc85d704" border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<caption class="t-bottom"><span class="table-number">Table 20:</span> The principal component loading factors, &phi;<sub>1</sub> and &phi;<sub>2</sub>, for the <code>USArrests</code> data.  These are also displayed in figure <a href="#orgee520e3">90</a>.</caption>

<colgroup>
<col  class="org-left" />

<col  class="org-right" />

<col  class="org-right" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">&#xa0;</th>
<th scope="col" class="org-right">PC1</th>
<th scope="col" class="org-right">PC2</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Murder</td>
<td class="org-right">0.5358995</td>
<td class="org-right">0.4181809</td>
</tr>

<tr>
<td class="org-left">Assault</td>
<td class="org-right">0.5831836</td>
<td class="org-right">0.1879856</td>
</tr>

<tr>
<td class="org-left">UrbanPop</td>
<td class="org-right">0.2781909</td>
<td class="org-right">-0.8728062</td>
</tr>

<tr>
<td class="org-left">Rape</td>
<td class="org-right">0.5434321</td>
<td class="org-right">-0.1673186</td>
</tr>
</tbody>
</table>


<div id="orgee520e3" class="figure">
<p><img src="figures/fig10_1.png" alt="fig10_1.png" />
</p>
<p><span class="figure-number">Figure 90: </span>The first two principal components for the <code>USArrests</code> data.  The blue state names represent the scores for the first two principal components.  The red arrows indicate the first two principal components loading vectors. This figure is known as a biplot, because it displays both the principal component scores and the principal component loadings.</p>
</div>
</div>
</div>
<div id="outline-container-org1f1971c" class="outline-4">
<h4 id="org1f1971c"><span class="section-number-4">10.2.2</span> Another Interpretation of Principal Components</h4>
</div>
<div id="outline-container-org9d74a40" class="outline-4">
<h4 id="org9d74a40"><span class="section-number-4">10.2.3</span> More on PCA</h4>
<div class="outline-text-4" id="text-10-2-3">
<p>
In figure <a href="#org5eb70d1">91</a>, the left-hand panel is the same as figure
<a href="#orgee520e3">90</a>, where each variable was first scaled to have mean zero and standard
deviation one, then principal component analysis was performed. The right-hand
panel displays the first two principal components on the raw data.  Since
variables were not scaled to have standard deviation one, the first principal
component places almost all of its weight on <code>Assault</code>, while the second
principal component places almost all of its weight on <code>UrbanPop</code>. 
</p>

<div id="org5eb70d1" class="figure">
<p><img src="figures/fig10_3.png" alt="fig10_3.png" />
</p>
<p><span class="figure-number">Figure 91: </span>Two principal component biplots for <code>USArrests</code> data.  Left: Same as figure <a href="#orgee520e3">90</a>, with variables scaled to have unit standard deviations.  Right: Principal components using unscaled data.  <code>Assault</code> has by far the largest loading on the first principal component because it has the largest variance among the four variables.  In general, it is recommended to scale variables to have standard deviation one.</p>
</div>

<p>
In Figure <a href="#org954bd21">92</a>, the left-hand panel shows the proportion of
variance explained (PVE) by each principal component of <code>USArrests</code> data.  The
right-hand panel shows the cumulative PVE.
</p>


<div id="org954bd21" class="figure">
<p><img src="figures/fig10_4.png" alt="fig10_4.png" />
</p>
<p><span class="figure-number">Figure 92: </span>Left: A scree plot depicting the proportion of variance explained by each of the four principal components in the <code>USArrests</code> data.  Right: The cumulative proportion of variance explained by the four principal components.</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org202cb85" class="outline-3">
<h3 id="org202cb85"><span class="section-number-3">10.3</span> Clustering Methods</h3>
<div class="outline-text-3" id="text-10-3">
</div>
<div id="outline-container-org4f92fc9" class="outline-4">
<h4 id="org4f92fc9"><span class="section-number-4">10.3.1</span> K-Means Clustering</h4>
<div class="outline-text-4" id="text-10-3-1">
<p>
Figure <a href="#org2b8d7ee">93</a> shows the results obtained from performing <i>K</i>-means
clustering on a simulated example consisting of 60 observations in two
dimensions.  Three different values of \(K\) are used.
</p>

<div id="org2b8d7ee" class="figure">
<p><img src="figures/fig10_5.png" alt="fig10_5.png" />
</p>
<p><span class="figure-number">Figure 93: </span>A simulated data set with 60 observations in two-dimensional space.  Panels show the results of applying <i>K</i>-means clustering with different values of \(K\), the number of clusters. The color of each observation indicates the cluster to which it was assigned using the <i>K</i>-means clustering algorithm.  Since there is no ordering of clusters, the cluster coloring is arbitrary.  The cluster labels are the outputs of the clustering procedure.</p>
</div>

<p>
Figure <a href="#orgc075aa8">94</a> shows the progression of the <i>K</i>-Means Clustering
Algorithm on the toy example from figure <a href="#org2b8d7ee">93</a>. 
</p>


<div id="orgc075aa8" class="figure">
<p><img src="figures/fig10_6.png" alt="fig10_6.png" />
</p>
<p><span class="figure-number">Figure 94: </span>The progress of <i>K</i>-means algorithm on the example of figure <a href="#org2b8d7ee">93</a> with \(K = 3\).  Top left: The observations are shown.  Top center: In Step 1 of the algorithm, each observation is randomly assigned to a cluster.  Top right: In Step 2(a), the cluster centroids are computed.  These are shown as \(\times\).  Since the clusters were chosen at random, initially centroids are almost overlapping.  Bottom left: In Step 2(b), each observation is assigned to the nearest centroid.  Bottom center: Step 2(a) is once again performed, leading to new cluster centroids.  Bottom right: The results obtained after five iterations.</p>
</div>

<p>
Figure <a href="#orgdecbcdf">95</a> shows the local optima obtained by running <i>K</i>-means
clustering six times using six different initial cluster assignments, using toy data
from figure <a href="#org2b8d7ee">93</a>. 
</p>


<div id="orgdecbcdf" class="figure">
<p><img src="figures/fig10_7.png" alt="fig10_7.png" />
</p>
<p><span class="figure-number">Figure 95: </span><i>K</i>-means clustering performed six times on the data from figure <a href="#org2b8d7ee">93</a> with \(K = 3\), each time with a different random assignment of the observations in Step 1 of the <i>K</i>-means algorithm.  Above each plot is the value of the objective (sum of squares of distances from centroids).  Three different local optima were obtained, one of which resulted in a smaller value of the objective and provides better separation between the clusters.  Those labeled in red all achieved the same best solution.</p>
</div>
</div>
</div>
<div id="outline-container-orgfa686cb" class="outline-4">
<h4 id="orgfa686cb"><span class="section-number-4">10.3.2</span> Hierarchical Clustering</h4>
<div class="outline-text-4" id="text-10-3-2">
<p>
We begin with the simulated data set shown in figure <a href="#org63bf1ed">96</a>,
consisting of 45 observations in two-dimensional space.  The data were generated
from a three-class model; the true class labels for each observation are shown
in distinct colors. However, suppose the data were observed without class
labels, and that we wanted to perform hierarchical clustering of the data.
Hierarchical clustering yields the results shown in the left-hand panel of
figure <a href="#orgdebc675">97</a>.  In the center-panel, cutting the dendrogram at a
height of 15 results in two clusters.  In the right-hand panel, cutting the
dendrogram at a height of 10 results in three clusters.  
</p>

<div id="org63bf1ed" class="figure">
<p><img src="figures/fig10_8.png" alt="fig10_8.png" />
</p>
<p><span class="figure-number">Figure 96: </span>Forty five observations generated in two-dimensional space.  In reality there are three distinct classes, shown in separate colors.  However, we will treat these class labels as unknown.  To discover the classes, we will seek to cluster the observations.</p>
</div>


<div id="orgdebc675" class="figure">
<p><img src="figures/fig10_9.png" alt="fig10_9.png" />
</p>
<p><span class="figure-number">Figure 97: </span>Left: Dendrogram obtained from hierarchically clustering the data from figure <a href="#org63bf1ed">96</a> with complete linkage and Euclidean distance.  Center: Dendrogram from left-hand panel, cut at a height of 15 (indicated by dashed line).  This cut results in two distinct clusters, shown in different colors.  Right: Dendrogram from left-hand panel, now cut at a height of 10.  This cut results in three distinct clusters, shown in different colors.  The colors were not used in clustering, but are simply used for display purposes.</p>
</div>

<p>
Consider the lef-hand panel of figure <a href="#orgc9b958d">98</a>, which shows a simple
dendrogram obtained from hierarchically clustering nine observations.  We can
see that observations 2 and 4 are quite similar to each other, since they fuse
at the lowest point on the dendrogram.  Observations 1 and 7 are also quite
similar to each other.  However, while observations 9 and 8 are next to each
other in the dendrogram, it is incorrect to conclude that these observations are
similar to each other.  In fact, based on information provided in the
dendrogram, observation 9 is no more similar to observation 8 than it is to
observations 2 and 4. 
</p>


<div id="orgc9b958d" class="figure">
<p><img src="figures/fig10_10.png" alt="fig10_10.png" />
</p>
<p><span class="figure-number">Figure 98: </span>An illustration of how to properly interpret a dendrogram with nine observations in two-dimensional space.  Left: A dendrogram generated using Euclidean distance and complete linkage.  Observations 2 and 4 are quite similar to each other, as are observations 1 and 7.  However, observation 9 is no more similar to observation 8 than it is to observations 2 and 4, even though observations 9 and 8 are closer together in terms of horizontal distance. This is because observations 8, 2, and 4 all fuse with observation 9 at the same height, approximately 1.3.  Right: The raw data used to generate the dendrogram can be used to verify that indeed, observation 9 is no more similar to observation 8 than it is to observations 2 and 4.</p>
</div>

<p>
Figure <a href="#orge4cd696">99</a> displays the first few steps in hierarchical
clustering algorithm, for the data from figure <a href="#orgdebc675">97</a>
</p>


<div id="orge4cd696" class="figure">
<p><img src="figures/fig10_11plot.png" alt="fig10_11plot.png" />
</p>
<p><span class="figure-number">Figure 99: </span>An illustration of the first few steps of the hierarchical clustering algorithm, using data from figure <a href="#orgc9b958d">98</a>, with complete linkage and Euclidean distance.  Top Left: Initially there are nine clusters, {1}, {2}, &#x2026;, {9}.  Top Right: The two clusters that are closest together, {2} and {4}, are fused into a single cluster.  Bottom Left: The next two clusters that are closest together, {1} and {7}, are fused into a single cluster. Bottom Right: The two clusters that are closest together using complete linkage, {8} and {2, 4}, are fused into a single cluster.</p>
</div>

<p>
Figure <a href="#orgb3fd911">100</a> illustrates dendrograms resulting from the same data
set when three different linkages are applied.  Average and complete linkages
tend to yield more balanced clusters.
</p>


<div id="orgb3fd911" class="figure">
<p><img src="figures/fig10_12plot.png" alt="fig10_12plot.png" />
</p>
<p><span class="figure-number">Figure 100: </span>Average, complete, and single linkage applied to data set from figure <a href="#org63bf1ed">96</a>.  Average and complete linkages tend to yield more balanced results.</p>
</div>

<p>
Figure <a href="#org70574e4">101</a> illustrates the difference between Euclidean and
correlation-based distance. 
</p>


<div id="org70574e4" class="figure">
<p><img src="figures/fig10_13.png" alt="fig10_13.png" />
</p>
<p><span class="figure-number">Figure 101: </span>Three observations with measurements on 20 variables.  Observations 0 and 1 have similar values for each variable.  Therefore, there is a small distance between them.  But since they are weakly correlated, they have a large correlation-based distance.  On the other hand, observations 0 and 2 have quite different values for each variable, and so there is a large Euclidean distance between them.  But since they are highly correlated, there is a small correlation-based distance between them.</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org9cfd91f" class="outline-3">
<h3 id="org9cfd91f"><span class="section-number-3">10.4</span> Lab 1: Principal Components Analysis</h3>
<div class="outline-text-3" id="text-10-4">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">from</span> sklearn.decomposition <span style="color: #0000FF;">import</span> PCA

<span style="color: #BA36A5;">arrests</span> = get_rdataset(<span style="color: #008000;">'USArrests'</span>).data

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Variables in USArrests data:'</span>)
<span style="color: #0000FF;">print</span>(arrests.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Means of variables:'</span>)
<span style="color: #0000FF;">print</span>(arrests.mean().<span style="color: #006FE0;">round</span>(2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Variances of variables:'</span>)
<span style="color: #0000FF;">print</span>(arrests.var().<span style="color: #006FE0;">round</span>(2))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Normalize data by subtracting mean and dividing by standard deviation</span>
<span style="color: #BA36A5;">arrests_normalized</span> = (arrests - arrests.mean()) / arrests.std()

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Calculate principal components</span>
<span style="color: #BA36A5;">pca</span> = PCA()
pca.fit(arrests_normalized)

<span style="color: #BA36A5;">arrests_pc</span> = pd.DataFrame(pca.components_.T, index=arrests_normalized.columns)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Principal components of USArrests:'</span>)
<span style="color: #0000FF;">print</span>(arrests_pc)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">To plot Figure 10.1, see program code/chap10/arrest_pca.py</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Variance explained by principal components:'</span>)
<span style="color: #0000FF;">print</span>(pca.explained_variance_.<span style="color: #006FE0;">round</span>(3))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Proportion of variance explained:'</span>)
<span style="color: #0000FF;">print</span>(pca.explained_variance_ratio_.<span style="color: #006FE0;">round</span>(4))
</pre>
</div>

<pre class="example">
Variables in USArrests data:
Index(['Murder', 'Assault', 'UrbanPop', 'Rape'], dtype='object')
Means of variables:
Murder        7.79
Assault     170.76
UrbanPop     65.54
Rape         21.23
dtype: float64
Variances of variables:
Murder        18.97
Assault     6945.17
UrbanPop     209.52
Rape          87.73
dtype: float64
------
Principal components of USArrests:
                 0         1         2         3
Murder    0.535899  0.418181 -0.341233  0.649228
Assault   0.583184  0.187986 -0.268148 -0.743407
UrbanPop  0.278191 -0.872806 -0.378016  0.133878
Rape      0.543432 -0.167319  0.817778  0.089024
------
Variance explained by principal components:
[2.48  0.99  0.357 0.173]
Proportion of variance explained:
[0.6201 0.2474 0.0891 0.0434]
</pre>
</div>
</div>
<div id="outline-container-org2e7b44a" class="outline-3">
<h3 id="org2e7b44a"><span class="section-number-3">10.5</span> Lab 2: Clustering</h3>
<div class="outline-text-3" id="text-10-5">
</div>
<div id="outline-container-org5f24036" class="outline-4">
<h4 id="org5f24036"><span class="section-number-4">10.5.1</span> K-Means Clustering</h4>
<div class="outline-text-4" id="text-10-5-1">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> rpy2 <span style="color: #0000FF;">import</span> robjects
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> sklearn.cluster <span style="color: #0000FF;">import</span> KMeans

<span style="color: #BA36A5;">X</span> = robjects.r(<span style="color: #008000;">'set.seed(2); x &lt;- rnorm(50 * 2)'</span>)
<span style="color: #BA36A5;">X</span> = np.array(X).reshape((50, 2), order=<span style="color: #008000;">'F'</span>)

<span style="color: #BA36A5;">X</span>[:25, 0] += 3
<span style="color: #BA36A5;">X</span>[:25, 1] -= 4

<span style="color: #BA36A5;">km</span> = KMeans(n_clusters=2, random_state=0)
km.fit(X)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Predicted cluster assignments:'</span>)
<span style="color: #0000FF;">print</span>(km.predict(X))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)
<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">To plot clusters, see code/chap10/km_cluster.py</span>

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">K-means clusters with 3 clusters</span>
<span style="color: #BA36A5;">km3</span> = KMeans(n_clusters=3, random_state=2)
km3.fit(X)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cluster centers:'</span>)
<span style="color: #0000FF;">print</span>(km3.cluster_centers_)

<span style="color: #BA36A5;">y_predict</span> = km3.predict(X)
<span style="color: #BA36A5;">within_clust_dist_sq</span> = [((X[y_predict == i] - km3.cluster_centers_[i]) ** 2).<span style="color: #006FE0;">sum</span>()
                        <span style="color: #0000FF;">for</span> i <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">range</span>(3)]
<span style="color: #BA36A5;">total_dist_sq</span> = ((X - X.mean(axis = 0)) ** 2).<span style="color: #006FE0;">sum</span>()
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'between_SS / total_SS ='</span>,
      <span style="color: #006FE0;">round</span>(1 - <span style="color: #006FE0;">sum</span>(within_clust_dist_sq) / total_dist_sq, 3) * 100, <span style="color: #008000;">'%'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">To select the number of times k-means algorithm will be run, assign a value to n_init</span>
</pre>
</div>

<pre class="example">
Predicted cluster assignments:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
------
Cluster centers:
[[ 3.77895672 -4.56200798]
 [-0.38203973 -0.08740753]
 [ 2.30015453 -2.69622023]]
between_SS / total_SS = 79.3 %

</pre>
</div>
</div>
<div id="outline-container-org9cb02e0" class="outline-4">
<h4 id="org9cb02e0"><span class="section-number-4">10.5.2</span> Hierarchical Clustering</h4>
<div class="outline-text-4" id="text-10-5-2">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">from</span> rpy2 <span style="color: #0000FF;">import</span> robjects
<span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">from</span> scipy.cluster.hierarchy <span style="color: #0000FF;">import</span> linkage, dendrogram, fcluster
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt

<span style="color: #BA36A5;">X</span> = robjects.r(<span style="color: #008000;">'set.seed(2); x &lt;- rnorm(50 * 2)'</span>)
<span style="color: #BA36A5;">X</span> = np.array(X).reshape((50, 2), order=<span style="color: #008000;">'F'</span>)

<span style="color: #BA36A5;">X</span>[:25, 0] += 3
<span style="color: #BA36A5;">X</span>[:25, 1] -= 4

<span style="color: #BA36A5;">hc_complete</span> = linkage(X, method=<span style="color: #008000;">'complete'</span>)
<span style="color: #BA36A5;">hc_average</span> = linkage(X, method=<span style="color: #008000;">'average'</span>)
<span style="color: #BA36A5;">hc_single</span> = linkage(X, method=<span style="color: #008000;">'single'</span>)
<span style="color: #BA36A5;">link_names</span> = [<span style="color: #008000;">'Complete'</span>, <span style="color: #008000;">'Average'</span>, <span style="color: #008000;">'Single'</span>]

<span style="color: #BA36A5;">fig</span> = plt.figure(figsize=(8, 3))
<span style="color: #0000FF;">for</span> i, hc <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">enumerate</span>([hc_complete, hc_average, hc_single]):
    <span style="color: #BA36A5;">axi</span> = fig.add_subplot(1, 3, i + 1)
    <span style="color: #BA36A5;">dn</span> = dendrogram(hc, ax=axi)
    axi.<span style="color: #006FE0;">set</span>(title=link_names[i] + <span style="color: #008000;">' Linkage'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cluster labels using complete linkage:'</span>)
<span style="color: #0000FF;">print</span>(fcluster(hc_complete, t=2, criterion=<span style="color: #008000;">'maxclust'</span>))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cluster labels using average linkage:'</span>)
<span style="color: #0000FF;">print</span>(fcluster(hc_average, t=2, criterion=<span style="color: #008000;">'maxclust'</span>))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cluster labels using single linkage:'</span>)
<span style="color: #0000FF;">print</span>(fcluster(hc_single, t=2, criterion=<span style="color: #008000;">'maxclust'</span>))
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'When four clusters are selected in single linkage:'</span>)
<span style="color: #0000FF;">print</span>(fcluster(hc_single, t=4, criterion=<span style="color: #008000;">'maxclust'</span>))

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Scale data before applying hierarchical clustering</span>
<span style="color: #BA36A5;">X_scale</span> = (X - X.mean(axis=0)) / X.std(axis=0)
<span style="color: #BA36A5;">hc</span> = linkage(X_scale, method=<span style="color: #008000;">'complete'</span>)
<span style="color: #BA36A5;">dn</span> = dendrogram(hc)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Use correlation-based distance</span>
<span style="color: #BA36A5;">X_new</span> = np.random.normal(size=(30, 3))
<span style="color: #BA36A5;">hc_corr</span> = linkage(X_new, method=<span style="color: #008000;">'complete'</span>, metric=<span style="color: #008000;">'correlation'</span>)
<span style="color: #BA36A5;">dn</span> = dendrogram(hc_corr)
</pre>
</div>

<pre class="example">
Cluster labels using complete linkage:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2]
Cluster labels using average linkage:
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1
 1 1 1 1 1 1 2 1 2 1 1 1 1]
Cluster labels using single linkage:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]
------
When four clusters are selected in single linkage:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 3 2 2 2 2 2 2 2 2]
</pre>
</div>
</div>
</div>
<div id="outline-container-orgd639e92" class="outline-3">
<h3 id="orgd639e92"><span class="section-number-3">10.6</span> Lab 3: NCI60 Data Example</h3>
<div class="outline-text-3" id="text-10-6">
<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset

<span style="color: #BA36A5;">nci</span> = get_rdataset(<span style="color: #008000;">'NCI60'</span>, package=<span style="color: #008000;">'ISLR'</span>)
<span style="color: #BA36A5;">nci_data</span> = nci.data

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Data dimensions:'</span>, nci_data.shape)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Counts of cancer types:'</span>)
<span style="color: #0000FF;">print</span>(nci_data[<span style="color: #008000;">'labs'</span>].value_counts())
</pre>
</div>

<pre class="example">
Data dimensions: (64, 6831)
Counts of cancer types:
RENAL          9
NSCLC          9
MELANOMA       8
COLON          7
BREAST         7
LEUKEMIA       6
OVARIAN        6
CNS            5
PROSTATE       2
K562B-repro    1
K562A-repro    1
MCF7A-repro    1
UNKNOWN        1
MCF7D-repro    1
Name: labs, dtype: int64
</pre>
</div>
<div id="outline-container-org2247c90" class="outline-4">
<h4 id="org2247c90"><span class="section-number-4">10.6.1</span> PCA on NCI60 Data</h4>
<div class="outline-text-4" id="text-10-6-1">
<div class="org-src-container">
<pre class="src src-python" id="orgc545fb2"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset

<span style="color: #BA36A5;">nci</span> = get_rdataset(<span style="color: #008000;">'NCI60'</span>, package=<span style="color: #008000;">'ISLR'</span>)
<span style="color: #BA36A5;">nci_data</span> = nci.data

<span style="color: #0000FF;">from</span> sklearn.decomposition <span style="color: #0000FF;">import</span> PCA

<span style="color: #BA36A5;">X</span> = nci_data.iloc[:, :-1]       <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">do not include labs</span>

<span style="color: #BA36A5;">pca</span> = PCA(n_components=3)
pca.fit(X)

<span style="color: #BA36A5;">labs</span> = nci_data[<span style="color: #008000;">'labs'</span>]
<span style="color: #BA36A5;">lab_names</span> = nci_data[<span style="color: #008000;">'labs'</span>].unique()
lab_names.sort()
<span style="color: #BA36A5;">lab_colors</span> = labs.<span style="color: #006FE0;">apply</span>(<span style="color: #0000FF;">lambda</span> x: np.where(lab_names == x)[0][0])

<span style="color: #BA36A5;">Z</span> = pca.transform(X)

<span style="color: #BA36A5;">fig</span> = plt.figure(figsize=(8, 4))
<span style="color: #BA36A5;">ax1</span> = fig.add_subplot(1, 2, 1)
ax1.scatter(Z[:, 0], Z[:, 1], c=lab_colors,
            cmap=plt.cm.get_cmap(<span style="color: #008000;">'rainbow'</span>, <span style="color: #006FE0;">len</span>(lab_names)), alpha=0.7)
ax1.<span style="color: #006FE0;">set</span>(xlabel=r<span style="color: #008000;">'$Z_1$'</span>, ylabel=r<span style="color: #008000;">'$Z_2$'</span>)

<span style="color: #BA36A5;">ax2</span> = fig.add_subplot(1, 2, 2)
ax2.scatter(Z[:, 0], Z[:, 2], c=lab_colors,
            cmap=plt.cm.get_cmap(<span style="color: #008000;">'rainbow'</span>, <span style="color: #006FE0;">len</span>(lab_names)), alpha=0.7)
ax2.<span style="color: #006FE0;">set</span>(xlabel=r<span style="color: #008000;">'$Z_1$'</span>, ylabel=r<span style="color: #008000;">'$Z_3$'</span>)

fig.tight_layout()
fig.savefig(fname)
<span style="color: #0000FF;">return</span> fname
</pre>
</div>


<div id="org5275a93" class="figure">
<p><img src="figures/fig10_15.png" alt="fig10_15.png" />
</p>
<p><span class="figure-number">Figure 102: </span>Projections of the <code>NCI60</code> cancer cell lines onto the first three principal components.  On the whole, in this low-dimensional space, observations belonging to a single canceer type tend to be near each other.  It would not have been possible to visualize the data without using a dimension reduction method such as PCA.  Based on the full data set, there are \({6,830 \choose 2}\) possible scatterplots, none of which would have been particularly informative.</p>
</div>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset

<span style="color: #BA36A5;">nci</span> = get_rdataset(<span style="color: #008000;">'NCI60'</span>, package=<span style="color: #008000;">'ISLR'</span>)
<span style="color: #BA36A5;">nci_data</span> = nci.data

<span style="color: #0000FF;">from</span> sklearn.decomposition <span style="color: #0000FF;">import</span> PCA
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd

<span style="color: #BA36A5;">X</span> = nci_data.iloc[:, :-1]       <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">do not include labs</span>
<span style="color: #BA36A5;">X_scaled</span> = (X - X.mean()) / X.std()

<span style="color: #BA36A5;">pca</span> = PCA(n_components=5)
pca.fit(X_scaled)

<span style="color: #BA36A5;">pca_res</span> = np.vstack([np.sqrt(pca.explained_variance_),
                     pca.explained_variance_ratio_,
                     np.cumsum(pca.explained_variance_ratio_)])

<span style="color: #BA36A5;">pca_res_df</span> = pd.DataFrame(
    pca_res, columns=[<span style="color: #008000;">'PC1'</span>, <span style="color: #008000;">'PC2'</span>, <span style="color: #008000;">'PC3'</span>, <span style="color: #008000;">'PC4'</span>, <span style="color: #008000;">'PC5'</span>],
    index=[<span style="color: #008000;">'Standard deviation'</span>, <span style="color: #008000;">'Proportion of variance'</span>,
           <span style="color: #008000;">'Cumulative proportion'</span>])

<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Proportion of variance explained (PVE) of first five principal components:'</span>)
<span style="color: #0000FF;">print</span>(pca_res_df.<span style="color: #006FE0;">round</span>(4))
</pre>
</div>

<pre class="example">
Proportion of variance explained (PVE) of first five principal components:
                            PC1      PC2      PC3      PC4      PC5
Standard deviation      27.8535  21.4814  19.8205  17.0326  15.9718
Proportion of variance   0.1136   0.0676   0.0575   0.0425   0.0373
Cumulative proportion    0.1136   0.1812   0.2387   0.2811   0.3185

</pre>

<div class="org-src-container">
<pre class="src src-python" id="orgbf746c8"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset

<span style="color: #BA36A5;">nci</span> = get_rdataset(<span style="color: #008000;">'NCI60'</span>, package=<span style="color: #008000;">'ISLR'</span>)
<span style="color: #BA36A5;">nci_data</span> = nci.data

<span style="color: #0000FF;">from</span> sklearn.decomposition <span style="color: #0000FF;">import</span> PCA

<span style="color: #BA36A5;">X</span> = nci_data.iloc[:, :-1]       <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">do not include labs</span>
<span style="color: #BA36A5;">X_normalized</span> = (X - X.mean()) / X.std()

<span style="color: #BA36A5;">pca</span> = PCA()
pca.fit(X_normalized)

<span style="color: #BA36A5;">fig</span> = plt.figure(figsize=(8, 4))
<span style="color: #BA36A5;">ax1</span> = fig.add_subplot(1, 2, 1)
ax1.plot(pca.explained_variance_ratio_)
ax1.<span style="color: #006FE0;">set</span>(xlabel=<span style="color: #008000;">'Principal Component'</span>, ylabel=<span style="color: #008000;">'PVE'</span>)

<span style="color: #BA36A5;">ax2</span> = fig.add_subplot(1, 2, 2)
ax2.plot(np.cumsum(pca.explained_variance_ratio_))
ax2.<span style="color: #006FE0;">set</span>(xlabel=<span style="color: #008000;">'Principal Component'</span>, ylabel=<span style="color: #008000;">'Cumulative PVE'</span>)

fig.tight_layout()
fig.savefig(fname)
<span style="color: #0000FF;">return</span> fname
</pre>
</div>


<div id="orgbe40842" class="figure">
<p><img src="figures/fig10_16.png" alt="fig10_16.png" />
</p>
<p><span class="figure-number">Figure 103: </span>The PVE of principal components of the <code>NCI60</code> cancer cell line microarray data set.  Left: PVE of each principal component is shown.  Right: Cumulative PVE of the principal components is shown.  Together, all principal components explain all of the variance.</p>
</div>
</div>
</div>
<div id="outline-container-orgaadd3da" class="outline-4">
<h4 id="orgaadd3da"><span class="section-number-4">10.6.2</span> Clustering the Observations of the NCI60 Data</h4>
<div class="outline-text-4" id="text-10-6-2">
<div class="org-src-container">
<pre class="src src-python" id="orgf18c096"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset

<span style="color: #BA36A5;">nci</span> = get_rdataset(<span style="color: #008000;">'NCI60'</span>, package=<span style="color: #008000;">'ISLR'</span>)
<span style="color: #BA36A5;">nci_data</span> = nci.data

<span style="color: #0000FF;">from</span> scipy.cluster.hierarchy <span style="color: #0000FF;">import</span> linkage, dendrogram

<span style="color: #BA36A5;">X</span> = nci_data.iloc[:, :-1]       <span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">do not include labs</span>
<span style="color: #BA36A5;">X_scaled</span> = (X - X.mean()) / X.std()

<span style="color: #BA36A5;">linkages</span> = [<span style="color: #008000;">'complete'</span>, <span style="color: #008000;">'average'</span>, <span style="color: #008000;">'single'</span>]

<span style="color: #BA36A5;">fig</span> = plt.figure(figsize=(6, 8))
<span style="color: #0000FF;">for</span> i, link <span style="color: #0000FF;">in</span> <span style="color: #006FE0;">enumerate</span>(linkages):
    <span style="color: #BA36A5;">axi</span> = fig.add_subplot(3, 1, i + 1)
    <span style="color: #BA36A5;">Z</span> = linkage(X_scaled, method=link)
    <span style="color: #BA36A5;">dn</span> = dendrogram(Z, labels=nci_data[<span style="color: #008000;">'labs'</span>], leaf_rotation=90, ax=axi)
    axi.<span style="color: #006FE0;">set</span>(title=link.capitalize() + <span style="color: #008000;">' Linkage'</span>)

fig.tight_layout()
fig.savefig(fname)
<span style="color: #0000FF;">return</span> fname
</pre>
</div>


<div id="orgc27d040" class="figure">
<p><img src="figures/fig10_17.png" alt="fig10_17.png" />
</p>
<p><span class="figure-number">Figure 104: </span>The <code>NCI60</code> cancer cell line microarray data, clustered with average, complete, and single linkage, all using Euclidean distance as the dissimilarity measure.  Complete and average linkages tend to yield evenly sized clusters.  Single linkage tends to yield extended clusters to which single leaves are fused one by one.</p>
</div>

<div class="org-src-container">
<pre class="src src-python"><span style="color: #0000FF;">import</span> numpy <span style="color: #0000FF;">as</span> np
<span style="color: #0000FF;">import</span> pandas <span style="color: #0000FF;">as</span> pd
<span style="color: #0000FF;">import</span> matplotlib.pyplot <span style="color: #0000FF;">as</span> plt
<span style="color: #0000FF;">from</span> statsmodels.datasets <span style="color: #0000FF;">import</span> get_rdataset

<span style="color: #BA36A5;">nci</span> = get_rdataset(<span style="color: #008000;">'NCI60'</span>, package=<span style="color: #008000;">'ISLR'</span>)
<span style="color: #BA36A5;">nci_data</span> = nci.data

<span style="color: #0000FF;">from</span> scipy.cluster.hierarchy <span style="color: #0000FF;">import</span> linkage, dendrogram, fcluster
<span style="color: #0000FF;">from</span> sklearn.cluster <span style="color: #0000FF;">import</span> KMeans
<span style="color: #0000FF;">from</span> sklearn.decomposition <span style="color: #0000FF;">import</span> PCA

<span style="color: #BA36A5;">X</span> = nci_data.iloc[:, :-1]
<span style="color: #BA36A5;">X_scaled</span> = (X - X.mean()) / X.std()

<span style="color: #BA36A5;">Z</span> = linkage(X_scaled, method=<span style="color: #008000;">'complete'</span>)
<span style="color: #BA36A5;">Z_labels</span> = fcluster(Z, t=4, criterion=<span style="color: #008000;">'maxclust'</span>)

<span style="color: #BA36A5;">res_tab</span> = pd.crosstab(nci_data[<span style="color: #008000;">'labs'</span>], Z_labels)
<span style="color: #BA36A5;">res_tab.columns.name</span> = <span style="color: #008000;">'predict_label'</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cross table when dendrogram is used to predict four labels:'</span>)
<span style="color: #0000FF;">print</span>(res_tab)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Plot the cut on the dendrogram that produces these four clusters</span>
<span style="color: #BA36A5;">fig</span>, <span style="color: #BA36A5;">ax</span> = plt.subplots()
<span style="color: #BA36A5;">dn</span> = dendrogram(Z, labels=nci_data[<span style="color: #008000;">'labs'</span>], ax=ax)
ax.axhline(y=139, color=<span style="color: #008000;">'red'</span>, linestyle=<span style="color: #008000;">'--'</span>, alpha=0.7)
fig.tight_layout()

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Fit data to K-means clustering, compare results with hierarchical clustering</span>
<span style="color: #BA36A5;">kmm</span> = KMeans(n_clusters=4, random_state=0)
kmm.fit(X_scaled)

<span style="color: #BA36A5;">kmm_vs_hier</span> = pd.crosstab(kmm.predict(X_scaled), Z_labels)
<span style="color: #BA36A5;">kmm_vs_hier.index.name</span> = <span style="color: #008000;">'k-means clusters'</span>
<span style="color: #BA36A5;">kmm_vs_hier.columns.name</span> = <span style="color: #008000;">'hierarchical clusters'</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cross table of k-means clusters and hierarchical clusters:'</span>)
<span style="color: #0000FF;">print</span>(kmm_vs_hier)
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'------'</span>)

<span style="color: #8D8D84;"># </span><span style="color: #8D8D84; font-style: italic;">Hierarchical clustering on first five principal components</span>
<span style="color: #BA36A5;">pca</span> = PCA(n_components=5)

<span style="color: #BA36A5;">Z_pca</span> = linkage(pca.fit_transform(X_scaled), method=<span style="color: #008000;">'complete'</span>)
<span style="color: #BA36A5;">dn_pca</span> = dendrogram(Z_pca, labels=nci_data[<span style="color: #008000;">'labs'</span>])
plt.tight_layout()

<span style="color: #BA36A5;">res_pca_hier_tab</span> = pd.crosstab(nci_data[<span style="color: #008000;">'labs'</span>],
                               fcluster(Z_pca, t=4, criterion=<span style="color: #008000;">'maxclust'</span>))
<span style="color: #BA36A5;">res_pca_hier_tab.columns.name</span> = <span style="color: #008000;">'predict_label'</span>
<span style="color: #0000FF;">print</span>(<span style="color: #008000;">'Cross table when hierarchical clustering is performed on first five principal components:'</span>)
<span style="color: #0000FF;">print</span>(res_pca_hier_tab)
</pre>
</div>

<pre class="example">
Cross table when dendrogram is used to predict four labels:
predict_label  1  2  3  4
labs                     
BREAST         3  2  2  0
CNS            2  3  0  0
COLON          0  2  5  0
K562A-repro    0  0  0  1
K562B-repro    0  0  0  1
LEUKEMIA       0  0  0  6
MCF7A-repro    0  0  1  0
MCF7D-repro    0  0  1  0
MELANOMA       0  8  0  0
NSCLC          1  8  0  0
OVARIAN        0  6  0  0
PROSTATE       0  2  0  0
RENAL          1  8  0  0
UNKNOWN        0  1  0  0
------
Cross table of k-means clusters and hierarchical clusters:
hierarchical clusters  1   2  3  4
k-means clusters                  
0                      0   2  9  0
1                      0   0  0  8
2                      0   9  0  0
3                      7  29  0  0
------
Cross table when hierarchical clustering is performed on first five principal components:
predict_label  1  2  3  4
labs                     
BREAST         0  5  2  0
CNS            2  3  0  0
COLON          7  0  0  0
K562A-repro    0  0  0  1
K562B-repro    0  0  0  1
LEUKEMIA       2  0  0  4
MCF7A-repro    0  0  1  0
MCF7D-repro    0  0  1  0
MELANOMA       1  7  0  0
NSCLC          8  1  0  0
OVARIAN        5  1  0  0
PROSTATE       2  0  0  0
RENAL          7  2  0  0
UNKNOWN        0  1  0  0
</pre>
</div>
</div>
</div>
</div>
<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">

<div class="footdef"><sup><a id="fn.1" class="footnum" href="#fnr.1">1</a></sup> <div class="footpara"><p class="footpara">
The middle panel is from a different data set.
</p></div></div>


</div>
</div></div>
<div id="postamble" class="status">
<p class="author">Author: Naresh Gurbuxani</p>
<p class="date">Created: 2020-08-19 Wed 08:09</p>
<p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
</div>
</body>
</html>