##[fit] Classification ##[fit] Metrics and Risk
#[fit] Ai 1
- Multiple Regression and Dimensionality
- Classification
- Logistic Regression
- Feature Selection
- Metrics
- Predictive distributions (back to regression)
- Bayes Risk (for regression)
- Bayes Risk for Classification
- Asymmetric case
- ROC curves and cost curves
#DISCRIMINATIVE CLASSIFIER
- Remember the curse of dimensionality
- it can show up as a large number of "power features", such as
$$x^{356}$$ , or it can show up as many "actual features": 224x224 in Imagenet for example - can tame with regularization
- or with explicit feature selection
- another way: would it not be better to have a classifier that captured the generative process
- We do generative models later in Ai2
- Can use Regularization for feature selection
- Lasso, aka L1 regularization , or
$$\sum_i | w_i | < C$$ sets some coefficients to exactly 0 and can thus be used for feature selection - We can also use
SelectKBest
- We can also do forward-backward feature selection.
Which set should such stuff be done on?
##[fit] 1. EVALUATING CLASSIFIERS
- accuracy is a number from 0 to 1. It’s a general measure of how often the prediction is correct.
- visually looking at the confusion matrix is another important way to evaluate
- Precision (also called positive predictive value, or PPV) tells us the percentage of our samples that were properly labeled “positive,” relative to all the samples we labeled as “positive.” Numerically, it’s the value of TP relative to TP+FP. In other words, precision tells us how many of the “positive” predictions were really positive.
- recall, (also called sensitivity, hit rate, or true positive rate). This tells us the percentage of the positive samples that we correctly labeled.
- F1 score is the harmonic mean of precision and recall. Generally speaking, the f1 score will be low when either precision or recall is low, and will approach 1 when both measures also approach 1.
Lets understand the various metrics on this example
Its important in predictions to make baselines and see if you can beat them.
The simplest baselines are all positives and all negatives.
For a balanced dataset, these models only have 50% accuracy.
It should be easy to beat these with the simplest logistic regression, which can then serve as a baseline
For example, consider a fraud situation in which there is only 1% fraud.
Thean a classifier predicting that there is no fraud has 99% accuracy.
Then the question arises: is accuracy the best metric?
We will come back to this towards the end.
##[fit] 2. Back to Regression: ##[fit] Prediction and the mean
- In machine learning we do not care too much about the functional form of our prediction
$$\hat{y} = \hat{f}(x)$$ , as long as we predict "well" - Remember however our origin story for the data: the measured
$$y$$ is assumed to have been a draw from a gaussian distribution at each x: this means that our prediction at an as yet not measured x should also be a draw from such a gaussian - Still, we use the mean value of the gaussian as the value of the "prediction", but note that we can have many "predicted" data sets, all consistent with the original data we have
- This means that there is an additional "smear" to that of the regression line...
- the band on the previous graph is the sampling distribution of the regression line, or a representation of the sampling distribution of the
$$\mathbf{w}$$ . -
$$p(y \vert \mathbf{x}, \mu_{MLE}, \sigma^2_{MLE})$$ is a probability distribution - thought of as $$p(y^{} \vert \mathbf{x}^, { \mathbf{x}i, y_i}, \mu{MLE}, \sigma^2_{MLE})$$, it is a predictive distribution for as yet unseen data $$y^{}$$ at $$\mathbf{x}^{}$$, or the sampling distribution for data, or the data-generating distribution, at the new covariates
$$\mathbf{x}^{*}$$ . This is a wider band.
When we estimate a model using maximum likelihood converted to a risk (how? by NLL) we are calling this risk an estimation risk.
Scoring is a different enterprise, where we want to compare different models using their score or decision risk
The latter leads to the idea of the Bayes Model, the best you can do..
Its the minimum risk ANY model can achieve.
Want to get as close to it as possible.
Could infimum amongst all possible functions. OVERFITTING!
Instead restrict to a particular Hypothesis Set:
##[fit] 3. Bayes Risk (Regression)
where
$$R_{out}(h) = E_{X}[(h-r)^2] + R^{}; R^{} = E_{X} E_{Y|X}[(r-y)^2]$$
For 0 mean, finite variance, then,
Note that:
- We are never given a population, rather we get a training set.
- Now, varying training sets make
$$R_{out}(h)$$ a stochastic quantity, varying from one training set to another, since a different model is fit on each set! - Goal of Learning: Build a function whose risk is closest to Bayes Risk, on our training set
$$ \renewcommand{\gcald}{g_{\cal D}} \renewcommand{\ecald}{E_{\cal{D}}} \bar{g} = \ecald[\gcald] = (1/M)\sum_{\cal{D}} \gcald $$. Then,
where
This is the bias variance decomposition for regression.
- first term is variance, squared error of the various fit g's from the average g, the hairiness.
- second term is bias, how far the average g is from the original f this data came from.
- third term is the stochastic noise, minimum error that this model will always have.
- We dont know
$$p(x,y)$$ , otherwise why are we bothering? - We want to fit a hypothesis
$$h = g_{\cal D}$$ , where$$\cal{D}$$ is our training sample. - So use empirical distribution on our sample as our best estimate of
$$p(x, y)$$ $$\hat{p}(x, y) = \frac{1}{N} \sum_{ i \in {\cal D}} \delta(x - x_i) \delta(y - y_i)$$ - Then
$$R_{out}(h) = \frac{1}{N} \sum_{ i \in {\cal D}} (h(x_i) - y_i )^2$$ and minimize to get$$g_{\cal D}(x)$$
##[fit] 4. Bayes Risk (Classification)
That is, we calculate the predictive averaged risk over all choices y, of making choice h for a given data point.
Overall risk, given all the data points in our set:
Then for the "decision"
and for the "decision"
#CLASSIFICATION RISK
$$R_{g_\cal{D}}(x) = P(y=1 | x) \ell(y=1, g) + P(y=0 | x) \ell(y=0, g) $$ - The usual loss is the 1-0 loss $$\ell = \mathbb{1}{g \ne y}$$. (over all points $$\frac{1}{n} \sum{i}^{n} I\left(y_{i}=\hat{y}_{i}\right)$$)
- Thus,
$$R_{g=1}(x) = P(y=0 |x)$$ and$$R_{g=0}(x) = P(y=1 |x)$$ at given$$x$$
CHOOSE CLASS WITH LOWEST RISK
Choose
choose 1 if
Telecom customer Churn data set from @YhatHQ1
Now, we'd choose
So, to choose '1', the Bayes risk can be obtained by setting:
Suppose you hope to build a widget recognizer that has 5% error. This is the best even humans can do. Right now, your training set has an error rate of 15%, and your dev set has an error rate of 16%.
Will adding training data help?
What should you focus your energies on?
(This is a problem from Andrew Ng's Machine Learning Yearning)
We break the 16% error into two components:
- First, the algorithm’s error rate on a very large training set. In this example, it is 15%. We think of this informally as the algorithm’s bias or more precisely unavoidable bias(bayes rate) _ bias.
- Second, how much worse the algorithm does on the dev (or test) set than the training set. In this example, it does 1% worse on the dev set than the training set. We think of this informally as the algorithm’s variance
- Third, even the perfect classifier (lets say here, humans, have a 5% error rate). This then is the Bayes Rate. We'll assume that the machines wont do better. While this is not always true, it will in any case be a possible limitation of our training data
ERROR = Unavoidable Bias(Bayes Error) + Bias + Variance
It introduces us to the idea of the best possible classifier/regressor. This is an even more important idea.
The Bayes Risk is the "unavoidable bias". The optimal error rate. When we said earlier we hoped to get to 5% accuracy. Thats what we meant by base rate.
From Machine Learning Yearning by Andrew Ng:
Suppose your algorithm achieves 10% error on a task, but a person achieves 2% error. Then we know that the optimal error rate is 2% or lower and the avoidable bias is at least 8%. Thus, you should try bias-reducing techniques.
- get human labelers relative to your training set
- but some problems such as recommendations and ads, humans find it hard as well. You might need to microtarget
##[fit] 5. Model Selection #[fit]COMPARING CLASSIFERS
We can compare classifiers on accuracy?
But is accuracy really the right measure?
For extremely asymmetric (in size) classes, a stupid baseline will give you more accuracy...
Maybe you want to try and beat that accuracy.
But more importantly you want to minimize the false negatives or the false positives.
#ASYMMETRIC CLASSES
- A has large FP2
- B has large FN.
- On asymmetric data sets, A will do very bad from an accuracy perspective
- On the other hand it has no FN, so if the cost of FN (as in cancer) is much higher, then A might be the classifier you want!
- Upsampling, downsampling and unequal classes are used in practice: having too few samples can throw thw training off
#[fit]ROC SPACE3
#[fit]ROC Curve
#ROC CURVE
- Rank test set by prob/score from highest to lowest
- At beginning no +ives
- Now move threshold just enough that one becomes positive. This might, for example be
$p=0.99$ - Keep moving threshold
- calculate confusion matrix at each threshold
- plot the TPR against the FPR
#ROC curves
#ASYMMETRIC CLASSES
We look for lines with slope
Large
Churn and Cancer u dont want FN: an uncaught churner or cancer patient (P=churn/cancer)
Average Cost = $$\frac{1}{N}$$ (Cost of TP * TP + cost of FP * FP + cost of FN * FN + cost of TN * TN)
#EXPECTED VALUE FORMALISM
Can be used for risk or profit/utility (negative risk)
Fraction of test set pred to be positive
#Profit curve
- Rank test set by prob/score from highest to lowest
- Calculate the expected profit/utility for each confusion matrix (
$$U$$ ) - Calculate fraction of test set predicted as positive (
$$x$$ ) - plot
$$U$$ against$$x$$
#Finite budget2
- 100,000 customers, 40,000 budget, 5$ per customer
- we can target 8000 customers
- thus target top 8%
- classifier 1 does better there, even though classifier 2 makes max profit
Footnotes
-
http://blog.yhathq.com/posts/predicting-customer-churn-with-sklearn.html ↩
-
this+next fig: Data Science for Business, Foster et. al. ↩