-
Notifications
You must be signed in to change notification settings - Fork 2
/
ng notes
1248 lines (815 loc) · 52.7 KB
/
ng notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
θ
λ
∑ - upper
σ - lower
Π - upper
π - lower
Δ - upper
δ - lower
μ Mu (pronounced mew)
ε - lower (epsilon, upper just looks like E)
α - alpha, lowercase
≤
≥
○
✖
≈
ℝ
∈
superscripts
¹
²
³
⁴
ⁱ
ʲ
ⁿ
subscripts
₀
₁
₂
₃
₄
₅
ⱼ
ₙ
- looking at the gtaphs and then manually trying to lead off the numbers is not a good way to decrease the cost function
- linear regression has no local optima
FEATURE SCALING
|3| <
or
|1/3| >
consider feature scaling
value/(max-min) --- results in a new range of just 1
MEAN NORMALIZATION
(size-average)/(max-min) --- results in a new average input value of 0
*you can also use the standard deviation instead of max
This stuff will get you not exactly but ROUGHLY into a normalized range
If gradient descent is working the value of J(theta) should decrease at EVERY iteration
Knowing when gradient descent has converged: See if it decrease by less than 10^(-3) in one iteration. But 10^(-3) isn't always the right number and finding that number is often difficult. Ng likes to look at plots to see if he's satisfied that it's converged.
If error goes up or goes up and down try a smaller learning rate. Slow convergence could even be caused by too high a learning rate. Just plot a bunch and use which works. Try 1, .1, .01, etc. He actually uses .001, .003, .01, .03, .1, .3, 1, 3, etc.
POLYNOMIAL REGRESSION FITTING
you may want a ^2 or a ^3 to fit the data.
if you want a cube what you can do is something like this:
θ1(size) + θ2(size)^2 + θ3(size)^3
e.g. just using the same feature but getting it to fit a more curvey graph.
*if you choose your feature like this feature scaling becomes very important. think 10, 10^2, 10^3... and the difference becomes really large
*if your hypothesis starts curving back down try sqrt(x) instead of x^2 and it may level out.
NORMAL EQUATION
kinda like this theta = pinv(X'X)XY
- one shot one kill
- slow if you have a lot (like 10,000 or more) features
- the cost comes from inverting a big matrix being expensive
actually kind of cool --- https://www.coursera.org/learn/machine-learning/supplement/bjjZW/normal-equation
GRADIENT DESCENT
- have to choose learning rate
- need many iterations
- runs okay with a lot of features
< 10000 features use normal equation.
> 10000 features use linear regression
WEEK 3 - CLASSIFICATION
*logistic regression is actually a classification algorithm. it's called regression for historical reasons.
sigmoid function (also called logistic function) - range is [0, 1]
| --------
|/
/
_____/|
------|----------
|
|
output between 0-1
if .7, we can say 70% chance it is true
p(y=1|x;θ) "probability that x=1 given x, parameterized by θ"
WHOA WHOA WHOA.
range of sigmoid is [0,1]. say it comes out to .7. so there's a 30% chance it's false. but you can't add up to 1 within that range. right???
"Parameterized by theta"
means you have a static X. but when * by theta, which is a parameter, you can make it dynamic. parameterized by theta.
btw in math 0
y=E {0,1}
{} is an array.
(0,1)U[6,infinity) --> interval notation
-ln(hypothesis(x) if y = 1
COST =
-ln(1- hypothesis(x)) if y = 0
A NUANCE
*if h(x) = y then cost = 0
*if h(x) = .5 then cost > 0
COME BACK TO THIS. LOOK AT THE GRAPH --- SOMETHING ABOTU EVEN IF CLOSE IT'S STILL AN ASYMPTOTE. NOT SURE HOW COST = 0 THOUGH
LOGISTIC REGRESSION 1 VS ALL
if you have triangles, circles, and squares make a classifier for each and then whichever is the most confident is the winner.
OVERFITTING
underfitting - also called high bias
REGULARIZATION
- t1x, t2x, t3x, t4x --- some of these are causing overfitting. we want them but we want their contribution to be small. So we make sure those values for theta are very small so as to drag that terms contribution down.
J(θ) = 1/2m ∑[-yⁱln(hθ(x)) - (1-yⁱ)ln(1-hθ(x))]
J(θ₀⁰) = 1/2m ∑[hθ(x)-y] + λ∑θ²j
left term want to fit as close as possible
right terms wants to minimize each terms impact (by keeping parameters small). "it determines how much the costs of our theta parameters are inflated".
ITS A WAR!
*but, there is a way to automatically choose the regularization parameter as well.
linear regression regularization and logistic regression regularization both use the same equation, J(θ₀⁰) = 1/2m ∑[hθ(x)-y] + λ∑θ²j. but they look different because logistic regression uses sigmoid to normalize the hypothesis.
NEURAL NETWORKS
-don't forget (θ^T)X is second nature. It means for a given record, multiply and sum all θ's and x's. e.g. do the polynomial for that theta and x's.
x₀ a₀² <-- bias units
x₁ a₁²
x₂ a₂² ○ --> hθ(x)
x₃ a₃²
col1 = layer 1 = input layer
col2 = layer 2 = hidden layer
col3 = layer 3 = ouput layer
θ is a matrix of weights now!!!
#########################################################
https://www.coursera.org/learn/machine-learning/supplement/Bln5m/model-representation-i
"Each layer gets its own matrix of weights, θ(j)" - from
Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of Θ(1) is going to be 4×3 where sj=2 and sj+1=4, so sj+1×(sj+1)=4×3.
##########################################################
If a layer has Sⱼ number of units. e.g. above there are 3 in the input and hidden...
The dimensions of theta for layer j be: (Sⱼ₊₁)X(Sⱼ+1)
(how many does the next guy have) X (how many do I Have + 1)
sigmoid is the "activation function"
################################################################
BACKPROPAGATION
- allows you to compute "errors" for each paramater in the nn. but i think "errors" just means the slope of each param at the current point in an n-dimensional graph.
- there is a vectorized implementation for computing Δ!!!!!!!!!!!!!
--- look at end of video 2. the section in red.
l,i,j insanity
l = curLayer
i = curRowInTraining set
j - which node in the cur layer
Numerical Esitmation of gradients
- always uses this because you might have a bug in backprop even though the error still gets smaller
- sort of like calculating lim(f(x)-f(x+))/h except calc (f(x+w)-f(x-w))/2w usually uses w=10^4. If you go too small you get numeric errors (probably doing arithmetic on tiny numbers not working).
-just does that for every theta. rolls out theta into a vector. then does a for loop with (θ+ε-(θ-ε))/2ε for every elemement in theta.
-once you know your gradient checking code is working turn it off so it doesnt slow down your code in production.
- the other way, e.g. regular backprop (which atm i dont fully understand), is much more effecient than this. this, e.g. gradient checking, is just for scaffolding/debugging. I suppose technically you could use it in prod but I don't think it's technically/mathematically as perfect AND is slower... I think that is the take-away.
Random Initialization (Symmetry breaking)
- You can't initialize theta1 to all zeros. If you do each weight coming out of a node (say x1) will get updated to the same value. e.g. ALl nodes it points to will be get the same input from it. BUt if you initialize to random values they are free to change. ...unless, by some coincidence the two weights ever become the same value. In which case wouldn't they got locked together too?
-"all of your hidden units are computing the exact same feature"
-rand(1) gives random value between 0 and 1 inclusive
-rand(1)*5 gives a random value between 0 and 5 inclusive
-rand(3, 1)*5 gives a random vector between 0 and 5 inclusive
MORE BACKPROP
ng does compute deltas for bias nodes when doing backprop. but it sounds like that's just because it makes the computation easier. he said he usually just throws them away. ...remember the bias nodes output 0 and there's no way to change that.
IF HYPOTHESIS DOESNT PERFORM WELL
- more training data
- more features
-less features
- polynomial features (yx^3, y^2x^2, y^3x) kind of thing
- change lambda
SPLIT TEST DATA
standard way is typically 70% to train 30% to validate. ...random, obviously
MODEL SELECTION
what order model to use. e.g. just order1 or how much to increase the order by doing this sort of thing (yx^3, y^2x^2, y^3x)
but that may not work because it is likely to be overly optimistic because it's fit to the test set. e.g. you can't trust the degree (model) that comes out of the test data.
solution:
60% training set
20% cross validation
20% test
train each model on train. evaluate each model on cross validation. pick the model that gives that lowest cross valiation error!
bias vs variance - interesting - if d (the order) is too large for the number of examples that we have. implies you can use a larger d if you have more training examples
note - if a learning algorithm is suffering from high variance getting more training data is likely to help. --- i guess because even though its jagged if you have enough examples it'll smooth out. ...idk
note - nn size - big networks are prone to overfitting but are more computationally expensive although in practice that's probably not a problem. small networks are prone to underfitting. its often a good idea to go with the bigger network and rely/work-on regularization.
- again - using a single hidden layer is often a reasonable default
- but - try a bunch of networks (different number of layers and nodes) and see which one works best on cross validation training set.
WEEK 6 Videos - 2nd pass
model selection is choosing the order. model selection is also about choosing lambda.
models are likely to perform better on training set than on real data. this is also why you need to evaluate a model using the cross validation set.
to select model try different order polynomals. run them on test data and then see which one works best on the cross validation set.
linear model
quadratic model
cubed model???
...
10th order polynomial
using the same data for the training set and validation set is not considered good practice but you will see some people do it. maybe sometimes okay if you have massive amounts of data
the graph in video 4 is an overall view of the graphs in videos 5 and 6. the graph in video 4 shows training data vs cv for each order of model. the graphs in videos 5 and 6 show training vs cv for just one model. sure would be cool if the trainer program output these graphs!!!
video 5 about choosing lambda. you dont use the lambda part of the cost function when measuring how well lambda did. ...duh? he's just saying use the squared error cost function to get the cost.
video 5 about choosing lamda. try them in ...multiples of 2 (... 10.24, you know, multiples of 2?). try a range of values. pick the one that has the lowest squared error on the cross validation set.
*when you think about the size of lambda
- think about regularization being used to smooth the hypothesis
- no regularization means overfitting. so SUPER regularization means the hypothesis gets stretched out into pretty much just a line
LEARNING CURVES
to plot learning curves plot JTrain and JCV
use only 10 or 20 or 30 or 40 examples in your plot
you still train on all of them. you just only plot the learning curve on your little subset. so your JTrain plot might only have 10 examples even though m>>10.
when you have very few training examples (1, 2, 3, ...< order i guess) its very easy to fit exactly. so JTrain will be low, maybe even 0. but as the number of training examples grows the error from JTrain will actually get bigger just because you can't fit the graph perfectly anymore. ---------- this is all intuition. just writing it down because.
HIGH VARIANCE
"as the training set size increase it becomes harder to fit the data exactly". which is why more examples can help solve high variance.
"When plotting learning curves they might not be as nice and smooth and pretty as in the video"
on the menu of options to try to fix a hypothesis that isn't performing well. changing lambdas is pretty easy so you might want to try that first.
small nns are prone to underfitting
large nns are prone to overfitting
*for large nns use lambda to address overfitting
often in terms of nns its the larger the better
number of hidden layers. usually the more the better. but you might want to try it with a range of hidden layers and see which one does best.
-------------------------------------------------
MACHINE LEARNING SYSTEM DESIGN wk6 - 8
SPAM CLASSIFIER
CHOOSING X
option1 - list of 100 keywords. "deal", "buy", "viagra", "sean"
- some of those are indicative of spam, others, like "sean" are indicative of a real email.
spam email = "get your viagra today!"
0 discount
0 cash
x=0 medicine
1 viagra
0 "your actual address" (wouldnt be in this pretend email cause its spam)
1's and 0's dont put 3 if the word is in there three times.
! how its actually done is to get a training set (10,000-50,000 emails) and pick the 100 most frequently occuring words.
other notes features
- look at email routing (spammers will try to hide where the email came from)
- consider should "discount" and "discounts" be treated as the same word? "deal" and "dealer"? should punctuation be considered a feature???
- consider detectig misspellings. so they dont fool your spam catcher. e.g. w4tches != watches
DO NOT RANDOMLY FIXATE ON ONE THING! There are lots of ideas here and some people will easily spend 6 months working on one of these problems without a good reason.
----------------------------------------------------
RECOMMENDED APPROACH
S.) Simple algorithm you can implement quickly - literally impossible to spend too little time on this
L.) Learning Curves
A.) Analyze errors. Manually look at what is being misclassified in your cross validation set. *note - again, your cross validation set. not your test set.
From here you can get a better idea of feasibility and what algorithm to actually use.
ANAYLZE ERRORS (spam example)
looking at misclassified emails.
1. manually look at them and categorize them. e.g. (pharmacy, replica, steal password scams")
you might find that many "steal password scam" emails are being miscategorized. so...
2. you might want to see if you can come up with better features to help categorize them correctly. for example, maybe you see that unusual punctuation is common in the misclassified emails. so now you know to focus on that.
SIMPLE ALGORITHM - using a quick and dirty helps you identify which training examples are the challenging ones
---------------------------------------
STEMMING
Universe/University = univers
Discount/Discounting/Discounted/Discounts = discoun
run it with and without stemming. see if cross validation error is higher or lower
* you can also try to distinguish between upper and lowercase
---------------------------------------
SKEWED CLASSES
say you have a 1% chance of something
your function is y=0. so it has an error of .5%.
if we make a change to our algorithm and it goes from 99.2% accuracy to 99.5% accuracy did our change actually help or are we just predicting y=0 more often and getting lucky.
ACTUAL
1 0
PREDICTED 1 true+ false+
0 false- true-
precision=of all patients where we predicted y, what fraction actually has cancer
recall=of all patients that actually have cancer, what fraction did we correctly detect as having cancer
DEF2 (by me)
precision= when we did predict 1, how often were we right
recall=when the value was 1, how often did we predict it
accuracy = how many you got right (1 or 0) out of all the examples
so if youre learning algorithm is predicting
y=0
then it will have a recall of
recall=0
because there wont be any true positives. ..because y=0
*usually you use the convention that y=1 (not y=0) for the "rare" class that you want to detect. ...rare being like, the ones that you wanted to detect.
----------------------------------------------------
TRADING RECALL AND PRECISION
*suppose we want to predict y=1 only if we're very confident. (e.g. we dont want to tell someone they have cancer if they really dont)
- then predict 1 only if h(x) >= .7 (as opposed to .5)
- of course, now you will have lower recall (e.g. not predict as many people who DO have cancer)
* suppose we want to avoid missing too many patients with cancer
- then set h(x) lower, say .3
- of course, now you will have lower precision (but higher recall)
* you can plot a precision/recall curve if you want. see video 11. 6:50
btw, for example, your y=0 predictor has perfect recall.
----------------------------------------------------
IF YOU HAVE TO THINK ABOUT WHICH VERSION OF YOUR ALGORITHM IS BETTER THAT SLOWS THINGS DOWN.
precision and recall. p=.5, r=.4. p=.7, r=.1, etc, which is better????
How do we evluate the algorithm based on ONE number?
you cant use avg because, for example, our perfect recall y=0 algorithm will get a higher score even though all it does it say y=0.
F score: 2PR/(P+R)
*there are other options but people just tend to use this. F doesn't have any meaning. thats just what it is.
if P or R = 0 you get a score of 0
a perfect score is P=1,R=1 == 1
so your range is 0-1
so a pretty reasonable way to automatically trade precission and recall would be to try a lot of values for h(x) operator ? and see which gives you the highest f score.
e.g.
h(x) >= .5
h(x) <= .3
h(x) > .9
------------------------------------------------------
*perceptron = logistic regression?
------------------------------------------------------
study done by michelle banko and eric brill
found that with perceptron, winnow, memory-based, naive bayes algorithms once you got enough data (millions) they all pretty much did as good as each other. so there is a saying
"it's not who has the best algorithm that wins. it's who has the most data"
------------------------------------------------------
large data rationale
- can a human do it?
e.g. i ate (two|too|to) eggs for breakfast. yep, a human can figure that out. how much will a 2400 sqft house cost? nope, not even an expert can tell you based on that information alone.
- assuming you have enough features to accurately predict y
- that will address bias
- JTrain will be small
but what about variance? well, if you have a ton of data, variance irons itself out.
so if you have a good set of features (low bias) you can feed it a lot of data, which will reduce variance, which will make it better and better.
--------------------------------------------------------
*another reminder
your choice of algorithm is not as important as knowing the right features to use and having enough data.
--------------------------------------------------------
SUPPORT VECTOR MACHINE
tl;dr two vectors. map V onto U = P. U'V=P*||U||
one vector is our decision bounday + 90 degrees. the other is our ith training example.
then we alter V (aka theta) so that p*||U|| > 1 or p*||U|| < -1 when y-1.
--------------------------------------------------------
KERNELS
kernels and svms are a good fit. you could use kernels to find features for logistic regression but they are better suited for svm's because they are more computatinallly effecient with svms.
e.g. using them to find paramaters instead of just increasing the order of your model doesn't sound like something that is really done.
ON THAT NOTE - do not write software to minimize an svm. there are off the shelf solutions that do it and they take advantage of magic. ...numerical optimization tricks.
ALSO - do not write software to minimize and SVM. two suggestions are liblinear and libsvm.
*gaussian and linear kernels are the two most popular kernels, by far.
google "mercer's theorem". [when svms were first being created there was a move to keep the general. not all kernels you write will be valid. which is why you should just stick with linear or gaussian unless you know what youre doing]
* some software packages include a gaussian and linear kernel and others you have to write it yourself (e.g. you pass in a function pointer)
* you do have to perform feature scaling before using an svm.
e.g. feature1 is 1000sqft and feature2 is 2 bedrooms.
other kernels (dont use these)
-polynomial kernel: (x'l)2,(x'l)^3,(x'l+1)^3,(x'l+constant)^degree
-string kernel (used for strings)
-histogram intersection kernel
-chi-square kernel
~1 to (10,000 or 50,000) examples with a large number of features (maybe up to 50,0000) is where svm's really shine.
- neural networks work well too but might be slower to trian compared to an svm.
*INTERESTING NOTE - the svm optimization problem is convex. e.g. its a lot easier to minimize
SVM's are good tools.
with SVM's, neural networks, and logistic regression you're very well positioned to write good machine learning software (in 2010???)
AGAIN - the choice in algorithm is important but often not as important as how skilled you are:
- identifying features
- coming up with new features
- solving errors in your algorithms
- learning curves? model selection? high bias vs variance. what is the math actually doing? get more data or work on model? e.g. wk6.
SVM BIAS AND VARIANCE - think of gaussian kernel. large sigma means wider kernel means higher bias and lower variance bariance. small sigma means more narrow kernel means lower bias and higher variance.
*note about "high variance". high variance means it varies more abruptly. it sounds like it might mean it varies more between the examples but in your head you might be imagining it nearly perfectly fitting a bunch of points and think that's not variance, it hits the mark every time. but what it means is it varies (e.g. jagged craziness) more between each example.
-------------------------------------------------------------
SPAMASSASIN PUBLIC CORPUS (ex6, in the pdf)
https://www.google.com/search?sclient=psy-ab&client=ubuntu&hs=3vU&channel=fs&btnG=Search&q=spamasasin+public+corpus
-------------------------------------------------------------
NOTES ON EX6 SPAM CLASSIFIER
STEP-1:
ex6_spam.m had something setup to handle email headers
STEP0:
preprocess the emails. tolowercase. normalize (money, m0ney, etc), strip html tags, replace urls with the word httpurl, replace email address with emailaddr
STEP0.5:
gonna have to look in the code for an exact answer on this one. take the most frequently used words in our taining set. thats all they way. not how often they are used or their standard deviation or anything. just "the most used". the idea is that if you include every word in your training set you will overfit.
STEP1:
select the most frequently used words in spamassasin. they used only words that occured more than 100 times in spam assasin (which led to a list of 1899 word. how they picked the number 100 i have no idea.
STEP2:
create a new vector where each member is the index of a word/member in spamassasin that was found in your training data.
STEP3 (GENERATE THE FEATURES):
you have your vector from step2. now, for each email, say if that email contained a word in your step2 vector. ...it will be the same size as your step2 vector, obviously.
will look like
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
STEP4:
now you have features for each email and you know if each email is spam or not. so now you can train your model.
-------------------------------------------------------------
UNSUPERVISED LEARNING
K MEANS (clustering) - VERY COOL
1 randomly initialize cluster centroid
2 group #examples/cluster centroids based on how close they are to each cluster centroid
3 move cluster centroid to average point of all theexamples associated with it
3.5 if cluster centroid has no points (either delete it or randomly reinitialize it)
4 repeat
-mathematical definition makes it look way more confusing than it actually is. when you minimize youre just minimizing the distance between the x's and the cluster centroids.
* small note - c^i is the index of the cluster to which x^i is assigned
- J(c¹, c², c³, ..., ck, μ₁, μ₂, μ₃, ..., μk)
- All minimization does is minimize with respect to c and then minimize with respect to μ and then it keeps on iterating.
HOW TO INITIALIZE THE K-MEANS CLASSIFIER CLUSTER CENTROIDS
-pick K random training examples. but dont pick the same example twice. if you did two centroids would be in the same spot and either your program would blow up or you'd effectively have K really= K-1
- set μk=theRandomXWeAreUsingAsKk
- train (you might get lucky and you might not
- do this a bunch of times (normal # of times would be ~50-1000). pick the one that gave you the lowest error (?error.... value for J). whatever that one where had the lowest error was, use the cluster centroids from that one (their final location, not where they started... obviously(?))
*using this approach of random initialization iteratively and then using the one with the lowest error is more likely to be helpful if you only have a few K. if you have a lot of K it wont help as much.
*2 or 3 or 4 clusters is the "regime" where random initialization is likely to help you find better clusters.
* stupid note from me - but isn't K=m the ultimate sweet spot? yes. for that training set, you moron. it will be overfit for everything else.
HOW DO YOU CHOOSE THE NUMBERS OF CLUSTERS?
by hand (look at output of k-means, look at visualization, or whatever else you have to go on)
*there isnt a good way to do it automatically.
- there isnt always a right answer! sometimes you could say well there are 4 clusters here while looking at the same dataset you could say well there are 2 clusters here.
ELBOW METHOD (for choosing number of clusters)
\
J \
\____
1234567
"after 3 clusters we start seeing very little reduction in J. so use 3."
- in reality it doesn't work like this. the curve is much more gradual.
***sometimes it works. its worth a shot but dont have a high expectation of it working.
OPTION2
think about it
- for example. im selling t-shirts and want to find market segments. well, what sizes of t-shirts might i be able to produce? S,M,L. Maybe also XS,S,M,L,XL. But probably not just S and L or one where K=10.
-----------------------------------------------------
K-MEANS FOR IMAGE COMPRESSION!? (part of programming assignment)
-----------------------------------------------------
------------------------------------------
----------------------------------
---------------------------
very cool but did not answer my question about distances between colors. it just used my code, lol. the code in findClosestCentroids.m which just takes the linear distance in 3-dimensional space, which sweeps out a sphere on which any point on the surface, regardless of its color to the eye, is the same distance.
very cool and useful. doesnt answer any questions about how to tell if two colors are similiar.
------------------------------------------------------------
DIMENSIONALITY REDUCTION
- sometimes you get redundant features. e.g. the same thing twice. e.g. the length of something in inches and the length of something in cm as two seperate features. or you might get very similiar (or highly correlated) features (for example, pilot skill and pilot enjoyment could potentially be reduced to just one feature).
- advantages (of say going from 2 features down to 1 or i suppose if you consider the whole thing at once going from like 20 down to 5)are less memory required and more importantly your algorithm will run faster.
- first part of video6 kind of explains it. last part of video6 visualizes it pretty good.
---------------------------------------------------------------
DIMENSIONALITY REDUCTION - DATA VISUALIZATION
- when you use the dimensionality reduction algorithm (PCA?) it doesn't prescribe a value to the output. So it outputs z1 and z2, its up to you to figure out what those mean.
----------------------------------------------------------------
PRINCIPAL COMPONENTS ANALYSIS
- by far the most popular dimensionality reduction algorithm
- before applying pca use mean normalization and feature scaling (maybe feature scaling, if you need it).
STEP 1 (preprocessing)
- normalize
- feature scaling if needed
- take the average y (the y feature). for each y,y:=y-avgY. now you have exactly zero mean.
the mean normalization part:
2,3,4,7 => 2+3+4+7=16. 16/4=4. 2-4=-2,3-4=-1,4-4=0,7-4=3.(-2)+(-1)+0+3=0,0/4=0. Now you have exactly zero mean.
the feature scaling part (scale features to have comparable range of values). do this on all your features:
(x-xMean)/(maxX-minX)
or what is more common:
(x-xMean)/(standardDeviation)
STEP 2 - its actually pretty easy (proof is hard though)
these x's are vectors. the multiplication below is of (nx1)(1xn)
m
∑ = (1/m) ∑ (xⁱ)(xⁱ)'
^ i=1
|
|
Sigma, the variable Sigma, not the summation sumbol. in this case Sigma is the "covariance matrix". ..Sigma is a matrix.
*vectorized implementation is (1/m)x'x ---- you can check it if you want. idk how that works.
[U,S,V]=svd(Sigma)
*"eigenvectors" - pronounced eye-gin-vectors (gyin, not gin the alcoholic drink).
*Singular Value Decomposition (there are many libraries that can compute this)
*octave also has an eig function that does the same thing as svd but svd is "a little more numerically stable". if you want to learn more, this has something to do with linear algebra.
svd output:
U=the only one we need. nxn
you want to reduce to 72 dimensions? then take the first 72 columns of U.
Ureduce is the first k columns of U. e.g. 72.
Ureduce is nxk, obviously. (that's n by k)
z=Ureduce'*x
*(kxn)(nx1)= z is a kx1 = z is a k dimensional vector
----------------------------------------------------------------
PCA RECONSTRUCTION
xApprox=Ureduce*z
isn't exact but kind of close. for example 2d to 1d is like taking our 1d line and giving the slope and y-intercept of the original projection. but all the points lie on the projection.
----------------------------------------------------------------
PCA CHOOSING K (K is the number of principal components. aka the number of features)
avg squared projection error: on avg how far off was xApprox
total variation in the data: avg value of x^2. he says ||x|| but not sure how he's getting a length.
2
3
4
is the vector. okay, i guess that has a length.
How do you choose K?
good rule of thumb is to choose the smallest K while
distance between x and its projections
/
how much the data varies
is <= .01. e.g. < 1%.
or so that "99% of variance is retained"
*dont tell people you went from 1000 dimension down to 20. say n% of the variance is retained.
*you can also use .05, or .1 or even .15 if you are crazy
HOW DO YOU DO IT?
while(v<.01) k++ ???????? nope, too slow
SEE PICTURES "PCA5 CHOOSING K"
* you can often get away with reducing a lot of dimensions. the features are often highly correlated. so you can often get away with a lot of compression while still retaining high variance.
------------------------------------------------------
PRETTY OBVIOUS
- when you run PCA do it on the training set and then apply it to the cross validation set. dont use the cross validation set to help figure out pca.
APPLICATIONS OF PCA
- reduce size and speed up learning algorithm
- visualization. we can only see in 2d and 3d. reduce dims to visualize higher dimensional data.
BAD USE OF PCA
- to prevent overfitting. Why? Well, it might actually work okay. But you should use regularization instead. ..it might actually be okay if you are retaining 99% of variance or something. ...but PCA throws away the values of y. Use regularization!
A GOOD QUESTION
What if we just do the whole thing without using PCA?
- try your original data first. then try pca. don't architect a plan that uses pca before you have actually tested your algorithm.
-------------------------------------------------------------
ANOMALY DETECTION
1. choose features that you think might could be indicators for anomalies. (say for credit card fraud - the country they log in from, their browsers user agent string, etc).
2. fit mu and sigma (there are also pictures of these in pictures/anomaly detection)
for j=1:n
m
μⱼ =(1/m) ∑ xⁱⱼ
i=1
---> the center of the distribution curve. e.g. the mean.
---> this can be vectorized for j=1:n all at once
---> umm, not sure how though. its just summing the row of
x₁,x₂,...xₙ. I guess sum(A,2) would be "vectorized"?
m
σ²ⱼ=(1/m) ∑(xⁱⱼ - μⱼ)²
i=1
---> the width (and the height, actually) of the distribution curve.
---> this can be vectorized too.
end
* reminder, the area under the distribution curve is always 1.
3. p(x)=Π(1/(sqrt(2*pi)sigma))*e^(-(x-μ)/(2σ²))
gives you the probability. its just f(x) where x is your distribution
curve. if p(x)=small then it has a low probability and can be flagged
as anamolous.
---> see pictures/anomaly detection/gaussiondistribution.jpg for a prettier version
-------------------------------------------------------------
DENSITY ESTIMATION
(probability of x1)*(probability of x2)*...(probability of xn)
p(x;mu1,sigma^2_1)*p(x;mu2,sigma^2_2)*...p(x;mun,sigma^2_n)
each mu and sigma^2 are different
n
Π p(xⱼ;μⱼ,σ²ⱼ)
j=1
------------------------------------------------------------
EVALUATING THE PERFORMANCE OF AN ANOMALY DETECTION ALGORITHM
- its nice to have a number you can just look at
- the standard way is to get some labeled data (y=0, y=1)
1. training set. get some normal (e.g. non-anomalous) data. it's okay if there are a few anomalies in your training set, though.
2. define a test set and cross-validation set. these should have a few anomalous data.
EXAMPLE:
we have this data:
10000 good (nonanomalous)
20 flawed (anomalous)
then do this:
training set - 6000 good
cv - 2000 good | 10 flawed
test - 2000 good | 10 flawed
Algorithm evalution (this seems so obvious i shouldnt write it down)
1. fit model p(x) on training set {x¹x²x³...x^m}
- remember p(x) is the product of the gaussians (where each feature gets its own gaussian curve)
2. y={1 if p(x) < ε
{0 if p(x) ≥ ε
TL;DR - check cv and tune ε. can features be weighted?
He just says to use a real value to estimate the performance and then tweak ε based on what, for example, F1 score it gets in the cross validation set. Then, just as a reminder, if you look above you will see how to calculate μ and σ for each feature to get that features distribution curve. So I guess the data is pretty much static and all we can really tune is ε. ...this is my own thinking here --- couldnt you give each feature its own weight? so that features that are more likely to be indicative of an anomaly are represented better?
*reminder - youre probably going to have a very skewed data set and so just predicting y=0 will have a very high accuracey. instead use (true pos, fals pos, fals neg, true neg), (precision/recall), (F₁-score)
-------------------------------------------------------------------
ANOMALY DETECTION - TRANSFORMING FEATURES
look at the feature's distribution curve as a histogram:
GOOD NOT GREAT BUT WILL PROBABLY STILL WORK
| |
||| |||
||||| |||||
IF ITS NOT GREAT YOU CAN TRY TRANSOFRMING THE DATA:
if | then replace x with log(x) and you may get a
||| much more gaussian looking (bell curve-ish)
||||| distribution
- you can also try log(x+1), log(x+constant), log(x^(1/3))
- this is done on a per feature basis. e.g. x₁, x₂, x₃, etc.
- just try a bunch of stuff.
- see pictures/anomaloy distribution/transform.png to see transform in action
once you get it distributed so that it looks more or less gaussian then you can feed it to your learning algorithm p(xⱼ;μⱼ,σ²ⱼ).
------------------------------------------------------------------
ERROR ANALYSIS FOR ANOMALY DETECTION
1. say you start with only feature x₁. you find an anomalous example but it's in a normal part of the distribution curve.
distribution curve for x₁:
|
|||
|||||
|||||||
|||||||||
^
|
---- anomalous example is here. wtf?
2. look at that anomalous example and see if you can come up with another feature that explains why it went bad (e.g. x₁ didn't explain why this aircraft engine went bad. maybe x₂ will).
3. the hope is that adding x₂ will allow improve your classification of anomalous examples.
○ good
✖ bad
*now how anomalous it is, is its distance from the mean of the two features. e.g. you could plot each "standard deviation" as a circle.
| ✖
|
|
| ○
x₂| ○ ○
| ○ ○
| ○ ○
| ○
----------------------
x₁
----------------------------------------------------------------
ADVICE FOR CHOOSING FEATURES
- choose features that might take on unusually large or unusually small values in the event of an anomaly.
Detecting anomalies in a data center. Let's think about this:
x₁ = memory use of a computer
x₂ = number of disk accesses
x₃ = cpu load
x₄ = network traffic
What would be a good feature for anomaly detection? Well if a server has a high cpu load and low network traffic that could very well mean it's caught in some kind of infinite loop. so a good feature would be
x₅ = cpu load/network traffic
*you could even think this through some more and
maybe it makes more sense to do:
x₅ = (cpu load)²/(network traffic)
if x₅ is anomalous then we may have detected a problem with that server.
**another example is heat and vibration in an engine. they are unrelated and either can be anomalous independently. so you could do x3=vibration/heat.
----------------------------------------------------------------
WHEN TO USE ORIGINAL MODEL VS MULTIVARIATE GAUSSIAN
- usually the original model (2d bell curve, normal model, [singlevariate-gaussian] distribution) is used
PROS OF MULTIVARIATE
- you don't have to do this x₅ = cpu load/network traffic
- multivariate captures correlations automatically
CONS OF MULTIVARIATE
- m must be greater than n otherwise it won't work out mathematically
PROS OF ORIGINAL
- it's faster
----------------------------------------------------------------
*note on multivariate vs normal gaussian distributions
something about mathematically they work out to be equal.
that is, normal distributions can be expressed as a special
case of multivariate distributions.
see wk9 end of video 8 for more. but i think the point is just that
you shouldn't worry about losing data when you go to/from normal and multivariate.
----------------------------------------------------------------
RECOMMENDER SYSTEMS
- often the most important machine learning applications. or at least often the ones businesses would most like to have improved.
| θ¹ θ² θ³ θ⁴|x₁ x₂
| |
| |r a
| a c |o c
| l a d |m t
| i b r a |n i
| c o o v |c o
| e b l e |e n
-------------|-------------------|------
romnceMovie1 | 5 5 0 0 |.9 0
romnceMovie2 | 5 ? ? 0 |1 .01
romnceMovie3 | ? 4 0 ? |.99 0
actionMovie1 | 0 0 5 4 |.1 1
actionMovie2 | 0 0 5 ? |0 .9
The goal is, given the user's ratings figure out the question marks.
nᵤ=number of users
nₘ=number of movies
r(i,j)=1 if user j has rated movie i (0 otherwise)
y(i,j)=rating by user j on movie i (if defined)
θʲ=parameter vector for user j
xⁱ=feature vector for movie i
for user j, movie i, predicted rating is θʲ'*xⁱ
-----------------------------------------------------------------
HOW TO DO IT
Make it a linear regression problem.
n
min (1/2) ∑ (θʲ'*xⁱ - yⁱʲ)² + (λ/2) ∑ (θʲₖ)²
θʲ i:r(i,j)=1 k=1