-
Notifications
You must be signed in to change notification settings - Fork 0
/
CodalabSubmission.Rmd
473 lines (384 loc) · 15.6 KB
/
CodalabSubmission.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
---
title: 'Fierce Iconic Legends: Predicting Biodegradability'
output:
pdf_document:
toc: yes
html_notebook:
theme: united
toc: yes
html_document:
df_print: paged
header-includes: \usepackage{color}
toc: yes
---
# Report Background
This report was prepared for **CHEMS-R-US** by *Fierce Iconic Legends.*
This notebook is by *Jose Figueroa*.
The consultants in our company are:
+ Jose Figueroa
+ Matthew Garber
+ Jim Jiang
+ Zak Stanke
+ Qianjun Chen
+ Miranda Kaiser
This report embeds all of the R code necessary to produce the results described in this report.
```{r, include=FALSE, set.seed(20)}
knitr::opts_chunk$set(cache = T)
# These will install required packages if they are not already installed
if (!require("ggplot2")) {
install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
}
if(!require("randomForest")) {
install.packages("randomForest", dependencies = TRUE)
library(randomForest)
}
if (!require("gplots")) {
install.packages("gplots", dependencies = TRUE)
library(gplots)
}
if (!require("knitr")) {
install.packages("knitr", dependencies = TRUE)
library(knitr)
}
if (!require("xtable")) {
install.packages("xtable", dependencies = TRUE)
library(xtable)
}
if (!require("pander")) {
install.packages("pander", dependencies = TRUE)
library(pander)
}
if (!require("devtools")) {
install.packages("devtools",dependencies = TRUE )
library("devtools")
}
if (!require("usethis")) {
install.packages("usethis" )
library("usethis")
}
if (!require("ggbiplot")) {
install_git("git://github.com/vqv/ggbiplot.git",dependencies = TRUE)
library("ggbiplot")
}
if (!require(reshape2)){
install.packages("reshape2", dependencies = TRUE)
library(reshape2)
}
if (!require(gridExtra)){
install.packages("gridExtra", dependencies = TRUE)
library(gridExtra)
}
if (!require(MASS)){
install.packages("MASS", dependencies = TRUE)
library(MASS)
}
if (!require(pROC)){
install.packages("pROC")
library(pROC)
}
if (!require(caret)){
install.packages("caret")
library(caret)
}
if (!require("glmnet")) {
install.packages("glmnet", dependencies = TRUE)
library(glmnet)
}
if(!require("Boruta")) {
install.packages("Boruta", dependencies = TRUE)
library(Boruta)
}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
As a representitive of FIL consulting, I strive to provide thorough, understandable analytics to our clients. Having looked over the data provided by **CHEMS-R-US**, I have come to the conclusion that much of the features being analyzed were not helpful in providing a strong classification method for biodegradeable materials. Through isolation forest and random forest methods, I have sharply reduced required features from 168 to 38, and ultimately produced an 84 % accuracy on testing data. According to CodaLab I have correctly identified 81% of the false features.
# Data Description:
```{r}
set.seed(20)
data.df <- read.csv("~/MATP-4400/FinalProject/chems_train.data.csv", header=FALSE)
labels <- read.csv("~/MATP-4400/data/chems_train.solution.csv", header=F)[ ,1]
print(sprintf("Number of attributes: %s", ncol(data.df)))
print(sprintf("Number of biodegradeable samples: %s", length(labels[labels=="1"])))
print(sprintf("Number of non-biodegradeable samples: %s", length(labels[labels=="-1"])))
#Head while gives a visual representation floods the report with unneeded information
#Simply understand that the 168 features range from continuous, discrete and binary data.
#head(data.df)
```
## Data Overview
* There is a mix of binary, categorical and real data, so scaling is needed to make sure binary values are not overshadowed by numerical data with large ranges.
* There is a total of 1018 samples in the training set, of which which 329 are biodegradable, and 689 are not.
* There is a total of 437 samples in the external test set, without response variables.
* The data is represented by 168 features, of which some are *fake*.
## PCA
```{r}
set.seed(20)
pca_all <- prcomp(data.df, scale=FALSE, center=TRUE)
pca_all_scale <- prcomp(data.df, scale=TRUE, center=TRUE)
t<-max(abs(pca_all$x[,1:2]))
p <- ggbiplot(pca_all,
choices=c(1,2),
alpha=1,
var.axes = FALSE,
groups=as.factor(labels))
p<- p + ggtitle('Unscaled plot of PC1 & PC2') +
coord_cartesian(xlim=c(-t,t)/1000, ylim=c(-t,t)/1000)
p
t<-max(abs(pca_all_scale$x[,1:2]))
p <- ggbiplot(pca_all_scale,
choices=c(1,2),
alpha=1,
var.axes = FALSE,
groups=factor(labels, labels = c("No", "Yes")))
p<- p + ggtitle('Scaled BiPlot map for Biodegradability') +
coord_cartesian(xlim=c(-t,t), ylim=c(-t,t))
p
```
As you can see, pre-scaling you have 92.3% of the variance explained in PC1, while scaling sharply reduces said value to 5.8%. This shows that not scaling gives far too much emphasis on continuous data with large values.
*From now on, references to train and test data refer to the scaled training and test data derived from the internal training set.*
```{r}
set.seed(20)
n<- nrow(data.df)
#80-20 split
train.size = ceiling(0.8*n)
train <- sample(n,train.size)
train_data <- data.df[train,]
test_data <- data.df[-train,]
rownames(train_data) <- c(1:nrow(train_data))
rownames(test_data) <- c(1:nrow(test_data))
trainclass <- labels[train]
testclass <- labels[-train]
trainmatrix <- as.matrix(train_data)
testmatrix <- as.matrix(test_data)
sc_tr <- scale(trainmatrix, center=TRUE, scale=TRUE)
means <- attr(sc_tr, 'scaled:center') # get the mean of the columns
stdevs <- attr(sc_tr, 'scaled:scale') # get the std of the columns
sc_tst <- scale(testmatrix, center=means, scale=stdevs)#scale tst using the means and std of tr
```
# Baseline Results:
## Predefined functions for quick analysis of models
```{r}
my_prediction_lda <- function(lda, matrix, classes) {
pred <- predict(lda, matrix)
confusion.matrix<-table(classes,pred$class)
print(sprintf("Precision: %.1f %%", precision<-100* confusion.matrix[2,2]/(confusion.matrix[2,2]+confusion.matrix[2,1])))
print(sprintf("Recall: %.1f %%", recall <- 100* confusion.matrix[2,2]/(confusion.matrix[2,2]+confusion.matrix[1,2])))
print(sprintf("F1-score: %.1f %%", f1 <- 2 *(precision * recall)/(precision + recall)))
ranking <- as.numeric(pred$x)
myrocLR <- pROC::roc(classes,ranking, levels=c('-1','1'))
# compute AUC=Area under ROC curve
myaucLR <- pROC::auc(myrocLR)
print(myaucLR)
print(fisher.test(confusion.matrix))
print(confusion.matrix)
}
my_prediction_glm <- function(glm, dataframe, classes) {
mypredictlr<-predict(glm,dataframe,type="response")
pred <- as.integer(mypredictlr>0.5)
pred[pred==0] <- -1
confusion.matrix <- table(classes, pred)
print(sprintf("Precision: %.1f %%", precision<-100* confusion.matrix[2,2]/(confusion.matrix[2,2]+confusion.matrix[2,1])))
print(sprintf("Recall: %.1f %%", recall <- 100* confusion.matrix[2,2]/(confusion.matrix[2,2]+confusion.matrix[1,2])))
print(sprintf("F1-score: %.1f %%", f1 <- 2 *(precision * recall)/(precision + recall)))
aoc.classes <- classes
aoc.classes[aoc.classes==-1] <- 0
myrocLR <- pROC::roc(aoc.classes ,mypredictlr, levels=c('0','1'))
# compute AUC=Area under ROC curve
myaucLR <- pROC::auc(myrocLR)
print(myaucLR)
print(fisher.test(confusion.matrix))
print(confusion.matrix)
}
my_prediction_rf <- function(rf, matrix, classes) {
rf.pred <- predict(rf, matrix, type="response")
cm <- table(rf.pred, classes)
print(sprintf("Precision: %.1f %%", precision<-100* cm[2,2]/(cm[2,2]+cm[2,1])))
print(sprintf("Recall: %.1f %%", recall <- 100* cm[2,2]/(cm[2,2]+cm[1,2])))
print(sprintf("F1-score: %.1f %%", f1 <- 2 *(precision * recall)/(precision + recall)))
ranking <- as.numeric(as.character(rf.pred))
myrocLR <- pROC::roc(classes,ranking, levels=c('-1','1'))
# compute AUC=Area under ROC curve
myaucLR <- pROC::auc(myrocLR)
print(myaucLR)
print(fisher.test(cm))
print(cm)
}
```
## Linear Discrimination Analysis
```{r}
set.seed(20)
lda.fit <- lda(sc_tr, trainclass, prior=c(1,1)/2)
```
Training set F1-scores:
```{r}
set.seed(20)
my_prediction_lda(lda.fit, sc_tr, trainclass)
```
Test set F1-scores:
```{r}
set.seed(20)
my_prediction_lda(lda.fit, sc_tst, testclass)
```
## Logistic Regression
```{r}
set.seed(20)
sc_train.df<-as.data.frame(sc_tr)
my.data<-cbind(trainclass,sc_train.df)
my.data["trainclass"][my.data["trainclass"]==-1] <- 0
glmfit <- glm(trainclass~.,data=my.data,family="binomial")
```
Train Data F1-scores:
```{r}
set.seed(20)
my_prediction_glm(glmfit, as.data.frame(sc_tr), trainclass)
```
Test data F1-scores:
```{r}
set.seed(20)
my_prediction_glm(glmfit, as.data.frame(sc_tst), testclass)
```
## LDA Results
While there is a solid 86.3% F1-score on training data, the accuracy drops about 15% for the test set. This may suggest fitting to features that dont actually correlate with biodegradability, likely the random junk variables.
## LR Results
Likewise for linear regression, we have a strong train accuracy this time 93% and a more severe drop in test data to 65%.
# Feature Selection Results:
*Here I try both linear and non-linear approaches to feature selection.*
## LDA weights
Using the magnitude of the scalar weights assigned to each feature in LDA, I used various arbitrary cutoff points based around it's quadrants to filter through unneeded features. Unfortunately no threshhold seemed to result in significant changes in accuracy, which may be due to non-linear relationships between variables.
```{r}
set.seed(20)
feature.weights <- abs(lda.fit$scaling)
summary(feature.weights)
col.names <- rownames(feature.weights)
str.weights <- col.names[feature.weights>0.25]
trim.sc_trn <- sc_tr[, str.weights]
```
```{r}
set.seed(20)
lda.fit_tr <- lda(trim.sc_trn, trainclass, prior=c(1,1)/2)
my_prediction_lda(lda.fit_tr, trim.sc_trn, trainclass)
```
```{r}
set.seed(20)
my_prediction_lda(lda.fit_tr, sc_tst[,str.weights], testclass)
```
```{r}
set.seed(20)
my.data<-cbind(trainclass,as.data.frame(trim.sc_trn))
my.data["trainclass"][my.data["trainclass"]==-1] <- 0
# Fit logistic regression model using glm
# Formula uses all of the features
glmfit <- glm(trainclass~.,data=my.data,family="binomial")
```
```{r}
set.seed(20)
my_prediction_glm(glmfit, as.data.frame(trim.sc_trn), trainclass)
```
```{r}
set.seed(20)
my_prediction_glm(glmfit, as.data.frame(sc_tst[,str.weights]), testclass)
```
## Random Forest
Here I use Boruta and randomForest libraries to analyze features and ultimately produce a stronger classifier. Boruta selects relevant features given a threshhold p-value through iterating each attribute in comparison to shadow attributes, created by shuffling original ones. Here I used the maxRuns default of 100. Afterwards, a standard randomforest algorithm is applied as a classifier, using only confirmed attributes
```{r}
set.seed(20)
boruta <- Boruta(x=trainmatrix, y=trainclass, pValue = 0.01, doTrace = 0)
```
```{r}
set.seed(20)
v <- names(boruta$finalDecision[boruta$finalDecision=="Confirmed"])
trim.rf <- randomForest(x=trainmatrix[,v], y=as.factor(trainclass),
xtest=testmatrix[,v], ytest=as.factor(testclass),
ntree = 500, classwt=c(1,1)/2,
replace=TRUE, keep.forest=TRUE)
```
```{r}
set.seed(20)
my_prediction_rf(trim.rf, trainmatrix[,v], trainclass)
```
```{r}
set.seed(20)
my_prediction_rf(trim.rf, testmatrix[,v], testclass)
```
Due to the nature of decision trees, a 100% accuracy on the training set is anticipated. However, the 90% F1-score on the test set is a massive improvement compared to other feature reduction methods as well as outperforms the baseline results.
My challenge ID is **figuej3** with an AUC score of 0.84 for prediction and 0.81 for feature selection.
The contest can be found here:
<!-- https://competitions.codalab.org/competitions/21678?secret_key=c29154f0-213e-43ea-8e01-4eef00f9d123 -->
https://competitions.codalab.org/competitions/22460
- `classification.csv`: Test target values (437 lines x 1 column)
- `selection.csv`: Solution indicating which variables are real and which are fake (168 lines x 1 column)
https://competitions.codalab.org/competitions/22460
```{r}
set.seed(20)
# Writing output to submit to contest
# Make a prediction of all class -1. You will need to replace this with your prediction
# NOTE: nrow needs to be the length of the test set!!!
blindtest <- read.csv("~/MATP-4400/FinalProject/chems_test.data.csv", header=FALSE)
blindtest <- blindtest[,v]
blind.pred <- predict(trim.rf, blindtest)
ypred<-matrix(-1,nrow=437,437,ncol=1)
ypred[blind.pred=="1", 1] <- 1
#default is to predict all features
featurepred<-matrix(0,nrow=168,168,ncol=1)
#for (i in 1:length(rf.str)) {
# index <- as.numeric(strsplit(rf.str[i], "V")[[1]][2])
# featurepred[index,1] <- 1
#}
featurepred[boruta$finalDecision=="Confirmed",1] <- 1
# Here is a sample file for submission to the website
write.table(ypred,file = "classification.csv", row.names=F, col.names=F)
write.table(featurepred,file = "selection.csv", row.names=F, col.names=F)
system("zip my_test.csv.zip classification.csv selection.csv")
```
# Comparison of Methods
```{r}
data <- matrix(c(0.68, 0.02, -1, -1, 0.23, 0.83, 0.84, 0.81, 0.71, 0.29, 0.04, 0.39),
ncol = 2, byrow = TRUE)
colnames(data) <- c("Classification Score", "Selection Score")
rownames(data) <- c("Zak S", "Jim", "Matthew G", "Jose F", "Qianjun C", "Miranda K")
table <- as.table(data)
kable(table, type="html",digits = 2)
```
```{r}
set.seed(20)
rf.pred <- predict(trim.rf, testmatrix[,v], type="response")
ranking<-as.numeric(as.character(rf.pred)) # Compute ROC
myroc <- pROC::roc(testclass,ranking, levels=c('-1','1'))
myauc <- pROC::auc(myroc)
mattgranking<-read.csv("~/idm_work/mattg.csv")[,1]
jimranking<-read.csv("~/idm_work/lasso_jim.csv")[,1]
chenranking<-read.csv("~/idm_work/chenq8.csv")[,1]
mirandaranking<-read.csv("~/idm_work/mirandak.csv")[,1]
zakranking<-read.csv("~/idm_work/ZakS_LR_Akaike.csv")[,1]
mattgroc <- pROC::roc(testclass,mattgranking)
jimroc <- pROC::roc(testclass,jimranking)
chenroc <- pROC::roc(testclass,chenranking)
mirandaroc <- pROC::roc(testclass,mirandaranking[1:203])
zakroc <- pROC::roc(testclass,zakranking[1:203])
ggroc(list(JoseRF = myroc,
MattLR = mattgroc,
JimLasso = jimroc,
Chen_LDA = chenroc,
Miranda_LDA = mirandaroc,
Zak_LR_Akaike = zakroc)) +
labs(color = "Method") +
ggtitle("Comparison of test accurcy on Chems-R-Us data")
```
```{r}
write.csv(ranking,"jose.csv",row.names=FALSE)
```
In our case, random forest worked best for both feature selection and prediction. This suggests the data has alot of non-linear relationships when predicting biodegradability. Its also good at predicting a large amount of irrelevant data (*in this case fake variables*). Linear models were not stringent enough and would only a small amount of the features, where RF removed 130 of the 138, our team member Matthew achieving an 83% AUC with this amount.
# Additional Analysis
Linear models are often confused by features that have strong correlations, it may be helpful to omit some if they add nothing of value.
```{r}
library(RColorBrewer)
corr <- cor(scale(data.df[,v]))
heatmap.2(corr,
Rowv = F,
Colv = F,
trace="none",
dendrogram="none",
main="Strongly correlated features")
```
# Conclusion
Overall, we as a group concluded a maximum fake variable analysis of 83% and a maximum biodegradability prediction of 84%. Linear models showed lackluster results, which shows that nonlinear relationships between features more accurately correlate with biodegradability. While some further analysis is needed in the 130 features we selected as non-relevant, most can be safely discarded.