forked from lmwalkowicz/KeplerML
-
Notifications
You must be signed in to change notification settings - Fork 2
/
feature_key.txt
494 lines (365 loc) · 17.7 KB
/
feature_key.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
#Author: Lucianne Walkowicz and Daniel Giles, from code written by Revant Nayar
#Code snippets are shown currently, with the intent of turning these into written descriptions of each feature.
#Code snippets are not intended to work out of the box and are shown for illustration purposes only, though should be useable with minor modificaiton.
numpdc - this is the PDC_SAP_FLUX array cleaned of NaNs. Being replaced by 'f' to simplify this document.
corrpdc - this is the trend corrected PDC_SAP_FLUX, it removes the longterm linear trend of a given lightcurve.
5/31/18 (Daniel): This is utilized for some features, but not for others. As of this note, a normalized flux is used which divides the lightcurve by the median flux value for most feature derivations.
The index 'i' refers to an indidual lightcurve, 'j' refers to a single observation within a lightcurve.
Features:
1 longtermtrend - Linear fit coeffecient, does the object get brighter or dimmer on average over time?
2 meanmedrat - ratio between the mean and the median, mean/median
3 skews - The skew of the distribution of fluxes, positive value indicates flux is skewed to values greater than the median value. A normal distribution has a skew of zero.
4 varss - variance of the flux
5 coeffvar - coeffecient of variability, the ratio of the standard deviation to the mean of the flux
6 stds - standard deviation of the flux
7 numout1s - number of flux values outside of 1-sigma deviation
8 numnegoutliers - number of fluxes values 4-sigma less than the mean
9 numposoutliers - number of flux values 4-sigma greater than the mean
10 numoutliers - total number of flux values 4-sigma away from the mean
11 kurt - kurtosis, a measure of the sharpness of the peak of the distribution of fluxes. 1/N*sum((nf-nf_mean)^4)/stds^4
12 mad - Median Absolute Difference, the median difference from the median flux
13 maxslope - the greatest positive slope between two points, dnf/dt. Uses the 99th percentile rather than the actual largest.
14 minslope - the most negative slope btw two points.
15 meanpslope - the mean of postive slopes
16 meannslope - the mean of negative slopes
17 g_asymm - ratio of mean postive slopes to the mean of negative slopes (defaults to 10 if the mean of negative slopes is 0, i.e. there are no negative slopes)
18 rough_g_asymm - ratio of the number of positive slopes to the number of negative slopes (10 if no negative slopes)
19 diff_asymm - diffence between the mean of the positive slopes and the absolute mean of the negative slopes
20 skewslope - skew of the distribution of slopes
21 meanabsslope - mean of the absolute slopes
22 varabsslope - variance of the absolute value of the slopes
23 varslope - variance of the slopes
24 absmeansecder - absolute mean of the second derivative
25 num_pspikes - Number of positive spikes as defined by a positive slope 3 sigma greater than the mean positive slope.
26 num_nspikes - Number of negative spikes as defined by a negative slope 3 sigma smaller than the mean negative slope.
27 num_psdspikes - number of positive second derivative spikes (>+4 sigma)
28 num_nsdspikes - number of negative second derivative spikes (<-4 sigma)
29 stdratio - ratio of the standard deviation of the positive slopes to the standard deviation of the negative slopes (10 if negative standard deviation is zero)
30 pstrend - pair slope trend, ratio of positive slopes with a subsequent positive slope to the total number of slopes (N-1)
31 num_zcross - Number of 'zero' crossings, accounts for longterm trend, so really the number of longterm trendline crossings
32 num_pm - number of 'plus-minus' slope switches (where slope switches from positive to negative)
33 len_nmax - number of naive maxima where a maxima is the largest within 10 points on either side
34 len_nmin - number of naive minima where a minima is the smallest within 10 points on either side
35 mautocorrcoef - Auto-correlation function of one maxima to the next - np.corrcoef(naivemax[:-1],naivemax[1:])[0][1]
36 ptpslopes - peak-to-peak slopes
37 periodicity - Coefficient of variability for time-differences, ratio of the standard deviation to the mean of time-difference between maxima
38 periodicityr - coefficient of variability for time-differences of maxima using residuals
sum(t_maxima_diff-np.mean(t_maxima_diff))/np.mean(t_maxima_diff)
39 naiveperiod - mean of the time-differences between naive maxima
40 maxvars - coefficient of variation of the maxima, ratio of the standard deviation to the mean of the naive maxima flux values
41 maxvarsr - coefficient of variation of maxima flux values using residuals instead of standard deviation
42 oeratio - ratio of odd to even numbered means for naive minima flux values
43 amp_2 - 2 times the amplitude based on 1st and 99th percentile
44 normamp - normalized amplitude (amp_2/mean)
45 mbp - median buffer percentile, fraction of points within 20% of the amplitude to the median
46 mid20 - ratio of flux percentiles (60th to 40th) over (95th to 5th)
47 mid35 - "" (67th to 32nd) ""
48 mid50 - "" (75th to 25th) ""
49 mid65 - "" (82nd to 17th) ""
50 mid80 - "" (90th to 10th) ""
51 percentamp - Largest difference between the max or min flux and the median (as a percentage of the median)
52 magratio - ratio of the maximum flux value to amp_2
53 autocorrcoef - auto-correletion coefficient of the flux from one to the next
54 sautocorrcoef - auto-correlation coefficient of the slopes from one to the next
55 flatmean - mean 'flatness' around naive maxima. 'Flatness' defined as average absolute value of 6 slopes on either side of maxima.
56 tflatmean - mean 'flatness' around naive minima
57 roundmean - mean 'roundness' around naive maxima. 'Roundness' defined as average of second derivatives on either side of maxima
58 troundmean - mean 'roundness' around naive minima
59 roundrat - ratio of flatness of maxima to flatness of minima
60 flatrat - ratio of roundness of maxima to roundness of minima
longtermtrend,
longtermtrend=np.polyfit(t, f, 1)[0]
meanmedrat,
"""ratio between mean and median flux"""
meanmedrat=np.mean(f)/np.median(f)
skews,
"""skew of the data, <0 indicates greater spread less than the median"""
skews=scipy.stats.skew(f)
varss,
"""Variance of the flux"""
varss=np.var(f)
coeffvar,
"""Coeff of variability"""
coeffvar=np.std(f)/np.mean(f)
stds,
"""Standard deviation"""
stds=np.std(f)
numoutliers, numnegoutliers, numposoutliers,
"""Fluxes beyond 4 sigma"""
outliers=[f[j] for j in range(len(f)) if (f[j]>mean+4*std) or (f[j]<mean-4*std)]
numoutliers=len(outliers)
negoutliers=[f[j] for j in range (f)) if (f[j]<mean-4*std)]
numnegoutliers=len(negoutliers)
posoutliers=[f[j] for j in range (f)) if (f[j]>mean+4*std)]
numposoutliers=len(posoutliers)
numout1s,
"""Number of observations with flux outside of 1-sigma from the mean"""
out1std=[f[j] for j in range (len(f)) if (f[j]>mean+std) or (f[j]<mean-std)]
numout1s=len(out1std)
5/31/2018 Bookmark - TODO: simplify all following code definitions
kurt, """kurtosis"""
kurt=np.zeros(len(numpdc))
for i in range(len(numpdc)):
kurt[i]=scipy.stats.kurtosis(numpdc[i])
mad, from Richards et al.
"""Median Absolute Deviation (MAD)"""
mad=np.zeros(len(numpdc))
for i in range(len(numpdc)):
mad[i]=np.median([abs(numpdc[i][j]-medians[i]) for j in range(len(numpdc[i]))])
maxslope, minslope,
slopes=[0]*(len(numpdc))
for i in range(len(numpdc)):
slopes[i]=[(numpdc[i][j+1]-numpdc[i][j])/(numtime[i][j+1]-numtime[i][j]) for j in range (len(numpdc[i])-1)]
"""mean slope- long term trend """
meanslope=np.zeros(len(numpdc))
for i in range(len(numpdc)):
meanslope[i]=np.mean(slopes[i])
"""max and min slopes"""
maxslope=np.zeros(len(numpdc))
minslope=np.zeros(len(numpdc))
for i in range(len(numpdc)):
maxslope[i]=np.percentile(slopes[i],99)
minslope[i]=np.percentile(slopes[i],1)
meanpslope,
meanpslope[i]=np.mean(pslope[i])
meannslope,
meannslope[i]=-np.mean(nslope[i])
g_asymm,
g_asymm[i]=meanpslope[i]/meannslope[i]
rough_g_asymm,
rough_g_asymm[i]=len(pslope[i])/len(nslope[i])
diff_asymm,
diff_asymm[i]=meanpslope[i]-abs(meannslope[i])
skewslope,
"""skew slope- hope of asymmetry"""
skewslope=np.zeros(len(numpdc))
for i in range(len(numpdc)):
skewslope[i]=scipy.stats.skew(corrslopes[i])
varabsslope,
varslope,
meanabsslope,
"""Abs slopes"""
absslopes=[0]*len(numpdc)
for i in range(len(numpdc)):
absslopes[i]= [abs(corrslopes[i][j]) for j in range(len(corrslopes[i]))]
"""varabsslope"""
varabsslope=np.zeros(len(numpdc))
meanabsslope=np.zeros(len(numpdc))
meanabsslope=[np.var(absslopes[i]) for i in range(len(numpdc))]
varabsslope=[np.mean(absslopes[i]) for i in range(len(numpdc))]
absmeansecder,
abssecder=[0]*(len(numpdc))
for i in range(len(numpdc)):
abssecder[i]=[abs((slopes[i][j]-slopes[i][j-1])/((numtime[i][j+1]-numtime[i][j])/2+(numtime[i][j]-numtime[i][j-1])/2)) for j in range (1, len(slopes[i])-1)]
absmeansecder=np.zeros(len(numpdc))
for i in range(len(numpdc)):
absmeansecder[i]=np.mean(abssecder[i])
"""var slope"""
varslope=np.zeros(len(numpdc))
varslope=[np.var(slopes[i]) for i in range(len(slopes))]
num_pspikes,
num_nspikes,
num_sdspikes,
num_sdspikes2,
stdratio,
"""corrsecders"""
corrsecder=[0]*len(numpdc)
for i in range(len(numpdc)):
corrsecder[i]=[(corrslopes[i][j]-corrslopes[i][j-1])/((numtime[i][j+1]-numtime[i][j])/2+(numtime[i][j]-numtime[i][j-1])/2) for j in range (1, len(corrpdc[i])-1)]
"""as regards periodicity in general,there can exist many levels"""
"""Num_spikes- you casn also isolate transits from cataclysmics using periodicity of spikes
take ratios of roundnessess or multiply them, """
stdratio=np.zeros(len(numpdc))
for i in range(len(numpdc)):
pslopestds[i]=np.std(pslope[i])
nslopestds[i]=np.std(nslope[i])
sdstds[i]=np.std(corrsecder[i])
meanstds[i]=np.mean(corrsecder[i])
stdratio[i]=pslopestds[i]/nslopestds[i]
"""
for i in range(len(numpdc)):
pspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]>=3*slopestds[i]]
nspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]<=3*slopestds[i]]
sdspikes[i]=[corrsecder[i][j] for j in range(len(corrsecder[i])) if corrsecder[i][j]>=4*sdstds[i]]
"""
for i in range(len(numpdc)):
pspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]>=meanpslope[i]+3*pslopestds[i]]
nspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]<=meannslope[i]-3*nslopestds[i]]
sdspikes[i]=[corrsecder[i][j] for j in range(len(corrsecder[i])) if corrsecder[i][j]>=4*sdstds[i]]
sdspikes2[i]=[corrsecder[i][j] for j in range(len(corrsecder[i])) if corrsecder[i][j]<=-4*sdstds[i]]
"""change around the 4 and add the min condition along with sdspike
to look for transits"""
num_pspikes=np.zeros(len(numpdc))
num_nspikes=np.zeros(len(numpdc))
num_sdspikes=np.zeros(len(numpdc))
num_sdspikes2=np.zeros(len(numpdc))
for i in range(len(numpdc)):
num_pspikes[i]=len(pspikes[i])
num_nspikes[i]=len(nspikes[i])
num_sdspikes[i]=len(sdspikes[i])
num_sdspikes2[i]=len(sdspikes2[i])
pstrend, """pair slope trend"""
pstrend=np.zeros(len(numpdc))
for i in range(len(numpdc)):
pstrend[i]=len([slopes[i][j] for j in range(len(slopes[i])-1) if (slopes[i][j]>0) & (slopes[i][j+1]>0)])/len(slopes[i])
num_zcross,
"""Zero crossings- accounted for ltt, plot with gasymm"""
zcrossind=[]
for i in range(len(numpdc)):
ltt=longtermtrend[i]
yoff=y_offset[i]
zcrossind.append([j for j in range(len(numpdc[i])-1) if (ltt*numtime[i][j+1]+ yoff-numpdc[i][j+1])*(ltt*numtime[i][j]+yoff-numpdc[i][j])<0])
num_zcross=np.zeros(len(numpdc))
for i in range(len(numpdc)):
num_zcross[i]=len(zcrossind[i])
num_pm,
"""pm"""
plusminus=[0]*len(numpdc)
for i in range(len(numpdc)):
plusminus[i]=[j for j in range(1,len(slopes[i])) if (slopes[i][j]<0)&(slopes[i][j-1]>0)]
num_pm=np.zeros(len(numpdc))
num_pm=[len(plusminus[i]) for i in range(len(numpdc))]
len_nmax
"""naive maxima and corresponding time values you can do it with 5 or 10 or something else, 1 or two largest"""
naivemaxes=[0]*len(numpdc)
nmax_times=[0]*len(numpdc)
maxinds=[0]*len(numpdc)
maxerr=[0]*len(numpdc)
for i in range(len(numpdc)):
naivemaxes[i]=[corrpdc[i][j] for j in range (len(numpdc[i])) if corrpdc[i][j] in heapq.nlargest(1, corrpdc[i][max(j-10,0):min(j+10, len(numpdc[i])-1): 1])]
nmax_times[i]=[numtime[i][j] for j in range (len(numpdc[i])) if corrpdc[i][j] in heapq.nlargest(1, corrpdc[i][max(j-10,0):min(j+10, len(numpdc[i])-1): 1])]
maxinds[i]=[j for j in range (len(numpdc[i])) if corrpdc[i][j] in heapq.nlargest(1, corrpdc[i][max(j-10,0):min(j+10, len(numpdc[i])-1): 1])]
maxerr[i]=[err[i][j] for j in maxinds[i]]
"""numbers of naive maxima"""
len_nmax=np.zeros(len(numpdc))
for i in range(len(numpdc)):
len_nmax[i]=len(naivemaxes[i])
len_nmin - """numbers of naive minima"""
len_nmin=np.zeros(len(numpdc))
for i in range(len(numpdc)):
len_nmin[i]=len(naivemins[i])
mautocorrcoef - """Auto-correlation function of one maximum to next-good clustering"""
autopdcmax=[0]*len(numpdc)
for i in range(len(numpdc)):
autopdcmax[i]=[naivemaxes[i][j+1] for j in range(len(naivemaxes[i])-1)]
mautocovs=np.zeros(len(numpdc))
mautocorrcoef=np.zeros(len(numpdc))
for i in range(len(numpdc)):
mautocorrcoef[i]=np.corrcoef(naivemaxes[i][:-1:], autopdcmax[i])[0][1]
mautocovs[i]=np.cov(naivemaxes[i][:-1:],autopdcmax[i])[0][1]
ptpslopes - """peak to peak slopes"""
ptpslopes=np.zeros(len(numpdc))
ppslopes=[0]*len(numpdc)
for i in range(len(numpdc)):
ppslopes[i]=[abs((naivemaxes[i][j+1]-naivemaxes[i][j])/(nmax_times[i][j+1]-nmax_times[i][j])) for j in range(len(naivemaxes[i])-1)]
for i in range(len(numpdc)):
ptpslopes[i]=np.mean(ppslopes[i])
periodicity, periodicityr, naiveperiod
"""Variation coefficient of time difference between successive maxima- periodicity?"""
maxdiff=[0]*(len(numpdc))
for i in range(len(numpdc)):
maxdiff[i]=[nmax_times[i][j+1]-nmax_times[i][j] for j in range(len(naivemaxes[i])-1)]
periodicity=np.zeros(len(numpdc))
periodicityr=np.zeros(len(numpdc))
for i in range(len(numpdc)):
periodicity[i]=np.std(maxdiff[i])/np.mean(maxdiff[i])
periodicityr[i]=np.sum(abs(maxdiff[i]-np.mean(maxdiff[i])))/np.mean(maxdiff[i])
naiveperiod=np.zeros(len(numpdc))
for i in range(len(numpdc)):
naiveperiod[i]=np.mean(maxdiff[i])
maxvars & maxvarsr - """variation coefficient of the maxima"""
maxvars=np.zeros(len(numpdc))
maxvarsr=np.zeros(len(numpdc))
for i in range(len(numpdc)):
maxvars[i]=np.std(naivemaxes[i])/np.mean(naivemaxes[i])
maxvarsr[i]=np.sum(abs(naivemaxes[i]-np.mean(naivemaxes[i])))/np.mean(naivemaxes[i])
oeratio,
omin=[0]*len(numpdc)
emin=[0]*len(numpdc)
meanomin=np.zeros(len(numpdc))
meanemin=np.zeros(len(numpdc))
for i in range(len(numpdc)):
emin[i]=[naivemins[i][j] for j in range(len(naivemins[i])) if j%2==0]
omin[i]=[naivemins[i][j] for j in range(len(naivemins[i])) if j%2!=0]
"""local secder dip"""
for i in range(len(numpdc)):
meanemin[i]=np.mean(emin[i])
meanomin[i]=np.mean(omin[i])
"""plt.scatter(meanomin, meanemin)"""
oeratio=np.zeros(len(numpdc))
for i in range(len(numpdc)):
oeratio[i]=meanomin[i]/meanemin[i]
amp_2 & normamp:
amp_2=np.zeros(len(numpdc))
amp =np.zeros(len(numpdc))
for i in range(len(numpdc)):
amp[i]=np.percentile(numpdc[i],99)-np.percentile(numpdc[i],1)
amp_2[i]=np.percentile(corrpdc[i],99)-np.percentile(corrpdc[i],1)
normnaiveamp=np.zeros(len(numpdc))
normamp=np.zeros(len(numpdc))
for i in range(len(numpdc)):
normnaiveamp[i]=naive_amp_2[i]/np.mean(numpdc[i])
normamp[i]=amp_2[i]/np.mean(numpdc[i])
mbp - Median Buffer Percentile
for i in range(len(numpdc)):
mbp[i]=len([numpdc[i][j] for j in range(len(numpdc[i])) if (numpdc[i][j]<(medians[i]+0.1*amp_2[i])) & (numpdc[i][j]>(medians[i]-0.1*amp_2[i]))])/len(numpdc[i])
mid20,
f4060[i]=np.percentile(numpdc[i], 60)-np.percentile(numpdc[i], 40)
f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
mid20[i]=f4060[i]/f595[i]
mid35,
f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
f3267[i]=np.percentile(numpdc[i], 67)-np.percentile(numpdc[i], 32)
mid35[i]=f3267[i]/f595[i]
mid50,
f2575[i]=np.percentile(numpdc[i], 75)-np.percentile(numpdc[i], 25)
f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
mid50[i]=f2575[i]/f595[i]
mid65,
f1782[i]=np.percentile(numpdc[i], 82)-np.percentile(numpdc[i], 17)
f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
mid65[i]=f1782[i]/f595[i]
mid80,
f1090[i]=np.percentile(numpdc[i],90)-np.percentile(numpdc[i],10)
f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
mid80[i]=f1090[i]/f595[i]
percentamp,
for i in range(len(numpdc)):
percentamp[i]=max([(corrpdc[i][j]-medians[i])/medians[i] for j in range(len(corrpdc[i]))])
magratio,
magratio=[(max(numpdc[i])-medians[i])/amp[i] for i in range(len(numpdc))]
sautocorrcoef,
sautopdc=[0]*len(slopes)
for i in range(len(slopes)):
sautopdc[i]=[slopes[i][j+1] for j in range(len(slopes[i])-1)]
sautocovs=np.zeros(len(slopes))
for i in range(len(slopes)):
sautocorrcoef[i]=np.corrcoef(slopes[i][:-1:], sautopdc[i])[0][1]
sautocovs[i]=np.cov(slopes[i][:-1:],sautopdc[i])[0][1]
autocorrcoef,
autopdc=[0]*len(numpdc)
for i in range(len(numpdc)):
autopdc[i]=[numpdc[i][j+1] for j in range(len(numpdc[i])-1)]
autocovs=np.zeros(len(numpdc))
autocorrcoef=np.zeros(len(numpdc))
for i in range(len(numpdc)):
autocorrcoef[i]=np.corrcoef(numpdc[i][:-1:], autopdc[i])[0][1]
autocovs[i]=np.cov(numpdc[i][:-1:],autopdc[i])[0][1]
flatmean,
for i in range(len(numpdc)):
flatmean[i]=np.nansum(flatness[i])/len(flatness[i])
tflatmean,
for i in range(len(numpdc)):
tflatmean[i]=np.nansum(tflatness[i])/len(tflatness[i])
roundmean,
for i in range(len(numpdc)):
roundmean[i]=np.nansum(roundness[i])/len(roundness[i])
troundmean,
for i in range(len(numpdc)):
troundmean[i]=np.nansum(troundness[i])/len(troundness[i])
roundrat,
for i in range(len(numpdc)):
roundrat[i]=roundmean[i]/troundmean[i]
flatrat
for i in range(len(numpdc)):
flatrat[i]=flatmean[i]/tflatmean[i]