feature_key.txt

#Author: Lucianne Walkowicz and Daniel Giles, from code written by Revant Nayar
#Code snippets are shown currently, with the intent of turning these into written descriptions of each feature. 
#Code snippets are not intended to work out of the box and are shown for illustration purposes only, though should be useable with minor modificaiton.

numpdc - this is the PDC_SAP_FLUX array cleaned of NaNs. Being replaced by 'f' to simplify this document.

corrpdc - this is the trend corrected PDC_SAP_FLUX, it removes the longterm linear trend of a given lightcurve.
    5/31/18 (Daniel):   This is utilized for some features, but not for others. As of this note, a normalized flux is used which                               divides the lightcurve by the median flux value for most feature derivations.

The index 'i' refers to an indidual lightcurve, 'j' refers to a single observation within a lightcurve.


Features:

1  longtermtrend - Linear fit coeffecient, does the object get brighter or dimmer on average over time?
2  meanmedrat - ratio between the mean and the median, mean/median
3  skews - The skew of the distribution of fluxes, positive value indicates flux is skewed to values greater than the median value. A normal distribution has a skew of zero.
4  varss - variance of the flux
5  coeffvar - coeffecient of variability, the ratio of the standard deviation to the mean of the flux
6  stds - standard deviation of the flux
7  numout1s - number of flux values outside of 1-sigma deviation
8  numnegoutliers - number of fluxes values 4-sigma less than the mean
9  numposoutliers - number of flux values 4-sigma greater than the mean
10 numoutliers - total number of flux values 4-sigma away from the mean
11 kurt - kurtosis, a measure of the sharpness of the peak of the distribution of fluxes. 1/N*sum((nf-nf_mean)^4)/stds^4
12 mad - Median Absolute Difference, the median difference from the median flux
13 maxslope - the greatest positive slope between two points, dnf/dt. Uses the 99th percentile rather than the actual largest.
14 minslope - the most negative slope btw two points.
15 meanpslope - the mean of postive slopes
16 meannslope - the mean of negative slopes
17 g_asymm - ratio of mean postive slopes to the mean of negative slopes (defaults to 10 if the mean of negative slopes is 0, i.e. there are no negative slopes)
18 rough_g_asymm - ratio of the number of positive slopes to the number of negative slopes (10 if no negative slopes)
19 diff_asymm - diffence between the mean of the positive slopes and the absolute mean of the negative slopes
20 skewslope - skew of the distribution of slopes
21 meanabsslope -  mean of the absolute slopes
22 varabsslope -  variance of the absolute value of the slopes
23 varslope - variance of the slopes
24 absmeansecder - absolute mean of the second derivative
25 num_pspikes - Number of positive spikes as defined by a positive slope 3 sigma greater than the mean positive slope.
26 num_nspikes - Number of negative spikes as defined by a negative slope 3 sigma smaller than the mean negative slope.
27 num_psdspikes - number of positive second derivative spikes (>+4 sigma)
28 num_nsdspikes - number of negative second derivative spikes (<-4 sigma)
29 stdratio - ratio of the standard deviation of the positive slopes to the standard deviation of the negative slopes (10 if negative standard deviation is zero)
30 pstrend - pair slope trend, ratio of positive slopes with a subsequent positive slope to the total number of slopes (N-1)
31 num_zcross - Number of 'zero' crossings, accounts for longterm trend, so really the number of longterm trendline crossings
32 num_pm - number of 'plus-minus' slope switches (where slope switches from positive to negative)
33 len_nmax - number of naive maxima where a maxima is the largest within 10 points on either side
34 len_nmin - number of naive minima where a minima is the smallest within 10 points on either side
35 mautocorrcoef - Auto-correlation function of one maxima to the next - np.corrcoef(naivemax[:-1],naivemax[1:])[0][1]
36 ptpslopes - peak-to-peak slopes 
37 periodicity - Coefficient of variability for time-differences, ratio of the standard deviation to the mean of time-difference between maxima
38 periodicityr - coefficient of variability for time-differences of maxima using residuals
    sum(t_maxima_diff-np.mean(t_maxima_diff))/np.mean(t_maxima_diff)
39 naiveperiod - mean of the time-differences between naive maxima
40 maxvars - coefficient of variation of the maxima, ratio of the standard deviation to the mean of the naive maxima flux values 
41 maxvarsr - coefficient of variation of maxima flux values using residuals instead of standard deviation
42 oeratio - ratio of odd to even numbered means for naive minima flux values
43 amp_2 - 2 times the amplitude based on 1st and 99th percentile
44 normamp - normalized amplitude (amp_2/mean)
45 mbp - median buffer percentile, fraction of points within 20% of the amplitude to the median
46 mid20 - ratio of flux percentiles (60th to 40th) over (95th to 5th)
47 mid35 - "" (67th to 32nd) ""
48 mid50 - "" (75th to 25th) ""
49 mid65 - "" (82nd to 17th) ""
50 mid80 - "" (90th to 10th) ""
51 percentamp - Largest difference between the max or min flux and the median (as a percentage of the median)
52 magratio - ratio of the maximum flux value to amp_2
53 autocorrcoef - auto-correletion coefficient of the flux from one to the next
54 sautocorrcoef - auto-correlation coefficient of the slopes from one to the next
55 flatmean - mean 'flatness' around naive maxima. 'Flatness' defined as average absolute value of 6 slopes on either side of maxima.
56 tflatmean - mean 'flatness' around naive minima
57 roundmean - mean 'roundness' around naive maxima. 'Roundness' defined as average of second derivatives on either side of maxima
58 troundmean - mean 'roundness' around naive minima
59 roundrat - ratio of flatness of maxima to flatness of minima
60 flatrat - ratio of roundness of maxima to roundness of minima


longtermtrend,
longtermtrend=np.polyfit(t, f, 1)[0]

meanmedrat, 
"""ratio between mean and median flux"""
meanmedrat=np.mean(f)/np.median(f)

skews, 
"""skew of the data, <0 indicates greater spread less than the median"""
skews=scipy.stats.skew(f)

varss, 
"""Variance of the flux"""
varss=np.var(f)


coeffvar, 
"""Coeff of variability"""
coeffvar=np.std(f)/np.mean(f)

stds, 
"""Standard deviation"""
stds=np.std(f)

numoutliers, numnegoutliers, numposoutliers, 
"""Fluxes beyond 4 sigma"""

outliers=[f[j] for j in range(len(f)) if (f[j]>mean+4*std) or (f[j]<mean-4*std)]  
numoutliers=len(outliers)
negoutliers=[f[j] for j in range (f)) if (f[j]<mean-4*std)]  
numnegoutliers=len(negoutliers)
posoutliers=[f[j] for j in range (f)) if (f[j]>mean+4*std)]  
numposoutliers=len(posoutliers)

numout1s,
"""Number of observations with flux outside of 1-sigma from the mean"""
out1std=[f[j] for j in range (len(f)) if (f[j]>mean+std) or (f[j]<mean-std)] 
numout1s=len(out1std)

5/31/2018 Bookmark - TODO: simplify all following code definitions
kurt, """kurtosis"""

kurt=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	kurt[i]=scipy.stats.kurtosis(numpdc[i])


mad, from Richards et al.
"""Median Absolute Deviation (MAD)"""

mad=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	mad[i]=np.median([abs(numpdc[i][j]-medians[i]) for j in range(len(numpdc[i]))])

maxslope, minslope, 
slopes=[0]*(len(numpdc))
for i in range(len(numpdc)):    
	slopes[i]=[(numpdc[i][j+1]-numpdc[i][j])/(numtime[i][j+1]-numtime[i][j]) for j in range (len(numpdc[i])-1)]


"""mean slope- long term trend """
meanslope=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	meanslope[i]=np.mean(slopes[i])


"""max and min slopes"""
maxslope=np.zeros(len(numpdc))
minslope=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	maxslope[i]=np.percentile(slopes[i],99)
	minslope[i]=np.percentile(slopes[i],1)

meanpslope, 
	meanpslope[i]=np.mean(pslope[i])

meannslope, 
	meannslope[i]=-np.mean(nslope[i])

g_asymm, 
	g_asymm[i]=meanpslope[i]/meannslope[i]

rough_g_asymm, 
	rough_g_asymm[i]=len(pslope[i])/len(nslope[i])


diff_asymm, 
	diff_asymm[i]=meanpslope[i]-abs(meannslope[i])

skewslope, 
"""skew slope- hope of asymmetry"""
skewslope=np.zeros(len(numpdc))  

for i in range(len(numpdc)):
	skewslope[i]=scipy.stats.skew(corrslopes[i])


varabsslope, 
varslope, 
meanabsslope, 
"""Abs slopes"""

absslopes=[0]*len(numpdc)

for i in range(len(numpdc)):
	absslopes[i]= [abs(corrslopes[i][j]) for j in range(len(corrslopes[i]))]

"""varabsslope"""

varabsslope=np.zeros(len(numpdc))
meanabsslope=np.zeros(len(numpdc))

meanabsslope=[np.var(absslopes[i]) for i in range(len(numpdc))]
varabsslope=[np.mean(absslopes[i]) for i in range(len(numpdc))]


absmeansecder, 
abssecder=[0]*(len(numpdc))

for i in range(len(numpdc)):
 	abssecder[i]=[abs((slopes[i][j]-slopes[i][j-1])/((numtime[i][j+1]-numtime[i][j])/2+(numtime[i][j]-numtime[i][j-1])/2)) for j in range (1, len(slopes[i])-1)]

absmeansecder=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	absmeansecder[i]=np.mean(abssecder[i])

"""var slope"""

varslope=np.zeros(len(numpdc))

varslope=[np.var(slopes[i]) for i in range(len(slopes))]


num_pspikes, 
num_nspikes, 
num_sdspikes, 
num_sdspikes2,
stdratio, 
"""corrsecders"""
corrsecder=[0]*len(numpdc)

for i in range(len(numpdc)):
	corrsecder[i]=[(corrslopes[i][j]-corrslopes[i][j-1])/((numtime[i][j+1]-numtime[i][j])/2+(numtime[i][j]-numtime[i][j-1])/2) for j in range (1, len(corrpdc[i])-1)]

"""as regards periodicity in general,there can exist many levels"""
"""Num_spikes- you casn also isolate transits from cataclysmics using periodicity of spikes
take ratios of roundnessess or multiply them, """

stdratio=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	pslopestds[i]=np.std(pslope[i])
	nslopestds[i]=np.std(nslope[i])
	sdstds[i]=np.std(corrsecder[i])
	meanstds[i]=np.mean(corrsecder[i])
	stdratio[i]=pslopestds[i]/nslopestds[i]
"""
for i in range(len(numpdc)):
	pspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]>=3*slopestds[i]] 
	nspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]<=3*slopestds[i]]
	sdspikes[i]=[corrsecder[i][j] for j in range(len(corrsecder[i])) if corrsecder[i][j]>=4*sdstds[i]] 
"""
for i in range(len(numpdc)):
	pspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]>=meanpslope[i]+3*pslopestds[i]] 
	nspikes[i]=[corrslopes[i][j] for j in range(len(corrslopes[i])) if corrslopes[i][j]<=meannslope[i]-3*nslopestds[i]]
	sdspikes[i]=[corrsecder[i][j] for j in range(len(corrsecder[i])) if corrsecder[i][j]>=4*sdstds[i]] 
	sdspikes2[i]=[corrsecder[i][j] for j in range(len(corrsecder[i])) if corrsecder[i][j]<=-4*sdstds[i]]

"""change around the 4 and add the min condition along with sdspike
to look for transits"""

num_pspikes=np.zeros(len(numpdc))
num_nspikes=np.zeros(len(numpdc))
num_sdspikes=np.zeros(len(numpdc))
num_sdspikes2=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	num_pspikes[i]=len(pspikes[i]) 
	num_nspikes[i]=len(nspikes[i])
	num_sdspikes[i]=len(sdspikes[i])
	num_sdspikes2[i]=len(sdspikes2[i])

pstrend, """pair slope trend"""
pstrend=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	pstrend[i]=len([slopes[i][j] for j in range(len(slopes[i])-1) if (slopes[i][j]>0) & (slopes[i][j+1]>0)])/len(slopes[i])


num_zcross,
"""Zero crossings- accounted for ltt, plot with gasymm"""

zcrossind=[]
for i in range(len(numpdc)):
	ltt=longtermtrend[i]
	yoff=y_offset[i]
	zcrossind.append([j for j in range(len(numpdc[i])-1) if (ltt*numtime[i][j+1]+ yoff-numpdc[i][j+1])*(ltt*numtime[i][j]+yoff-numpdc[i][j])<0])


num_zcross=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	num_zcross[i]=len(zcrossind[i])
 
num_pm, 
"""pm"""
plusminus=[0]*len(numpdc)


for i in range(len(numpdc)):
	plusminus[i]=[j for j in range(1,len(slopes[i])) if (slopes[i][j]<0)&(slopes[i][j-1]>0)]

num_pm=np.zeros(len(numpdc))
num_pm=[len(plusminus[i]) for i in range(len(numpdc))]


len_nmax
"""naive maxima and corresponding time values you can do it with 5 or 10 or something else, 1 or two largest"""

naivemaxes=[0]*len(numpdc)
nmax_times=[0]*len(numpdc)
maxinds=[0]*len(numpdc)
maxerr=[0]*len(numpdc)
for i in range(len(numpdc)):
	naivemaxes[i]=[corrpdc[i][j] for j in range (len(numpdc[i])) if corrpdc[i][j] in heapq.nlargest(1, corrpdc[i][max(j-10,0):min(j+10, len(numpdc[i])-1): 1])]
	nmax_times[i]=[numtime[i][j] for j in range (len(numpdc[i])) if corrpdc[i][j] in heapq.nlargest(1, corrpdc[i][max(j-10,0):min(j+10, len(numpdc[i])-1): 1])]
	maxinds[i]=[j for j in range (len(numpdc[i])) if corrpdc[i][j] in heapq.nlargest(1, corrpdc[i][max(j-10,0):min(j+10, len(numpdc[i])-1): 1])]
	maxerr[i]=[err[i][j] for j in maxinds[i]]

"""numbers of naive maxima"""

len_nmax=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	len_nmax[i]=len(naivemaxes[i])

len_nmin - """numbers of naive minima"""

len_nmin=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	len_nmin[i]=len(naivemins[i])

mautocorrcoef - """Auto-correlation function of one maximum to next-good clustering"""

autopdcmax=[0]*len(numpdc)
for i in range(len(numpdc)):
	autopdcmax[i]=[naivemaxes[i][j+1] for j in range(len(naivemaxes[i])-1)]

mautocovs=np.zeros(len(numpdc))
mautocorrcoef=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	mautocorrcoef[i]=np.corrcoef(naivemaxes[i][:-1:], autopdcmax[i])[0][1]
	mautocovs[i]=np.cov(naivemaxes[i][:-1:],autopdcmax[i])[0][1]
 
ptpslopes - """peak to peak slopes"""
ptpslopes=np.zeros(len(numpdc))
ppslopes=[0]*len(numpdc)
for i in range(len(numpdc)):
	ppslopes[i]=[abs((naivemaxes[i][j+1]-naivemaxes[i][j])/(nmax_times[i][j+1]-nmax_times[i][j])) for j in range(len(naivemaxes[i])-1)]

for i in range(len(numpdc)):
	ptpslopes[i]=np.mean(ppslopes[i])

periodicity, periodicityr, naiveperiod
"""Variation coefficient of time difference between successive maxima- periodicity?"""

maxdiff=[0]*(len(numpdc))
for i in range(len(numpdc)):
	maxdiff[i]=[nmax_times[i][j+1]-nmax_times[i][j] for j in range(len(naivemaxes[i])-1)]

periodicity=np.zeros(len(numpdc))
periodicityr=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	periodicity[i]=np.std(maxdiff[i])/np.mean(maxdiff[i])
	periodicityr[i]=np.sum(abs(maxdiff[i]-np.mean(maxdiff[i])))/np.mean(maxdiff[i])
naiveperiod=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	naiveperiod[i]=np.mean(maxdiff[i])

 
maxvars & maxvarsr - """variation coefficient of the maxima"""
maxvars=np.zeros(len(numpdc))
maxvarsr=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	maxvars[i]=np.std(naivemaxes[i])/np.mean(naivemaxes[i])
	maxvarsr[i]=np.sum(abs(naivemaxes[i]-np.mean(naivemaxes[i])))/np.mean(naivemaxes[i])


oeratio, 
omin=[0]*len(numpdc)
emin=[0]*len(numpdc)
meanomin=np.zeros(len(numpdc))
meanemin=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	emin[i]=[naivemins[i][j] for j in range(len(naivemins[i])) if j%2==0]
	omin[i]=[naivemins[i][j] for j in range(len(naivemins[i])) if j%2!=0]
"""local secder dip"""
for i in range(len(numpdc)):
	meanemin[i]=np.mean(emin[i])
	meanomin[i]=np.mean(omin[i])
"""plt.scatter(meanomin, meanemin)"""
oeratio=np.zeros(len(numpdc))
for i in range(len(numpdc)):
	oeratio[i]=meanomin[i]/meanemin[i]


amp_2 & normamp:

amp_2=np.zeros(len(numpdc))
amp =np.zeros(len(numpdc))

for i in range(len(numpdc)):
	amp[i]=np.percentile(numpdc[i],99)-np.percentile(numpdc[i],1)
	amp_2[i]=np.percentile(corrpdc[i],99)-np.percentile(corrpdc[i],1)

normnaiveamp=np.zeros(len(numpdc))
normamp=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	normnaiveamp[i]=naive_amp_2[i]/np.mean(numpdc[i])
	normamp[i]=amp_2[i]/np.mean(numpdc[i])


mbp - Median Buffer Percentile

for i in range(len(numpdc)):
	mbp[i]=len([numpdc[i][j] for j in range(len(numpdc[i])) if (numpdc[i][j]<(medians[i]+0.1*amp_2[i])) & (numpdc[i][j]>(medians[i]-0.1*amp_2[i]))])/len(numpdc[i])


mid20, 
	f4060[i]=np.percentile(numpdc[i], 60)-np.percentile(numpdc[i], 40)
	f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
	mid20[i]=f4060[i]/f595[i]

mid35, 
	f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
	f3267[i]=np.percentile(numpdc[i], 67)-np.percentile(numpdc[i], 32)
	mid35[i]=f3267[i]/f595[i]

mid50, 
	f2575[i]=np.percentile(numpdc[i], 75)-np.percentile(numpdc[i], 25)
	f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
	mid50[i]=f2575[i]/f595[i]

mid65, 
	f1782[i]=np.percentile(numpdc[i], 82)-np.percentile(numpdc[i], 17)
	f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
	mid65[i]=f1782[i]/f595[i]

mid80, 
	f1090[i]=np.percentile(numpdc[i],90)-np.percentile(numpdc[i],10)	
	f595[i]=np.percentile(numpdc[i],95)-np.percentile(numpdc[i],5)
	mid80[i]=f1090[i]/f595[i]


percentamp, 
for i in range(len(numpdc)):
	percentamp[i]=max([(corrpdc[i][j]-medians[i])/medians[i] for j in range(len(corrpdc[i]))])

magratio, 
magratio=[(max(numpdc[i])-medians[i])/amp[i] for i in range(len(numpdc))]

sautocorrcoef, 
sautopdc=[0]*len(slopes)
for i in range(len(slopes)):
	sautopdc[i]=[slopes[i][j+1] for j in range(len(slopes[i])-1)]

sautocovs=np.zeros(len(slopes))
for i in range(len(slopes)):
	sautocorrcoef[i]=np.corrcoef(slopes[i][:-1:], sautopdc[i])[0][1]
	sautocovs[i]=np.cov(slopes[i][:-1:],sautopdc[i])[0][1]

autocorrcoef, 
autopdc=[0]*len(numpdc)
for i in range(len(numpdc)):
	autopdc[i]=[numpdc[i][j+1] for j in range(len(numpdc[i])-1)]

autocovs=np.zeros(len(numpdc))
autocorrcoef=np.zeros(len(numpdc))

for i in range(len(numpdc)):
	autocorrcoef[i]=np.corrcoef(numpdc[i][:-1:], autopdc[i])[0][1]
	autocovs[i]=np.cov(numpdc[i][:-1:],autopdc[i])[0][1]

flatmean, 
for i in range(len(numpdc)):
	flatmean[i]=np.nansum(flatness[i])/len(flatness[i])

tflatmean, 
for i in range(len(numpdc)):
	tflatmean[i]=np.nansum(tflatness[i])/len(tflatness[i])

roundmean, 
for i in range(len(numpdc)):
	roundmean[i]=np.nansum(roundness[i])/len(roundness[i])


troundmean, 
for i in range(len(numpdc)):
	troundmean[i]=np.nansum(troundness[i])/len(troundness[i])

roundrat, 
for i in range(len(numpdc)):
	roundrat[i]=roundmean[i]/troundmean[i]

flatrat
for i in range(len(numpdc)):
	flatrat[i]=flatmean[i]/tflatmean[i]