Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smooth histograms — which methods are best? #3

Open
phinate opened this issue May 11, 2020 · 5 comments
Open

Smooth histograms — which methods are best? #3

phinate opened this issue May 11, 2020 · 5 comments

Comments

@phinate
Copy link
Member

phinate commented May 11, 2020

Following up on the discussion that took place during the kick-off meeting, an interesting point arose on the issue of which methods are best used to approximate a set of binned counts.

In the case where one has a neural network-based observable, one could apply the softmax function exp(nn / T) / sum(exp(nn / T)) ( T = temperature hyperparam) to the nn output, like one would do with a classification problem. This essentially puts every event in all bins, but weights each bin with the corresponding nn output (normalized to 1). The histogram is then the elementwise sum of the softmaxed output over all events.

An alternative to this is to take an approach using a kernel density estimate (kde), which is defined by instantiating a distribution (kernel) centered around each data point (e.g. standard normal distribution) and averaging their contributions. The smoothness of the resulting estimate is controlled with a 'bandwidth' hyperparameter, which scales the widths of the kernels. Once you choose a bandwidth, you can get 'counts' by integrating the pdf over a set of intervals (bins). One can equivalently use the cumulative density function.

Revisiting the case with the nn-based observable, one could then make the network output one value (regression net), and then define a kde over the set of outputs from a batch of events. The question is then: is there any clear advantage in expressivity or principle from using a kde, softmax, or other methods?

Another interesting question is where these approaches break down, e.g. using a kde in this way seems to be very senstitive to the bandwidth.

It's also worth noting that a kde may make more sense for general use as a candidate for smooth histogramming, but the case of the nn observable may be more nuanced.

@philippwindischhofer
Copy link

[...] which is defined by instantiating a distribution (kernel) centered around each data point [...]

This is a very pragmatic question and shows my ignorance about these things: Have you already tried to investigate the scaling behavior of your current KDE implementation (resource-wise)? Do you (need to) keep all the kernels of all data points in memory at the same time for this to work? (Put differently: "what happens if you fill a KDE-style histogram with 10M events?")

@phinate
Copy link
Member Author

phinate commented May 14, 2020

@philippwindischhofer In practice, our code looks something like:

events = np.array([x_1, x_2, ..., x_n])
bins = np.linspace(a,b,num_steps) 

edge_hi = bins[1:] # high edge for each bin 
edge_lo = bins[:-1] # low edge for each bin 

# get cumulative counts from ‘a’ up to each lo and hi bin edge 
cdf_up = norm.cdf(edge_hi, loc = events, scale = bandwidth) 
cdf_dn = norm.cdf(edge_lo, loc = events, scale = bandwidth) 

# subtract to get counts in [lo,hi] for all bins 
counts = cdf_up - cdf_dn 

Since this is just essentially applying normcdf twice in a vectorised way, I imagine there wouldnt be any issues with scaling (but I havent tried!)

@HDembinski
Copy link

I think use of KDEs is fine, but using them with a normal Poisson-based maximum-likelihood function as implemented in pyhf is mathematically wrong. Maximum-likelihood fits only produce unbiased estimates of the value and its uncertainty when the data in each bin is Poisson-distributed, this is not the case when you convert data into KDEs. If you add a single new event, it will not only increase the value in a single bin, but also in neighboring bins. These additional correlations are not included in the fitted model.

Whether this is a problem in practice, I cannot say. The problem I describe may not affect the optimization significantly, but this point needs to be studied.

@alexander-held
Copy link
Member

alexander-held commented Jul 22, 2020

The bandwidth could be reduced throughout analysis optimization to slowly approach the discrete bin case. For the final inference step after the optimization, the KDE can be replaced by a traditional histogram.

@HDembinski
Copy link

btw

events = np.array([x_1, x_2, ..., x_n])
bins = np.linspace(a,b,num_steps) 

edge_hi = bins[1:] # high edge for each bin 
edge_lo = bins[:-1] # low edge for each bin 

# get cumulative counts from ‘a’ up to each lo and hi bin edge 
cdf_up = norm.cdf(edge_hi, loc = events, scale = bandwidth) 
cdf_dn = norm.cdf(edge_lo, loc = events, scale = bandwidth) 

# subtract to get counts in [lo,hi] for all bins 
counts = cdf_up - cdf_dn 

can be more efficiently written as

events = np.array([x_1, x_2, ..., x_n])
bins = np.linspace(a,b,num_steps) 

# get cumulative counts from ‘a’ up to each lo and hi bin edge 
cdf = norm.cdf(bins, loc = events, scale = bandwidth) 

# subtract to get counts in [lo,hi] for all bins 
counts = np.diff(cdf) 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants