Skip to content

Commit

Permalink
Merge pull request #5 from ihmeuw-msca/feature/edge-case-and-ecdf
Browse files Browse the repository at this point in the history
Feature/edge case and ecdf
  • Loading branch information
mbi6245 authored Sep 25, 2024
2 parents a3e95ce + e4194f8 commit f8ece98
Show file tree
Hide file tree
Showing 6 changed files with 119 additions and 61 deletions.
40 changes: 6 additions & 34 deletions docs/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,50 +10,22 @@ Example
import scipy.stats
from ensemble.ensemble_model import EnsembleModel, EnsembleFitter
# creates an ensemble distribution composed of the normal and gumbel distributions both sharing
# the same mean and variance; the normal distribution can be thought of as contributing a
# quarter of the "height" of the density curve to the ensemble's density, and the gumbel as
# contributing the remaining 3 quarters
# the same mean and variance
normal_gumbel = EnsembleModel(distributions=["normal", "gumbel"],
weights=[0.25, 0.75],
mean=4,
variance=1)
# fits an EnsembleModel object to standard normal draws. Here, the user has specified a
# distribution (the gumbel) that is not reflected in the truth. The model typically correctly
# identifies this and will give weights close to 1 for the normal, and 0 for the gumbel
# distribution (the gumbel) that is not reflected in the truth. Try on your own to see how the
# model reflects this!
std_norm_draws = scipy.stats.norm.rvs(0, 1, size=100)
model = EnsembleFitter(["normal", "gumbel"], "L2").fit(std_norm_draws)
fitted_weights = model.weights
fitted_distribution = model.ensemble_distribution
Using both the draws from the standard normal and the fitted :code:`EnsembleModel` object from
above, we can also plot the results with the help of the :code:`matplotlib` package. There are many
things that you may want to plot, but 2 useful plots that will be demonstrated below are a density
histogram overlaid with the ensemble PDF, as well as a comparison of the eCDF and the ensemble's
CDF.
# default plotting function for a demo visualization
normal_gumbel.plot()
Plotting
--------

.. code-block:: python
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2)
support = np.linspace(np.min(std_norm_draws), np.max(std_norm_draws), 1000)
# plot histogram vs fitted PDF
ax[0].hist(std_norm_draws, density=True, bins=30)
ax[0].plot(support, fitted_distribution.pdf(support))
ax[0].set_xlabel("DATA VALUES (UNITS)")
ax[0].set_ylabel("density")
ax[0].set_title("DATA histogram w/ensemble PDF Overlay")
# plot eCDF vs fitted CDF
stats.ecdf(std_norm_draws).cdf.plot(ax[1])
ax[1].plot(support, fitted_distribution.cdf(support))
ax[1].set_xlabel("DATA VALUES (UNITS)")
ax[1].set_ylabel("density")
ax[1].set_title("Empirical vs Ensemble CDF Comparison")
**Please see** :ref:`Plotting` **for a practical guide on plotting with ensemble.**
6 changes: 3 additions & 3 deletions docs/user_guide/ensemble_fitting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ Finally, the function of interest for this use case is the :code:`fit()` functio
Example: Fitting an Ensemble
----------------------------

Suppose we have microdata for systolic blood pressure (SBP) from a certain population of young men
in Seattle. Since SBP must be positive, let's use all the distributions (except the exponential)
with a positive support to fit this data.
Suppose we have microdata for systolic blood pressure (SBP) from a certain population of young
people in Seattle. Since SBP must be positive, let's use all the distributions (except the
exponential) with a positive support to fit this data.

.. code-block:: python
Expand Down
1 change: 1 addition & 0 deletions docs/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ User guide

ensemble_fitting
ensemble_model
plotting
concepts

.. note::
Expand Down
43 changes: 43 additions & 0 deletions docs/user_guide/plotting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Plotting
========

Let's reuse the example from :ref:`Example: Fitting an Ensemble` with the SBP. There are many
things that you may want to plot, but 2 useful plots that will be demonstrated below are:

* a density histogram of the SBP overlaid with the ensemble PDF
* a comparison between the eCDF of the SBP data and the ensemble's CDF

.. code-block:: python
import numpy as np
import matplotlib.pyplot as plt
# generate SBP data and fit model as before
data = stats.norm(loc=120, scale=7).rvs(size=100)
model = EnsembleFitter(
distributions=["gamma", "invgamma", "fisk", "lognormal"],
objective="L2"
)
res = model.fit(data)
# set up matplotlib plotting
fig, ax = plt.subplots(1, 2)
support = np.linspace(np.min(data), np.max(data), 1000)
# plot histogram w/fitted PDF
ax[0].hist(data, density=True, bins=30)
ax[0].plot(support, fitted_distribution.pdf(support))
ax[0].set_xlabel("SBP (mm/Hg)")
ax[0].set_ylabel("density")
ax[0].set_title("SBP histogram w/ensemble PDF Overlay")
# plot eCDF vs fitted CDF
stats.ecdf(std_norm_draws).cdf.plot(ax[1])
ax[1].plot(support, fitted_distribution.cdf(support))
ax[1].set_xlabel("SBP (mm/Hg)")
ax[1].set_ylabel("density")
ax[1].set_title("Empirical vs Ensemble CDF Comparison")
**What is** :code:`support` **?:** You can think of :code:`support` as the x values (in the space of the
data)for which we will calculate corresponding y values of density for, whether that be the PDF or
CDF.
49 changes: 28 additions & 21 deletions plots.ipynb

Large diffs are not rendered by default.

41 changes: 38 additions & 3 deletions src/ensemble/model.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from typing import List, Tuple, Union

import cvxpy as cp
import matplotlib.pyplot as plt
import numpy as np
import numpy.typing as npt
import scipy.optimize as opt
Expand Down Expand Up @@ -219,6 +220,28 @@ def stats_temp(
else:
return tuple(res_list)

def plot(self):
"""THIS IS A DEMONSTRATION FUNCTION. SEE DOCUMENTATION FOR MORE PRACTICAL PLOTS
plots the PDF and CDF of an ensemble distribution
"""
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
scaling = 3 * np.sqrt(self.variance)
lb = np.max([self.support[0], self.mean - scaling])
ub = np.min([self.support[1], self.mean + scaling])
support = np.linspace(lb, ub, 100)
pdf = self.pdf(support)
cdf = self.cdf(support)
ax[0].plot(support, pdf)
ax[0].set_xlabel("DATA VALUES (UNITS)")
ax[0].set_ylabel("density")
ax[0].set_title("ensemble PDF")

ax[1].plot(support, cdf)
ax[1].set_xlabel("DATA VALUES (UNITS)")
ax[1].set_ylabel("density")
ax[1].set_title("ensemble CDF")


class EnsembleResult:
"""Result from ensemble distribution fitting
Expand Down Expand Up @@ -333,11 +356,23 @@ def fit(self, data: npt.ArrayLike) -> EnsembleResult:
raise ValueError(
"data exceeds bounds of the support of your ensemble"
)

if len(data) <= 1:
raise ValueError(
"you may only run this function with 2 or more data points"
)

# sample stats, ecdf
sample_mean = np.mean(data)
sample_variance = np.var(data, ddof=1)
ecdf = stats.ecdf(data).cdf.probabilities
equantiles = stats.ecdf(data).cdf.quantiles
ecdf = stats.ecdf(data).cdf

# reintroduce duplicates into scipy's ecdf for fitting only
sorted_indices = np.argsort(data)
equantiles = data[sorted_indices]
eprobabilities = np.interp(
equantiles, ecdf.quantiles, ecdf.probabilities
)

# fill matrix with cdf values over support of data
num_distributions = len(self.distributions)
Expand All @@ -352,7 +387,7 @@ def fit(self, data: npt.ArrayLike) -> EnsembleResult:

# CVXPY implementation
w = cp.Variable(num_distributions)
objective = cp.Minimize(self._objective_func(ecdf - cdfs @ w))
objective = cp.Minimize(self._objective_func(eprobabilities - cdfs @ w))
constraints = [0 <= w, cp.sum(w) == 1]
prob = cp.Problem(objective, constraints)
prob.solve()
Expand Down

0 comments on commit f8ece98

Please sign in to comment.