Skip to content

Commit

Permalink
Clean up
Browse files Browse the repository at this point in the history
  • Loading branch information
cthoyt committed Jan 18, 2024
1 parent 91ce3e3 commit a80e470
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 68 deletions.
File renamed without changes
Binary file removed docs/source/img/regression_example.pdf
Binary file not shown.
Binary file added docs/source/img/regression_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions src/eliater/examples/ecoli.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,11 +129,11 @@
" observed biomolecular networks. Bioinformatics, 39(Supplement_1), i494-i503.",
graph=graph,
description="This is the transcriptional E. Coli regulatory network obtained from EcoCyc database. "
"The experimental data were 260 RNA-seq normalized expression profiles of E. coli K-12"
" MG1655 and BW25113 across 154 unique experimental conditions, extracted from the PRECISE"
" database by (Sastry et al., 2019) from this paper: 'The Escherichia coli transcriptome mostly"
" consists of independently regulated modules' ",
"The experimental data were 260 RNA-seq normalized expression profiles of E. coli K-12"
" MG1655 and BW25113 across 154 unique experimental conditions, extracted from the PRECISE"
" database by (Sastry et al., 2019) from this paper: 'The Escherichia coli transcriptome mostly"
" consists of independently regulated modules' ",
example_queries=[Query.from_str(treatments="fur", outcomes="dpiA")],
)

ecoli_transcription_example.__doc__ = ecoli_transcription_example.description
ecoli_transcription_example.__doc__ = ecoli_transcription_example.description
108 changes: 45 additions & 63 deletions src/eliater/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,26 @@
where $X$ can take discrete or continuous values. In the case of a binary exposure, where X only takes 1 (meaning
that the treatment has been received) or 0 (meaning that treatment has not been received), the ATE is defined as
$\mathbb{E}[Y \mid do(X=1)] - \mathbb{E}[Y \mid do(X=0)]$. In this module, we support both continuous and discrete
values of $X$. In general the ATE varies depending on levels of $x$, but in linear models, the ATE reduces to a single
number. Hence, it does not depend on the value of $x$. This is shown below.
values of $X$. In general the ATE varies depending on levels of $x$, but in linear models, the ATE reduces to a
single number. Hence, it does not depend on the value of $x$. This is shown below.
In order to have an intuition for how to use linear regression on the treatment variable, we can create a
Gaussian linear structural causal model (SCM). With Gaussian linear SCMs, each variable is defined as a
linear combination of its parents. For example, consider this graph:
.. code-block:: python
from y0.dsl import Variable, Z, X, Y
from y0.dsl import X, Y, Z
from y0.graph import NxMixedGraph
graph = NxMixedGraph.from_edges(
directed=[
(X, Y),
(Z, Y),
(Z, X),
],
undirected=[],
)
graph = NxMixedGraph.from_edges([
(X, Y),
(Z, Y),
(Z, X),
])
graph.draw()
.. figure:: img/regression.png
.. figure:: img/backdoor.png
:scale: 70%
The goal is to find the causal effect of X on Y. In this graph a Gaussian linear SCM can be defined as below:
Expand Down Expand Up @@ -65,53 +62,45 @@
non-causal relationship due
to the confounder $Z$, indicated by the path X ← Z → Y . Such confounding paths initiating with an arrow directed
towards $X$ are termed *back-door paths*.
Nevertheless, it's noteworthy that the regression coefficient of $Y$ on $X$ when adjusted for $ZZ$
Nevertheless, it's noteworthy that the regression coefficient of $Y$ on $X$ when adjusted for $Z$
(denoted by $\gamma_{yx.z}$) simplifies to:
$\gamma_{yx.z} = \lambda_{xy}$
This means that adjusting for $ZZ$ blocks the back-door path, and alows us to directly estimate the effect of $X$
This means that adjusting for $Z$ blocks the back-door path, and allows us to directly estimate the effect of $X$
on $Y$, which leads to an unbiased estimate of the ATE.
The set of variables blocking the backdoor paths are called adjustment sets.
This module finds the optimal adjustment set, i.e., the adjustment set that leads to an estimate of ATE with least
asymptotic variance, if it exist. If the optimal adjustment set does not exist, this module tries to find the
optimal minimal adjustment set, i.e., the adjustment set with minimal cardinality that provides the least asymptotic
variance in the estimation of ATE. If the optimal adjustment set, or the optimal minimal adjustment set does not
exist, this module finds a random adjustment set among the existing minimal adjustment sets.
The module identifies an adjustment set with the following priority:
Once the adjustment set is selected, this module use it to regress $X$ and the adjustment set on $Y$ to find an unbiased
estimate of the $P(Y \mid do(X=x))$ or $\mathbb{E}[Y \mid do(X=x)]$ or ATE.
1. An optimal adjustment set - the adjustment set that leads to an estimate of ATE with least
asymptotic variance, if it exist
2. An optimal minimal adjustment set - identify all possible adjustment sets and choose one with the minimum cardinality
(i.e., number of elements) that results in the least asymptomatic variance in the estimation of the ATE
(choosing the one resulting in the ATE still needs to be implemented).
3. A randomly chosen adjustment set
Once the adjustment set is selected, this module use it to perform a regression with $X$ and the variables in the
adjustment set as the inputs and $Y$ as the output to find an unbiased
estimate of the $P(Y \mid do(X=x))$, $\mathbb{E}[Y \mid do(X=x)]$, and the ATE.
Linear regression is known for its simplicity, speed, and high interpretability.
However, linear regression is most appropriate when the variables exhibit linear relationships. In addition, it only
uses the variables from the back-door adjustment set and does not utilize useful variables such as mediators.
Example
-------
We'll work with the following example where $X$ is the treatment, and $Y$ is the outcome. We will explore estimating
causal effects in the forms of $P(Y \mid do(X=x))$, $\mathbb{E}[Y \mid do(X=x)]$ and ATE.
We'll work with the example :data:`eliater.frontdoor_backdoor.example_2` in which $X$ is the treatment and $Y$ is
the outcome. We will explore estimating causal effects in the forms of $P(Y \mid do(X=x))$,
$\mathbb{E}[Y \mid do(X=x)]$, and ATE.
.. code-block:: python
from y0.dsl import Z1, Z2, Z3, Variable, X, Y
M1 = Variable("M1")
M2 = Variable("M2")
from y0.graph import NxMixedGraph
graph = NxMixedGraph.from_edges(
directed=[
(Z1, X),
(X, M1),
(M1, M2),
(M2, Y),
(Z1, Z2),
(Z2, Z3),
(Z3, Y),
],
)
graph.draw()
from eliater.examples import example_2
.. figure:: img/regression_example.pdf
example_2.draw()
.. figure:: img/regression_example.png
:scale: 70%
Typically, estimation of ATE is more popular and desirable than estimation of $\mathbb{E}[Y \mid do(X=x)]$, because ATE
Expand Down Expand Up @@ -142,21 +131,21 @@
)
The output of the query type in the form of "expected value" ($\mathbb{E}[Y \mid do(X=1)]$) is 69.78.
This means that if one intervene on $X$ and fix its value of 1, the expected value over $Y$ will be around 70.
This means that if one intervene on $X$ and fix its value of 1, the expected value over $Y$ will be around 70
However, if one estimates the expected value of $Y$ without any intervention on $X$, it will amount to 73.61.
.. code-block:: python
from eliater.frontdoor_backdoor import example_2
import numpy as np
data = example_2.generate_data(100, seed=100)
np.mean(data['Y'])
data.mean()['Y']
This means that, intervention on $X$ and fixing its value to 0 causes a decrease in the expected value of $Y$. Now
we will explore the output of the query in the form of ATE:
$\mathbb{E}[Y \mid do(X=x+1)] - \mathbb{E}[Y \mid do(X=x+1)]$. Note that the value of $x$ is indifferent in the
output, because ATE amounts to the estimate for the coefficient of $X$ (\hat{\lambda_{xy}}), in a regression equation where $X$, and the
optimal adjustment set $Z_3$ are regressed on $Y$ as follows:
output, because ATE amounts to the estimate for the coefficient of $X$ (\hat{\lambda_{xy}}), in a regression equation
where $X$, and the optimal adjustment set $Z_3$ are regressed on $Y$ as follows:
$Y = \lambda_{xy} X + \lambda_{z_3Y} Z_3 + U_Y; U_Y \sim \mathcal{N}(0, \sigma^2_Y)$
Expand Down Expand Up @@ -216,11 +205,11 @@
graph = example_2.graph
data = example_2.generate_data(100, seed=100)
summary_statistics(
graph= graph,
data= data,
treatments= {X},
outcome= Y,
interventions= {X: 0},
graph=graph,
data=data,
treatments={X},
outcome=Y,
interventions={X: 0},
)
The output is as follows:
Expand All @@ -229,19 +218,12 @@
first_quartile=64.48297054774694, second_quartile=69.74351236828335, third_quartile=74.41515094449318,
max=91.8646554122394)'''
.. todo:: Questions to answer in documentation:
1. How does estimation with linear regression work? Rework the text above from JZ. Remember the point
is to explain to someone who doesn't really care about the math but wants to decide if they should
use it for their situation
2. What's the difference between estimation with this module and what's available in Ananke?
3. What are the limitations of estimation with this methodology?
4. What does it look like to actually use this code? Give a self-contained example of doing estimation
with this module and include an explanation on how users should interpret the results.
PLEASE DO NOT DELETE THIS LIST. Leave it at the bottom of the module level docstring so we
can more easily check if all of the points have been addressed.
Unanswered Questions for Later
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. What's the difference between estimation with this module and what's available in Ananke?
2. What are the limitations of estimation with this methodology?
3. Where do all of these random numbers throughout the examples of using the code come from? Can we have those
propagated/calculated automatically? I don't trust any numbers that were typed by hand.
"""

import statistics
Expand Down

0 comments on commit a80e470

Please sign in to comment.