Clean up

y0-causal-inference · Jan 18, 2024 · a80e470 · a80e470
1 parent 91ce3e3
commit a80e470
Show file tree

Hide file tree

Showing 5 changed files with 50 additions and 68 deletions.
diff --git a/docs/source/img/regression.png → docs/source/img/backdoor.png b/docs/source/img/regression.png → docs/source/img/backdoor.png
diff --git a/docs/source/img/regression_example.pdf b/docs/source/img/regression_example.pdf
diff --git a/docs/source/img/regression_example.png b/docs/source/img/regression_example.png
diff --git a/src/eliater/examples/ecoli.py b/src/eliater/examples/ecoli.py
@@ -129,11 +129,11 @@
     " observed biomolecular networks. Bioinformatics, 39(Supplement_1), i494-i503.",
     graph=graph,
     description="This is the transcriptional E. Coli regulatory network obtained from EcoCyc database. "
-                "The experimental data were 260 RNA-seq normalized expression profiles of E. coli K-12"
-                " MG1655 and BW25113 across 154 unique experimental conditions, extracted from the PRECISE"
-                " database by (Sastry et al., 2019) from this paper: 'The Escherichia coli transcriptome mostly"
-                " consists of independently regulated modules' ",
+    "The experimental data were 260 RNA-seq normalized expression profiles of E. coli K-12"
+    " MG1655 and BW25113 across 154 unique experimental conditions, extracted from the PRECISE"
+    " database by (Sastry et al., 2019) from this paper: 'The Escherichia coli transcriptome mostly"
+    " consists of independently regulated modules' ",
     example_queries=[Query.from_str(treatments="fur", outcomes="dpiA")],
 )
 
-ecoli_transcription_example.__doc__ = ecoli_transcription_example.description
+ecoli_transcription_example.__doc__ = ecoli_transcription_example.description
diff --git a/src/eliater/regression.py b/src/eliater/regression.py
@@ -13,29 +13,26 @@
    where $X$ can take discrete or continuous values. In the case of a binary exposure, where X only takes 1 (meaning
    that the treatment has been received) or 0 (meaning that treatment has not been received), the ATE is defined as
    $\mathbb{E}[Y \mid do(X=1)] - \mathbb{E}[Y \mid do(X=0)]$. In this module, we support both continuous and discrete
-   values of $X$. In general the ATE varies depending on levels of $x$, but in linear models, the ATE reduces to a single
-   number. Hence, it does not depend on the value of $x$. This is shown below.
+   values of $X$. In general the ATE varies depending on levels of $x$, but in linear models, the ATE reduces to a
+   single number. Hence, it does not depend on the value of $x$. This is shown below.
 
 In order to have an intuition for how to use linear regression on the treatment variable, we can create a
 Gaussian linear structural causal model (SCM). With Gaussian linear SCMs, each variable is defined as a
 linear combination of its parents. For example, consider this graph:
 
 .. code-block:: python
 
-    from y0.dsl import Variable, Z, X, Y
+    from y0.dsl import X, Y, Z
     from y0.graph import NxMixedGraph
 
-    graph = NxMixedGraph.from_edges(
-        directed=[
-            (X, Y),
-            (Z, Y),
-            (Z, X),
-        ],
-        undirected=[],
-    )
+    graph = NxMixedGraph.from_edges([
+        (X, Y),
+        (Z, Y),
+        (Z, X),
+    ])
     graph.draw()
 
-.. figure:: img/regression.png
+.. figure:: img/backdoor.png
    :scale: 70%
 
 The goal is to find the causal effect of X on Y. In this graph a Gaussian linear SCM can be defined as below:
@@ -65,53 +62,45 @@
 non-causal relationship due
 to the confounder $Z$, indicated by the path X ← Z → Y . Such confounding paths initiating with an arrow directed
 towards $X$ are termed *back-door paths*.
-Nevertheless, it's noteworthy that the regression coefficient of $Y$ on $X$ when adjusted for $ZZ$
+Nevertheless, it's noteworthy that the regression coefficient of $Y$ on $X$ when adjusted for $Z$
 (denoted by $\gamma_{yx.z}$) simplifies to:
 
 $\gamma_{yx.z} = \lambda_{xy}$
 
-This means that adjusting for $ZZ$ blocks the back-door path, and alows us to directly estimate the effect of $X$
+This means that adjusting for $Z$ blocks the back-door path, and allows us to directly estimate the effect of $X$
 on $Y$, which leads to an unbiased estimate of the ATE.
 The set of variables blocking the backdoor paths are called adjustment sets.
 
-This module finds the optimal adjustment set, i.e., the adjustment set that leads to an estimate of ATE with least
-asymptotic variance, if it exist. If the optimal adjustment set does not exist, this module tries to find the
-optimal minimal adjustment set, i.e., the adjustment set with minimal cardinality that provides the least asymptotic
-variance in the estimation of ATE. If the optimal adjustment set, or the optimal minimal adjustment set does not
-exist, this module finds a random adjustment set among the existing minimal adjustment sets.
+The module identifies an adjustment set with the following priority:
 
-Once the adjustment set is selected, this module use it to regress $X$ and the adjustment set on $Y$ to find an unbiased
-estimate of the $P(Y \mid do(X=x))$ or $\mathbb{E}[Y \mid do(X=x)]$ or ATE.
+1. An optimal adjustment set - the adjustment set that leads to an estimate of ATE with least
+   asymptotic variance, if it exist
+2. An optimal minimal adjustment set - identify all possible adjustment sets and choose one with the minimum cardinality
+   (i.e., number of elements) that results in the least asymptomatic variance in the estimation of the ATE
+   (choosing the one resulting in the ATE still needs to be implemented).
+3. A randomly chosen adjustment set
+
+Once the adjustment set is selected, this module use it to perform a regression with $X$ and the variables in the
+adjustment set as the inputs and $Y$ as the output to find an unbiased
+estimate of the $P(Y \mid do(X=x))$, $\mathbb{E}[Y \mid do(X=x)]$, and the ATE.
 
 Linear regression is known for its simplicity, speed, and high interpretability.
 However, linear regression is most appropriate when the variables exhibit linear relationships. In addition, it only
 uses the variables from the back-door adjustment set and does not utilize useful variables such as mediators.
 
 Example
 -------
-We'll work with the following example where $X$ is the treatment, and $Y$ is the outcome. We will explore estimating
-causal effects in the forms of $P(Y \mid do(X=x))$, $\mathbb{E}[Y \mid do(X=x)]$ and ATE.
+We'll work with the example :data:`eliater.frontdoor_backdoor.example_2` in which $X$ is the treatment and $Y$ is
+the outcome. We will explore estimating causal effects in the forms of $P(Y \mid do(X=x))$,
+$\mathbb{E}[Y \mid do(X=x)]$, and ATE.
 
 .. code-block:: python
 
-    from y0.dsl import Z1, Z2, Z3, Variable, X, Y
-    M1 = Variable("M1")
-    M2 = Variable("M2")
-    from y0.graph import NxMixedGraph
-    graph = NxMixedGraph.from_edges(
-    directed=[
-        (Z1, X),
-        (X, M1),
-        (M1, M2),
-        (M2, Y),
-        (Z1, Z2),
-        (Z2, Z3),
-        (Z3, Y),
-        ],
-    )
-    graph.draw()
+    from eliater.examples import example_2
 
-.. figure:: img/regression_example.pdf
+    example_2.draw()
+
+.. figure:: img/regression_example.png
    :scale: 70%
 
 Typically, estimation of ATE is more popular and desirable than estimation of $\mathbb{E}[Y \mid do(X=x)]$, because ATE
@@ -142,21 +131,21 @@
     )
 
 The output of the query type in the form of "expected value" ($\mathbb{E}[Y \mid do(X=1)]$) is 69.78.
-This means that if one intervene on $X$ and fix its value of 1, the expected value over $Y$ will be around 70.
+This means that if one intervene on $X$ and fix its value of 1, the expected value over $Y$ will be around 70
 However, if one estimates the expected value of $Y$ without any intervention on $X$, it will amount to 73.61.
 
 .. code-block:: python
 
     from eliater.frontdoor_backdoor import example_2
-    import numpy as np
+
     data = example_2.generate_data(100, seed=100)
-    np.mean(data['Y'])
+    data.mean()['Y']
 
 This means that, intervention on $X$ and fixing its value to 0 causes a decrease in the expected value of $Y$. Now
 we will explore the output of the query in the form of ATE:
 $\mathbb{E}[Y \mid do(X=x+1)] - \mathbb{E}[Y \mid do(X=x+1)]$. Note that the value of $x$ is indifferent in the
-output, because ATE amounts to the estimate for the coefficient of $X$ (\hat{\lambda_{xy}}), in a regression equation where $X$, and the
-optimal adjustment set $Z_3$ are regressed on $Y$ as follows:
+output, because ATE amounts to the estimate for the coefficient of $X$ (\hat{\lambda_{xy}}), in a regression equation
+where $X$, and the optimal adjustment set $Z_3$ are regressed on $Y$ as follows:
 
 $Y = \lambda_{xy} X + \lambda_{z_3Y} Z_3 + U_Y; U_Y \sim \mathcal{N}(0, \sigma^2_Y)$
 
@@ -216,11 +205,11 @@
     graph = example_2.graph
     data = example_2.generate_data(100, seed=100)
     summary_statistics(
-    graph= graph,
-    data= data,
-    treatments= {X},
-    outcome= Y,
-    interventions= {X: 0},
+        graph=graph,
+        data=data,
+        treatments={X},
+        outcome=Y,
+        interventions={X: 0},
     )
 
 The output is as follows:
@@ -229,19 +218,12 @@
 first_quartile=64.48297054774694, second_quartile=69.74351236828335, third_quartile=74.41515094449318,
 max=91.8646554122394)'''
 
-.. todo:: Questions to answer in documentation:
-
-    1. How does estimation with linear regression work? Rework the text above from JZ. Remember the point
-       is to explain to someone who doesn't really care about the math but wants to decide if they should
-       use it for their situation
-    2. What's the difference between estimation with this module and what's available in Ananke?
-    3. What are the limitations of estimation with this methodology?
-    4. What does it look like to actually use this code? Give a self-contained example of doing estimation
-       with this module and include an explanation on how users should interpret the results.
-
-    PLEASE DO NOT DELETE THIS LIST. Leave it at the bottom of the module level docstring so we
-    can more easily check if all of the points have been addressed.
-
+Unanswered Questions for Later
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. What's the difference between estimation with this module and what's available in Ananke?
+2. What are the limitations of estimation with this methodology?
+3. Where do all of these random numbers throughout the examples of using the code come from? Can we have those
+   propagated/calculated automatically? I don't trust any numbers that were typed by hand.
 """
 
 import statistics