I wanted to get a better understanding of causal models. I first went through Dawid's excellent survey, but I was still somewhat uncertain about what can and cannot be done in causal inference. My opinion was that everything can be done with decision diagrams, but Pearl was kind enough to point out that this was not the case. So, I decided to start reading his book in my spare time. I intend to be as critical as possible. The book didn't help to clarify my thinking very much (in contrast to decision diagrams).
In this post, I'll mainly go through the first chapter, and chapter 7, which is the one with have a formal definition of causal models. These are simply graphical models, with two types of random variables: endogenous and exogenous. There are two questions to answer:
1. How do we model interventions? Here, Pearl offers 'do' actions, which simply disconnect endogenous variables from input by the rest of the graph, and replace them with a fixed choice. This makes sense, but the set of actions is limited to these, and so the formalism looks like a specialisation of decision diagrams.
2. How do we deal with observed variables when dealing with counterfactuals and interventions? Should we assume that "everything else remains the same"? It's hard to motivate this for interventions, but it seems to make sense for counterfactuals. But in the end, it should all depend on whether observations are helping us infer model parameters or simply 'what is currently happening'. The book offers no clear guidance on this topic.
All in all, I found the formalisms in this book clear but restrictive, and the examples and discussion unhelpful and confusing. For example, It is unclear, and left to intuition, whether to consider some variables observed or not, and "structural equation modelling" does nothing to clarify the issue. But it is possible I misunderstood something.
Without further ado, here are my detailed notes.
*Chapter 1
This chapter introduces the book's main themes. After going through the basics of Bayesian networks, the concept of Causal Bayesian Networks is introduced. The general principle is that directed links should imply causal effects. There is also the principle of interventions. If the BN is causal, then performing an action that sets a variable to some value also involves removing the link between that variable and its parents. This preserves the probability properties that we'd expect. The chapter also discusses the stability of causal relations, talking about how causal relationships are /ontological/ while probabilistic ones are /epistemic/.
Functional causal models are basically described as equations of the form
\[
x_i = f_i (P_i, u_i)
\]
where $P_i$ are the parents of $i$ and $u_i$ are disturbances.
There are thre types of queries that Pearl wishes to discuss:
- Predictions: Probability of $x_i$ if we observe $x_j$.
- Interventions: Probability of $x_i$ if we /set/ $x_j$ to some value.
- Counterfactuals: Probabiltiy of $x_i$ if $x_j$ had some other value.
In my view, counterfactuals are no different from predictions conceptually. Since $x_j$ is a random variable, what we have essentially is
\[
\Pr(x_i = a \mid x_j = b)
= \Pr(\{\omega : x_i(\omega) = a \wedge x_j(\omega) = b\})
/ \Pr(\{\omega : x_j(\omega) = b\}).
\]
So now we are counting in how many worlds $x_i, x_j$ take a joint value. So, for the counterfactual, we need to be able to same something about how something else might have happened. That would necessarily mean changing the value of $x_j$ for that specific $\omega$.
For more on counterfactuals, Pearl gives an example where $x, y \in \{0,1\}$ and where $\Pr(y \mid x) = 1/2$ for all $x, y$. This example is basically that of patients either dying or getting cured by treatments. Statistically, it's impossible to separate the case of the effect of the treatment being nil, and the patients belonging in two populations - with one population dying due to the treatment and the other being cured.
The solution to this paradox is to *give the treatment* to a sub-sample of the population that survived. However, this effectively creates an additional random variable. Instead of having one
treatment-effect pair, we have two for each patient $\omega$ $(x_1, y_1)$ and $(x_2, y_2)$. In this context, $x_i$ are decision variables. Hence, decision diagrams seem to be perfectly sufficient to model this.
Finally, this section introduces an alternative functional equation notation for the conditional probability table. This is more economical as a representation, but I see no other advantage in it (e.g. it does not allow for some other causal inferences). I would personally prefer to consider $u_1, u_2$ as samples from the underlying probability space rather than as additive 'disturbances' and then fall back to familiar decision theoretic territory.
* Chapter 7
This is the first formal chapter, and the only one that actually needs
reading. A causal model is a triple
\[
M = \langle U, V, F\rangle
\]
where
- $U$ are background/exogenous variables
- $V$ are endogenous variables (determined by variables in the model, i.e. $U \cup V$)
- $F$ is a set of functions so that
\[
v_i = f_i (u_i, \mathcal{P}_i),
\]
where $\mathcal{P}_i$ are the parents of $v_i$ in $V$.
Thus, every causal model $M$ can be represented as a directed graph.
A submodel is an important concept. If $X \subset V$ and $x$ is a
realisation of $X$, then $M_X = \langle U, V, G_X \rangle$ and $F_x = \{f_i V_i \neq X\} \cup \{X = x\}$. This corresponds to deleting all variables and replacing them with constant functions.
Pearl then considers a very specific set of actions, such that each action $x$ \on $X$ consists of assigning value $x$ to $X$. However, more generally we can speak of actions as policies $\pi$. Each policy has a certain effect on variables, such as setting some of them to a specific value.
In the decision theoretic setting, we can instead consider a set of actions $A$. We can then simply define the random variables $v_i$ as the function
\[
v_i = v_i(f_i, a).
\]
Pearl's account amounts to a particular special case, where the set of actions $A$ include the null action (that has no effect) and a set of interventions on specific variables (which corresponds to submodels).
*Potential responses* of actions are simply the set of possible of other variables when we fix some action $a$.
*Counterfactual* The counter factual that some variable $Y = y$ is simply the euqality $Y_x(u) = y$, so we need to calculate the probabiltiy of $Y = y$ under the coutnerfactual model $M_{X|x}$. Here note that nothing done here is different from anything done in standard decision diagrams... [See Theorem 7.1.7]
*Probabilistic Causal Model* A probabilistic causal model is a tuple
\[
\langle M, P \rangle
\]
where $P$ is simply a probability measure on the exogenous variables $U$.
Then counterfactual statements have the following probabilities
\[
P(Y = y \mid a)
=
P(\{u : Y(\omega) = y \} \mid a) .
\]
So, this is the same as the standard decision theoretic calculus.
I once again see no reason to introduce new notation, which only seems to limit us to very specific action models.
**** Probabilistic example in linear models
Here, we have a model with the dependencies
\begin{align}
q &= b_1 p + d_1 i + u_1
\\
p &= b_2 p + d_2 w + u_2
\end{align}
In this model, the set of endogenous variables is
\[
V = \{q, p\}
\]
and the exogenous ones are
\[
u = \{u_1, u_2, i, w\},
\]
with $i, w$ observed and $u_1, u_2$ latent and independent of $i, w$. We can assume that $u$ is jointly Gaussian. Then the model ask three questions.
1. What is the expected value of the demand $q$ if we set the price to $p_0$? I take this to mean $E[q | a(p := p_0)]$, but the book actually also conditions on $i$.
The modified model says that $p$ depends only on our non-null action, which is $p := p_0$.
Then
\[
E[q | a(p := p_0)] = b_1 p_0 + d_1 E[i] + E[u_1],
\qquad
E[q | a(p := p_0), i] = b_1 p_0 + d_1 i + E[u_1]
\]
But we also know that
\[
E[u_1] = b_1 E[q] - b_1 E[p] - d_1 E[i],
\]
so we can replace that directly in there to obtain
\begin{align}
E[q | a(p := p_0)] &= E[q] + b_1(p_0 - E[p])
\\
E[q | a(p := p_0), i] &= E[q] + b_1(p_0 - E[p]) + d_1(i - E[i])
\end{align}
This seems good, though the question didn't say about conditioning on $i$.
2. This is a simple observation question.
3. Given the actual price is $p_0$, what would be the expeted demand if we set it to $p_1$? I interpret this to mean
\[
E[q | a(p:=p_1), p_0, i, w],
\]
ie since this is a counterfactual, we assume that /everything else, including the exogenous variables is equal/. I am not sure why this needs to be the case. I understand whyt we need to condition on the observed endogenous variable $p$, but conditioning on the exogenous variables lacks any motivation.