01 - From causal inference to out-of-distribution generalization simulation

We here summarise the work of Meinshausen (2018). We recommend reading the original paper for a more complete overview of the ideas sketched here.

Introduction

The most common learning paradigm in statistical learning is arguably Empirical Risk Minimization (ERM), which can be written as:

$$ \arg\min_{\theta} \mathbb{E}[\ell(Y, f_{\theta}(X))] $$

where $X \in \mathbb{R}^d$ is a set of covariates and $Y \in \mathbb{R}$ is a target variable. We aim to find the parameters $\theta$, which parametrize the function $f_\theta: \mathbb{R}^d \rightarrow \mathbb{R}$, by minimizing the expected value of the loss $\ell: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$.

While this is very useful when the goal is to predict outcomes for new samples $X$ drawn from the same distribution as the training data, it can fail dramatically when the test distribution shifts — a common scenario in real-world applications. In this context, a more robust learning framework is the minimization of the worst-case risk, where the distribution $Q$ of $(X, Y)$ is chosen from a set of possible distributions $\mathbb{Q}$. This can be formally written as:

$$ \arg\min_{\theta} \sup_{Q \in \mathbb{Q}} \mathbb{E}[\ell(Y, f_{\theta}(X))] $$

The class $\mathbb{Q}$, often referred to as the uncertainty set, can be constrained using distributional metrics such as $f$-divergence (see Namkoong (2016)) or the Wasserstein distance (see Esfahani (2015)).

In this work, we consider a set of distributions $\mathbb{Q}$ arising from interventions on covariates. We will see that allowing different sets of interventions leads to different types of robustness and generalization guarantees. The next section introduces key concepts from causality to help define intervention-based distributions.

The Causal Inference Paradigm

Modularity Assumption

Let’s begin with a simple example. Suppose we are given two variables, temperature $T$ and altitude $A$, along with their joint distribution $p(T, A)$. Statistically, this distribution can be factored in two ways:

$$ p(A, T) = p(A \mid T)p(T) $$

$$ p(A, T) = p(T \mid A)p(A) $$

The second factorization is considered causal because $p(T \mid A)$ reflects a physical mechanism linking altitude to temperature, up to some noise. This mechanism is said to be modular, meaning that if we intervene by changing altitude, the mechanism remains stable.

In contrast, the first factorization is non-causal, as $p(A \mid T)$ does not represent a stable mechanism. There’s no reliable way to model how altitude would respond to changes in temperature.

If a joint distribution is entailed by a directed acyclic graph (DAG), the graph is called causal if it respects modularity. That is, manipulating one variable does not alter the conditional distributions of the others. This becomes more intuitive under a Structural Causal Model (SCM), where each variable is defined by a function of its parents. Formally, an SCM $\mathcal{C}$ consists of structural assignments:

$$ X_i := f_i(PA_i, N_i), \quad i = 1, \dots, n $$

where $PA_i$ are the parents of $X_i$, $N_i$ is a noise term, and $f_i$ is a stable (modular) function. The modularity assumption implies that interventions change only the relevant functions, leaving others untouched.

Interventions

So far, we’ve used the idea of manipulation informally. In causal inference, we formalize it using the concept of intervention. Given a set of variables $X$ governed by an SCM, a (strong) intervention sets a variable to a fixed value.

Returning to the altitude and temperature example, assume their relationship is captured by the SCM:

$$ T = f(A, N) $$

where $f$ is a deterministic, modular function and $N$ is some noise. A strong intervention sets the random variable $A$ to a specific value $a$, yielding the interventional distribution:

$$ T^{do(A=a)} = f(a, N) $$

Note that this distribution remains random due to the noise $N$.

More generally, for an SCM $\mathcal{C}$, we denote the distribution under a hard intervention as $P_X^{\mathcal{C}, do(X_i := x_i)}$, or simply $P_X^{do(X_i := x_i)}$.

Philosophical note: This idea aligns with Woodward’s manipulation theory of causation—which understands causation as a system’s response to manipulation. A manipulation (or intervention) breaks a variable’s usual causal influences and sets it to a new value. This notion of intervention is central to the manipulationist theory of causation. For an introduction to this perspective, see Woodward (2001).

Distribution Generalization via Causal Inference

Consider the case where $P_{X, Y}$ is entailed by the following linear SCM:

$$ X = X\mathbf{B} + N \in \mathbb{R}^d $$

$$ Y = X\mathbf{b} + N_y \in \mathbb{R} $$

We denote the interventional distribution obtained by setting some component of $X$ to a specific value as $P_{X, Y}^{do(X_i := x_i)}$.

Define $\mathbb{Q}$ as the set of all such interventional distributions:

$$ \mathbb{Q} := {P_{X, Y}^{do(X := x)} \mid x \in \mathcal{X} \subseteq \mathbb{R}^d} $$

It can be shown that for the squared loss $\ell(Y, f_\theta(X)) := (Y - f_\theta(X))^2$, we have the equivalence:

$$ \theta^{\text{causal}} = \arg\min_{\theta} \sup_{Q \in \mathbb{Q}} \mathbb{E}[\ell(Y, f_\theta(X))] $$

where $f_{\theta^{\text{causal}}}$ is the structural equation from the SCM. The intuition is straightforward:

$$ \begin{cases} \infty & \text{if } \theta \ne \theta^{\text{causal}} \ \operatorname{Var}(N_y) & \text{if } \theta = \theta^{\text{causal}} \end{cases} $$

If $\theta \ne \theta^{\text{causal}}$, there exists some intervention that causes the loss to diverge. In contrast, if $\theta = \theta^{\text{causal}}$, the loss becomes independent of $X$ and equals the variance of the noise.

Discussion

We proposed a framework showing that, in the linear case and for the squared loss, worst-case risk minimization for out-of-distribution generalization is equivalent to recovering the causal parameters of an SCM—when the uncertainty set $\mathbb{Q}$ includes all possible interventions on $X$.

In the next chapter, we explore how this approach might be overly conservative. Restricting $\mathbb{Q}$ to a specific subset of interventions can sometimes yield better generalization. This direction builds on the work of Rothenhäusler et al. (2021) and the idea of anchor regression.

👉 Anchor regression as a diluted form of causality