Structure-Agnostic Causal Estimation

We have another new technical blog post, courtesy Jikai Jin and Vasilis Syrgkanis, about optimality of double machine learning for causal inference.


An introduction to causal inference

Causal inference deals with the fundamental question of “what if”, trying to estimate/predict the counterfactual outcome that one does not directly observe. For instance, one may want to understand the effect of a new medicine on a population of patients. For each patient, we never simultaneously observe the outcome under the new medicine (treatment) and the outcome under the baseline treatment (control). This makes causal inference a challenging task, and the ground-truth causal parameter of interest is identifiable only under additional assumptions on the data generating process.

The most central quantity of interest in the causal inference literature is the Average Treatment Effect (ATE). To mathematically define the ATE, we will use the language of potential outcomes. We posit that nature generates two potential outcomes Y_i(0), Y_i(1), where Y_i(d) can be thought as the outcome we would have observed from unit i, had we treated them with treatment d\in \{0,1\}. Then the ATE is defined as the average difference of these two potential outcomes in the population:

\theta = \mathbb{E}_{P_0}[Y(1)-Y(0)].

Unless otherwise specified, we will always use a subscript 0 to denote the ground-truth quantity. The main problem is that for each unit we do not observe both potential outcomes. Rather, we observe the potential outcome for the assigned treatment, Y_i = Y_i(D_i)

The first key question in causal inference is the identification question: can we write the ATE, which depends on the distribution of unobserved quantities, as a function of the distribution of observed random variables? Many techniques have been developed in causal inference that solve the identification question under various assumptions on the data generating process and the kinds of variables that are observed. For the interested reader, one can search for terms such as identification by conditioning, instrumental variables, proximal causal inference, difference-in-differences, regression discontinuity and synthetic controls, and refer to related textbooks [AP09,CHK+24].

For the purpose of this blog we will focus on identification by conditioning, which has been well-studied in the literature and very frequently used in the practice of causal inference. This identification approach makes the assumption that, once we condition on a large enough set of observed characteristics X (typically referred to as “control variables” or “confounders”), the treatment is assigned as if it was a randomized trial; a condition typically referred to as the conditional ignorability assumption

\{Y(0),Y(1)\} \perp D \mid X.

Under this assumption, the ATE is identifiable via the well-known g-formula:

\theta = \mathbb{E}[g(1,X)-g(0,X)].

where the function g(d,x) = \mathbb{E}[Y\mid X=x, D=d] is a regression function and is thus uniquely determined by the distribution of observed data. Intuitively, this formula says: train a predictive model that predicts the outcome from the treatment and the control variables and then take the average difference of the predictions of this model as you flip the treatment variable on or off. This quantity is also strongly related to the partial dependence plot, used frequently in interpretable machine learning. It basically corresponds to the difference of the value of the partial dependence plot of the outcome with the treatment, when the treatment takes value one vs when the treatment takes value zero.

The causal machine learning paradigm

The second key question in causal inference is the estimation question: given n samples of the observed variables, how should we estimate the ATE? In other words, we need to translate the identification strategy into an estimation strategy. For instance, in the context of identification by conditioning, note that even though our goal is to estimate \theta, to achieve that we also need to estimate the complicated non-parametric regression function g. Such auxiliary functions, whose estimation is required in order to estimate the target parameter of interest, are referred to as nuisance functions. The requirement to estimate complicated nuisance functions in a flexible manner arises in most identification strategies in causal inference and this is exactly where machine learning techniques can be of great help, giving rise to the Causal Machine Learning paradigm.

At a high level, causal machine learning is an emerging research area that incorporates machine learning (ML) techniques into statistical problems that emerge in causal inference. In the past decade, ML has gained tremendous success on numerous tasks, such as image classification, language processing, and video games. These problems more or less possess certain intrinsic structures that one can exploit. In image classification problems, for example, semantically meaningful objects can typically be found locally as a combination of pixels, and this suggests that using convolutional neural networks, rather than standard feed-forward neural networks, might lead to better results. The idea of causal machine learning is to leverage the ability of ML techniques to adapt to intrinsic notions of dimension, when learning the complex nuisance quantities that arise in causal identification strategies.

Double/debiased machine learning: an overview

What makes causal ML different from ML? To answer this question, it is instructive to revisit an extremely popular algorithm in causal ML: double/debiased machine learning (DML)[CCD+17] (variants of the ideas we will present below have also appeared in the targeted learning literature [LR11], but for simplicity of exposition we adopt the DML paradigm in this blogpost). 

Suppose that we are given i.i.d. data \{(X_i,D_i,Y_i)\}_{i=1}^n where X_i is a high-dimensional covariate vector, D_i is a binary treatment variable and Y_i is an outcome of interest. Without loss of generality, we can describe the data generating process of these variables via the following nonparametric regression equations:

Y = g(D,X) + \epsilon, \quad \mathbb{E}[\epsilon\mid D,X]=0

D=p(X)+\eta, \quad \mathbb{E}[\eta \mid X]=0.

where g(d,x) is known as the outcome regression and p(x) is known as the propensity score. Let P_0 be the distribution of (X,D,Y). Then the ATE problem asks us to estimate the quantity \theta_0 = E[g(1,X) - g(0,X)]

The ATE is just one example of a broad class of causal parameter estimation problems, for which the ground-truth parameter \theta_0 satisfies some moment equation

\mathbb{E}[m(Z,\theta_0,h_0(X))]=0,

where m(\cdot) is some moment function, Z is the observed data, X is a subvector of Z and h_0 is the ground-truth nuisance function. In the case of ATE, we can for example choose Z=X, h=(g,p)  and m\left(Z, \theta, h(X)\right)=g(1, X)-g(0, X)-\theta. Given this expression, a naive approach for estimating \theta_0 can be derived by first using ML to fit an estimate \hat{h} of the ground-truth nuisance functions h_0, and then solve the empirical moment equation

\frac{1}{n}\sum_{i=1}^n m(Z_i,\theta,\hat{h}(X_i)) = 0.

However, the resulting estimate \hat{\theta} would be biased if the nuisance estimates are biased. The latter happens quite often in practice, since ML typically requires using regularization to prevent the model from overfitting. As a result, it would be desirable if the quality of our estimate is more robust to nuisance estimation errors.

The key observation is that this would be the case if a Neyman orthogonality condition holds, namely 

\mathbb{E}\left[\partial_h m\left(Z, \theta_0, h_0(X)\right)\right]=0,

where \partial_h denotes the functional derivative of h.

Intuitively, this condition implies that the induced error is less sensitive to misspecification of nuisance function. Then a simple Taylor expansion would imply that the estimation error \hat{\theta}_{DML} solved from the empirical moment equation would have second-order dependency on the nuisance errors.

In the case of ATE, the moment function

m(Z, \theta, h(X))=(g(1, X)-g(0, X))+\frac{D(Y-g(1, X))}{p(X)}-\frac{(1-D)(Y-g(0, X))}{1-p(X)}-\theta

satisfies such requirements, where Z=(X,D,Y) and h=(g,p). Given a dataset \{(X_i,D_i,Y_i)\}_{i=1}^n, we can split it into two datasets \mathcal{D}_1=\{(X_i,D_i,Y_i)\}_{i=1}^{n/2} and \mathcal{D}_2=\{(X_i,D_i,Y_i)\}_{i=n/2+1}^{n} each with n/2 samples. Then DML consists of the following two stages:

  1. Use our favorite ML method (e.g. Lasso, random forest, neural network etc.) to estimate g(\cdot) and p(\cdot).
  2. Solve for \theta from the empirical moment equation

 \sum_{i=n/2+1}^n m(Z_i,\theta,\hat{h}(X_i))=0

 where \hat{h}=(\hat{g},\hat{p}) is our first-stage nuisance estimate.

Note that the main reason why DML would improve over the naive approach is that the moment function is chosen to satisfy the Neyman orthogonality property. By contrast, one can easily verify that the moment function m\left(Z, \theta, h(X)\right)=g(1, X)-g(0, X)-\theta is not Neyman orthogonal.

It is well-known that for the case of the ATE, the DML approach also possesses double robustness properties; a property that dates back to the seminal work of [RRZ94]. In fact the resulting estimator is the well-known doubly robust estimator [RRZ94] with the extra element of sample splitting, when estimating the nuisance functions. Specifically, suppose that our first-stage nuisance estimates have mean-square errors \epsilon_g and \epsilon_p respectively, then under mild regularity assumptions, the DML estimate \hat{\theta}_{DML} satisfies

\left|\hat{\theta}_{DML}-\theta_0\right| \leq C\left(\epsilon_g \epsilon_p+n^{-1 / 2}\right)

with high probability. Intuitively, because the estimation error of \hat{\theta}_{DML} stems from the misspecification of nuisance functions in the moment equation, by Taylor’s formula, it would contain the term \epsilon_g^{\alpha}\epsilon_p^{\beta} if and only if \mathbb{E}\left[\partial_g^{\alpha}\partial_p^{\beta} m\left(Z, \theta_0, h_0(X)\right)\right]\neq 0. By calculating the functional derivatives, it is then easy to check that \epsilon_g\epsilon_p is the dominating term. In particular, Neyman orthogonality implies that all first-order error terms vanish.

Importantly, this guarantee is structure-agnostic: this rate does not rely on any structural assumptions on the nuisance functions. What we need to assume is merely access to black-box ML estimates with some mean-squared error bounds. This is the reason why DML is widely adopted in practice: while there exist alternative estimators that can achieve improved error rates under structural assumptions on the non-parametric components, these assumptions can easily be violated, making these estimators cumbersome to deploy.

The problems that causal ML studies are not new. In the non-parametric estimation literature, there have been extensive results that focus on non-parametric efficiency and optimal rates for estimating causal quantities, under structural assumptions on the model such as smoothness of the non-parametric parts of the data generating process [RLM17,KBRW22]. However, the causal ML approach takes a more structure-agnostic view on the estimation of these nuisance quantities, and essentially solely assumes access to a good black-box oracle that provides us with relatively accurate estimates. This naturally gives rise to the structure-agnostic minimax optimality framework.

The structure-agnostic framework

We have seen that the key characteristic that differentiates the causal ML approach to estimation (e.g. the DML approach) with the traditional approaches is its structure-agnostic nature. In this section, we discuss the structure-agnostic framework that allows us to compare the performance of structure-agnostic estimators. This framework was originally proposed by [BKW23].

To keep things simple, we restrict ourselves to the same setting as the previous section. Now suppose we have nuisance estimates \hat{h}=(\hat{g},\hat{p}) with mean-square errors \epsilon_g and \epsilon_p. The structure agnostic minimax optimality framework asks the following question: if we don’t make any further restriction on the data generating process other than the fact that we have access to estimates for the nuisance functions that have mean-squared-error that is upper bounded of \epsilon_g and \epsilon_p, then what is the best estimation rate that is achievable by any estimation method?

To formalize this, we define the uncertainty set, as the set containing all distributions that are consistent with the given estimators:

\mathcal{F}_{\epsilon_g, \epsilon_p} = \Big\{ (P_X, p, g) \;|\; \|g(d, X) - \hat{g}(d, X)\|_{P_X, 2}^2 \leq \epsilon_g, \; d \in \{0,1\},

\left\| p(X) - \hat{p}(X) \right\|_{P_X, 2}^2 \leq \epsilon_p, \; 0 \leq p(x), \; g(d, x) \leq 1, \; \forall x \in \mathcal{X}, \; d \in \{0,1\} \Big\}

where P_X is the marginal distribution of X. Here we restrict ourselves to the case where D and Y are binary. This additional constraint would only strengthen our minimax lower bounds presented in this blog. In this case,  each tuple (P_X,p,g) uniquely determines a distribution over observational data. For any set \mathcal{F}, we define the minimax 1-\gamma quantile risk for estimating the ATE by

\mathfrak{M}_{n, \gamma}^{A T E}(\mathcal{F})=\inf _{\hat{\theta}:(\mathcal{X} \times\mathcal{D} \times \mathcal{Y})^n \mapsto \mathbb{R}} \sup _{s=\left(P_X^*, p^*,g^*\right) \in \mathcal{F}} Q_{P_s,1-\gamma}\left(\left|\hat{\theta}-\theta_s^{\mathrm{ATE}}\right|\right),

where P_s and \theta_s^{\mathrm{ATE}} are the data distribution and the ATE induced by s respectively, and Q_{P,1-\gamma}(\cdot) is the quantile function under data distribution P. Clearly, our previous discussions of DML implies that the worst-case risk is at most \epsilon_g\epsilon_p+n^{-1/2}. This framework precisely captures the main idea behind causal ML estimators that we described in the previous section.

Main results

In this section, we introduce our main results on structure-agnostic lower bounds [JS24]. Prior to our work, the only known structure agnostic lower bounds were established in [BKW23]. In their paper, it is shown that DML is optimal for estimating a set of functionals of interest, which relate to the ATE but which do not include the ATE functional.

Our first result establishes the optimality of DML for estimating the ATE, i.e., the doubly robust estimator with sample splitting, achieves the statistically optimal rate. As discussed in the previous section, the DML estimator for the ATE is given by

\hat{\theta}^{\mathrm{ATE}}=\frac{1}{n} \sum_{i=1}^n\left[\hat{g}\left(1,X_i\right)-\hat{g}\left(0,X_i\right)+\frac{D_i-\hat{p}\left(X_i\right)}{\hat{p}\left(X_i\right)\left(1-\hat{p}\left(X_i\right)\right)}\left(Y_i-\hat{g}\left(D_i,X_i\right)\right)\right]

and has the structure-agnostic rate of \epsilon_g\epsilon_p+n^{-1/2}. We now establish a matching lower bound.

Theorem 1. Let \mathrm{supp}(X)=[0,1]^K and \tilde{\mathcal{F}}_{\epsilon_g, \epsilon_p} contains all distributions in \mathcal{F}_{\epsilon_g, \epsilon_p} with marginal distribution of X being uniform.  For any constant 1/2<\gamma<1, if our nuisance estimates (\hat{g},\hat{p}) take values in [c,1-c], where c\in(0,1/2) is a constant, then 

\mathfrak{M}_{n, \gamma}^{A T E}\left(\tilde{\mathcal{F}}_{\epsilon_g,\epsilon_p}\right)=\Omega\left(\epsilon_g\epsilon_p+n^{-1 / 2}\right) .

Interestingly, knowing the marginal distribution of X would not change the statistical limit.

We also consider another important causal parameter, the average treatment effect of the treated (ATT), defined by \theta^{ATT}=\mathbb{E}\left[Y(1)-Y(0) \mid D=1\right]. Under conditional ignorability, it can be written as

\theta^{ATT}=\mathbb{E}\left[Y-g_0(0, X) \mid D=1\right].

The DML estimate of \theta^{ATT} is 

\hat{\theta}_{DML}=\left(\sum_{i=1}^n D_i\right)^{-1}\sum_{i=1}^n\left[D_i\left(Y_i-\hat{g}\left(0,X_i\right)\right)-\frac{\hat{p}\left(X_i\right)}{1-\hat{p}\left(X_i\right)}\left(1-D_i\right)\left(Y_i-\hat{g}\left(0, X_i\right)\right)\right]

and can be shown to achieve the same \epsilon_g\epsilon_p+n^{-1/2} rate as for ATE. We also show that this rate is unimprovable:

Theorem 2. In the same setting as Theorem 1, we have

\mathfrak{M}_{n, \gamma}^{A T T}\left(\tilde{\mathcal{F}}_{\epsilon_g,\epsilon_p}\right)=\Omega\left(\epsilon_g\epsilon_p+n^{-1 / 2}\right)

Finally, we can also extend Theorem 1 to the weighted ATE (WATE) defined as

\theta^{WATE}=\mathbb{E}_{P_0}[w(X)(Y(1)-Y(0))],

which arises in policy evaluation [AW21]. Here w(x) is a uniformly bounded weight function but is not required to be non-negative. The following theorem addresses the minimax structure-agnostic rate for estimating WATE:

Theorem 3. In the same setting as Theorem 1, we have

\mathfrak{M}_{n, \gamma}^{W A T E}\left(\tilde{\mathcal{F}}_{\epsilon_g,\epsilon_p}\right)=\Omega\left(\|w\|_{L^2\left(P_X\right)} \epsilon_g \epsilon_p+\|w\|_{L^{\infty}\left(P_X\right)} n^{-1 / 2}\right),

where P_X is the uniform distribution over \mathrm{supp}(X)=[0,1]^K. Moreover, this rate is achieved by the DML estimator

\hat{\theta}^{\mathrm{WATE}}=\frac{1}{n} \sum_{i=1}^n w\left(X_i\right)\left[\hat{g}\left(1, X_i\right)-\hat{g}\left(0,X_i\right)+\frac{D_i-\hat{p}\left(X_i\right)}{\hat{p}\left(X_i\right)\left(1-\hat{p}\left(X_i\right)\right)}\left(Y_i-\hat{g}\left(D_i, X_i\right)\right)\right] .

Conclusion and discussions

In this blogpost, we introduced the setting and main results of our recent paper [JS24], that establishes the optimality of the celebrated DML algorithm, and in particular the doubly robust estimator with sample splitting, in a structure-agnostic framework for two important causal parameters: the ATE and the ATT, as well as the weighted version of the former. For practitioners, the main takeaway is that if no particular structural insights are available, then it might be better to use DML rather than more refined estimators that leverage potentially brittle assumptions on the non-parametric components of the data generating process. 

[AW21] Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica 89.1 (2021): 133-161.

[BKW23] Sivaraman Balakrishnan, Edward H Kennedy, and Larry Wasserman. The fundamental limits of structure-agnostic functional estimation. arXiv preprint arXiv:2305.04116, 2023. 

[CCD+17] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. Double/debiased/neyman machine learning of treatment effects. American Economic Review,107(5):261–265, 2017.

[JS24] Jikai Jin and Vasilis Syrgkanis. Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation. arXiv preprint arXiv:2402.14264, 2024.

[KBRW22] Edward H Kennedy, Sivaraman Balakrishnan, James M Robins, and Larry Wasserman. Minimax rates for heterogeneous causal effect estimation. The Annals of Statistics 52.2 (2024): 793-816.

[RLM17] James M Robins, Lingling Li, and Rajarshi Mukherjee. Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017.

[RRZ94] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. “Estimation of regression coefficients when some regressors are not always observed.” Journal of the American Statistical Association 89.427 (1994): 846-866.

[LR11] M.J. Van der Laan and Sherri Rose. Targeted learning: causal inference for observational and experimental data (Vol. 4). New York: Springer.

[AP09] Joshua D. Angrist, and Jörn-Steffen Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009.

[CHK+24] Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, Vasilis Syrgkanis (2024). Applied Causal Inference Powered by ML and AI.

Leave a Reply