Most pandemic research has focused on questions related to prediction such as: how many COVID cases do we expect to see next week in Allegheny County? Causal inference, on the other hand, is concerned with “what if” questions. For example: what would happen if everyone wore masks? How many cases would there be if there was a lockdown? How many deaths would occur if 40 percent of the population is vaccinated? These questions involve hypothetical interventions on a population.

Estimating causal effects is tricky. We all know that “correlation isn’t causation.” For example, mask usage and cases could both increase over time. But that doesn’t mean that masks cause more cases to occur. Activation of “Buckle your seatbelt” signs on airplanes is correlated with turbulence. But activating the sign does not cause turbulence.

So how do we estimate causal effects? There is a collection of methods for this task. The important thing is that we can’t just use standard prediction methods. We need to use specialized methods that were designed for causal inference. In this post we will discuss our work on the causal effect of social mobility on deaths from COVID. The social mobility variable we will use is the proportion of people staying home. We can think of this of anti-mobility, and expect that higher values of this variable will lead to fewer deaths from COVID.

Let’s start with a brief introduction to causal inference.

Causal inference
is a huge, complex topic.
We give a very brief
exposition of some key ideas here.
To learn more,
we recommend the book
by Hernan and Robins and
the paper
“Statistics and Causal Inference”
by Paul Holland
(*Journal of the American Statistical Association* 1986, pp. 945-960).
And one can find many tutorials
on the web.

Suppose we have three variables \(X\), \(A\) and \(Y\), where \(Y\) is the outcome of interest (such as deaths from COVID), \(A\) is a treatment or exposure (such as mobility level), and \(X\) are confounding variables, which are variables that affect both \(A\) and \(Y\) (such as age). The joint density is

\[ p(x,a,y) = p(x)p(a|x)p(y|x,a). \]

We want to know: what would be the mean of \(Y\) if we could set \(A\) to be equal to some value \(a\)? For example, what would the number of COVID cases be if 92 percent of people wore masks? We let \(Y^a\) denote this hypothetical outcome, which is called a counterfactual or potential outcome. And we denote the mean of \(Y^a\) by \(\mathbb{E}[Y^a]\).

If we have measured all the confounding variables then the answer (ignoring a few technical details) is given by Robins’ \(g\)-formula:

\[ \mathbb{E}[Y^a] = \int \mu(x,a) p(x) dx = \mathbb{E}[\mu(X,a)] \]

where
\(\mu(x,a) = \mathbb{E}[Y|X=x,A=a]\)
is the mean of \(Y\) given \(X\) and \(A\).
This is not equal to
the mean of \(Y\) given \(A=a\), which is
\(\mathbb{E}[Y|A=a] = \int \mu(x,a) p(x|a) dx\).
This is the difference between causation
and prediction (correlation).
You can think of
the \(g\)-formula as the mean of \(Y\)
if we replace \(p(a|x)\)
in the joint density
with a point mass at \(a\).
Using the \(g\)-formula is one example of *adjusting for
confounding*.

Now suppose we observe data \((X_1,A_1,Y_1),\dots,(X_n,A_n,Y_n)\). How do we estimate \(\mathbb{E}[Y^a]\)? The simplest approach is to get an estimate \(\hat\mu(x,a)\) of \(\mu(x,a)\) (this is just regression) and replace the integral in the \(g\)-formula with a sample average to get the estimate

\[ \hat{\mathbb{E}(Y^a)} = \frac{1}{n}\sum_i \hat\mu(X_i,a). \]

This is called the plug-in estimator. (Note that for prediction we would not use this formula. We would just use \(\hat \mu(X, A)\).) There are often better estimators, but we won’t get into that here. The important thing is: there is a formula for the causal effect and we can estimate it.

The first plot below shows an example where we would predict
higher values of \(Y\) when \(A\) is large. For pure prediction, this is
the correct conclusion. The second plot shows that once we
account for \(X = \text{age}\) (corresponding to different colors) there is
a negative relationship between \(Y\) and \(A\): for a given person with a fixed
age, increasing \(A\) would *decrease* \(Y\). In this case, age is a
confounder and the \(g\)-formula would correctly recover the negative
relationship. For causal inference, this is the correct conclusion.

Things get trickier when there are time varying variables. Consider weekly mobility and death data \((A_1,Y_1),\dots, (A_T,Y_T)\) in one state. For simplicity, we’ll assume that there are no \(X\) variables. But we’ll see that at time \(t\), the variables \(Y_1, \dots, Y_{t-1}\) are confounding variables for the causal effect of mobility on \(Y_t\). Define \(\overline{A}_t = (A_1,\dots, A_t)\) and \(\overline{Y}_t = (Y_1,\dots, Y_t)\) for \(t\geq 1\). The density of \((\overline{y}_t,\overline{a}_t)\) can be written as

\[ p(\overline{y}_t,\overline{a}_t) = \prod_{s=1}^t p(y_s|\overline{y}_{s-1},\overline{a}_{s}) p(a_s|\overline{a}_{s-1},\overline{y}_{s-1}). \]

Now consider the causal question:

What would \(Y_t\) be if we set \(\overline{A}_t\) equal to some value \(\overline{a}_t =(a_1,\dots, a_t)\)?

Let \(Y^{\overline{a}_t}_t\) denote this counterfactual quantity. For example, if \(A_t\) is mobility, measuring, say, the percentage of people who stay home, and \(\overline{a}_t = (12,12,12,6,6,6)\) then \(Y^{\overline{a}_t}_t\) is the number of deaths we would observe if we set mobility to 12 for three weeks then set it to 6 for three weeks.

How do we find the mean \(\mathbb{E}[Y^{\overline{a}_t}_t]\)? As a general rule, we adjust for pre-treatment variables but not for post-treatment variables. But in the time varying case, this rule makes no sense. Each \(Y_t\) is both a pre-treatment variable (it occurs before \(A_{t+1}\)) and a post-treatment variable (it occurs after \(A_{t-1}\)). The correct way to get the mean of \(\mathbb{E}[Y^{\overline{a}_t}]\) is to use the more general form of the \(g\)-formula:

\[ \psi(\overline{a}_t)\equiv \mathbb{E}[Y^{\overline{a}_t}_t] = \int\cdots\int \mathbb{E} \left[Y_t | \overline{A}_t = \overline{a}_t, \overline{Y}_{t-1}=\overline{y}_{t-1}\right] \prod_{s=1}^{t-1} p(y_s|\overline{y}_{s-1},\overline{a}_{s}) d y_s. \]

You can ignore that equation if you like; suffice it to say that there is indeed a formula for the quantity we are interested in.

We note, by the way, that some authors denote \(\mathbb{E}[Y^{\overline{a}_t}_t]\) by \(\mathbb{E}[Y_t| {\rm do}(\overline{a}_t)]\). They mean the same thing.

Having a formula for \(\psi(\overline{a}_t)\) is great, but how do we estimate it? Again we could plug-in estimates of all the quantities in the \(g\)-formula. But there are better things we can do, as we now explain.

The approach we use is
to make use of
something called
a *marginal structural model* (MSM).

Usually when we construct a statistical model, we are trying to come up with a way to express the distribution of the data, in this case, \((A_1,Y_1),\dots, (A_T,Y_T)\). In effect, we want a formula that explains how we would generate a dataset like the one we have. But to do so in a tractable way usually involves making a bunch of assumptions about the model. Instead, with an MSM, we specify a plausible model for the form of the function \(\mathbb{E}[Y^{\overline{a}_t}]\) but we otherwise leave the model for the data unspecified. For example, we might assume that \(\mathbb{E}[Y^{\overline{a}_t}]\approx \beta_0 + \beta \sum_s a_s\). In a sense, we are spending our modeling assumptions wisely: we want the effect of the treatment to be smooth and fairly simple. But we otherwise don’t want to impose assumptions.

The model we use is

\[ \mathbb{E}[L_t^{\overline{a}_t}]= s(t) + \beta M_t \]

where \(L_t = \log (1+Y_t)\) is log-deaths, \(s(t)\) is an arbitrary, smooth, increasing function and

\[ M_t = \sum_{s=1}^{t-\delta}(a_s-a_1) \]

is cumulative mobility (compared to baseline mobility). Based on published studies, we take \(\delta = 4\) weeks. (We switch from deaths to log-deaths mainly for certain numerical reasons related to the software we are using.) Here, \(s(t)\) represents the evolution of the pandemic if there were no interventions and no reduction in susceptible individuals. The term \(\beta M_t\) captures the effect of mobility. With no change in mobility, deaths increase exponentially as \(e^{s(t)}\).

The intuition for using cumulative mobility comes from ideas from epidemic modeling. Roughly, if \(I_t\) denotes new infections in week \(t\), we might assume that \(I_t \approx e^{\nu(t) + \beta A_t} I_{t-1}\) for some function \(\nu(t)\). This implies that \(I_{t-1} \approx e^{\nu(t-1) + \beta A_{t-1}} I_{t-2}\) and so on. If we continue back to \(t=1\) we find that \(I_t \approx e^{s(t) + \beta \sum_s A_s} I_1\) where \(s(t)=\sum_{s=1}^t \nu(s)\). If we further assume that some fraction of people infected a time \(t\) will die in \(\delta\) weeks, then we can derive the above model. (More complex models are discussed in the paper.)

To re-cap: we are leaving the distribution \(P\) of the data \((A_1,Y_1),\dots, (A_T,Y_T)\) unspecified, except that we require: if you apply the \(g\)-formula to \(P\) you must get \(s(t) + \beta M_t\). In other words, the statistical model is the set of all densities \(p(\overline{\ell}_t,\overline{a}_t)\) that satisfy the condition

\[ \int\cdots\int \mathbb{E}[L_t | \overline{A}_t = \overline{a}_t, \overline{L}_{t-1}=\overline{\ell}_{t-1}] \prod_{s=1}^{t-1} p(\ell_s|\overline{\ell}_{s-1},\overline{a}_{s}) d \ell_s = s(t) + \beta M_t. \]

This is an example of a semiparametric model: a model that has some sort of constraint but is otherwise nonparametric.

Estimating the MSM is a bit involved. A key step is estimating the following function

\[ W_t = \frac{p(a_t|\overline{a}_{t-1})}{p(a_t| \overline{y}_{t-1},\overline{a}_{t-1},\overline{x}_{t-1})} \]

where \(\overline{x}_{t-1}\) represents potential confounding variables. The model is fit with these weights and the weights are crucial because that is how we adjust for confounding in the MSM. We won’t go into detail here. Instead, let’s look at the data and the estimates.

The data we use are weekly deaths and weekly averaged mobility starting February 15 2019. The mobility signal we’ll consider here is the proportion of people of who stay home, which is kindly provided to Delphi by SafeGraph. (We publish this data in our COVIDcast Epidata API, so it is available to anyone interested in studying mobility during the pandemic.) We use data at the state level. Note that this is actually a sort of anti-mobility measure: the higher it is, the less mobility there is. Here are plots of weekly deaths and mobility starting February 15, 2020, for a few states:

```
library(covidcast)
library(directlabels)
library(dplyr)
library(lubridate)
library(ggplot2)
# Top 5 most populated states according to
# https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population
states_of_interest <- c("ca", "tx", "fl", "ny", "pa")
states_deaths <- covidcast_signal(
"jhu-csse", "deaths_incidence_num",
start_day = "2020-02-15", end_day = "2020-12-16",
geo_type = "state", geo_values = states_of_interest
)
states_mobility <- covidcast_signal(
"safegraph", "completely_home_prop",
start_day = "2020-02-15", end_day = "2020-12-16",
geo_type = "state", geo_values = states_of_interest
)
dat <- full_join(x = select(states_deaths, time_value, geo_value, value),
y = select(states_mobility, time_value, geo_value, value),
by = c("time_value" = "time_value", "geo_value" = "geo_value"))
dat <- dat %>%
rename(date = time_value,
state = geo_value,
stay_home = value.y,
deaths = value.x) %>%
group_by(state) %>%
# epi week is from saturday to sunday
mutate(weekday = wday(date, label = TRUE, week_start = 6),
weeknum = wday(date, week_start = 6),
week = cumsum(weeknum == 1))
dat_week <- dat %>%
group_by(state, week) %>%
mutate(weekly_deaths = sum(deaths),
weekly_deaths = 1 + weekly_deaths, # avoid 0 in the log scale
weekly_stay_home = mean(stay_home)) %>%
select(week, state, weekly_stay_home, weekly_deaths, weekly_deaths) %>%
unique()
ggplot(dat_week, aes(x = week, y = weekly_deaths, color = state)) +
geom_line() +
scale_y_log10() +
geom_dl(aes(label = toupper(state)), method = "last.bumpup") +
labs(x = "Week", y = "# deaths (log scale)",
title = "Reported deaths over time",
subtitle = "From JHU CSSE COVID-19 data",
caption = "Data from Delphi COVIDcast, delphi.cmu.edu") +
theme_bw() +
guides(color = FALSE)
```

```
ggplot(dat_week, aes(x = week, y = weekly_stay_home, color = state)) +
geom_line() +
geom_dl(aes(label = toupper(state)), method = "last.bumpup") +
labs(x = "Week", y = "Proportion staying at home",
title = "(Anti)mobility over time",
subtitle = "From SafeGraph",
caption = "Data from Delphi COVIDcast, delphi.cmu.edu") +
theme_bw() +
guides(color = FALSE)
```

After fitting the MSM, we can estimate various quantities. Let’s consider the effect of changing the observed stay-at-home series \((A_1,\dots, A_T)\) to the hypothetical value \(\overline{a}_T = (A_1+\gamma,\dots, A_T+\gamma)\). In other words, what would have happened if we had increased stay-at-home by \(\gamma\)? Below is the estimated difference in deaths (with 90 percent pointwise confidence bands) for a few states: Arizona, California, Florida and Georgia. The plot shows the mean of

\[ Y_t^\gamma - Y_t \]

where \(Y_t\) is the number of deaths and \(Y_t^\gamma\) is the (counterfactual) number of deaths we would have observed of mobility had been increases by \(\gamma\) at all times, where we take \(\gamma = 10\) (a ten percent increase in stay-at-home). The fact that the values are negative indicates that increasing stay-at-home decreases deaths, as we would expect. (We remind the reader that these results are preliminary.)

One important question in every causal analysis is: what if there is unobserved confounding? In the current problem, a confounder is a variable that affects both mobility and death from COVID. We try to include as many plausible observed confounding variables as possible. So far we use all past values of death and mobility as confounding variables. But of course, it is always possible that there are unobserved confounders. Currently, we are conducting a sensitivity analysis to assess how robust the estimates are to unobserved confounders.

In this post we have only addressed mobility. As soon as the mobility analysis is complete we will turn our attention to estimating the causal effect of masks on COVID cases.

**Acknowledgements: **
We’d like to thank the Delphi engineering team for making this data available.
And we’d like to thank Roni Rosenfeld and Ryan Tibshirani for posing the
challenge of making counterfactual predictions. We thank Rob Tibshirani for
suggesting several improvements on this post, and Alex Reinhart for getting
our post into the appropriate format.

Matteo Bonvini
is a graduate student in the Department of Statistics & Data Science at CMU.

Edward Kennedy
is an Assistant Professor in the Department of Statistics & Data Science at CMU.

Valérie Ventura
is a Professor in the Department of Statistics & Data Science at CMU and a member of Delphi.

Larry Wasserman
is a Professor in the Department of Statistics & Data Science at CMU as well as in the Machine Learning Department and is a member of Delphi.

© 2021 Delphi group authors. Text and figures released under
CC BY 4.0
; code under the MIT license.