# Samples of Thoughts

## about data, statistics and everything in between

These are code snippets and notes for the fifth chapter, The Many Variables & The Spurious Waffles, section 2, of the book Statistical Rethinking (version 2) by Richard McElreath.

## Hidden Influence in Milk

In the previous section about spurious associations we used multiple regression to eliminate variables that seemed to have an influence when comparing bivariate relationships but whose association vanishes when introducing more variables to the regression. Now the opposite can also happen: there might be no bivariate association between variables because two variables mask each other. We will explore this using the milk data set to compare the caloric content of milk in primates with their brain size. We will see later that the variable of the body mass also plays a role.

library(rethinking)
data(milk)
d <- milk
str(d)
'data.frame':   29 obs. of  8 variables:
$clade : Factor w/ 4 levels "Ape","New World Monkey",..: 4 4 4 4 4 2 2 2 2 2 ...$ species       : Factor w/ 29 levels "A palliata","Alouatta seniculus",..: 11 8 9 10 16 2 1 6 28 27 ...
$kcal.per.g : num 0.49 0.51 0.46 0.48 0.6 0.47 0.56 0.89 0.91 0.92 ...$ perc.fat      : num  16.6 19.3 14.1 14.9 27.3 ...
$perc.protein : num 15.4 16.9 16.9 13.2 19.5 ...$ perc.lactose  : num  68 63.8 69 71.9 53.2 ...
$mass : num 1.95 2.09 2.51 1.62 2.19 5.25 5.37 2.51 0.71 0.68 ...$ neocortex.perc: num  55.2 NA NA NA NA ...

Let’s first standardize the necessary variables:

d$K <- standardize( d$kcal.per.g )
d$N <- standardize( d$neocortex.perc )
d$M <- standardize( log(d$mass ) )

We first try a bivariate regression: \begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_N N_i \end{align*}

We first run this as a quap() model with vague priors. However, there are some missing values in the data and we need to take care of them first (otherwise the model will throw an error)

dcc <- d[ complete.cases(d$K, d$N, d$M), ] So now the model with the new data frame without the missing values: m5.5_draft <- quap( alist( K ~ dnorm( mu, sigma ), mu <- a + bN*N, a ~ dnorm(0, 1), bN ~ dnorm(0, 1), sigma ~ dexp( 1 ) ), data=dcc ) Let’s do a quick check first how reasonable these priors are. While these models are still quite simple so these probably won’t hurt but once we do more complex models, good priors become more important. prior <- extract.prior( m5.5_draft ) xseq <- c(-2, 2) mu <- link( m5.5_draft, post=prior, data=list(N=xseq)) plot(NULL, xlim=xseq, ylim=xseq) for (i in 1:50) lines( xseq, mu[i,], col=col.alpha("black", 0.3)) The prior on the left is rather crazy: for the average value of $K$, we can end up with very extreme values of $N$. It would be better to tighten $\alpha$ so that it’s closer to the average value of the outcome variable (i.e. 0). Also, the slope can be extremely steep suggesting very unrealistic associations between the two variables, better to tigthen $\beta$ as well. Better priors (more realistic priors) would be the one in the right plot, coming from this model: m5.5 <- quap( alist( K ~ dnorm( mu, sigma), mu <- a + bN*N, a ~ dnorm(0, 0.2), bN ~ dnorm(0, 0.5), sigma ~ dexp( 1 ) ), data=dcc ) precis(m5.5) mean sd 5.5% 94.5% a 0.040 0.154 -0.207 0.287 bN 0.133 0.224 -0.224 0.491 sigma 1.000 0.165 0.737 1.263 The estimate for $\beta$ is not very strong and its uncertainty interval includes 0. Let’s look at this in a bit more visual. The parameter coefficient estimates: plot( precis( m5.5)) And the model together with the posterior mean line and interval: xseq <- seq( from=min(dcc$N) - 0.15, to=max(dcc\$N) + 0.15, length.out = 30)
mu_mean <- apply(mu, 2, mean)
mu_PI <- apply(mu, 2, PI)
plot( K ~ N, data=dcc,
xlab="neocortex percent (std)", ylab="kilocal per g (std)")
lines( xseq, mu_mean, lwd=2 )
shade( mu_PI, xseq) The relationship is very weak and the uncertainty interval also includes lines with no slope or even slight negative slope.

Let’s consider a model with the (log) body mass instead:

m5.6 <- quap(
alist(
K ~ dnorm( mu, sigma) ,
mu <- a + bM*M,
a ~ dnorm(0, 0.2),
bM ~ dnorm(0, 0.5),
sigma ~ dexp(1)
), data=dcc
)
precis(m5.6)
mean sd 5.5% 94.5%
a 0.047 0.151 -0.195 0.288
bM -0.283 0.193 -0.591 0.026
sigma 0.949 0.157 0.698 1.200

And again the model together with the posterior mean line and interval. For easier comparison, I also add the posterior plot from the neocortex model. The association seems stronger than for the neocortex percent but it is still highly uncertain and includes many possible weaker (as well as stronger) relationships.

Now let’s see what happens when we use both variables in a regression:

\begin{align*} K_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_N N_i + \beta_K K_i \\ \alpha &\sim \text{Normal}(0, 0.2) \\ \beta_N &\sim \text{Normal}(0, 0.5) \\ \beta_M &\sim \text{Normal}(0, 0.5) \\ \sigma &\sim \text{Exponential}(1) \end{align*} And the code for the model:

m5.7 <- quap(
alist(
K ~ dnorm( mu, sigma),
mu <- a + bN*N + bM*M,
a ~ dnorm(0, 0.2),
bN ~ dnorm(0, 0.5),
bM ~ dnorm(0, 0.5),
sigma ~ dexp( 1 )
), data=dcc
)
precis(m5.7)
mean sd 5.5% 94.5%
a 0.068 0.134 -0.146 0.282
bN 0.675 0.248 0.278 1.072
bM -0.703 0.221 -1.056 -0.350
sigma 0.738 0.132 0.526 0.950

For both variables, the observed effect increased:

plot( coeftab( m5.5, m5.6, m5.7 ), pars=c("bM", "bN") ) In the upper part are the coefficients for the body mass and in the lower part the coefficients for the neocortex.

## The Causal Reasoning behind it

So what does this mean?

What happened is that both variables are correlated with the outcome but one is positively correlated (neocortex percent) and one is negatively correlated (body mass). On top of that, they’re both positively correlated with each other. In this specific situation, bigger species such as apes (high body mass) have milks will less energy. But species with more neocortex (“smarter ones”) tend to have richer milk. These correlations make it hard to see what’s really happening.

pairs( ~ K + M + N, dcc) While the association between $K$ and $M$ and $K$ and $N$ are not as clear in the plot, the positive correlation between $M$ and $N$ is quite strong.

Let’s get some DAGs going. There are at least three DAGs consistent with these data: 1. The body mass influences the neocortex and both influence the milk energy content.
2. The neocortex influences the body mass and both influence the milk energy content.
3. An unobserved variable $U$ influences both the body mass and the neocortex which both influence the milk energy content.

How do we know which one is the correct one? The (not so satisfying) answer is we can’t know, at least not from the data alone. All three graphs imply the same set of conditional independencies (there are none). To pick one, we need to make use of our scientific knowledge.

Let’s make some more counterfactual plots. ### Markov Equivalence

A set of DAGs is known as a Markov Equivalence set if they all have the same implied conditional independencies. We can use {{daggity}} to compute the Markov Equivalence set of a specific DAG.

dag5.7 <- dagitty( "dag{
M -> K <- N
M -> N }")
coordinates(dag5.7) <- list(x=c(M=0,K=1,N=2), y=c(M=0.5, K=1,N=0.5))
MElist <- equivalentDAGs(dag5.7)

The object MElist then contains all DAGs with the same implied conditional independencies: Most of these DAGs are probably not reasonable from a scientific standpoint. That’s where domain knowledge is needed to distinguish the reasonable DAG from the silly one.

Sometimes it helps to better understand the DAGs and their implied associations by using some simulations.

Let’s simulate data for the first DAG: n <- 100
M <- rnorm( n )
N <- rnorm( n, M )
K <- rnorm( n, N - M)
d_sim <- data.frame(K=K, N=N, M=M)
pairs(d_sim) The second DAG: n <- 100
N <- rnorm( n )
M <- rnorm( n, N )
K <- rnorm( n, N - M)
d_sim2 <- data.frame(K=K, N=N, M=N)
pairs(d_sim2) And the third DAG: n <- 100
U <- rnorm ( n )
N <- rnorm( n, U )
M <- rnorm( n, U )
K <- rnorm( n, N - M)
d_sim2 <- data.frame(K=K, N=N, M=N)
pairs(d_sim2) Full code.