These are code snippets and notes for the fifth chapter, *The Many Variables & The Spurious Waffles*, section 2, of the book Statistical Rethinking (version 2) by Richard McElreath.

## The Causal Reasoning behind it

So what does this mean?

What happened is that both variables are correlated with the outcome but one is positively correlated (neocortex percent) and one is negatively correlated (body mass). On top of that, they’re both positively correlated with each other. In this specific situation, bigger species such as apes (high body mass) have milks will less energy. But species with more neocortex (“smarter ones”) tend to have richer milk. These correlations make it hard to see what’s really happening.

`pairs( ~ K + M + N, dcc)`

While the association between \(K\) and \(M\) and \(K\) and \(N\) are not as clear in the plot, the positive correlation between \(M\) and \(N\) is quite strong.

Let’s get some DAGs going. There are at least three DAGs consistent with these data:

- The body mass influences the neocortex and both influence the milk energy content.
- The neocortex influences the body mass and both influence the milk energy content.
- An unobserved variable \(U\) influences both the body mass and the neocortex which both influence the milk energy content.

How do we know which one is the correct one? The (not so satisfying) answer is we can’t know, at least not from the data alone. All three graphs imply the same set of conditional independencies (there are none). To pick one, we need to make use of our scientific knowledge.

Let’s make some more counterfactual plots.

### Markov Equivalence

A set of DAGs is known as a *Markov Equivalence* set if they all have the same implied conditional independencies.
We can use `{{daggity}}`

to compute the *Markov Equivalence* set of a specific DAG.

```
dag5.7 <- dagitty( "dag{
M -> K <- N
M -> N }")
coordinates(dag5.7) <- list(x=c(M=0,K=1,N=2), y=c(M=0.5, K=1,N=0.5))
MElist <- equivalentDAGs(dag5.7)
```

The object `MElist`

then contains all DAGs with the same implied conditional independencies:

Most of these DAGs are probably not reasonable from a scientific standpoint. That’s where domain knowledge is needed to distinguish the reasonable DAG from the silly one.

## Simulating a Masking Ball

Sometimes it helps to better understand the DAGs and their implied associations by using some simulations.

Let’s simulate data for the first DAG:

```
n <- 100
M <- rnorm( n )
N <- rnorm( n, M )
K <- rnorm( n, N - M)
d_sim <- data.frame(K=K, N=N, M=M)
pairs(d_sim)
```

The second DAG:

```
n <- 100
N <- rnorm( n )
M <- rnorm( n, N )
K <- rnorm( n, N - M)
d_sim2 <- data.frame(K=K, N=N, M=N)
pairs(d_sim2)
```

And the third DAG:

```
n <- 100
U <- rnorm ( n )
N <- rnorm( n, U )
M <- rnorm( n, U )
K <- rnorm( n, N - M)
d_sim2 <- data.frame(K=K, N=N, M=N)
pairs(d_sim2)
```