Samples of Thoughts

about data, statistics and everything in between

First Linear Predictions

These are code snippets and notes for the fourth chapter, Geocentric Models, , sections 4, of the book Statistical Rethinking (version 2) by Richard McElreath.

In this section, we work with our first prediction model where we use the weight to predict the height of a person. We again use the !Kung data and restrict to adults above 18.

d <- Howell1
d2 <- d[ d$age >= 18, ]
plot(height ~ weight, data=d2)

It looks like there is a nice, clear linear relationship between the weight and height of a person.

We use the following model to capture this relationship: \[\begin{align*} h_i &\sim \text{Normal}(\mu_i, \sigma)\\ \mu_i &= \alpha + \beta(x_i - \bar{x}) \\ \alpha &\sim \text{Normal}(178, 20) \\ \beta &\sim \text{Normal}(0, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]

Prior Predictive Checks

Before we start modelling this with R, let’s have a look at the priors and their implications for the model. For this, we’ll sample from the prior for \(\alpha\) and \(\beta\) and plot the resulting lines.

N <- 100
a <- rnorm( N, 178, 20 )
b <- rnorm( N, 0, 10 )

Let’s plot the lines resulting from these values:

This looks like a rather unreasonable model for height. There are many lines with a negative slope, meaning the heavier you are, the smaller you are. But even the lines with a positive slope have extremely steep slopes that seem rather unrealistic. Many lines go below the dashed line of zero height and above the line at 272cm representing the height of the tallest person.

Since we already know that the relationship between weight and height is positive, we can model the slope \(\beta\) with a strictly positive distribution such as the log-normal:

b <- rlnorm( 1e4, 0 , 1 )
dens(b, xlim=c(0,5), adj=0.1)

Let’s plot the prior lines again, this time using the log-normal prior for \(\beta\):

The log-normal prior for \(\beta\) seems a much more reasonable choice for this model. Our new model then looks like this: \[\begin{align*} h_i &\sim \text{Normal}(\mu_i, \sigma)\\ \mu_i &= \alpha + \beta(x_i - \bar{x}) \\ \alpha &\sim \text{Normal}(178, 20) \\ \beta &\sim \text{Log-Normal}(0, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]

Running the Model in R

Now let’s translate the model to R, using quap() from the {{rethinking}} package:

m4.3 <- quap(
          height ~ dnorm( mu, sigma),
          mu <- a + b * ( weight - xbar ) ,
          a ~ dnorm( 156, 100),
          b ~ dlnorm( 0, 1),
          sigma ~ dunif(0, 50)
        ) ,
        data = d2

The results:

precis( m4.3 )
mean sd 5.5% 94.5%
a 154.60 0.27 154.17 155.03
b 0.90 0.04 0.84 0.97
sigma 5.07 0.19 4.77 5.38

How do we interpret these values? So \(\alpha\) is the (average) predicted height when $ x - {x} = 0$ that is, when \(x\) is the mean height. So \(\alpha\) is close to what we had for \(\mu\) in our model without the linear part. \(\beta\) represents the increase in height if a person weighs one kg more.

We can check the covariances among the parameters:

vcov( m4.3)
a b sigma
a 0.07 0 0.00
b 0.00 0 0.00
sigma 0.00 0 0.04

Or as a pairs() plot:

There is very little covariation between the parameters. This actually has to do with the fact that we centralized our predictor variable height.

Visualize our Model

Let’s visualize the model to get a better understanding of its results. We start with plotting the raw data and the posterior mean of \(\alpha\) and \(\beta\) as a kind of “mean line” (or more officially, the MAP line):

This is a nice line but it does not visualize the uncertainty in the two parameters. One approach to visualize this uncertainty is to plot a few random lines sampled from the posterior, similarly to how we did for the prior predictive plots.

post <- extract.samples( m4.3 )
a b sigma
155 0.94 5.22
154 0.89 4.75
154 0.92 5.34
154 0.92 5.16

To be able to better see the difference in uncertainty, we refit the model again on a subset of the data. This way, we can observe how the uncertainty decreases when we add more data:

How do we compute compatibility intervals? For a single weight value, say 50 kilograms, proceed as follows:

post <- extract.samples(m4.3)
mu_at_50 <- post$a + post$b * ( 50 - xbar )
dens( mu_at_50, col=rangi2, lwd=2, xlab="mu | weight=50")

PI(mu_at_50, prob=0.89)
 5% 94% 
159 160 

Using the link() function, we can compute the distribution of \(\mu\) for each value of observed weight.

mu <- link(m4.3)
 num [1:1000, 1:352] 157 158 158 158 157 ...

Instead of computing the \(\mu\)-values for the observed weight values, we can provide new data, e.g. all weight values in the range of reasonable weight values:

weight.seq <- seq( from=25, to=70, by=1)

mu <- link(m4.3, data = data.frame( weight=weight.seq ) )
 num [1:1000, 1:46] 138 136 137 137 135 ...

We can plot all 1000 x 146 values for \(\mu\):

plot(height ~ weight, d2, type="n")

for (i in 1:100)
  points(weight.seq, mu[i,], pch=16, col=col.alpha(rangi2, 0.1))

Or we can summarize the values for each weight and compute the mean and a compatibility interval:

mu.mean <- apply(mu, 2, mean)
mu.PI <- apply(mu, 2, PI, prob=0.89)
# plot raw data
plot(height ~ weight, data=d2, col=col.alpha(rangi2, 0.5))

# plot the MAP line
lines(weight.seq, mu.mean)

# plot a shaded region for 89% PI
shade( mu.PI, weight.seq)

All the Uncertainty

So far, we’ve only worked with \(\mu\) and its uncertainty. If we want to generate new observations (i.e. predictions) we also have to account for the uncertainty in the standard deviation \(\sigma\). To simulate new heights, we pluck the values for \(\mu\) together with the values for \(\sigma\) into a normal distribution. Or simply use the function sim():

sim.height <- sim( m4.3, data=list(weight=weight.seq))
 num [1:1000, 1:46] 130 130 133 146 144 ...

For each weight, we can then get a compatibility interval of predicted heights:

height.PI <- apply(sim.height, 2, PI, prob=0.89)
5% 128 129 130 132 132
94% 144 146 147 148 149

And visualize both the uncertainty in \(\mu\) and the uncertainty in the prediction:

# plot everything together
plot( height ~ weight, d2, col=col.alpha(rangi2, 0.5))

# draw a MAP line
lines(weight.seq, mu.mean)

# draw PI region for line
shade(mu.PI, weight.seq)

# draw PI region for simulated heights
shade(height.PI, weight.seq )

Our model suggests that 89% of observed data points should be within these boundaries. We can also visualize different boundaries:

We could also simulate the heights manually. Remember, to manually compute \(\mu\) (instead of using link()), we use the following:

post <- extract.samples(m4.3)
weight.seq <- 25:70
mu <- sapply(weight.seq, function(weight) post$a + post$b * ( weight - xbar ) )

To generate height observations, we also compute \(\mu\) but pluck it straight into the normal distribution rnorm() to sample from it:

sim.height <- sapply( weight.seq, function(weight)
    mean=post$a + post$b * ( weight - xbar ),

Full code.


Read more posts by this author.