Samples of Thoughts

about data, statistics and everything in between

Why everything so normal

These are code snippets and notes for the fourth chapter, Geocentric Models, sections 1 to 3, of the book Statistical Rethinking (version 2) by Richard McElreath.

Why normal distributions are normal

The chapter discusses linear models and starts with a recap on the normal distributions. Why is it such a commonly used distribution and how does it arise?

  • Normal by addition Normalcy arises when we sum up random variables:
pos <- replicate( 1000, sum( runif(16, -1, 1)))
dens(pos, norm.comp = T)

  • Normal by multiplication In some cases, normalcy can also arise from multiplication. If the multiplication encapsulates some kind of growth with relatively small growth percentages then the effect is approximately the same as addition and thus also leads to a bell curve:
growth <- replicate(1000, prod(1 + runif(12, 0, 0.1)))
dens(growth, norm.comp = T)

This fails, if the growth factor is too large:

big <- replicate(1000, prod(1 + runif(12, 0, 0.5)))
dens(big, norm.comp = T)

The smaller the factor, the better the approximation:

small <- replicate(1000, prod(1 + runif(12, 0, 0.0001)))
dens(small, norm.comp = T)

  • Normal by log-multiplication Since the log of a product is the same as the sum of the log of each factor, log-multiplication also leads to a normal distribution:
log.big <- replicate(1000, log( prod( 1 + runif(12, 0, 0.05))))
dens(log.big, norm.comp = T)

There are broadly two reasons why we’re using Gaussian distributions in modelling: (1) ontological: we have reason to believe the process we’re modelling is actually following a Gaussian distribution, for the reasons laid out above. E.g. measurement errors follow a Gaussian since at the heart, the measurement errors arise through a process of added fluctuations. (2) epistemological: we don’t really know much about our process, except that it has a mean and a variance, so we use a normal distribution because it makes the least assumptions. (If we know more, we should probably use a different distribution!)

A Gaussian Model of Height

For the example, we’ll be using the !Kung data to estimate height.

A short look at the data:

d <- Howell1
str(d )
'data.frame':   544 obs. of  4 variables:
 $ height: num  152 140 137 157 145 ...
 $ weight: num  47.8 36.5 31.9 53 41.3 ...
 $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
 $ male  : int  1 0 0 1 0 1 0 1 0 1 ...
mean sd 5.5% 94.5% histogram
height 138.26 27.6 81.11 165.7 ▁▁▁▁▁▁▁▂▁▇▇▅▁
weight 35.61 14.7 9.36 54.5 ▁▂▃▂▂▂▂▅▇▇▃▂▁
age 29.34 20.8 1.00 66.1 ▇▅▅▃▅▂▂▁▁
male 0.47 0.5 0.00 1.0 ▇▁▁▁▁▁▁▁▁▇

Since height strongly correlates with age before adulthood, we’ll only use adults for the following analysis:

d2 <- d[ d$age >= 18 ,]

The distribution of height then looks approximately Gaussian.

Our model looks as follows: \[\begin{align*} h_i &\sim \text{Normal}(\mu, \sigma)\\ \mu &\sim \text{Normal}(178, 20) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align*}\]

To better understand the assumptions we’re making with our priors, let’s have a short look at them:

curve( dnorm( x, 178, 20 ), from=100, to=250) 

curve( dunif( x, 0, 50), from=-10, to=60) 

To better understand what these priors mean, we can do a prior predictive simulation. That is, we sample from our priors and pluck this into the likelihood to find out what the model thinks would be reasonable observations, just based on the priors, before having seen the data.

sample_mu <- rnorm(1e4, mean=178, sd=20)
sample_sigma <- runif(1e4, 0, 50)
prior_height <- rnorm(1e4, mean=sample_mu, sd=sample_sigma)

The model expects people to have a height between 100 and 250cm. Wide range, but seems reasonable. Compare this with the prior predictive we would get from flat priors (changing the standard deviation for the \(\mu\) prior to 100):

sample_mu <- rnorm(1e4, mean=178, sd=100)
prior_height <- rnorm(1e4, mean=sample_mu, sd=sample_sigma)

The flat prior thinks people could have negative height (to the left of the dashed line) or be extremely tall (the right line indicates the height of one of the tallest persons ever recorded). Not very sensible.

For educational purpose, let’s do a grid approximation of our model.

Start with the grid:

mu.list <- seq(from=150, to=160, length.out = 200)
sigma.list <- seq(from=7, to=9, length.out = 200)
post <- expand.grid(mu=mu.list, sigma=sigma.list)

For each value in the grid, compute the log-likelihood for the whole data:

post$LL <- sapply( 1:nrow(post), function(i) sum( dnorm( 

Since we’re working with the log-likelihood, we can sum everything up instead of multiplying. This is mostly to avoid rounding errors. So the same way, we can add the prior densities (again on log):

post$prod <- post$LL + dnorm( post$mu, 178, 20, TRUE) +   
  dunif( post$sigma, 0, 50, TRUE)

Finally, we return from log-scale to normal again but first scale it to avoid too small values and rounding errors.

post$prob <- exp( post$prod - max(post$prod))

Because of the scaling, the resulting values are actually not proper probabilities but relative probabilities.

We can visualize the distribution using a contour map:

contour_xyz( post$mu, post$sigma, post$prob)

Or a heatmap:

image_xyz( post$mu, post$sigma, post$prob)

The highest probability density for \(\mu\) is somewhere around 155 with highest probability density for \(\sigma\) around 8.

Instead of working with the grid posterior, we can use the grid to obtain a sample from our posterior:

sample.rows <- sample( 1:nrow(post), size=1e4, replace=TRUE,
                       prob=post$prob) <- post$mu[ sample.rows ]
sample.sigma <- post$sigma[ sample.rows]

And can plot this instead:

plot(, sample.sigma, cex=0.5, pch=16, col=col.alpha(rangi2, 0.1))

In the plot, we can still see the artifacts of the grid we used to approximate the posterior. One could either pick a finer grid or add a tiny bit of jitter to the samples to get rid of these artifacts.

We can also look at the marginal posterior:

dens( )

As sample size increases, posterior densities approach the normal. Indeed, the posterior for \(\mu\) looks very normal already. The posterior for \(\sigma\) on the other hand is very slightly right-skewed:

dens( sample.simga)

We can compute posterior compatibility intervals:

PI( )
 5% 94% 
154 155 
PI( sample.sigma)
  5%  94% 
7.32 8.26 
Short note on why the standard deviation is less normal:

Posterior of \(\sigma\) tend to have a long-right hand tail. The reasons? Complex! But an intuition is that, because the standard deviation has to be positive, there is less uncertainty how small it is (it is bounded by 0 after all) and more uncertainty about how big it is. If we repeat the example from above on a small subset of the data, this tail in the uncertainty of \(\sigma\) becomes more pronounced:

There’s a longer tail at the top of this point-cloud. We can also see it in the marginal density of \(\sigma\):

Quadratic Approximation and Prior Predictive Checks

Since grid approximation becomes unfeasible with more complex models, we now start using quadratic approximation. As we’ve seen above, the posterior approaches a normal distribution with enough data, so we can use that to approximate the posterior. The quadratic approximation, quap(), first climbs the posterior distribution to find the mode, i.e. the MAP (Maximum A Posteriori) and then estimates the quadratic curvature.

To use quap(), we place our model into a formula list (alist() in R):

flist <- alist(
  height ~ dnorm( mu, sigma) ,
  mu ~ dnorm( 178, 20),
  sigma ~ dunif(0, 50)

m4.1 <- quap( flist, data=d2 )
precis( m4.1 )
mean sd 5.5% 94.5%
mu 154.61 0.41 153.95 155.3
sigma 7.73 0.29 7.27 8.2

Comparing this with the compatibility intervals from the grid approximation model before, we that they’re nearly identical:

PI( )
 5% 94% 
154 155 
PI( sample.sigma )
  5%  94% 
7.32 8.26 

Let’s see what happens if we use a very narrow prior on \(\mu\) instead:

m4.2 <- quap( 
            height ~ dnorm(mu, sigma),
            mu ~ dnorm(178, 0.1),
            sigma ~ dunif(0, 50)
mean sd 5.5% 94.5%
mu 177.9 0.10 178 178
sigma 24.5 0.93 23 26

Now the estimate for \(\mu\) has hardly moved off the prior. Note however, how the posterior for \(\sigma\) changed by quite a lot, even though we didn’t change its prior. Since we told the model that \(\mu\) is basically 178, very convincingly with this prior, it changes the estimate for \(\sigma\) in return to make the data fit with the model.

To sample from the quadratic approximation, note that the quadratic approximation of a posterior distribution is just a multi-dimensional Gaussian distribution. To sufficiently describe a multidimensional Gaussian distribution, we only need a list of means and a matrix of variances and covariances.

vcov( m4.1 )
mu sigma
mu 0.17 0.000
sigma 0.00 0.085

We can decompose the matrix of variances and covariances into a vector of variances and a correlation matrix:

diag( vcov(m4.1 ))  
cov2cor( vcov( m4.1 ))  
    mu  sigma 
0.1697 0.0849 
mu sigma
mu 1.000 0.002
sigma 0.002 1.000

We could use the variances and correlation matrix to sample from a multi-dimensional Gaussian distribution. Or simply use the provided short-cut:

post <- extract.samples( m4.1, n=1e4)
mu sigma
154 8.13
155 7.62
155 7.09
155 7.71
154 7.96
154 7.27
mean sd 5.5% 94.5% histogram
mu 154.61 0.41 153.94 155.28 ▁▁▁▅▇▂▁▁
sigma 7.73 0.29 7.26 8.19 ▁▁▁▂▅▇▇▃▁▁▁▁

Equivalently, we could also get the posterior samples manually as follows:

post <- mvrnorm(n = 1e4, mu=coef(m4.1), Sigma=vcov(m4.1))

Full code.


Read more posts by this author.