Ordered Categorical Outcomes
library(rethinking)
data(Trolley)
d <- Trolley
The data contains answers of 331 individuals for different stories, about how morally permissible the action in the story is. The answer is an integer from 1 to 7. The outcome is thus categorical and ordered.
simplehist( d$response, xlim=c(1,7), xlab="response")
Describing an ordered distribution with intercepts
We want to redescribe this histogram on the log-cumulative-odds scale. We first compute the cumulative probabilities:
pr_k <- table( d$response ) / nrow(d)
cum_pr_k <- cumsum( pr_k )
plot( 1:7, cum_pr_k, type="b", xlab="response",
ylab="cumulative proportions", ylim=c(0, 1))
Next, to get the log-cumulative odds, we use the following link function to get an intercept parameter for each reponse: \[\begin{align*} \log \frac{Pr(y_i \leq k)}{1 - Pr(y_i \leq k)} = \alpha_k \end{align*}\] where \(\alpha_k\) is the intercept parameter unique to each possible outcome value \(k\).
logit <- function(x) log(x/(1-x))
( lco <- logit( cum_pr_k ) )
1 2 3 4 5 6 7
-1.9160912 -1.2666056 -0.7186340 0.2477857 0.8898637 1.7693809 Inf
The cumulative probability for the last response will always be 1, so its log-cumulative odds is always infinity and thus we only need 6 parameters.
From the log-cumulative odds, we can then get back again to our cumulative proportions by using the inverse of the link, and then obtain the likelihood for each (ordered) category by substraction. \[p_k = Pr(y_i = k) = Pr(y_i \leq k) - Pr(y_i \leq k-1)\] These likelihoods are the blue line segments in the following plot:
plot( 1:7, cum_pr_k, type="b", xlab="response",
ylab="cumulative proportions", ylim=c(0, 1), xlim=c(1, 7.4))
for (i in 1:7) {
lines(c(i, i), c(0, cum_pr_k[i]) , lwd=2.5, col="grey")
lines(c(i,i) + 0.1, c(cum_pr_k[i] - pr_k[i], cum_pr_k[i]),
lwd=2.5, col="steelblue")
text( i + 0.3, cum_pr_k[i] - pr_k[i]/2, labels=i,
col="steelblue")
}
Putting these together, we get this model: \[\begin{align*} R_i &\sim \text{Ordered-logit}(\phi_i, \kappa) &\text{[probability of data]}\\ \phi_i &= 0 &\text{[linear model]} \\ \kappa_k &\sim \text{Normal}(0, 1.5) &\text{[common prior for each intercept]} \end{align*}\]
More verbose, this model is the same as \[\begin{align*} R_i &\sim \text{Categorical}(\boldsymbol p) &\text{[probability of data]}\\ p_i &= q_i &\text{[probabilities for each value }k] \\ p_k &= q_k - q_{k-1} \quad \text{ for } K > k > 1 & \\ p_K &= 1 - q_{k-1} & \\ \text{logit}(q_k) &= \kappa_k - \phi_i &\text{[cumulative logit link]} \\ \phi_i &= \text{terms of linear model} &\text{[linear model]} \\ \kappa_k &\sim \text{Normal}(0, 1.5) &\text{[common prior for each intercept]} \end{align*}\]
In ulam
:
m12.5 <- ulam(
alist(
R ~ dordlogit( 0, cutpoints ),
cutpoints ~ dnorm( 0, 1.5 )
),
data=list( R = d$response ), chains=4, cores=3, refresh=0
)
or in quap, where we need to specify start
values. The exact values aren’t important, but their ordering is:
m12.5q <- quap(
alist(
response ~ dordlogit( 0, c(a1, a2, a3, a4, a5, a6)),
c(a1,a2,a3,a4,a5,a6) ~ dnorm(0, 1.5)
), data=d,
start=list(a1=-2, a2=-1, a3=0, a4=1, a5=2, a6=2.5)
)
The posterior distribution:
precis( m12.5 , depth=2)
mean sd 5.5% 94.5% n_eff Rhat4
cutpoints[1] -1.9160333 0.03061587 -1.9652723 -1.8673509 1398.757 1.0028701
cutpoints[2] -1.2657203 0.02439893 -1.3040607 -1.2260394 1903.602 1.0028004
cutpoints[3] -0.7177779 0.02191545 -0.7529061 -0.6819212 2224.825 1.0008826
cutpoints[4] 0.2481206 0.02053067 0.2159976 0.2818062 2922.036 0.9988227
cutpoints[5] 0.8902924 0.02196657 0.8538981 0.9243229 2343.677 0.9994340
cutpoints[6] 1.7702364 0.02839789 1.7244130 1.8152051 2277.799 0.9990671
Note that these values are on the log-cumulative-odds scale. We transform them to the cumulative probabilities by using the inverse logit:
inv_logit( coef(m12.5))
cutpoints[1] cutpoints[2] cutpoints[3] cutpoints[4] cutpoints[5] cutpoints[6]
0.1283046 0.2199907 0.3278825 0.5617139 0.7089505 0.8544871
Compare this with the cumulative probabilities we computed earlier:
cum_pr_k
1 2 3 4 5 6 7
0.1282981 0.2198389 0.3276939 0.5616314 0.7088620 0.8543807 1.0000000
Okay great, we now have a Bayesian representation of the histogram from before. Lots of caclucation for rather little. But we can now include predictor variables in our model.
Adding predictor variables
To include predictor variables, we add a linear model \(\phi_i=\beta x_i\). Then each cumulative logit becomes:
\[\begin{align*}
\text{log}\frac{Pr(y_i \leq k)}{1 - Pr(y_i \leq k)} &= \alpha_k - \phi_i \\
\phi_i &= \beta x_i
\end{align*}\]
Why is the linear model \(\phi\) subtracted from each intercept? Because if we decrease the log-cumulative-odds of every outcome value \(k\) below the maximum, this necessarily shifts probability mass upwards towards higher outcome values.
For example, suppose we take the posterior means from m12.5
and subtract 0.5 from each:
( pk <- dordlogit( 1:7, 0, coef(m12.5) ) )
[1] 0.12830456 0.09168619 0.10789175 0.23383136 0.14723666 0.14553655 0.14551294
These probabilities imply an average outcome value of:
sum( pk * (1:7) )
[1] 4.198671
And now subtracting 0.5 from each:
( pk <- dordlogit( 1:7, 0, coef(m12.5) - 0.5 ) )
[1] 0.08195822 0.06411714 0.08225238 0.20903322 0.15899213 0.18443011 0.21921679
Now values on the left have decreased while values on the right have increased. The expected value is now:
sum( pk * (1:7) )
[1] 4.729141
That’s why we subtract \(\phi\), the linear model, from each intercept. This way, a positive \(\beta\) value indicates that an increase in the predictor variable \(x\) results in an increase in the average response.
Going back to the trolley data, we will use action
, intention
, and contact
as predictor variables.
Since the influence of intention may depend upon the simultaneous presence of action or contact, we include an interaction. This gives us the following log-cumulative-odds:
\[\begin{align*}
\text{log}\frac{Pr(y_i \leq k)}{1 - Pr(y_i \leq k)} &= \alpha_k - \phi_i \\
\phi_i &= \beta_A A_i + \beta_C C_i + \text{B}_{I,i}I_i \\
\text {B}_{I,i} &= \beta_I + \beta_{IA}A_i + \beta_{IC}C_i
\end{align*}\]
For the interaction, we use an accessory linear model, \(\text{B}_I\). This helps to make the notation clearer.
dat <- list(
R = d$response,
A = d$action,
I = d$intention,
C = d$contact
)
m12.6 <- ulam(
alist(
R ~ dordlogit( phi, cutpoints ),
phi <- bA*A + bC*C + BI*I,
BI <- bI + bIA*A + bIC*C,
c(bA, bI, bC, bIA, bIC) ~ dnorm(0, 0.5),
cutpoints ~ dnorm( 0, 1.5)
), data=dat, chains=4, cores=4, refresh=0 )
precis(m12.6)
mean sd 5.5% 94.5% n_eff Rhat4
bIC -1.2375365 0.09506630 -1.3867936 -1.0847873 1188.208 1.000377
bIA -0.4355137 0.07818393 -0.5647729 -0.3071681 1148.478 1.001882
bC -0.3412000 0.06714190 -0.4497271 -0.2353824 1103.074 1.001444
bI -0.2902317 0.05712674 -0.3842900 -0.1991619 1008.461 1.003361
bA -0.4716408 0.05228266 -0.5536497 -0.3877824 1037.580 1.001588
plot( precis(m12.6), xlim=c(-1.4, 0))
The combination of contact and intention is the worst. Interestingly, neither contanct nor intention by itself have a large impact on ratings.
Visualizing log-cumulative-odds models is not quite straight-forward. That is because each prediction is actually a vector of probabilities, one probability for each possible response value.
post <- extract.samples(m12.6)
par(mfrow=c(1,3))
modi <- c("00", "10", "01")
for (modus in modi) {
plot (NULL, type="n", xlab="intention", ylab="probability",
xlim=c(0,1), ylim=c(0,1), xaxp=c(0,1,1), yaxp=c(0,1,2), bty="l" )
kA <- as.numeric( substring(modus, 1,1) )
kC <- as.numeric( substring(modus, 2,2) )
kI <- 0:1
pdat <- data.frame(A=kA, C=kC, I=kI)
phi <- link( m12.6, data=pdat )$phi
for (k in 1:3){
for (s in 1:50) {
pk <- pordlogit( 1:6, phi[s, ], post$cutpoints[s,])
for (i in 1:6) {
lines( kI, pk[, i], col=col.alpha("black", 0.1), lwd=0.4)
}
}
mtext(paste("action=",kA, ", contact=", kC ))
temp <- d %>%
filter(action == kA & contact == kC) %>%
select(response, intention) %>%
group_by(response, intention) %>%
summarize(n=n()) %>%
group_by(intention) %>%
arrange(response) %>%
mutate(prop = n / sum(n), cumprop = cumsum(prop)) %>%
filter(response != 7)
points(temp$intention, temp$cumprop, col="steelblue", pch=19, cex=1.7)
}
}
This plot shows how the distribution of predicted responses varies by intention
. We can see that if contact = 1
then intention has the highest impact so that this story combination is deemed by people the least morally permissible (most people have given low-valued responses).
We can also visualize the histogram of outcomes:
par(mfrow=c(1,3))
for (modus in modi) {
kA <- as.numeric( substring(modus, 1,1) )
kC <- as.numeric( substring(modus, 2,2) )
kI <- 0:1
kI <- 0:1
pdat <- data.frame(A=kA, C=kC, I=kI)
s <- sim( m12.6, data=pdat)
simplehist( s, xlab="response", col=c("black", "steelblue"), bty="l")
mtext(paste("action=",kA, ", contact=", kC ))
}
The blue segments are the frequences when intention
is 1. In this visualization it is easy to see that some answers are more salient than others: the middle response 4 is much more frequent. This is one reason why it is better to treat these kind of responses as ordered categorical outcomes instead of just as an ordinary metric variable.
Ordered Categorical Predictors
Just as we can have ordered categorical outcomes, we can also have ordered predictor variables. In this data set for example, there is the variable of completed education:
levels(d$edu)
[1] "Bachelor's Degree" "Elementary School" "Graduate Degree"
[4] "High School Graduate" "Master's Degree" "Middle School"
[7] "Some College" "Some High School"
The right order is as follows:
edu_levels <- c(2, 6, 8, 4, 7, 1, 5, 3)
d$edu_new <- edu_levels[ d$edu]
levels(d$edu)[edu_levels]
[1] "Elementary School" "Middle School" "Some High School"
[4] "High School Graduate" "Some College" "Bachelor's Degree"
[7] "Master's Degree" "Graduate Degree"
The idea is, that for such a categorical predictor, each step up in value comes with its own incremental effect on the outcome. So completing middle school can have a different impact than completing your bachelor. We will absorb the first level into the intercept, which means with 8 education levels we will need 7 parameters. We get a linear model as follows: \[\phi_i = \beta_E \sum_{j=0}^{E_i -1} \delta_j + \text{other stuff}\] where the parameter \(\delta_j\) is the effect of completing the \(j\)th level of education and \(E_i\) the completed education level of individual \(i\). The parameters \(\delta_j\) are fractions so that \(\sum_{j=0}^{7}\delta_j = 1\). This means that \(\beta_E\) is the maximum effect of education. This parameterization also helps with setting priors. If for example the prior expectation is that all of the levels have the same incremental effect, then we want all the \(\delta_j\)’s to have the same prior. We can set a separate prior for \(\beta_E\).
So now our full model: \[\begin{align*} R_i &\sim \text{Ordered-logit}(\phi_i, \kappa) \\ \phi_i &= \beta_E \sum_{j=0}^{E_i -1} \delta_j + \beta_A A_i + \beta_I I_i + \beta_C C_i \\ \kappa_k &\sim \text{Normal}(0, 1.5) \\ \beta_A, \beta_I, \beta_C, \beta_E &\sim \text{Normal}(0, 1) \\ \delta &\sim \text{Dirichlet}(\alpha) \end{align*}\]
The prior for \(\delta\) is a Dirichlet distribution which is the multivariate extension of the beta distribution. The Dirichlet distribution is parametrized by a vector \(\alpha\) of pseudo-counts for each possibility. We’ll use a rather weak prior with each value inside \(\alpha\) being 2. Let’s simulate from this prior:
library(gtools)
set.seed(1805)
delta <- rdirichlet( 10, alpha = rep(2,7))
str(delta)
num [1:10, 1:7] 0.1053 0.2504 0.1917 0.1241 0.0877 ...
h <- 3
plot(NULL, xlim=c(1,7), ylim=c(0,0.4), xlab="index", ylab="probability")
for (i in 1:nrow(delta) ) {
lines( 1:7, delta[i,], type="b",
pch=ifelse(i==h, 16, 1),
lwd=ifelse(i==h, 4, 1.5),
col=ifelse(i==h, "black", col.alpha("black", 0.7)))
}
The highlighted vector isn’t special but simply shows how much variation can exist in a single vector.
Now, let’s code the model:
dat <- list(
R = d$response,
action = d$action,
intention = d$intention,
contact = d$contact,
E = as.integer(d$edu_new ), # edu_new as an index
alpha = rep(2, 7) # delta prior
)
m12.7 <- ulam(
alist(
R ~ ordered_logistic( phi, kappa ),
phi <- bE*sum( delta_j[1:E] ) + bA*action + bI*intention + bC*contact,
kappa ~ normal( 0, 1.5 ),
c(bA, bI, bC, bE) ~ normal( 0, 1 ),
vector[8]: delta_j <<- append_row( 0, delta ),
simplex[7]: delta ~ dirichlet(alpha)
), data=dat, chains=3, cores=3, refresh=0
)
precis(m12.7, depth=2, omit="kappa")
mean sd 5.5% 94.5% n_eff Rhat4
bE -0.24199358 0.05303379 -0.33168202 -0.1593494 1242.595 0.9998003
bC -0.95829974 0.05031225 -1.04082898 -0.8820681 1258.110 0.9990665
bI -0.71707055 0.03663222 -0.77611303 -0.6589980 1229.436 1.0019991
bA -0.70428785 0.04010258 -0.76913912 -0.6414254 1132.548 1.0010330
delta[1] 0.11704672 0.07138965 0.02613915 0.2457580 1997.753 1.0000268
delta[2] 0.14080012 0.08474988 0.02718562 0.2924116 2771.578 0.9987934
delta[3] 0.21579313 0.11290178 0.06044549 0.4151034 1776.309 0.9989093
delta[4] 0.28181314 0.12989130 0.08846192 0.4983804 1910.673 0.9983736
delta[5] 0.06175723 0.04203348 0.01254910 0.1373547 1870.382 0.9989488
delta[6] 0.07228345 0.05112297 0.01427695 0.1712458 1925.217 0.9990211
delta[7] 0.11050621 0.06875393 0.02493212 0.2382028 1767.699 0.9993556
The overall associaton of education is negative, more educated individuals disapproved more of everything.
delta_labels <- c("Elem", "MidSch", "SHS", "HSG", "SCol", "Bach", "Mast", "Grad")
pairs( m12.7, pars="delta", labels=delta_labels )
All but one level of education (Some Colege, SCol) produce some modest increment on average.
Let’s compare this posterior with one we would get from a more conventional model where education is entered as an ordinary continuous variable.
dat$edu_norm <- normalize( d$edu_new )
m12.8 <- ulam(
alist(
R ~ ordered_logistic( mu, cutpoints ),
mu <- bE*edu_norm + bA*action + bI*intention + bC*contact,
c(bA, bI, bC, bE) ~ normal( 0, 1) ,
cutpoints ~ normal( 0, 1.5)
), data=dat, chains=3, cores=3, refresh=0
)
precis( m12.8 )
mean sd 5.5% 94.5% n_eff Rhat4
bE -0.2113820 0.05773207 -0.3010982 -0.1171681 1174.7162 0.9998395
bC -0.9576836 0.05236242 -1.0437901 -0.8739644 902.8339 1.0019555
bI -0.7183698 0.03675487 -0.7757871 -0.6599124 1263.8602 0.9984837
bA -0.7057842 0.04229579 -0.7725477 -0.6403573 918.1924 1.0022448
This model finds a slightly weaker association between education and the rating response. This is possibly because the effect isn’t actually linear; different levels have different incremental associations.