Hierarhical Modeling

The Setup

Let's look at an example of making inference using hierarchical modeling. The example we'll use is the rat tumor data set from Gelman's Bayesian Data Analysis text, and again, this explanation follows this text very closely. You can follow along by downloading the data here.

The idea here is that we have 70 independent experiments, each estimating the probability that a rat will develop a tumor. Basically we have a vector $y_j$ of number of tumors, and we're going to assume that the underlying probabilities that gave rise to each $y_j$ ( that is to say that $y_j ~ Bin(n_j,\theta_j)$), are somehow related. Hierarchical modeling allows us to estimate individual $\theta_j$s for each experiment by first estimating the distribution from which these $\theta_j$s are drawn. Put another way, we can think of $\theta_j$ as being a draw from a prior distribution defined by some parameter vector $\phi$. This means that:

$$p(\theta|\phi) = \prod_{j=1}^Jp(\theta_j|\phi)$$

As $\phi$ is not known, it has it's own prior distribution, $p(\phi)$. This means that the joint prior for $(\phi,\theta)$ is

$$p(\phi,\theta)=p(\phi)p(\theta|phi)$$

which means that the joint posterior is

$$p(\phi,\theta|y) \propto p(\phi,\theta)p(y|\phi,\theta) = p(\phi,\theta) p(y|\theta) $$

We know that $p(y|\phi,\theta)=p(y|\theta)$ because $y$ depends only on $\theta$.

The real magic of the hierarchical model comes when we end up with something of the form $p(\phi|y)$, because once we have that, we can draw from this distribution to estimate our individual $\theta_j$s. The most obvious way to do this is simply to integrate out theta:
$$p(\phi|y)=\int p(\theta,\phi|y)d\theta$$
If we can compute the marginal posterior of $\phi$ though, then we can use the formula:

$$p(\phi|y)=\frac{p(\theta,\phi|y)}{p(\theta|\phi,y)}$$

In this case, the components of $\phi$ are $\alpha$ and $\beta$, so we can rewrite our joint posterior for the specific case of our data:

$$p(\theta,\alpha,\beta|y) \propto p(\alpha,\beta)p(\theta|\alpha,\beta)p(y|\theta,\alpha,\beta)\
=p(\alpha,\beta)\prod_{j=1}^{J}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta_j^{\alpha-1}(1-\theta_j)^{\beta-1}\prod_{j=1}^J\theta_j^{y_j}(1-\theta_j)^{n_j-y_j}$$

There are two ways we can compute the marginal for $\phi$, and I'll show both. First, let's try
$$p(\alpha,\beta|y)=\frac{p(\theta,\alpha,\beta|y)}{p(\theta|\alpha,\beta,y)}$$

$p(\theta|\alpha,\beta,y)$ is simply beta-binomial:

$$p(\theta|\alpha,\beta,y) = \prod_{j=1}^J\frac{\Gamma(\alpha+\beta+n_j)}{\Gamma(\alpha+y_j)\Gamma(\beta+n_j-y_j)}\theta_j^{\alpha+y_j-1}(1-\theta)^{\beta+n_j-y_j-1} $$

If we divide $p(\theta,\alpha\beta|y)$ by $p(\theta|\alpha,\beta,y)$, the $\theta$s cancel and we're left with our marginal. The other option is to integrate out $\theta$, (which is how we get the beta-binomial in the first place), and we end up with the same answer:

$$p(\alpha,\beta|y) \propto \prod_{j=1}^J \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(\alpha+y_j)\Gamma(\beta+n_j-y_j)}{\Gamma(\alpha+\beta+n_j)}$$

The final step is to come up with a prior for $p(\alpha,\beta)$, start drawing samples from the prior, and compute the hyperprior. I'll cover this in a later post about how to come up with priors