Inferring population admixture

In the process of preparing for my preliminary exam, I thought it might be useful to summarize some of the things I've learned with a blog post. For my first post, I'll be writing about methods for inferring population admixture.

What is admixture?

One of the largest subdisciplines in the field of genetics is that of population genetics. Population genetics, as its name implies, is the study of how allele frequency changes within and between populations. There are several ways that allele frequency can change within a population, including genetic drift, natural selection, and population admixture, or the "mixing", or interbreeding of otherwise distinct populations. There is some inherent difficulty in defining what exactly is meant by "distinct populations" when discussing admixed individuals, as all living things share a common ancestor. Also inherent in this definition is the notion that there are "unmixed" populations, which is of course a nonsensical idea in sexually reproducing populations, as sexual reproduction is the act of mixing the genetic material of two (hopefully somewhat) unrelated individuals. For the purpose of this post, we'll operate under the assumption that the populations from which admixed individuals descend have been isolated long enough so that there are discernable allele frequency differences at an appreciable number of loci. It is important to emphasize that whether these allele frequency differences are due to selelction or drift (or previous admixture) is largely irrelevant.

Model based methods

The first method I'll discuss for inferring admixture is the model based approach used by (among many others) Pritchard, Novembre, and Tang. In this approach, a statistical model of the data is fit by either maximum likelihood or bayesian methods. I'll be discussing the maxmimum likelihood method used by Novembre.

The model

Let's imagine that we have genotyped $ I $ individuals at $ J $ SNP loci (we could also use structural variants, or the like, but for this example let's stick to SNPs) that descend from at least one of and possibly all of $ K $ ancestral populations. We'll let $ g_{ij}$ represent the dosage (number of copies) of the reference allele for individual $i$ at locus $j$. Let's represent the fraction of individual $i$'s genome that comes from population $k$ as $q_{ik}$. Finally, let's let $f_{kj}$ represent the allele frequency of the reference allele at locus $j$ in population $k$. As a reminder, the likelihood for a given genotype for one individual descended from one population with allele frequency $f_j$ is going to be
$$ \binom{2}{g_{ij}} (f_{kj})^{g_{ij}}(1-f_{kj})^(2-{g_{ij}}) $$

For one individual ($i$) from several ($K$) populations the likelihood for a given genotype conditioned on it's frequency in each population and on the mixture proportion of each individual is:

$$ \propto \left( \sum_{K} q_{ik}f_{kj} \right)^{g_{ij}} \left( \sum_{K} q_{ij}(1-f_{kj}) \right)^{(2-g_{ij})} $$

It's straightforward from there to generalize this to a log-likelihood function for $I$ individuals:

$$ \sum_i \sum_j \left(g_{ij} \log(\sum_k q_{ik} f_{kj}) + (2-g_{ij}) \log( \sum_k q_{ik}(1-f_{kj})) \right) $$

So now that we've set up a likelihood function, how do we go about estimating the parameters? In particular, we are interested in the $q$ vector, as it tells us the admixture proportion of each individual. Well one approach we can take, is basically an iterative guess and check. What if, instead of thinking of $q$ as the mixture proportion, we thought of it as the probability that an individual's genotype at a given locus is inherited from a particular population. For example, if 25% of my genome is from population $k$, then the probability that my genotype at a given locus $j$ is inherited from population $k$ is $q_{ik}$. What we can do then, is, using loci where the values of $f_{kj}$ are very different between populations (if we have population values of $f_{kj}, otherwise we can simply use loci with assign individuals to these populations. We can then re-estimate the allele frequencies for the ancestral populations given their asignment, re-estimate ancestral assignment, and repeat.