PopGenetics Blog

Running qpAdm in R: Testing and Interpreting Ancestry Models

This post covers using qpAdm in R to test ancestry models and estimate admixture proportions. qpAdm builds on f4-statistics and provides a framework for evaluating whether proposed source populations can explain a target population’s genetic makeup. For R and admixtools setup instructions on Debian/Ubuntu, see my previous post: Running f4-Statistics with Admixtools in R. Windows users can find R installation instructions on the R website. What is qpAdm? qpAdm is a method for testing ancestry models and estimating admixture proportions. It determines whether a target population can be modeled as a mixture of specified source populations (“left populations”), and if the model fits, calculates the contribution from each source. The method builds on f4-statistics (covered in my previous post) to evaluate these ancestry models. ...

Running f4-Statistics with Admixtools in R

In this post I’ll cover how to run f4-statistics using the admixtools package for R. While I do not typically use R for general-purpose programming, I prefer this implementation over the original one because working in a REPL environment is more practical than editing parameter files, especially when you’re testing different population combinations. The interactive workflow makes programmatic model testing straightforward. Beyond these workflow improvements, the R version is also significantly faster, not because of the language itself, but simply better implementation. You can work with the full AADR dataset without creating subsets. ...

MyHeritage's $30 WGS: Technical Analysis

Recently, MyHeritage quietly announced that all new DNA kits will be processed using low-pass whole-genome sequencing (WGS) instead of traditional genotyping arrays, for just $30 per kit. The coverage depth will be roughly 2x (as announced in their blog post), compared to the 30x used in clinical sequencing. It’s nowhere near diagnostic quality, but it is whole-genome data nevertheless. The surprising part isn’t the technology, it’s the price. Getting an entire genome, even at shallow depth, for less than what MyHeritage charges to “unlock” an uploaded SNP file (~$35) seems almost too cheap to be real. Below are a few of the issues I see: technical, economic, and data-security related. ...

How to Subset Genetic Samples by Population Labels with awk (Create PLINK --keep file)

In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations. Prepare a list of populations to keep: Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line: ...

Running ADMIXTURE in Supervised Mode

This post is a short follow-up to the previous one on Estimating Ancestry Components Using ADMIXTURE. Here, we’ll explore supervised ADMIXTURE, a mode that allows you to explicitly define ancestral populations and infer the ancestry proportions of unassigned individuals based on those references. What Is a Supervised Run? In supervised mode, ADMIXTURE skips the component discovery step and instead uses user-defined groupings to represent ancestral components. The benefit: if you already have solid candidates for reference populations, you can use them to quickly infer ancestry proportions for target or admixed individuals. ...