Practical tutorials in computational population genetics

From PCA and ADMIXTURE to imputation and f-statistics.

MyHeritage’s $30 Whole-Genome Sequencing (WGS): Technical and Economic Analysis

Recently, MyHeritage quietly announced that all new DNA kits will be processed using low-pass whole-genome sequencing (WGS) instead of traditional genotyping arrays, for just $30 per kit. The coverage depth will be roughly 2x (as announced in their blog post), compared to the 30x used in clinical sequencing. It’s nowhere near diagnostic quality, but it is whole-genome data nevertheless. The surprising part isn’t the technology, it’s the price. Getting an entire genome, even at shallow depth, for less than what MyHeritage charges to “unlock” an uploaded SNP file (~$35) seems almost too cheap to be real. Below are a few of the issues I see: technical, economic, and data-security related. ...

November 11, 2025

How to Subset Genetic Samples by Population Labels with awk (Create PLINK --keep file)

In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations. Prepare a list of populations to keep: Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line: ...

November 10, 2025

Running ADMIXTURE in Supervised Mode

This post is a short follow-up to the previous one on Estimating Ancestry Components Using ADMIXTURE. Here, we’ll explore supervised ADMIXTURE, a mode that allows you to explicitly define ancestral populations and infer the ancestry proportions of unassigned individuals based on those references. What Is a Supervised Run? In supervised mode, ADMIXTURE skips the component discovery step and instead uses user-defined groupings to represent ancestral components. The benefit: if you already have solid candidates for reference populations, you can use them to quickly infer ancestry proportions for target or admixed individuals. ...

August 5, 2025

How to Run ADMIXTURE (Unsupervised): Full Tutorial & Python Plotting Script

In this post, I’ll demonstrate how to estimate ancestry proportions using one of the most widely used tools in population genetics: ADMIXTURE. ADMIXTURE is a model-based clustering algorithm that infers individual ancestries from multilocus SNP genotype datasets. Preparing the Dataset Download the appropriate ADMIXTURE binary and either place it in your dataset directory or make it globally accessible. For this run, I included a subset of West Asian populations along with a few adjacent populations (around 150 samples in total). Linkage Disequilibrium (LD) pruning was applied beforehand. If you’re unsure how to prune your dataset, refer to the previous post. ...

August 2, 2025

SmartPCA Tutorial: How to Run PCA on Genetic Data (EIGENSOFT)

This post is a continuation of the previous one, where I demonstrated how to perform PCA with PLINK. While PLINK’s PCA is great for quick, exploratory analysis, smartpca (part of the EIGENSOFT toolset) is more commonly used in published genetic studies. Smartpca needs to be compiled on Linux or macOS. I covered how to install and prepare the toolchain on Linux in this earlier post: From EIGENSTRAT to PACKEDPED. As before, I’ll use a small subset. The focus here is on the technical process, not on interpreting the results. One key difference in this post is that I’ll perform Linkage Disequilibrium (LD) pruning, which helps reduce SNP redundancy and improves the detection of population structure in PCA. ...

July 30, 2025