Computational population genetics, clearly explained

In-depth writing on genetic data, ancient DNA, and the methods used to study ancestry and human history.

Latest Posts

Convert Raw DNA Files to EIGENSTRAT for ADMIXTOOLS and Merge with AADR

Commercial raw DNA exports are not provided in the file formats normally used by ADMIXTOOLS, ADMIXTOOLS 2, AADR-based workflows, or PLINK. Files from 23andMe, AncestryDNA, FamilyTreeDNA, MyHeritage, and Living DNA are usually plain-text vendor exports, while downstream workflows often require PLINK PACKEDPED or EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT is often used loosely to refer to the .geno/.snp/.ind triplet. Strictly speaking, EIGENSTRAT is the plain-text version of that triplet; PACKEDANCESTRYMAP is the packed binary form of the same three files. ADMIXTOOLS and ADMIXTOOLS 2 work with either, but PACKEDANCESTRYMAP takes far less disk space and loads much faster, which is why it’s the practical default used here. ...

May 15, 2026

The Genetic Origins of the Proto-Anatolians

The origins of the Proto-Anatolians are often treated as one of the more obscure problems, but the genetic data may be not that ambigous. Anatolian is regarded as the earliest-splitting branch of “Indo-European”, and its divergence is deep enough that some linguists distinguish a pre–Proto-Indo-European stage, sometimes called “Indo-Anatolian”, from the Proto-Indo-European reconstructed from the non-Anatolian branches. Under either framing, the relevant question is the same: whether the earlier Eneolithic steppe-related ancestry behind Yamnaya, particularly the Caucasus–Lower Volga (CLV) component, also moved south of the Caucasus into Anatolia. For this purpose, I use Progress-2 specifically as proxy for the north Caucasus-facing part of this Eneolithic steppe-related ancestry, since it sits directly at the northern end of the Caucasus and therefore serves as a good proxy for groups that may have passed through the region. ...

May 10, 2026

Downloading and Converting AADR v66

Recently, in April 2026, new AADR versions were released on Harvard Dataverse. Among the more important additions are the new compatibility datasets introduced for reducing platform-specific bias when co-analyzing ancient DNA generated with different experimental setups. This matters when combining data produced with different capture reagents such as Agilent (AG), Twist (TW), and shotgun (SG), because these can introduce systematic differences that may affect downstream population genetic analyses. The compatibility panels were added to minimize that problem and make mixed-platform datasets more directly comparable. ...

April 17, 2026

dt: A Modern awk Alternative for Common Data Workflows

I recently published dt, a modern data transformation tool designed to make the awk workflows commonly used on this blog more intuitive, expressive, and fast. Dt is written in Rust because it compiles to a single binary that runs anywhere, and it uses Polars for the actual data processing, giving you columnar operations that handle large files efficiently. The syntax uses explicit functions (filter(), select(), mutate()) chained together with pipes, making common transformations easier to read and modify. There’s also an interactive REPL that shows you the result after each operation, letting you build complex pipelines step-by-step, catch mistakes early, and undo errors with .undo. ...

February 11, 2026

Fast, Transparent f4-Based Admixture Screening in R

In this post, I will build a transparent admixture-screening workflow from scratch in R using f4-statistics and constrained regression. The main advantage is automation: instead of hand-writing every candidate model, the script tests many 2-way, 3-way, and 4-way source combinations in one pass and ranks them by fit. The result is not a replacement for qpAdm, but a fast screening layer that can help you identify promising models before you validate them more formally. ADMIXTOOLS 2 already includes batch tools such as qpadm_multi() and qpadm_rotate(), so the point here is not that qpAdm cannot be automated. The point is that this custom workflow is compact, transparent, easy to modify, and useful for exploratory model search. ...

February 3, 2026

How to Merge EIGENSTRAT Datasets Using mergeit

mergeit is part of the EIGENSOFT package and can be used to merge exactly two EIGENSTRAT/PACKEDANCESTRYMAP datasets without converting to PACKEDPED format first. In this post, I’ll show how to merge the sample we created in Pseudohaploid Genotyping for Ancient DNA: BAM to EIGENSTRAT with the AADR dataset. Setting up EIGENSOFT mergeit is part of the EIGENSOFT package. You can install it via conda: conda install -c bioconda eigensoft If you haven’t installed conda yet, see the Miniconda setup in the pseudohaploid genotyping post. ...

January 12, 2026

Pseudohaploid Genotyping for Ancient DNA: BAM to EIGENSTRAT

This is a follow-up to my previous post Processing Ancient DNA: From FASTQ to Aligned BAM, where I aligned an ancient DNA sample against the hs37d5 reference genome, producing a filtered BAM compatible with the AADR dataset. In this post, I’ll cover pseudohaploid genotype calling using pileupCaller and converting the output to EIGENSTRAT format for use with ADMIXTOOLS. Since we just created this BAM ourselves in the previous post, we already know it’s aligned to hs37d5. However, if you’re starting with a BAM file, you’ll need to verify the reference genome first. I’ll start by showing how to check BAM headers to identify the reference genome. ...

January 4, 2026

Processing Ancient DNA: From FASTQ to Aligned BAM

This is the first post in a series where I’ll process an ancient DNA sample from raw FASTQ sequences to EIGENSTRAT format for use with ADMIXTOOLS. This is a complete walkthrough of processing ancient DNA from raw sequences to EIGENSTRAT format, with actual aDNA-specific BWA parameter settings that are rarely documented elsewhere. System Requirements All commands are written for Debian/Ubuntu-based Linux systems. CPU: At least 8 cores recommended. BWA-MEM scales efficiently up to 12–16 threads; beyond this, you will likely encounter diminishing returns due to memory bandwidth saturation. RAM Recommendation: 16 GB for moderate-coverage samples. 20–32 GB can be beneficial for deeply sequenced datasets, mainly to keep sorting and downstream processing in memory. Storage & Network: A fast internet connection is recommended. While this tutorial uses a small sample of approximately 3.5 GB of compressed FASTQs, ancient DNA samples vary widely in sequencing depth and can easily reach or exceed 10 GB of compressed FASTQs per individual. If you’re limited by hardware or bandwidth, consider using a cloud computing instance (Google Cloud Computing, AWS, or DigitalOcean). You can SSH into the instance and run the pipeline there. I use this approach since my internet speed is not the best. ...

January 2, 2026

Running qpAdm with ADMIXTOOLS2 in R: Testing and Interpreting Ancestry Models

This post covers using qpAdm in R to test ancestry models and estimate admixture proportions. qpAdm builds on f4-statistics and provides a framework for evaluating whether proposed source populations can explain a target population’s genetic makeup. For R and admixtools setup instructions on Debian/Ubuntu, see my previous post: Running f4-Statistics with Admixtools in R. Windows users can find R installation instructions on the R website. What is qpAdm? qpAdm is a method for testing ancestry models and estimating admixture proportions. It determines whether a target population can be modeled as a mixture of specified source populations (“left populations”), and if the model fits, calculates the contribution from each source. The method builds on f4-statistics (covered in my previous post) to evaluate these ancestry models. ...

December 16, 2025

Running f4-Statistics with ADMIXTOOLS2 in R

This post covers how to run f4-statistics using the admixtools package for R. Compared with the original ADMIXTOOLS workflow, the R implementation is more convenient for testing multiple population combinations because it can be used interactively, without repeatedly editing parameter files. It is also much faster in practice, which makes it possible to work directly with the full AADR dataset without creating subsets. What are f4-statistics? F4-statistics measure asymmetry in allele sharing among four populations. For four populations AAA, BBB, CCC, and DDD, the statistic is written as: ...

November 28, 2025

MyHeritage's $30 WGS: Technical Analysis

Recently, MyHeritage quietly announced that all new DNA kits will be processed using low-pass whole-genome sequencing (WGS) instead of traditional genotyping arrays, for just $30 per kit. The coverage depth will be roughly 2x (as announced in their blog post), compared to the 30x used in clinical sequencing. It’s nowhere near diagnostic quality, but it is whole-genome data nevertheless. The surprising part isn’t the technology, it’s the price. Getting an entire genome, even at shallow depth, for less than what MyHeritage charges to “unlock” an uploaded SNP file (~$35) seems almost too cheap to be real. Below are a few of the issues I see: technical, economic, and data-security related. ...

November 11, 2025

How to Subset Genetic Samples by Population Labels with awk (Create PLINK --keep file)

In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations. Prepare a list of populations to keep: Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line: ...

November 10, 2025

Running ADMIXTURE in Supervised Mode

This post is a short follow-up to the previous one on Estimating Ancestry Components Using ADMIXTURE. Here, we’ll explore supervised ADMIXTURE, a mode that allows you to explicitly define ancestral populations and infer the ancestry proportions of unassigned individuals based on those references. What Is a Supervised Run? In supervised mode, ADMIXTURE skips the component discovery step and instead uses user-defined groupings to represent ancestral components. The benefit: if you already have solid candidates for reference populations, you can use them to quickly infer ancestry proportions for target or admixed individuals. ...

August 5, 2025

How to Run ADMIXTURE (Unsupervised): Full Tutorial & Python Plotting Script

In this post, I’ll demonstrate how to estimate ancestry proportions using one of the most widely used tools in population genetics: ADMIXTURE. ADMIXTURE is a model-based clustering algorithm that infers individual ancestries from multilocus SNP genotype datasets. Preparing the Dataset Download the appropriate ADMIXTURE binary and either place it in your dataset directory or make it globally accessible. For this run, I included a subset of West Asian populations along with a few adjacent populations (around 150 samples in total). Linkage Disequilibrium (LD) pruning was applied beforehand. If you’re unsure how to prune your dataset, refer to the previous post. ...

August 2, 2025

SmartPCA Tutorial: How to Run PCA on Genetic Data (EIGENSTRAT)

This post is a continuation of the previous one, where I demonstrated how to perform PCA with PLINK. While PLINK’s PCA is great for quick, exploratory analysis, smartpca (part of the EIGENSOFT toolset) is more commonly used in published genetic studies. Smartpca needs to be compiled on Linux or macOS, or alternatively installed via conda. I covered both methods in this earlier post: From EIGENSTRAT to PACKEDPED. As before, I’ll use a small subset. The focus here is on the technical process. One key difference in this post is that I’ll perform Linkage Disequilibrium (LD) pruning, which helps reduce SNP redundancy and improves the detection of population structure in PCA. ...

July 30, 2025

PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output)

In this post, I’ll demonstrate how to perform a PCA on a PLINK dataset. Before we begin, we need to prepare a subset of samples we’re interested in analyzing. To do this, we’ll extract sample information from the .fam file. But first, we need to identify the samples of interest. For example, those from a specific population such as Sardinians. The easiest way is to open the corresponding .ind file and look at the population column, which is the third column in each row. Open the file in a text editor, and search for the population name, in this case, Sardinian. ...

July 29, 2025

Converting EIGENSTRAT/PACKEDANCESTRYMAP to PACKEDPED

The files downloaded in the previous blog post are distributed as an EIGENSTRAT-style .geno/.snp/.ind dataset. This naming can be confusing: the .snp and .ind files are the usual EIGENSTRAT metadata files, but the .geno file may either be plain-text EIGENSTRAT or binary PACKEDANCESTRYMAP. PACKEDPED format allows for easier downstream processing using the PLINK toolset. With PLINK, it becomes straightforward to extract sample subsets, filter SNPs, and perform a wide range of analyses. ...

July 29, 2025

How to Download the AADR Dataset (Linux & WSL)

Note: This post uses an older AADR release and parts of it may now be outdated. For the latest AADR v66 download, including TGENO conversion and ADMIXTOOLS2 compatibility notes, see Downloading and Converting AADR v66. A Linux environment is unavoidable when it comes to bioinformatical data processing and preparation. You can use your favorite distribution. For Windows users, the Windows Subsystem for Linux (WSL) provides a good alternative to dual booting or setting up a full virtual machine. ...

July 29, 2025