Practical guides • Contract work

Practical tutorials in computational population genetics

From PCA and ADMIXTURE to imputation and f-statistics. A technical blog on population genetics, ancient DNA, bioinformatics tools, and pipelines.

All Posts

Latest Posts

dt: A Modern awk Alternative for Common Data Workflows

I recently published dt, a modern data transformation tool designed to make the awk workflows commonly used on this blog more intuitive, expressive, and fast. Dt is written in Rust because it compiles to a single binary that runs anywhere, and it uses Polars for the actual data processing, giving you columnar operations that handle large files efficiently. The syntax uses explicit functions (filter(), select(), mutate()) that you chain together with pipes, so transformations read like a recipe instead of a regex puzzle. There’s also an interactive REPL that shows you the result after each operation, letting you build complex pipelines step-by-step, catch mistakes early, and undo errors with .undo. ...

Deterministic F4-Statistic Regression for Admixture Modeling

In this post, I’ll cover how to build a transparent, deterministic admixture modeler from scratch using R. The main advantage of this approach is automated combinatorial testing: instead of manually specifying and running each potential ancestry model one-by-one in qpAdm, this script systematically tests hundreds or even thousands of source combinations in seconds and ranks them by fit quality. While qpAdm is the “industry standard” and provides sophisticated covariance weighting through block jackknife procedures, it requires manual intervention for each model specification and can often feel like a “black box.” This script offers a transparent, complementary approach that uses the exact same mathematical foundation (f4-statistic regression with quadratic programming constraints) to rapidly explore the models. If you have 10 potential sources (or even more) and want to test all 2-way, 3-way, and 4-way combinations, that’s 375 separate qpAdm runs against a single automated run with this script. ...

How to Merge EIGENSTRAT Datasets Using mergeit

mergeit is part of the EIGENSOFT package and can be used to merge exactly two EIGENSTRAT datasets without converting to PACKEDPED format first. In this post, I’ll show how to merge the sample we created in Pseudohaploid Genotyping for Ancient DNA: BAM to EIGENSTRAT with the AADR dataset. Setting up EIGENSOFT mergeit is part of the EIGENSOFT package. You can install it via conda: conda install -c bioconda eigensoft If you haven’t installed conda yet, see the Miniconda setup in the pseudohaploid genotyping post. ...

Pseudohaploid Genotyping for Ancient DNA: BAM to EIGENSTRAT

This is a follow-up to my previous post Processing Ancient DNA: From FASTQ to Aligned BAM, where I aligned an ancient DNA sample against the hs37d5 reference genome, producing a filtered BAM compatible with the AADR dataset. In this post, I’ll cover pseudohaploid genotype calling using pileupCaller and converting the output to EIGENSTRAT format for use with ADMIXTOOLS. Since we just created this BAM ourselves in the previous post, we already know it’s aligned to hs37d5. However, if you’re starting with a BAM file, you’ll need to verify the reference genome first. I’ll start by showing how to check BAM headers to identify the reference genome. ...

Processing Ancient DNA: From FASTQ to Aligned BAM

This is the first post in a series where I’ll process an ancient DNA sample from raw FASTQ sequences to EIGENSTRAT format for use with ADMIXTOOLS. This is a complete walkthrough of processing ancient DNA from raw sequences to EIGENSTRAT format, with actual aDNA-specific BWA parameter settings that are rarely documented elsewhere. System Requirements All commands are written for Debian/Ubuntu-based Linux systems. CPU: At least 8 cores recommended. BWA-MEM scales efficiently up to 12–16 threads; beyond this, you will likely encounter diminishing returns due to memory bandwidth saturation. RAM Recommendation: 16 GB for moderate-coverage samples. 20–32 GB can be beneficial for deeply sequenced datasets, mainly to keep sorting and downstream processing in memory. Storage & Network: A fast internet connection is recommended. While this tutorial uses a small sample of approximately 3.5 GB of compressed FASTQs, ancient DNA samples vary widely in sequencing depth and can easily reach or exceed 10 GB of compressed FASTQs per individual. If you’re limited by hardware or bandwidth, consider using a cloud computing instance (Google Cloud Computing, AWS, or DigitalOcean). You can SSH into the instance and run the pipeline there. I use this approach since my internet speed is not the best. ...