EIGENSTRAT

How to Subset Genetic Samples by Population Labels with awk (Create PLINK --keep file)

In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations. Prepare a list of populations to keep: Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line: ...

Converting EIGENSTRAT to PACKEDPED

The files downloaded in the previous blog post are in EIGENSTRAT format. In this post, we’ll look at how to convert them to PACKEDPED format. PACKEDPED format allows for easier downstream processing using the PLINK toolset. With PLINK, it becomes straightforward to extract sample subsets, filter SNPs, and perform a wide range of analyses. Downloading PLINK I use PLINK 1.9. While there is a newer version (2.0), I prefer 1.9 because it includes several features that were deprecated or removed in the newer release. ...

How to Download the AADR Dataset (Linux & WSL)

A Linux environment is unavoidable when it comes to bioinformatical data processing and preparation. You can use your favorite distribution. For Windows users, the Windows Subsystem for Linux (WSL) provides a good alternative to dual booting or setting up a full virtual machine. Installing WSL with Debian Open PowerShell as Administrator and run: wsl --install -d Debian Once installed, update the system: sudo apt update && sudo apt upgrade -y Downloading A Genetic Dataset Before doing PCA, ADMIXTURE, qpAdm, etc, you need actual genotype data. A good and comprehensive resource is the Allen Ancient DNA Resource (AADR). ...