How to Subset Genetic Samples by Population Labels with awk (Create PLINK --keep file)
In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations. Prepare a list of populations to keep: Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line: ...