How to Subset Genetic Samples by Population Labels with awk (Create PLINK --keep file)

In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations.

Prepare a list of populations to keep:

Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line:

Norwegian.HO
Sardinian.DG
Han.HO

You can include as many as you need.

Assumption: your .ind is standard EIGENSTRAT (col1 = IID and col3 = population), and your PLINK .fam corresponds to the same individuals (col1 = FID, col2 = IID). If you are working with the AADR, use the original .ind as downloaded.

Generate a PLINK --keep file with awk:

awk -v OFS='\t' '
  FNR==1 {f++}
  f==1 {want[$1]=1; next}               # populations to keep
  f==2 {if ($3 in want) keep[$1]=1; next} # reference.ind (IID in col1, pop in col3)
  f==3 {if ($2 in keep) print $1,$2}    # reference.fam (FID col1, IID col2)
' pops reference.ind reference.fam > pops.keep

What it does:

reads your population list (pops)
scans the .ind and marks all matching IIDs
emits FID\tIID pairs for those IIDs by looking them up in .fam

pops.keep is now ready for PLINK.

Run PLINK to keep the subset:

# Example
plink --bfile reference --keep pops.keep --make-bed --out subset

You now have a packed binary PLINK dataset containing all samples belonging to the population labels listed in pops.

Related tutorials#

Related tutorials