In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations.


  1. Prepare a list of populations to keep:

Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line:

Norwegian.HO
Sardinian.DG
Han.HO

You can include as many as you need.

Assumption: your .ind is standard EIGENSTRAT (col1 = IID and col3 = population), and your PLINK .fam corresponds to the same individuals (col1 = FID, col2 = IID). This is also the usual outcome when converting with convertf.

  1. Generate a PLINK --keep file with awk:
awk -v OFS='\t' '
  FNR==1 {f++}
  f==1 {want[$1]=1; next}               # populations to keep
  f==2 {if ($3 in want) keep[$1]=1; next} # reference.ind (IID in col1, pop in col3)
  f==3 {if ($2 in keep) print $1,$2}    # reference.fam (FID col1, IID col2)
' pops reference.ind reference.fam > pops.keep

What it does:

  • reads your population list (pops)
  • scans the .ind and marks all matching IIDs
  • emits FID\tIID pairs for those IIDs by looking them up in .fam

pops.keep is now ready for PLINK.

  1. Run PLINK to keep the subset:
# Example
plink --bfile reference --keep pops.keep --make-bed --out subset

You now have a packed binary PLINK dataset containing all samples belonging to the population labels listed in pops.