In an earlier post, PLINK PCA Tutorial: Running PCA in PLINK (Commands + Output), I showed the manual way to build a subset from the .ind/.fam. That works, but if you want to keep thousands of samples it gets tedious fast. Below is a one-liner using awk that generates a PLINK --keep file automatically from a list of populations.
- Prepare a list of populations to keep:
Create a text file (e.g. pops) in the same directory as your reference .ind and .fam. Put one population label per line:
Norwegian.HO
Sardinian.DG
Han.HO
You can include as many as you need.
Assumption: your
.indis standard EIGENSTRAT (col1 = IID and col3 = population), and your PLINK.famcorresponds to the same individuals (col1 = FID, col2 = IID). This is also the usual outcome when converting withconvertf.
- Generate a PLINK
--keepfile withawk:
awk -v OFS='\t' '
FNR==1 {f++}
f==1 {want[$1]=1; next} # populations to keep
f==2 {if ($3 in want) keep[$1]=1; next} # reference.ind (IID in col1, pop in col3)
f==3 {if ($2 in keep) print $1,$2} # reference.fam (FID col1, IID col2)
' pops reference.ind reference.fam > pops.keep
What it does:
- reads your population list (
pops) - scans the
.indand marks all matching IIDs - emits
FID\tIIDpairs for those IIDs by looking them up in.fam
pops.keep is now ready for PLINK.
- Run PLINK to keep the subset:
# Example
plink --bfile reference --keep pops.keep --make-bed --out subset
You now have a packed binary PLINK dataset containing all samples belonging to the population labels listed in pops.