SmartPCA Tutorial: How to Run PCA on Genetic Data (EIGENSTRAT)

This post is a continuation of the previous one, where I demonstrated how to perform PCA with PLINK. While PLINK’s PCA is great for quick, exploratory analysis, smartpca (part of the EIGENSOFT toolset) is more commonly used in published genetic studies.

Smartpca needs to be compiled on Linux or macOS. I covered how to install and prepare the toolchain on Linux in this earlier post: From EIGENSTRAT to PACKEDPED.

As before, I’ll use a small subset. The focus here is on the technical process, not on interpreting the results. One key difference in this post is that I’ll perform Linkage Disequilibrium (LD) pruning, which helps reduce SNP redundancy and improves the detection of population structure in PCA.

LD Pruning

# Window size: 50 SNPs
# Step size: 5 SNPs
# LD threshold: r² > 0.2 will be pruned
plink --bfile input --indep-pairwise 50 5 0.2

# Extract the pruned SNPs into a new dataset
plink --bfile input --extract plink.prune.in --make-bed --out final

Note: If you’re working with low-coverage ancient samples, applying LD pruning directly to a mixed dataset (modern + ancient) may introduce bias, since SNP retention will favor samples with higher coverage. A better approach:

Prune only the modern dataset.
Use the same pruned SNP list (plink.prune.in) to extract SNPs from the ancient dataset.
Merge both datasets.

Here’s how you can do that:

# Step 1: Prune modern samples
plink --bfile modern --indep-pairwise 50 5 0.2 

# Step 2: Create pruned datasets for both groups
plink --bfile modern --extract plink.prune.in --make-bed --out pruned_modern
plink --bfile ancient --extract plink.prune.in --make-bed --out ancient_extracted

# Step 3: Merge datasets
plink --bfile pruned_modern --bmerge ancient_extracted --make-bed --out final

Converting the Dataset to EIGENSTRAT Format

Smartpca requires data in EIGENSTRAT format. Assuming your merged/pruned data is in PACKEDPED (.bed/.bim/.fam) format, you can use convertf (part of EIGENSOFT) to convert it.

Create a parameter file named convert (or a name of your choice, adjust the parameters accordingly) with the following content:

genotypename:    final.bed 
snpname:         final.bim 
indivname:       final.fam 
outputformat:    EIGENSTRAT 
genotypeoutname: pca.geno 
snpoutname:      pca.snp 
indivoutname:    pca.ind

Now run:

convertf -p convert

This will generate the three required input files for smartpca: pca.geno, pca.snp, and pca.ind.

Creating the Smartpca Parameter File

Like most Reich Lab tools, smartpca uses a parameter file to specify input and output options. Here’s a minimal example:

genotypename:     pca.geno
snpname:          pca.snp
indivname:        pca.ind
evecoutname:      pca.evec
evaloutname:      pca.eval

altnormstyle:     NO 
numoutevec:       10
numoutliter:      0
lsqproject:       YES
numthreads:       5

lsqproject: YES enables projection mode, meaning projected samples don’t influence the PCA axes. In this example, all samples are high-coverage modern individuals, so projection isn’t technically needed, but I demonstrate it here for completeness. Save the parameter file with a name like pca_param (no extension).

To project samples, you’ll need to modify the .ind file of your EIGENSTRAT dataset. Change the third column from Case to P:

      ...
      2032:TLA018.HO M       Case
      2033:TLA019.HO M       P
      2034:TLA020.HO M       Case
      ...

Running Smartpca

Now we run smartpca with:

smartpca -p pca_param

This generates two files: pca.eigenval (lists how much variance each eigenvector explains) and pca.evec (contains sample IDs and principal component values).

Preparing for Plotting

To make the output compatible with the attached script from the below, convert the .evec file to CSV:

awk 'NR>1 { $NF=""; sub(/[ \t]+$/, ""); gsub(/[ \t]+/, ","); print }' pca.evec > smartpca.csv

At this point, you have two options:

Use the Python plotting script from my previous PLINK PCA tutorial
Use Vahaduo for a browser-based, interactive visualization (explained below)

Option 1: Plotting with Python

If you’ve already set up Python and created the plotting script from the PLINK tutorial, you can use it directly. You’ll need to:

Create a labels file from your .ind file:

awk 'NR==FNR {map[$1]=$3; next} {print map[$2]}' pca.ind final.fam > labels

Run the plotting script with smartpca.csv and labels:

This is the resulting plot for an European subset:

SmartPCA plot showing genetic clustering of Sardinian, Greek, Northern and Eastern European populations on PC1 and PC2

Option 2: Plotting with Vahaduo (No Coding Required)

Vahaduo is a web-based tool that can generate interactive 2D and 3D PCA plots without any programming. However, it requires population labels to be embedded directly in the CSV file.

Preparing the file for Vahaduo:

Your current smartpca.csv file looks like this:

...
11788:HGDP00553.HO,0.0136,0.0104,0.0353,-0.0510,...
11789:HGDP00554.HO,0.0136,0.0107,0.0334,-0.0494,...
11790:HGDP00555.HO,0.0137,0.0104,0.0346,-0.0511,...
...

Vahaduo groups samples by the text before the : character. We need to replace the numerical FIDs with population labels. Here’s an awk command that does this automatically (adjust filenames if needed):

# When I refer to an .ind file, I always mean the original version from the dataset as it was downloaded (aside from the prefix used here)

awk -F',' -v OFS=',' '
  NR == FNR {
    line = $0
    gsub(/^[[:space:]]+/, "", line)
    if (line == "") next

    split(line, a, /[[:space:]]+/)
    if (a[1] == "" || a[3] == "") next

    label = a[3]
    sub(/\.(AG|DG|HO|SG)$/, "", label)
    pop[a[1]] = label
    next
  }

  {
    if (split($1, parts, ":") >= 2 && (parts[2] in pop))
      $1 = pop[parts[2]] ":" parts[2]
    print
  }
' data.ind smartpca.csv > smartpca_with_labels.csv

This transforms the file to:

...
Papuan:HGDP00553.HO,0.0136,0.0104,0.0353,-0.0510,...
Papuan:HGDP00554.HO,0.0136,0.0107,0.0334,-0.0494,...
Papuan:HGDP00555.HO,0.0137,0.0104,0.0346,-0.0511,...
...

Using Vahaduo:

Go to https://vahaduo.github.io/custompca/
Navigate to the PCA tool
Paste the contents of smartpca_with_labels.csv into the “PCA Data” section
Click “Plot PCA” in the “PCA plot” section to generate the visualization

Advantages of Vahaduo:

No Python set-up required
Interactive plots (zoom, pan, rotate in 3D)
Hover over points to see sample details
Quick exploratory analysis
Saving plots

Vahaduo can also run regression to estimate a target sample as a mix of chosen source groups, and the CSV produced here works with that feature. The main consideration is that PCA compresses many SNPs into a relatively small set of principal components, so the model operates on a simplified and linear view of the underlying data.

LD Pruning#

Converting the Dataset to EIGENSTRAT Format#

Creating the Smartpca Parameter File#

Running Smartpca#

Preparing for Plotting#

Option 1: Plotting with Python#

Option 2: Plotting with Vahaduo (No Coding Required)#