Pairwise f2 Statistics and FST in AdmixPy

This post covers how to run pairwise f2-statistics and FST in AdmixPy. They are simple to interpret at the pairwise level, but they are also useful computationally because once f2 blocks have been computed and cached, many downstream analyses can reuse them without repeatedly reading and converting the original genotype data.

What does f2 measure?

The f2-statistic measures allele-frequency differentiation between two populations. For two populations $A$ and $B$ , it is written as:

f_2(A, B) = E[(p_A - p_B)^2]

Here, $p_A$ and $p_B$ are the allele frequencies of populations $A$ and $B$ at a SNP, and the expectation is taken across SNPs.

In words: f2 is the average squared difference in allele frequency between two populations. If two populations have very similar allele frequencies, the value will be low. If they have accumulated more drift or are more differentiated from each other, the value will be higher.

This makes f2 useful as a compact distance-like summary of population differentiation. A higher value means more allele-frequency differentiation, but it does not by itself tell you when populations split, whether gene flow occurred, or what the exact demographic history was.

Bias-corrected f2

A basic calculation of f2 would simply average the squared allele-frequency differences:

(p_A - p_B)^2

But allele frequencies are estimated from finite samples. If a population has only one or a few observed alleles at a SNP, the observed allele frequency can differ from the true population frequency just because of sampling noise. Without correction, this sampling noise inflates f2.

AdmixPy uses an allele-count-aware finite-sample correction:

\hat{f}_2(A, B) = E\left[(p_A - p_B)^2 - \frac{p_A(1 - p_A)} {\operatorname{max}(1, c_A - 1)} - \frac{p_B(1 - p_B)} {\operatorname{max}(1, c_B - 1)}\right]

Here, $c_A$ and $c_B$ are the observed allele counts for populations $A$ and $B$ , not the number of individuals.

A diploid individual contributes two alleles at a SNP, while a pseudo-haploid individual contributes only one. The correction therefore has to depend on the number of observed alleles, not just on how many sample labels are assigned to a population.

Diploid and pseudo-haploid samples

Because the finite-sample correction depends on observed allele counts, ploidy matters most when one of the populations is represented by only one sample, or by a very small number of samples.

A diploid individual contributes two alleles per SNP:

c = 2

Its observed allele frequency can be:

p \in \{0, 0.5, 1\}

A pseudo-haploid individual contributes only one allele per SNP:

c = 1

Its observed allele frequency can only be:

p \in \{0, 1\}

For larger population groups, the effect is reduced because the allele-frequency estimate is averaged across more samples. For example, a pseudo-haploid population represented by many individuals still has many observed alleles in total, even though each individual contributes only one allele per SNP. In that case, pseudo-haploid calling is still accounted for, but the estimate is not dominated by the uncertainty of a single allele.

f2 and FST are population-level summaries. They can be applied to single samples or very small groups, but those results should be interpreted more cautiously because the allele-frequency estimates are noisier. This also can be tested with a leave-one-out comparison: remove a single modern sample from its population group, treat that sample as its own population, and compare it against the remaining population average. This gives a useful sense of how much the estimate changes when a population-level comparison is reduced to a single individual.

Running pairwise f2 in AdmixPy

For AdmixPy setup instructions, refer to the previous post: Introducing AdmixPy: f-statistics, qpAdm, and qpWave in Python. After activating the virtual environment, start a Python REPL by entering python in the terminal.

Then import AdmixPy:

import admixpy as ap

Since we imported admixpy as ap, pairwise f2-statistics can now be run with ap.f2().

For this example, I will compare a Late Neolithic population from Shandong against a set of modern East Asian populations:

# Adjust to your dataset prefix or path
prefix = "ho_v66"

pop2 = ["Han", "Japanese", "Korean", "Mongol", "Tibetan"]

Now run:

ap.f2(prefix, pop1="China_Shandong_Dinggong_LN", pop2=pop2)

The result:


                         pop1      pop2         est           se
0  China_Shandong_Dinggong_LN       Han  0.00243779  0.000126418
1  China_Shandong_Dinggong_LN  Japanese   0.0043627  0.000137583
2  China_Shandong_Dinggong_LN    Korean  0.00483497   0.00019115
3  China_Shandong_Dinggong_LN    Mongol  0.00484313  0.000143311
4  China_Shandong_Dinggong_LN   Tibetan  0.00401899  0.000133827

The est column is the block-weighted f2 estimate. The se column is the block-jackknife standard error.

The interpretation is simple: lower f2 means the two populations are more similar in allele-frequency space, while higher f2 means they are more differentiated. In this example, the Han average has the lowest f2 value with China_Shandong_Dinggong_LN, suggesting that among the listed comparison populations, Han are closest to this Shandong Late Neolithic group in allele-frequency space.

This should not be interpreted as an ancestry model. The result only says that, among this set of populations, Han show the lowest pairwise allele-frequency differentiation from China_Shandong_Dinggong_LN.

Ancient DNA samples are often represented by pseudo-haploid genotype calls. For single-sample “populations”, this can strongly affect the f2 estimate because the finite-sample correction depends on observed allele counts. In this case, however, China_Shandong_Dinggong_LN contains many samples in the dataset, so the effect of pseudo-haploid calling is partly averaged across individuals rather than being dominated by one sample.

Blocks and standard errors

Like other ADMIXTOOLS-style statistics, AdmixPy estimates uncertainty using SNP blocks. SNPs are grouped by chromosome and genetic position. AdmixPy computes f2 separately for each block, then combines the blocks into the final estimate.

The standard error is calculated using a leave-one-block-out jackknife. In each jackknife replicate, one block is left out and the statistic is recomputed from the remaining blocks. The variation across these leave-one-block-out estimates gives the standard error.

If standard errors were computed as if every SNP were independent, they would usually be too small. Block jackknifing gives a more realistic estimate of uncertainty by accounting for correlation among nearby SNPs.

Using get_f2() and the f2 cache

The main helper behind the higher-level APIs is get_f2(). It can accept genotype data, an in-memory F2Blocks object, or an on-disk f2 cache.

For example, you can compute f2 blocks from a TGENO prefix:

blocks = ap.get_f2(prefix, pops=["Chimp", "Sardinian", "Orcadian", "Norwegian", "Mongol"])

This blocks object can now be passed into ap.f4() instead of the dataset prefix. Since the block-level f2 values have already been computed, the f4-statistic will run almost instantly.

If you want to reuse these blocks later, you can write them to disk with write_f2():

ap.write_f2(blocks, "f2_cache")

This creates a folder called f2_cache. Inside it, AdmixPy stores the block-level pairwise estimates and block lengths. Then you can load the folder again with get_f2():

blocks = ap.get_f2("f2_cache")

The loaded blocks object can be passed to functions such as ap.f2(), ap.f3(), or ap.f4(), just like the original in-memory object.

Cached f2 blocks are tied to the SNP overlap of the populations used when the cache was created. Ancient DNA samples often have lower coverage and more missing data than modern samples, so adding them later may not give the same SNP set as a cache built with them from the start.

Running FST in AdmixPy

AdmixPy also provides pairwise Hudson-style FST through ap.fst():

ap.fst(prefix, pop1="China_Shandong_Dinggong_LN", pop2=pop2)

Result:

                         pop1      pop2        est           se
0  China_Shandong_Dinggong_LN       Han  0.0095689  0.000498521
1  China_Shandong_Dinggong_LN  Japanese  0.0171573  0.000536695
2  China_Shandong_Dinggong_LN    Korean  0.0188877  0.000743521
3  China_Shandong_Dinggong_LN    Mongol  0.0184716  0.000542773
4  China_Shandong_Dinggong_LN   Tibetan  0.0155903  0.000517366

FST is closely related to f2, but it is normalized by the amount of genetic variation within the compared populations. F2 measures the raw squared allele-frequency difference, while FST asks how large that difference is relative to the total variation.

F2 and FST will often give a similar ordering of comparison populations. If one population has the lowest f2 to a target, it will often also have one of the lowest FST values. The difference is mainly scale.

Summary

At the pairwise level, f2 and FST summarize how differentiated two populations are in allele-frequency space. Lower estimates generally mean more similar allele frequencies; higher estimates mean more differentiation.

AdmixPy uses a finite-sample correction based on observed allele counts, which is relevant for single-sample populations and pseudo-haploid ancient DNA. It also computes statistics by genomic blocks and uses a block jackknife for standard errors.

A practical advantage is caching. By storing block-level f2 values in an F2Blocks object or an on-disk f2 cache, AdmixPy makes repeated f2, f3, f4, qpWave, and qpAdm analyses much faster. Instead of repeatedly reading and converting genotype data, you can compute the shared foundation once and reuse it across many tests.

For exploratory population-genetic analysis, this makes f2 both a useful statistic and a useful workflow primitive.

What does f2 measure?#

Bias-corrected f2#

Diploid and pseudo-haploid samples#

Running pairwise f2 in AdmixPy#

Blocks and standard errors#

Using get_f2() and the f2 cache#

Running FST in AdmixPy#

Summary#