Introducing AdmixPy: f-statistics, qpAdm, and qpWave in Python

I recently published AdmixPy on GitHub, a fast implementation of f-statistics, qpAdm, and qpWave in Python that runs on Linux, macOS, and Windows. It works directly on the new AADR TGENO distribution format and is faster than ADMIXTOOLS 2 on most workloads. Supported input formats: EIGENSTRAT (.geno/.snp/.ind), packed AncestryMap (.geno/.snp/.ind), TGENO (.tgeno/.snp/.ind), and SNP-major PLINK binary (.bed/.bim/.fam).

AdmixPy is implemented in Python and depends only on NumPy, SciPy, and pandas. Installation is handled through pip, and it behaves the same on every platform. ADMIXTOOLS 2, on the other hand, needs a working R setup and a compiler, since remotes::install_github builds the package and several of its dependencies from source. That compilation is slow, and it can break in many ways: an incompatible R version, a dependency that won’t compile, or a failed download from GitHub.

Setup

AdmixPy requires Python 3.10 or newer and runs on Linux, macOS, and Windows. You’ll also need git to clone the repository.

Clone the repository and enter it:

git clone https://github.com/system0x7/admixpy.git
cd admixpy

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

On Windows, activate with venv\Scripts\activate instead.

Install the package:

python -m pip install --upgrade pip
python -m pip install -e .

Verify the install:

python -c "import admixpy; print(admixpy.__file__)"

You should see a path ending in admixpy/__init__.py. If you get an ImportError, double-check that the virtual environment is activated.

Usage

The main functions are:

admixpy.f2(data, pop1, pop2)
admixpy.fst(data, pop1, pop2) 
admixpy.f3(data, pop1, pop2, pop3)
admixpy.f4(data, pop1, pop2, pop3, pop4)
admixpy.qpwave(data, left, right)
admixpy.qpadm(data, target, left, right)

data can be a genotype dataset prefix or precomputed f2 data. For PLINK input, population labels are read from the FID column of the .fam file.

Start a Python REPL (after activating the venv) in the directory containing your AADR files and run an f4 statistic:

>>> import admixpy as a
>>> prefix = "v66_compatibility"
>>> a.f4(prefix, "Chimp", "Turkey_N", "Sardinian", "French")

Result:

Loading f2 data for 4 population pairs
Reading TGENO data: 1243531 SNPs, 23259 samples, 131 selected samples, 4 populations
Detecting pseudohaploid samples from first 1000 SNPs
Reading TGENO sample 131/131      
Filtering SNPs
Computing f2 block 708/708: SNP rows 569568-569807      
Computing f4 for 1 population combinations
    pop1      pop2       pop3    pop4       est        se     z    p
0  Chimp  Turkey_N  Sardinian  French -0.001644  0.000109 -15.1  0.0

The significantly negative ( $Z=-15.1$ ) estimate with $A$ as outgroup indicates that Anatolian Neolithic farmers share more drift with Sardinians than with French, reflecting the stronger Neolithic Farmer affinity in Sardinia. Follow-up posts will work through f-statistics, qpAdm and qpWave models on AADR data using AdmixPy.

Setup#

Usage#

Setup

Usage