Deterministic F4-Statistic Regression for Admixture Modeling

In this post, I’ll cover how to build a transparent, deterministic admixture modeler from scratch using R. While tools like qpAdm are the “industry standard”, they can often feel like a “black box.” Hence below, a transparent way to estimate ancestry proportions. This script is not a replacement for qpAdm, but rather a complement, that uses the exact same mathematical foundation (f4-statistic regression) to rapidly test and rank thousands of potential source combinations.

Theoretical Foundation

The methodology is based on a fundamental principle: if a target population is truly an admixture of Source populations, then its genetic drift relative to any Reference populations should be mathematically predictable as a linear combination of the sources’ drift patterns.

The f4-statistic measures allele frequency correlations between populations. Specifically, it quantifies how much genetic drift is shared between different population pairs. The script calculates f4 in the form: f4(Outgroup, Test; Ref₁, Ref₂). This measures the correlation between the test population (either target or a source candidate) and Ref₁ relative to their relationships with both the Outgroup and Ref₂

The Regression

The core mathematical operation is setting up a system of equations where: Target_vec ≈ Σ(Source_mat × w)

In words: The Target’s f4-vector (its genetic relationships to all reference populations) should approximately equal the weighted sum of each Source’s f4-vector, where the weights (w) represent ancestry proportions.

The Constraints (Quadratic Programming): Standard linear regression cannot handle biological reality, so this script uses Quadratic Programming to enforce two constraints:

Non-negativity: You cannot inherit negative DNA, so all weights must be ≥ 0:
Sum to 1: All ancestry proportions must sum to exactly 100% (Σw = 1)

The solver finds the optimal weights that minimize the squared Euclidean distance between the Target’s observed f4-vector and the model’s predicted vector.

Why Euclidean Distance Is Valid Here (vs. PCA)

This approach is fundamentally different from PCA-based distance minimization. In PCA models, Euclidean distance is measured in a dimension-reduced space where spatial proximity is conflated with ancestry.

In contrast, f4-statistics directly measure shared genetic drift patterns, the actual phylogenetic signal of common ancestry. When we minimize Euclidean distance between f4-vectors, we’re not working in a compressed linear projection; we’re measuring how well the model reproduces the Target’s true drift relationships across multiple independent reference axes. Each dimension of the f4-vector represents an independent test of shared drift with a reference population. Minimizing this distance tests whether the proposed sources can mathematically explain the Target’s observed drift patterns, which is exactly what admixture does genetically.

The Role of References and Outgroups

The Outgroup: acts as the root, polarizing the allele frequencies so we know which direction “drift” is happening.

The References: act as the coordinate system or scaffolding. The model measures the Target’s genetic drift relative to each Reference population independently, creating a multi-dimensional vector of drift relationships. More References provide more independent axes along which to test the proposed admixture model.

Running the Script in Practice

To run the script you’ll need three libraries: the R-implementation of admixtools (“admixtools2”), tidyverse, and quadprog.

Installation:

If you haven’t installed these packages yet, run the following commands in R:

# Install admixtools2 from GitHub
install.packages("remotes")
remotes::install_github("uqrmaie1/admixtools")

# Install tidyverse and quadprog from CRAN
install.packages("tidyverse")
install.packages("quadprog")

Loading the Libraries:

Once installed, load them at the start of your R session:

library(admixtools)
library(tidyverse)
library(quadprog)

Now set up the data prefix, target sample, outgroup, and potential sources (contributors to the ancestry of the target). I recommend leaving the variable names exactly as shown, after you follow the setup, you can copy and paste the rest of the script into R without any variable name adjustments. Here I’ll model the Yamnaya EBA population:

# Specify your target population
target_name <- "Russia_Samara_EBA_Yamnaya.AG"

# Specify the outgroup (deeply diverged population)
outgroup <- "Mbuti.DG"

# Set your EIGENSTRAT data prefix
# This should point to your .geno, .snp, and .ind files
# For example, if your files are named data.geno, data.snp, data.ind:
data <- "data"

Next, declare a vector of potential sources you want tested for the target. Here I use a small source pool:

potential_sources <- c(
  "Turkey_Southeast_Cayonu_PPN.SG", 
  "Russia_Samara_HG.AG", 
  "Sweden_Mesolithic_HG.AG", 
  "Turkey_Marmara_Barcin_N.AG", 
  "Georgia_Kotias_Mesolithic.SG", 
  "Turkey_Central_Boncuklu_PPN.AG", 
  "Iran_HajjiFiruz_N.AG", 
  "Turkmenistan_C_TepeAnau.AG"
)

Selecting References: The Critical Step

Now comes the most important part: selecting the right reference populations. Omitting one crucial “anchor” can result in a false positive model, though this vulnerability is no different in qpAdm. The key principle: f4-statistics should be informative and the references should be differentially related to the sources (including the target). Adding five different references that deliver essentially the same genetic information has no value and in fact only introduces noise.

For a deeper understanding of what makes f4-statistics informative, refer to my post: Running f4-Statistics with Admixtools in R.

To see concrete examples of how missing critical references (“anchors”) affect f4-based models, see my qpAdm tutorial: Running qpAdm in R: Testing and Interpreting Ancestry Models.

For modeling the EBA Yamnaya samples from Samara, I selected the following references:

references <- c(
  "Switzerland_Bichon_Epipaleolithic.SG",
  "Georgia_Satsurblia_LateUP.SG",
  "Georgia_Dzudzuana_UP.AG",
  "Russia_MA1_UP.SG",
  "Iran_GanjDareh_N.AG",
  "Turkey_Central_Pinarbasi_Epipaleolithic.AG"
)

These references were chosen to capture different axes of genetic variation relevant to the West Eurasian gene pool:

Western Hunter-Gatherer (Bichon)
Early Caucasus Hunter-Gatherer (Satsurblia)
Upper Paleolithic Caucasus (Dzudzuana)
Ancient North Eurasian (MA1)
Early Iranian Neolithic (Ganj Dareh)
Anatolian Epipaleolithic (Pınarbasi)

Together, they provide independent tests of the target’s drift relationships across distinct ancestral drifts.

Running the Analysis

Now that all variables are declared, if you’ve used the exact variable names shown above, you can copy the complete script below and paste it directly into R:

# --- 2. INTERNAL SOLVER FUNCTIONS ---
solve_weights <- function(target_vec, source_mat) {
  n <- ncol(source_mat)
  # Dmat: Quadratic part (covariance). Add tiny noise to ensure positive definite.
  Dmat <- crossprod(source_mat) + diag(1e-10, n)
  dvec <- crossprod(source_mat, target_vec)
  # Constraints: Sum to 1 (meq=1) and each weight >= 0
  Amat <- cbind(rep(1, n), diag(n))
  bvec <- c(1, rep(0, n))
  
  fit <- try(solve.QP(Dmat, dvec, Amat, bvec, meq = 1), silent = TRUE)
  if(inherits(fit, "try-error")) return(NULL)
  return(fit$solution)
}

# --- 3. DATA PREPARATION ---
cat("Calculating f4 basis...\n")
all_pops <- c(target_name, potential_sources)

f4_results <- f4(data, 
                 pop1 = outgroup, 
                 pop2 = all_pops, 
                 pop3 = references[1], 
                 pop4 = references[-1])

# Pivot and maintain both estimate (est) and standard error (se)
f4_matrix_wide <- f4_results %>%
  dplyr::select(pop2, pop4, est, se) %>%
  tidyr::pivot_wider(names_from = pop2, values_from = c(est, se))

# Extract Target vector and its Standard Errors
y <- as.matrix(f4_matrix_wide[[paste0("est_", target_name)]])
target_se <- as.numeric(f4_matrix_wide[[paste0("se_", target_name)]])

# Extract Source Matrix
x_full <- f4_matrix_wide %>% 
  dplyr::select(starts_with("est_")) %>% 
  dplyr::select(-all_of(paste0("est_", target_name)))
colnames(x_full) <- gsub("est_", "", colnames(x_full))
x_full <- as.matrix(x_full)

# --- 4. COMBINATORIAL LOOP ---
results_list <- list()
# Test combinations of size 2, 3, and 4
for (m in 2:4) {
  cat("Testing all", m, "source combinations...\n")
  combos <- combn(colnames(x_full), m)
  
  for(i in 1:ncol(combos)) {
    curr_names <- combos[, i]
    x_sub <- x_full[, curr_names, drop = FALSE]
    
    w <- solve_weights(y, x_sub)
    
    if(!is.null(w)) {
      # Math Alignment
      w_mat <- matrix(w, ncol = 1)
      y_pred <- x_sub %*% w_mat
      residuals <- y - y_pred
      
      # Statistics
      chi_sq <- sum((residuals / target_se)^2)
      df_val <- nrow(x_sub) - length(w)
      p_val  <- if(df_val > 0) pchisq(chi_sq, df_val, lower.tail = FALSE) else NA
      rss    <- sum(residuals^2)
      
      results_list[[length(results_list) + 1]] <- data.frame(
        n_sources = length(w),
        model     = paste(curr_names, collapse = " + "),
        rss       = rss,
        p_value   = p_val,
        weights   = paste(round(w * 100, 1), collapse = " / ")
      )
    }
  }
}

# --- 5. OUTPUT ---
final_table <- bind_rows(results_list) %>%
  mutate(status = ifelse(p_value > 0.05, "PASS", "FAIL")) %>%
  arrange(desc(p_value))

cat("\n--- TOP MODELS (RANKED BY P-VALUE) ---\n")
print(head(final_table, 20))

You’ll see an output with the top 20 most plausible combinations first, followed by the fit table. For this example, the best fit was:

                                                                                                model
1                                             Russia_Samara_HG.AG + Georgia_Kotias_Mesolithic.SG + Iran_HajjiFiruz_N.AG

Now check for the same row in the fit table:

            rss    p_value                   weights status
1  2.753318e-08 0.94373890          52 / 22.9 / 25.1   PASS

Model 1 (3-way mixture): This shows the estimated ancestry proportions in the order of sources listed:

52.0% Russia_Samara_HG.AG (Eastern Hunter-Gatherer ancestry)
22.9% Georgia_Kotias_Mesolithic.SG (Caucasus Hunter-Gatherer-related ancestry)
25.1% Iran_HajjiFiruz_N.AG (Iranian Neolithic farmer-related ancestry)

Understanding the Statistics:

rss (Residual Sum of Squares): 2.75 × 10⁻⁸, extremely low, indicating excellent fit
p-value: 0.94, well above the 0.05 threshold, meaning the residuals are statistically insignificant
status: PASS, the model successfully explains the target’s f4-statistics

This model shows a very high p-value (0.94), indicating the residuals are not statistically distinguishable from zero, meaning this model excellently explains the observed f4-statistics. The extremely high p-value suggests this is a robust model that captures the ancestry composition well.

Comparison with qpAdm

The p-values from this script may differ slightly from qpAdm results for the same model. This occurs because the two methods, while based on the same f4-statistic framework, use different computational implementations and potentially different handling of the covariance matrix. The script uses quadratic programming with explicit non-negativity constraints, whereas qpAdm uses weighted least squares that can sometimes produce small negative weights before they’re resolved.

Ideally, you should validate promising models from this script against qpAdm to ensure consistency. However, it’s worth noting that occasionally the same source combination will yield slightly different ancestry proportions between the two methods, or qpAdm may produce small negative weights where this script does not (or vice versa). This doesn’t necessarily mean one model is “wrong”; rather, it reflects differences in how the two algorithms handle the constraint optimization and weight the reference populations in the regression.

If you encounter models with vastly different results between methods, this is a signal to examine your reference set more carefully and consider whether additional or alternative references might better anchor the model.

Theoretical Foundation#

The Regression#

Why Euclidean Distance Is Valid Here (vs. PCA)#

The Role of References and Outgroups#

Running the Script in Practice#

Theoretical Foundation

The Regression

Why Euclidean Distance Is Valid Here (vs. PCA)

The Role of References and Outgroups

Running the Script in Practice