In this post, I’ll cover how to build a transparent, deterministic admixture modeler from scratch using R. The main advantage of this approach is automated combinatorial testing: instead of manually specifying and running each potential ancestry model one-by-one in qpAdm, this script systematically tests hundreds or even thousands of source combinations in seconds and ranks them by fit quality.

While qpAdm is the “industry standard” and provides sophisticated covariance weighting through block jackknife procedures, it requires manual intervention for each model specification and can often feel like a “black box.” This script offers a transparent, complementary approach that uses the exact same mathematical foundation (f4-statistic regression with quadratic programming constraints) to rapidly explore the models. If you have 10 potential sources (or even more) and want to test all 2-way, 3-way, and 4-way combinations, that’s 375 separate qpAdm runs against a single automated run with this script.


Theoretical Foundation

The methodology is based on a fundamental principle: if a target population is truly an admixture of source populations, then its genetic drift relative to any reference populations should be mathematically predictable as a linear combination of the sources’ drift patterns.

The f4-statistic measures allele frequency correlations between populations. Specifically, it quantifies how much genetic drift is shared between different population pairs. The script calculates f4 in the form: f4(Outgroup, Test; Ref₁, Ref₂). This measures the correlation between the test population (either target or a source candidate) and Ref₁ relative to their relationships with both the Outgroup and Ref₂.

The Regression

The core mathematical operation is setting up a system of equations where: Target_vec ≈ Σ(Source_mat × w)

In words: The target’s f4-vector (its genetic relationships to all reference populations) should approximately equal the weighted sum of each source’s f4-vector, where the weights (w) represent ancestry proportions.

The Constraints (Quadratic Programming): Standard linear regression cannot handle biological reality, so this script uses Quadratic Programming to enforce two constraints:

  1. Non-negativity: You cannot inherit negative proportions, so all weights must be ≥ 0:
  2. Sum to 1: All ancestry proportions must sum to exactly 100% (Σw = 1)

The solver finds the optimal weights that minimize the squared Euclidean distance between the target’s observed f4-vector and the model’s predicted vector.

Why Euclidean Distance Is Valid Here Over PCA

This approach is fundamentally different from PCA-based distance minimization. In PCA models, Euclidean distance is measured in a dimension-reduced space where spatial proximity is conflated with ancestry.

In contrast, f4-statistics directly measure shared genetic drift patterns, the actual phylogenetic signal of common ancestry. When we minimize Euclidean distance between f4-vectors, we’re not working in a compressed linear projection; we’re measuring how well the model reproduces the target’s true drift relationships across multiple independent reference axes. Each dimension of the f4-vector represents an independent test of shared drift with a reference population. Minimizing this distance tests whether the proposed sources can mathematically explain the target’s observed drift patterns, which is exactly what admixture does genetically.

The Role of References and Outgroups

The Outgroup: acts as the root, polarizing the allele frequencies so we know which direction “drift” is happening.

The References: act as the coordinate system or scaffolding. The model measures the target’s genetic drift relative to each reference population independently, creating a multi-dimensional vector of drift relationships. More references provide more independent axes along which to test the proposed admixture model.


Running the Script in Practice

To run the script you’ll need three libraries: the R-implementation of admixtools (“admixtools2”), tidyverse, and quadprog.

Installation:

If you haven’t installed these packages yet, run the following commands in R:

# Install admixtools2 from GitHub
install.packages("remotes")
remotes::install_github("uqrmaie1/admixtools")

# Install tidyverse and quadprog from CRAN
install.packages("tidyverse")
install.packages("quadprog")

Loading the Libraries:

Once installed, load them at the start of your R session:

library(admixtools)
library(tidyverse)
library(quadprog)

Now set up the data prefix, target sample, outgroup, and potential sources (contributors to the ancestry of the target). I recommend leaving the variable names exactly as shown, after you follow the setup, you can copy and paste the rest of the script into R without any variable name adjustments. Here I’ll model the Yamnaya EBA population:

# Specify your target population
target_name <- "Russia_Samara_EBA_Yamnaya.AG"

# Specify the outgroup (deeply diverged population)
outgroup <- "Mbuti.DG"

# Set your EIGENSTRAT data prefix
# This should point to your .geno, .snp, and .ind files
# For example, if your files are named data.geno, data.snp, data.ind:
data <- "data"

Next, declare a vector of potential sources you want tested for the target. Here I use a small source pool:

potential_sources <- c(
  "Turkey_Southeast_Cayonu_PPN.SG", 
  "Russia_Samara_HG.AG", 
  "Sweden_Mesolithic_HG.AG", 
  "Turkey_Marmara_Barcin_N.AG", 
  "Georgia_Kotias_Mesolithic.SG", 
  "Turkey_Central_Boncuklu_PPN.AG", 
  "Iran_HajjiFiruz_N.AG", 
  "Turkmenistan_C_TepeAnau.AG"
)

Selecting References: The Critical Step

Now comes the most important part: selecting the right reference populations. Omitting one crucial “anchor” can result in a false positive model, though this vulnerability is no different in qpAdm. The key principle: f4-statistics should be informative and the references should be differentially related to the sources (including the target). Adding five different references that deliver essentially the same genetic drift information has no value and in fact only introduces noise.

For a deeper understanding of what makes f4-statistics informative, refer to my post: Running f4-Statistics with Admixtools in R.

To see concrete examples of how missing critical references (“anchors”) affect f4-based models, see my qpAdm tutorial: Running qpAdm in R: Testing and Interpreting Ancestry Models.

For modeling the EBA Yamnaya samples from Samara, I selected the following references:

references <- c(
  "Switzerland_Bichon_Epipaleolithic.SG",
  "Georgia_Satsurblia_LateUP.SG",
  "Georgia_Dzudzuana_UP.AG",
  "Russia_MA1_UP.SG",
  "Iran_GanjDareh_N.AG",
  "Turkey_Central_Pinarbasi_Epipaleolithic.AG"
)

These references were chosen to capture different axes of genetic variation relevant to the West Eurasian gene pool:

  • Western Hunter-Gatherer (Bichon)
  • Early Caucasus Hunter-Gatherer (Satsurblia)
  • Upper Paleolithic Caucasus (Dzudzuana)
  • Ancient North Eurasian (MA1)
  • Early Iranian Neolithic (Ganj Dareh)
  • Anatolian Epipaleolithic (Pinarbasi)

Together, they provide independent tests of the target’s drift relationships across distinct ancestral drifts.

Running the Analysis

Now that all variables are declared, if you’ve used the exact variable names shown above, you can copy the complete script below and paste it directly into R:

# --- 2. INTERNAL SOLVER FUNCTIONS ---
solve_weights <- function(target_vec, source_mat) {
  n <- ncol(source_mat)
  # Dmat: Quadratic part (covariance). Add tiny noise to ensure positive definite.
  Dmat <- crossprod(source_mat) + diag(1e-10, n)
  dvec <- crossprod(source_mat, target_vec)
  # Constraints: Sum to 1 (meq=1) and each weight >= 0
  Amat <- cbind(rep(1, n), diag(n))
  bvec <- c(1, rep(0, n))
  
  fit <- try(solve.QP(Dmat, dvec, Amat, bvec, meq = 1), silent = TRUE)
  if(inherits(fit, "try-error")) return(NULL)
  return(fit$solution)
}

# --- 3. DATA PREPARATION ---
cat("Calculating f4 basis...\n")
all_pops <- c(target_name, potential_sources)

f4_results <- f4(data, 
                 pop1 = outgroup, 
                 pop2 = all_pops, 
                 pop3 = references[1], 
                 pop4 = references[-1])

# Pivot and maintain both estimate (est) and standard error (se)
f4_matrix_wide <- f4_results %>%
  dplyr::select(pop2, pop4, est, se) %>%
  tidyr::pivot_wider(names_from = pop2, values_from = c(est, se))

# Extract Target vector and its Standard Errors
y <- as.matrix(f4_matrix_wide[[paste0("est_", target_name)]])
target_se <- as.numeric(f4_matrix_wide[[paste0("se_", target_name)]])

# Extract Source Matrix
x_full <- f4_matrix_wide %>% 
  dplyr::select(starts_with("est_")) %>% 
  dplyr::select(-all_of(paste0("est_", target_name)))
colnames(x_full) <- gsub("est_", "", colnames(x_full))
x_full <- as.matrix(x_full)

# --- 4. COMBINATORIAL LOOP ---
results_list <- list()
# Test combinations of size 2, 3, and 4
for (m in 2:4) {
  cat("Testing all", m, "source combinations...\n")
  combos <- combn(colnames(x_full), m)
  
  for(i in 1:ncol(combos)) {
    curr_names <- combos[, i]
    x_sub <- x_full[, curr_names, drop = FALSE]
    
    w <- solve_weights(y, x_sub)
    
    if(!is.null(w)) {
      # Math Alignment
      w_mat <- matrix(w, ncol = 1)
      y_pred <- x_sub %*% w_mat
      residuals <- y - y_pred
      
      # Statistics
      chi_sq <- sum((residuals / target_se)^2)
      df_val <- nrow(x_sub) - length(w)
      p_val  <- if(df_val > 0) pchisq(chi_sq, df_val, lower.tail = FALSE) else NA
      rss    <- sum(residuals^2)
      
      results_list[[length(results_list) + 1]] <- data.frame(
        n_sources = length(w),
        model     = paste(curr_names, collapse = " + "),
        rss       = rss,
        p_value   = p_val,
        weights   = paste(round(w * 100, 1), collapse = " / ")
      )
    }
  }
}

# --- 5. OUTPUT ---
final_table <- bind_rows(results_list) %>%
  mutate(status = ifelse(p_value > 0.05, "PASS", "FAIL")) %>%
  arrange(desc(p_value))

cat("\n--- TOP MODELS (RANKED BY P-VALUE) ---\n")
print(head(final_table, 20))

You’ll see an output with the top 20 most plausible combinations first, followed by the fit table. For this example, the best fit was:

                                                                                                model
1                                             Russia_Samara_HG.AG + Georgia_Kotias_Mesolithic.SG + Iran_HajjiFiruz_N.AG

Now check for the same row in the fit table:

            rss    p_value                   weights status
1  2.753318e-08 0.94373890          52 / 22.9 / 25.1   PASS

Model 1 (3-way mixture): This shows the estimated ancestry proportions in the order of sources listed:

  • 52.0% Russia_Samara_HG.AG (Eastern Hunter-Gatherer ancestry)
  • 22.9% Georgia_Kotias_Mesolithic.SG (Caucasus Hunter-Gatherer-related ancestry)
  • 25.1% Iran_HajjiFiruz_N.AG (Northwest Iranian Neolithic farmer-related ancestry)

Understanding the Statistics:

  • rss (Residual Sum of Squares): 2.75 × 10⁻⁸, extremely low, indicating excellent fit
  • p-value: 0.94, well above the 0.05 threshold, meaning the residuals are statistically insignificant
  • status: PASS, the model successfully explains the target’s f4-statistics

This model shows a very high p-value (0.94), indicating the residuals are not statistically distinguishable from zero, meaning this model excellently explains the observed f4-statistics. The extremely high p-value suggests this is a robust model that captures the ancestry composition well.

Comparison with qpAdm

Running the same model in qpAdm produces:

# A tibble: 3 × 5
  target                       left                         weight     se     z
  <chr>                        <chr>                         <dbl>  <dbl> <dbl>
1 Russia_Samara_EBA_Yamnaya.AG Russia_Samara_HG.AG           0.540 0.0229 23.6 
2 Russia_Samara_EBA_Yamnaya.AG Georgia_Kotias_Mesolithic.SG  0.210 0.0366  5.75
3 Russia_Samara_EBA_Yamnaya.AG Iran_HajjiFiruz_N.AG          0.250 0.0434  5.77

$rankdrop
# A tibble: 3 × 7
  f4rank   dof   chisq         p dofdiff chisqdiff   p_nested
   <int> <int>   <dbl>     <dbl>   <int>     <dbl>      <dbl>
1      2     4    4.32 3.65e-  1       6      351.  1.18e- 72

The ancestry proportions are highly consistent between methods (54% / 21% / 25% in qpAdm against 52% / 22.9% / 25.1% in the script), though the p-values differ (0.365 vs 0.94). Both values indicate the model adequately explains the target’s f4-statistics. This difference reflects how the two methods weight the covariance structure: qpAdm uses a block jackknife-derived weighting matrix that accounts for correlations between f4-statistics, while this script uses standard quadratic programming that treats residuals independently, resulting in more optimistic p-values. Despite this difference, both methods converge on essentially the same ancestry proportions and both accept the model as plausible.