Fast, Transparent f4-Based Admixture Screening in R

In this post, I will build a transparent admixture-screening workflow from scratch in R using f4-statistics and constrained regression. The main advantage is automation: instead of hand-writing every candidate model, the script tests many 2-way, 3-way, and 4-way source combinations in one pass and ranks them by fit. The result is not a replacement for qpAdm, but a fast screening layer that can help you identify promising models before you validate them more formally. ADMIXTOOLS 2 already includes batch tools such as qpadm_multi() and qpadm_rotate(), so the point here is not that qpAdm cannot be automated. The point is that this custom workflow is compact, transparent, easy to modify, and useful for exploratory model search.

qpAdm remains the standard formal framework for testing admixture models with left and right populations. It estimates admixture weights, computes model fit, and can optionally constrain weights to be non-negative. It also uses block jackknife or bootstrap resampling to estimate uncertainty. The script below should therefore be thought of as a complementary f4-space screening tool, not as a full substitute for qpAdm or qpWave.

The basic idea

The logic is simple. If a target population can be approximated as a mixture of several source-related ancestries, then the target’s pattern of f4-statistics relative to a fixed outgroup and a fixed reference set should also be approximable as a weighted combination of the source f4-vectors. This relies on the same broad f-statistic framework that underlies ADMIXTOOLS methods more generally.

In this post I compute f4-statistics of the form:

f4(Outgroup, Test; Ref1, RefX)

for the target and for each candidate source. Each population is therefore represented by a vector of allele-sharing relationships across the reference panel. If the target lies near the convex combination of some sources in this f4 space, that source set is a plausible ancestry model worth checking more carefully.

What the script actually optimizes

The core fit is a constrained weighted least-squares problem:

\min_w \sum_i \left(\frac{y_i - (Xw)_i}{se_i}\right)^2

subject to:

w_j \ge 0 \quad \text{for all } j

and

\sum_j w_j = 1

Here, y is the target f4-vector, X is the matrix of source f4-vectors, and se contains the marginal standard errors of the target f4 entries. So this is not ordinary Euclidean distance in raw coordinates. It is a diagonally weighted residual objective, solved under biologically meaningful constraints using quadratic programming.

That makes the procedure much more interpretable than PCA-distance fitting. In PCA, distances are measured in a reduced coordinate system whose geometry is optimized for variance, not directly for ancestry modeling. Here the fit is performed directly on f4-statistics, which are designed to capture allele-sharing asymmetries and clade deviations.

At the same time, this script still uses a simplified error model. It weights by the marginal standard errors of each f4 entry, but it does not use the full covariance matrix among f4-statistics the way qpAdm does. That means the weights can still be useful as screening estimates, but the p-values from this script should be treated as approximate fit scores rather than as formal qpAdm-equivalent tests. qpAdm, by contrast, is built around covariance-aware model testing and resampling-based uncertainty estimation.

The role of the outgroup and references

The outgroup anchors the statistic, while the reference populations define the coordinate system in which the target and sources are compared. This is why the reference choice matters. A model is never being tested in an absolute vacuum. It is being tested relative to the chosen right-side information. qpWave and qpAdm are built on the same principle. In qpWave terms, the left and right populations define an f4 matrix whose rank tells us about the minimum number of ancestry streams separating them. qpAdm extends this logic to test whether a target can be explained by the chosen left populations relative to the chosen right populations.

This has two practical consequences.

First, more references are not automatically better. A reference that is symmetrically related to all candidate sources contributes little discriminatory information. Second, changing the right set can change whether a model looks acceptable, because the model is conditional on that set. The same caveat applies to qpAdm, not just to this custom script.

What this workflow is good for

This workflow is especially useful when you have a target, a stable right set, and a medium-sized panel of candidate sources, and you want to answer questions like:

Which 2-way, 3-way, or 4-way models are even geometrically plausible?
Which sources recur across many good-fitting models?
Which sources look interchangeable?
Which candidates are obviously poor proxies?

Because the script builds one f4 basis and reuses it across all combinations, it can screen large model spaces quickly. That makes it a good first-pass tool before running competition tests, qpWave checks, or full qpAdm validation on the strongest candidates. ADMIXTOOLS 2 itself also supports efficient batch workflows, but the custom regression screen remains useful because the fitting logic is completely explicit and easy to customize.

What it is not

This script is not a replacement for qpWave rank testing. It does not test the rank structure of the full left-right f4 matrix the way qpWave does. It also does not use the full covariance-aware qpAdm machinery. If a model looks good here, the right next step is still to check it with qpAdm, and in many cases also with qpWave or source-competition tests.

Installation

You will need admixtools, tidyverse, and quadprog. Current documentation for the R package shows GitHub installation via remotes::install_github("uqrmaie1/admixtools").

install.packages("remotes")
install.packages("tidyverse")
install.packages("quadprog")

remotes::install_github("uqrmaie1/admixtools")

library(admixtools)
library(tidyverse)
library(quadprog)

Example setup

Below I will model the Samara Yamnaya Early Bronze Age group.

# Specify your target population
target_name <- "Russia_Samara_EBA_Yamnaya.AG"

# Specify the outgroup
outgroup <- "Mbuti"

# EIGENSTRAT prefix
data <- "data"

Candidate sources:

potential_sources <- c(
  "Turkey_N",
  "Turkey_Cayonu_PPN",
  "Iran_HajjiFiruz_N",
  "Georgia_KotiasKlde_Mesolithic",
  "Russia_Samara_EN_Mesolithic",
  "Tajikistan_Mesolithic",
  "Hungary_EN_Starcevo-1",
  "Azerbaijan_MenteshTepe_N_ShomutepeShulaveri",
  "Russia_AfontovaGora_UP",
  "Russia_YanaRiver_UP",
  "Russia_Vologda_Mesolithic"
)

Right populations (references):

references <- c(
  "Turkey_Epipaleolithic",
  "Russia_Malta_UP",
  "Switzerland_Epipaleolithic",
  "Georgia_Satsurblia_LateUP",
  "Iran_BeltCave_Mesolithic",
  "Iraq_PPNA"
  )

These references here are intended to capture different axes of West Eurasian variation, including western hunter-gatherer-related ancestry, Caucasus-related ancestry, Ancient North Eurasian-related ancestry, early Iranian farmer-related ancestry, and Anatolian-related ancestry. It is important that the reference panel should contain populations that help distinguish among the candidate sources. qpWave and qpAdm both rely on informative right populations for the same reason. In retrospect, I would have included a broader set of reference populations, which may have given the script more discriminatory information to work with.

The full script

# --- 1. INPUT CHECKS ---
stopifnot(length(references) >= 2)

potential_sources <- unique(setdiff(potential_sources, target_name))

if (length(potential_sources) < 2) {
  stop("You need at least two candidate sources.")
}

# --- 2. INTERNAL HELPER FUNCTIONS ---
solve_weights <- function(target_vec, source_mat, se_vec, tol = 1e-10) {
  row_ok <- is.finite(target_vec) &
            is.finite(se_vec) & se_vec > 0 &
            apply(source_mat, 1, function(z) all(is.finite(z)))

  y <- as.numeric(target_vec[row_ok])
  X <- source_mat[row_ok, , drop = FALSE]
  se <- as.numeric(se_vec[row_ok])

  if (length(y) == 0 || nrow(X) == 0) return(NULL)

  # Weighted least squares objective:
  # minimize sum(((y - X %*% w) / se)^2)
  y_w <- y / se
  X_w <- sweep(X, 1, se, "/")

  # Skip rank-deficient combinations
  if (qr(X_w)$rank < ncol(X_w)) return(NULL)

  n <- ncol(X_w)
  Dmat <- crossprod(X_w) + diag(tol, n)
  dvec <- crossprod(X_w, y_w)

  # Constraints: sum(weights) = 1 and weights >= 0
  Amat <- cbind(rep(1, n), diag(n))
  bvec <- c(1, rep(0, n))

  fit <- try(quadprog::solve.QP(Dmat, dvec, Amat, bvec, meq = 1), silent = TRUE)
  if (inherits(fit, "try-error")) return(NULL)

  w <- fit$solution
  w[abs(w) < 1e-8] <- 0

  if (any(w < -1e-6)) return(NULL)

  w <- pmax(w, 0)
  w <- w / sum(w)

  list(weights = w, keep = row_ok)
}

evaluate_model <- function(y, X, se, w) {
  y_pred <- as.numeric(X %*% w)
  residuals <- y - y_pred
  z_resid <- residuals / se

  chi_sq <- sum(z_resid^2)
  df_val <- length(y) - (length(w) - 1)
  p_val <- if (df_val > 0) pchisq(chi_sq, df_val, lower.tail = FALSE) else NA_real_
  rss <- sum(residuals^2)
  max_abs_z <- max(abs(z_resid))

  list(
    chi_sq = chi_sq,
    df = df_val,
    p_value = p_val,
    rss = rss,
    max_abs_z = max_abs_z
  )
}

# --- 3. DATA PREPARATION ---
cat("Calculating f4 basis...\n")

all_pops <- c(target_name, potential_sources)

f4_results <- f4(
  data,
  pop1 = outgroup,
  pop2 = all_pops,
  pop3 = references[1],
  pop4 = references[-1],
  allsnps = FALSE
)

# allsnps = FALSE keeps SNP selection consistent across these f4 entries.
# This can reduce SNP counts, but it avoids mixing different SNP subsets
# across the basis vectors.

f4_matrix_wide <- f4_results %>%
  dplyr::select(pop2, pop4, est, se) %>%
  tidyr::pivot_wider(names_from = pop2, values_from = c(est, se))

y <- as.numeric(f4_matrix_wide[[paste0("est_", target_name)]])
target_se <- as.numeric(f4_matrix_wide[[paste0("se_", target_name)]])

x_full <- f4_matrix_wide %>%
  dplyr::select(starts_with("est_")) %>%
  dplyr::select(-all_of(paste0("est_", target_name)))

colnames(x_full) <- gsub("^est_", "", colnames(x_full))
x_full <- as.matrix(x_full)

if (ncol(x_full) < 2) {
  stop("Not enough candidate sources after preprocessing.")
}

# --- 4. COMBINATORIAL SEARCH ---
results_list <- list()
source_sizes <- 2:min(4, ncol(x_full))

for (m in source_sizes) {
  cat("Testing all", m, "source combinations...\n")
  combos <- combn(colnames(x_full), m)

  for (i in seq_len(ncol(combos))) {
    curr_names <- combos[, i]
    x_sub <- x_full[, curr_names, drop = FALSE]

    fit <- solve_weights(y, x_sub, target_se)

    if (!is.null(fit)) {
      w <- fit$weights
      keep <- fit$keep

      y_use <- y[keep]
      se_use <- target_se[keep]
      x_use <- x_sub[keep, , drop = FALSE]

      stats <- evaluate_model(y_use, x_use, se_use, w)

      results_list[[length(results_list) + 1]] <- tibble::tibble(
        n_sources = length(w),
        model = paste(curr_names, collapse = " + "),
        chi_sq = stats$chi_sq,
        df = stats$df,
        rss = stats$rss,
        max_abs_z = stats$max_abs_z,
        p_value = stats$p_value,
        weights = paste(
          paste0(curr_names, "=", round(w * 100, 1), "%"),
          collapse = "\n           "
        )
      )
    }
  }
}

if (length(results_list) == 0) {
  stop("No valid models were found.")
}

# --- 5. OUTPUT ---
final_table <- dplyr::bind_rows(results_list) %>%
  dplyr::mutate(
    status = dplyr::case_when(
      is.na(p_value) ~ "NA",
      p_value > 0.05 ~ "PASS",
      TRUE ~ "FAIL"
    )
  ) %>%
  dplyr::arrange(dplyr::desc(p_value), chi_sq)

top20 <- final_table %>%
  dplyr::slice_head(n = 20)

cat("\n--- TOP 20 MODELS (RANKED BY APPROXIMATE P-VALUE) ---\n\n")

for (i in seq_len(nrow(top20))) {
  cat(sprintf("[%d]\n", i))
  cat("model:     ", top20$model[i], "\n", sep = "")
  cat("weights:\n", top20$weights[i], "\n", sep = "")
  cat(sprintf("p_value:   %.6g\n", top20$p_value[i]))
  cat(sprintf("chi_sq:    %.6g\n", top20$chi_sq[i]))
  cat(sprintf("df:        %s\n", top20$df[i]))
  cat(sprintf("max_abs_z: %.6g\n", top20$max_abs_z[i]))
  cat("status:    ", top20$status[i], "\n\n", sep = "")
}

A note on allsnps = FALSE ADMIXTOOLS 2 documents that, when computing f4-statistics directly from genotype data, different f4 entries may otherwise be based on different SNP subsets if some data are missing. Setting allsnps = FALSE forces a common SNP set across the populations involved in the calculation, which is attractive here because the script is treating the f4 values as coordinates in one shared basis. The trade-off is that SNP counts can drop, sometimes a lot, when many populations are included.

Interpreting the output

The script returns a ranked table of source combinations and their fitted weights. The most useful columns are:

model, the source combination
weights, the fitted ancestry proportions under the simplex constraint
p_value, an approximate fit score under the script’s diagonal weighting scheme
max_abs_z, the largest standardized residual, which is often more intuitive than RSS alone
status, a rough pass/fail flag

The key phrase is approximate fit score. Because the script does not use the full covariance structure among f4-statistics, its p-values are usually more optimistic than formal qpAdm p-values. They are still useful for ranking and screening, but they should not be treated as the final arbiters of model validity. qpAdm uses covariance-aware fitting and resampling-based uncertainty, and that is one reason its p-values can differ even when the best source combination and the proportions look similar.

Example interpretation

In my Yamnaya example, the top model was:

[1]
model:     Turkey_Cayonu_PPN + Georgia_KotiasKlde_Mesolithic + Russia_Vologda_Mesolithic
weights:
Turkey_Cayonu_PPN=19%
           Georgia_KotiasKlde_Mesolithic=25.4%
           Russia_Vologda_Mesolithic=55.7%
p_value:   0.980515
chi_sq:    0.181524
df:        3
max_abs_z: 0.34791
status:    PASS

with fitted weights of roughly:

55.7% Russia_Vologda_Mesolithic
25.4% Georgia_KotiasKlde_Mesolithic
19% Turkey_Cayonu_PPN

That is exactly the kind of result this script is meant to surface: a small number of source combinations that look geometrically plausible in f4 space and are worth testing more formally.

When I ran the same model in qpAdm, I obtained:

  target                    left                          weight     se     z
  <chr>                     <chr>                          <dbl>  <dbl> <dbl>
1 Russia_Samara_EBA_Yamnaya Turkey_Cayonu_PPN              0.180 0.0733  2.46
2 Russia_Samara_EBA_Yamnaya Georgia_KotiasKlde_Mesolithic  0.287 0.0651  4.40
3 Russia_Samara_EBA_Yamnaya Russia_Vologda_Mesolithic      0.533 0.0361 14.8 

$rankdrop
# A tibble: 3 × 7
  f4rank   dof chisq        p dofdiff chisqdiff  p_nested
   <int> <int> <dbl>    <dbl>   <int>     <dbl>     <dbl>
1      2     4  11.2 2.45e- 2       6      61.2  2.55e-11
2      1    10  72.4 1.51e-11       8     359.   1.09e-72
3      0    18 431.  2.53e-80      NA      NA   NA

The fitted proportions are fairly close between the two approaches, which suggests that the screening script is capturing a real ancestry direction in f4 space. However, this qpAdm model does not formally pass: the rankdrop p-value for the 3-way model is 0.0245, below the conventional 0.05 threshold. So this source combination should be treated as a useful screening hit rather than as a validated admixture model.

One possible reason is that qpAdm can become overly conservative when the right set is larger than necessary or contains redundant reference populations. Harney et al. specifically caution that too many right populations can bias qpAdm p-values downward and lead to rejection of otherwise plausible models. In practice, this means the model may be worth re-evaluating with a thinner, more informative right set, chosen on methodological grounds rather than tuned ad hoc to force a pass.

The basic idea#

What the script actually optimizes#

The role of the outgroup and references#

What this workflow is good for#

What it is not#

Installation#

Example setup#

The full script#

Interpreting the output#

Example interpretation#