This is a follow-up to my previous post Processing Ancient DNA: From FASTQ to Aligned BAM, where I aligned an ancient DNA sample against the hs37d5 reference genome, producing a filtered BAM compatible with the AADR dataset.
In this post, I’ll cover pseudohaploid genotype calling using pileupCaller and converting the output to EIGENSTRAT format for use with ADMIXTOOLS. Since we just created this BAM ourselves in the previous post, we already know it’s aligned to hs37d5. However, if you’re starting with a BAM file, you’ll need to verify the reference genome first. I’ll start by showing how to check BAM headers to identify the reference genome.
Identifying the Reference Genome from BAM Headers
Before processing any BAM file, you should verify which reference genome it was aligned against. This is critical because AADR compatibility requires hs37d5 specifically. BAMs aligned to other GRCh37-based references like hg19 are also compatible (since they share the same coordinate system, differing only in chromosome naming conventions), but hg38/GRCh38 BAMs would require realignment from FASTQs.
To identify the reference genome, check the BAM header:
samtools view -H ERR14088885.bam | head -25
The output shows:
@HD VN:1.6 SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
...
@SQ SN:X LN:155270560
@SQ SN:Y LN:59373566
The @SQ lines show reference sequences with their lengths. The chromosome naming (SN:1 instead of SN:chr1) and the specific length of chromosome 1 (249250621 bp) confirm this BAM is aligned to hs37d5 or GRCh37. If we saw SN:chr1 with the same length, that would indicate hg19. Different lengths (like 248956422 for chr1) would indicate hg38.
Reference Genome Quick-Check
If you are unsure which reference was used, compare your @SQ lengths to this table:
| Reference Name | Chr1 Length (bp) | Chromosome Naming |
|---|---|---|
| hs37d5 / GRCh37 | 249,250,621 | 1, 2, X, MT |
| hg19 | 249,250,621 | chr1, chr2, chrX, chrM |
| hg38 / GRCh38 | 248,956,422 | chr1, chr2, chrX, chrM |
Note: AADR compatibility requires the hs37d5 / GRCh37 coordinate system. If your BAM uses hg19, it’s compatible (same coordinates, just rename chromosomes by removing the “chr” prefix if needed). BAMs aligned to hg38/GRCh38 require realignment from FASTQs.
With the reference genome verified, we can proceed to genotype calling.
Pseudohaploid Genotype Calling with pileupCaller
First, I’ll create a BED file containing the AADR SNP positions (not to be confused with EIGENSTRAT’s .bed format, this is a genomic intervals file). It tells samtools mpileup to only process the positions we care about, rather than scanning the entire genome. This requires the .snp file from the AADR EIGENSTRAT dataset. If you haven’t downloaded it yet, see: How to Download the AADR Dataset (Linux & WSL).
Creating a BED File from AADR SNP Positions
Run the command:
# Assuming the .snp file is named v62.0_HO_public.snp
awk '{print $2, $4-1, $4}' v62.0_HO_public.snp > v62_positions.bed
This extracts the chromosome (column 2) and position (column 4), converting from 1-based to 0-based coordinates by subtracting 1 from the start position. The resulting BED file contains three columns: chromosome, start (0-based), and end (0-based, exclusive).
Installing pileupCaller
pileupCaller is part of the sequenceTools package and can be installed via conda or compiled from source. For this tutorial, I’ll use the conda installation as it’s the most straightforward method.
If you don’t have conda installed, install Miniconda first:
# Download Miniconda installer
# For other architectures (ARM, macOS, etc.), see: https://docs.conda.io/en/latest/miniconda.html
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Run installer
bash Miniconda3-latest-Linux-x86_64.sh
# Follow prompts, then restart your terminal or run:
source ~/.bashrc
Then install pileupCaller:
conda install -c bioconda sequencetools
# Test installation
pileupCaller --help
Pseudohaploid Genotype Calling
Now we run samtools mpileup and pipe the output directly to pileupCaller to generate EIGENSTRAT format files:
samtools mpileup -R -B -q 30 -Q 30 \
-l v62_positions.bed \
-f hs37d5/hs37d5.fa.gz \
ERR14088885_filtered.bam | \
pileupCaller --sampleNames ERR14088885 \
--samplePopName Bashur_Hoyuk_EBA \
--randomHaploid \
-f v62.0_HO_public.snp \
-e eigenstrat_output
Parameters to adjust for your sample:
hs37d5/hs37d5.fa.gz: Path to your reference genome (see previous post for downloading and indexing; if you have the reference but haven’t indexed it, runsamtools faidx hs37d5/hs37d5.fa.gz)ERR14088885_filtered.bam: Your input BAM file (here: the filtered BAM from the previous post)--sampleNames ERR14088885: Sample identifier (match your BAM filename or use the ID from the study)v62.0_HO_public.snp: The AADR EIGENSTRAT.snpfile (same file used to create the BED file above)--samplePopName Bashur_Hoyuk_EBA: Population/individual label from the study (this will be the third column of the.ind)
What this does:
samtools mpileupgenerates pileup format (a text representation showing all reads covering each position) at AADR positions only (specified by the BED file)pileupCaller --randomHaploidrandomly samples a single high-quality allele at each position, avoiding bias from aDNA damage- Output: Three EIGENSTRAT files (
eigenstrat_output.geno,.snp,.ind)
Note on quality thresholds: The quality filters here (q30/Q30) are stricter compared to the BAM filtering step (q20). This two-stage approach retains more data during alignment while applying stringent filters during genotype calling to ensure high-confidence calls.
After a successful run, you should see output similar to this:
SampleName TotalSites NonMissingCalls avgRawReads avgDamageCleanedReads avgSampledFrom
ERR14088885 584131 19549 7.457744923655824e-2 7.457744923655824e-2 7.439598309283363e-2
Interpreting the results:
- TotalSites: Total AADR SNP positions attempted (584,131 for HO dataset)
- NonMissingCalls: Positions with sufficient coverage for genotype calls (19,549)
- avgRawReads: Mean coverage of raw pileup input across all sites (~0.074x or 7.4%)
- avgDamageCleanedReads: Mean coverage after single-stranded damage removal (~0.074x)
- avgSampledFrom: Mean coverage after removing tri-allelic reads (~0.074x)
This sample has ~3.35% of AADR positions covered. While this may seem low, it’s sufficient for ADMIXTOOLS analyses (f-statistics, qpAdm) and rough PCA projection with smartpca.
In the next post, I’ll show how to merge this sample with the AADR dataset using mergeit.
Now that you have your EIGENSTRAT files (.geno, .snp, .ind), you might want to delete intermediate files. The small BED file (v62_positions.bed) can be safely deleted as it’s easily recreated from the AADR .snp file. However, keep your BAM and BAI files if you’re interested in uniparental markers analyis. Even though we’ve called autosomal SNPs for ADMIXTOOLS, the BAM is still required for Y-DNA haplogroup assignment, mitochondrial analysis, and damage assessment with mapDamage. I possibly cover Y-DNA inference using ybyra in a future post, so keep your filtered BAMs if you plan to follow along.