This is a follow-up to my previous post Processing Ancient DNA: From FASTQ to Aligned BAM, where I aligned an ancient DNA sample against the hs37d5 reference genome, producing a filtered BAM compatible with the AADR dataset.
In this post, I’ll cover pseudohaploid genotype calling using pileupCaller and converting the output to EIGENSTRAT format for use with ADMIXTOOLS. Since we just created this BAM ourselves in the previous post, we already know it’s aligned to hs37d5. However, if you’re starting with a BAM file, you’ll need to verify the reference genome first. I’ll start by showing how to check BAM headers to identify the reference genome.
Identifying the Reference Genome from BAM Headers
Before processing any BAM file, you should verify which reference genome it was aligned against. This is critical because AADR compatibility requires hs37d5 specifically. BAMs aligned to other GRCh37-based references like hg19 are also compatible (since they share the same coordinate system, differing only in chromosome naming conventions), but hg38/GRCh38 BAMs would require realignment from FASTQs.
To identify the reference genome, check the BAM header:
samtools view -H ERR14088885.bam | head -25
The output shows:
@HD VN:1.6 SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
...
@SQ SN:X LN:155270560
@SQ SN:Y LN:59373566
The @SQ lines show reference sequences with their lengths. The chromosome naming (SN:1 instead of SN:chr1) and the specific length of chromosome 1 (249250621 bp) confirm this BAM is aligned to hs37d5 or GRCh37. If we saw SN:chr1 with the same length, that would indicate hg19. Different lengths (like 248956422 for chr1) would indicate hg38.
Reference Genome Quick-Check
If you are unsure which reference was used, compare your @SQ lengths to this table:
| Reference Name | Chr1 Length (bp) | Chromosome Naming |
|---|---|---|
| hs37d5 / GRCh37 | 249,250,621 | 1, 2, X, MT |
| hg19 | 249,250,621 | chr1, chr2, chrX, chrM |
| hg38 / GRCh38 | 248,956,422 | chr1, chr2, chrX, chrM |
Note: AADR compatibility requires the hs37d5 / GRCh37 coordinate system. If your BAM uses hg19, it’s compatible (same coordinates, just rename chromosomes by removing the “chr” prefix if needed). BAMs aligned to hg38/GRCh38 require realignment from FASTQs.
With the reference genome verified, we can proceed to genotype calling.
Creating a BED File from AADR SNP Positions
First, I’ll create a BED file containing the AADR SNP positions (not to be confused with EIGENSTRAT’s .bed format, this is a genomic intervals file). It tells samtools mpileup to only process the positions we care about, rather than scanning the entire genome. This requires the .snp file from the AADR EIGENSTRAT dataset. If you haven’t downloaded it yet, see: How to Download the AADR Dataset (Linux & WSL).
Run the command:
# Assuming the .snp file is named v62.0_HO_public.snp
awk '{print $2, $4-1, $4}' v62.0_HO_public.snp > v62_positions.bed
This extracts the chromosome (column 2) and position (column 4), converting from 1-based to 0-based coordinates by subtracting 1 from the start position. The resulting BED file contains three columns: chromosome, start (0-based), and end (0-based, exclusive).
Installing pileupCaller
pileupCaller is part of the sequenceTools package and can be installed via conda or compiled from source. For this tutorial, I’ll use the conda installation as it’s the most straightforward method.
If you don’t have conda installed, install Miniconda first:
# Download Miniconda installer
# For other architectures (ARM, macOS, etc.), see: https://docs.conda.io/en/latest/miniconda.html
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Run installer
bash Miniconda3-latest-Linux-x86_64.sh
# Follow prompts, then restart your terminal or run:
source ~/.bashrc
Then install pileupCaller and bamutil (used in the next step):
conda install -c bioconda sequencetools bamutil
# Test installation
pileupCaller --help
Trimming Damage-Prone Read Ends
Ancient DNA damage (C→T deamination) concentrates in the first and last ~2bp of fragments. We need to remove these positions before genotyping to avoid damage-induced errors.
Trim 2bp from both ends of each read:
bam trimBam ERR14088885_filtered.bam ERR14088885_trimmed.bam 2
# Index the trimmed BAM
samtools index ERR14088885_trimmed.bam
The output confirms successful trimming:
Number of records read = 2689408
Number of records written = 2689408
All reads are retained but with terminal bases masked, removing most damage without discarding data.
Pseudohaploid Genotype Calling with pileupCaller
Now we run samtools mpileup and pipe the output directly to pileupCaller to generate EIGENSTRAT format files:
# -q 10: Minimum mapping quality
# -Q 20: Minimum base quality
samtools mpileup -R -B -q 10 -Q 20 \
-l v62_positions.bed \
-f hs37d5/hs37d5.fa.gz \
ERR14088885_trimmed.bam | \
pileupCaller --sampleNames ERR14088885 \
--samplePopName Bashur_Hoyuk_EBA \
--randomHaploid \
-f v62.0_HO_public.snp \
-e eigenstrat_output
Reference and Input Files
hs37d5/hs37d5.fa.gz: Reference genome (must be indexed withsamtools faidx)ERR14088885_trimmed.bam: Your filtered BAM filev62_positions.bed: BED file of AADR SNP positionsv62.0_HO_public.snp: AADR EIGENSTRAT SNP file
Sample Identifiers:
--sampleNames ERR14088885: Sample ID (typically matching BAM filename)--samplePopName Bashur_Hoyuk_EBA: Population/individual label (becomes third column in.indfile)
What it does
samtools mpileupgenerates pileup format at AADR positions only, showing all reads covering each sitepileupCaller --randomHaploidrandomly samples one high-quality allele per position, producing pseudohaploid genotypes that avoid aDNA damage bias
Quality Filtering Parameters
Mapping Quality (-q 10):
Ancient DNA fragments are often short, resulting in lower mapping quality scores. A threshold of 10 retains authentic ancient reads that stricter filters (e.g., 30) would discard.
Base Quality (-Q 20):
Requires minimum Phred score of 20 (99% base call accuracy).
Random Haploid Sampling (--randomHaploid):
Randomly samples one allele per site rather than calling heterozygotes, which is problematic in damaged/low-coverage aDNA where allelic dropout and damage can create false heterozygote calls.
Note: For high-coverage modern data, use stricter thresholds (
-q 30 -Q 30) and consider diploid calling instead of--randomHaploid.
Output
PileupCaller outputs three EIGENSTRAT files: eigenstrat_output.geno, .snp, .ind
After a successful run, you should see output in your terminal similar to this:
ERR14088885 584131 20265 7.743468502784479e-2 7.743468502784479e-2 7.716077386750575e-2
Interpreting the results:
- TotalSites: Total AADR SNP positions attempted (584,131 for HO dataset)
- NonMissingCalls: Positions with sufficient coverage for genotype calls (20,265)
- avgRawReads: Mean coverage of raw pileup input across all sites (~0.077x or 7.7%)
- avgDamageCleanedReads: Mean coverage after single-stranded damage removal (~0.077x)
- avgSampledFrom: Mean coverage after removing tri-allelic reads (~0.077x)
This sample has ~3.47% of AADR positions covered. While this may seem low, it’s sufficient for ADMIXTOOLS analyses (f-statistics, qpAdm) and rough PCA projection with smartpca.
In the next post, I’ll show how to merge this sample with the AADR dataset using mergeit.
Now that you have your EIGENSTRAT files (.geno, .snp, .ind), you might want to delete intermediate files:
# Clean up intermediate files
rm v62_positions.bed ERR14088885_filtered.bam ERR14088885_filtered.bam.bai
Keep your trimmed BAM and BAI files if you’re interested in uniparental marker analysis (Y-DNA haplogroup assignment, mitochondrial analysis) or damage assessment with mapDamage. I’ll possibly cover Y-DNA inference using ybyra in a future post, so keep your trimmed BAMs if you plan to follow along.