Processing Ancient DNA: From FASTQ to Aligned BAM

This is the first post in a series where I’ll process an ancient DNA sample from raw FASTQ sequences to EIGENSTRAT format for use with ADMIXTOOLS. This is a complete walkthrough of processing ancient DNA from raw sequences to EIGENSTRAT format, with actual aDNA-specific BWA parameter settings that are rarely documented elsewhere.

System Requirements

All commands are written for Debian/Ubuntu-based Linux systems.

CPU: At least 8 cores recommended. BWA-MEM scales efficiently up to 12–16 threads; beyond this, you will likely encounter diminishing returns due to memory bandwidth saturation.
RAM Recommendation:
- 16 GB for moderate-coverage samples.
- 20–32 GB can be beneficial for deeply sequenced datasets, mainly to keep sorting and downstream processing in memory.
Storage & Network: A fast internet connection is recommended. While this tutorial uses a small sample of approximately 3.5 GB of compressed FASTQs, ancient DNA samples vary widely in sequencing depth and can easily reach or exceed 10 GB of compressed FASTQs per individual.

If you’re limited by hardware or bandwidth, consider using a cloud computing instance (Google Cloud Computing, AWS, or DigitalOcean). You can SSH into the instance and run the pipeline there. I use this approach since my internet speed is not the best.

If you’re running this on Windows Subsystem for Linux, work exclusively within the Linux filesystem (~/ or /home/username/), not in Windows directories (/mnt/c/). Accessing Windows filesystems from WSL incurs massive I/O overhead that can slow alignment extremely. All downloads, reference genomes, and BAM files should live in your Linux home directory.

What this Post covers

In this post I’ll cover installing BWA, Samtools, downloading and indexing the human reference genome (hs37d5), aligning paired-end ancient DNA reads with aDNA-optimized settings, and quality filtering to prepare BAM files for downstream analysis.

For this series I’ll process a sample from the study “Inequality at the dawn of the Bronze Age: the case of Başur Höyük, a ‘royal’ cemetery at the margins of the Mesopotamian world”. Public ancient DNA samples are usually uploaded to the European Nucleotide Archive (ENA), and in many cases also to the NCBI Sequence Read Archive (SRA). The project accession number is PRJEB83032 and the sample I’ll process has the run accession ERR14088885.

Since the study has published BAM files, I’ll first check what reference genome they used. We need the alignment to be against hs37d5 specifically for compatibility with the AADR dataset.

Note: they’re unaligned BAMs, so we’ll need to align the FASTQs from scratch (hence the title). In most cases, published BAMs are already aligned to a reference genome, which means you could skip this entire tutorial and go directly to genotype calling if that would be the case for the sample you want to process. But even with aligned BAMs, you’d still need to check the header to see which reference genome was used. For AADR compatibility we specifically need hs37d5. For this sample we’re not that lucky with pre-aligned files.

If the published BAMs had been aligned to any hg19/hg37 derivative (like GRCh37, hg19, or hs37d5), you typically wouldn’t need to realign from scratch. The main differences between these builds are chromosome naming conventions (e.g. ‘chr1’ or ‘1’) and some additional regions/decoys that aren’t used in analysis tools anyway. Chromosome names can be easily remapped during downstream processing. However, BAMs aligned to hg38 (GRCh38) are incompatible with AADR and would require realignment from FASTQs to hs37d5, as the coordinate systems differ substantially between these genome builds. Since these BAMs are completely unaligned, we need to start from the FASTQs.

Checking if the published BAMs are aligned

First, install samtools:

sudo apt update
sudo apt install samtools

Download the BAM file and index it:

wget ftp://ftp.sra.ebi.ac.uk/vol1/err/ERR140/085/ERR14088885/ERR14088885.bam

samtools index ERR14088885.bam

Check the header:

samtools view -H ERR14088885.bam | head -20

The output shows no @SQ (reference sequence) lines at all. This is an unaligned BAM (uBAM). Since we need aligned BAMs for downstream analysis anyway, I’ll start directly from the FASTQ files and perform a clean alignment from scratch. We can delete this BAM now since we won’t need it.

rm ERR14088885.bam ERR14088885.bam.bai

Downloading and Preparing the Reference Genome

First create a directory for the reference genome and download the hs37d5 fasta with wget:

# Install wget
sudo apt install wget

# Download the reference genome (compressed)
# The -P flag creates the hs37d5 directory automatically
wget -c -P hs37d5 https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

Now install BWA and index the reference fasta. This is a one-time task:

# Install BWA
# Using system package manager (requires sudo):
sudo apt-get update
sudo apt-get install bwa

# OR using conda/mamba:
conda install -c bioconda bwa
# or
mamba install -c bioconda bwa

# Index the reference genome
bwa index hs37d5/hs37d5.fa.gz


# Index for samtools (needed for variant calling and other tools later on)
samtools faidx hs37d5/hs37d5.fa.gz

# Verify indexing completed, you should see these files:
# hs37d5.fa.gz.amb, hs37d5.fa.gz.ann, hs37d5.fa.gz.bwt, hs37d5.fa.gz.pac, hs37d5.fa.gz.sa
# hs37d5.fa.gz.fai (samtools index)
ls -lh hs37d5/hs37d5.fa.gz*

The indexing step is computationally intensive. On a system with 16GB RAM and 8 CPU cores, the indexing takes approximately 45 minutes (~2700 seconds total). In total, the directory will grow to approximately 5.9GB. This includes the original 851MB compressed reference and the ~5.1GB of generated index files (the .bwt file alone accounts for 3GB).

Downloading and Aligning the FASTQs

Next, we download the paired-end FASTQ files. FASTQ files store raw sequencing reads with quality scores. Alignment maps these reads to a reference genome. These samples consist of forward (_1) and reverse (_2) reads, so we need to download both files:

# Download paired-end reads
wget -c \
  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR140/085/ERR14088885/ERR14088885_1.fastq.gz \
  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR140/085/ERR14088885/ERR14088885_2.fastq.gz

Now we align the reads to the reference genome and convert to a sorted BAM file:

bwa mem -t 8 \
  -k 16 \
  -L 16,16 \
  hs37d5/hs37d5.fa.gz \
  ERR14088885_1.fastq.gz \
  ERR14088885_2.fastq.gz \
  | samtools view -Sb - \
  | samtools sort -@ 8 -o ERR14088885.bam -

Parameters explained:

-t 8: Uses 8 threads for parallel processing (adjust based on your available CPU cores)
-k 16: Minimum seed length of 16bp (shorter than default to capture short aDNA fragments, default: 19)
-L 16,16: Clipping penalty of 16 (prevents over-aggressive clipping of damaged aDNA ends)

Note on sensitivity: For very low-quality or highly damaged samples, you can add -r 2.5 to force more aggressive re-seeding when matches are noisy due to damage. This significantly increases runtime but can improve read recovery.

Regarding single-end reads: Some aDNA samples are sequenced as single-end rather than paired-end. If you encounter a sample with only one FASTQ file, simply provide that single file to bwa mem instead of two files. All other parameters can remain the same.

This alignment step is the most computationally intensive part of the pipeline. For this sample (~375000 read pairs), alignment took approximately 3.8 hours on 8 cores. Processing time will vary based on available computational resources.

After successful alignment, you’ll have a sorted BAM file containing reads mapped to the hs37d5 reference genome. Finally, we create an index to enable fast random access to the BAM file:

# Index BAM for efficient access
samtools index ERR14088885.bam

This creates ERR14088885.bam.bai, which is required by most downstream analysis tools.

Depth Statistics and BAM Filtering

After alignment and indexing, it’s useful to assess some basic statistics. We can calculate mean depth across different chromosomes to evaluate sample quality and determine gender:

# For specific chromosomes:
# Mean depth on autosomes (chr1-22)
samtools depth -r 1 ERR14088885.bam | awk '{sum+=$3; count++} END {print "Chr1 mean depth:", sum/count}'

# Mean depth on Y chromosome (indicates gender)
samtools depth -r Y ERR14088885.bam | awk '{sum+=$3; count++} END {print "ChrY mean depth:", sum/count}'

Interpreting results:

Autosomal depth: Indicates sequencing depth (average reads per position)
Y chromosome depth: High depth suggests a male individual; near-zero depth indicates female
For ancient DNA, even low depth (0.1-1x) can be valuable for certain analyses

For this particular sample, chromosome 1 has ~14x depth and the Y chromosome has ~4.86x depth, confirming a male individual. This Y chromosome depth should be sufficient for haplogroup assignment using specialized tools (like ybyra).

Filtering the BAM

Finally, we filter out low-quality alignments to improve data quality for downstream analysis:

# Filter alignments with permissive settings for aDNA
samtools view -b -q 20 -F 4 ERR14088885.bam > ERR14088885_filtered.bam

# Index the filtered BAM
samtools index ERR14088885_filtered.bam

Filter parameters explained:

-q 20: Keeps only reads with mapping quality ≥20 (reduces mismapped reads while being permissive enough for damaged aDNA)
-F 4: Excludes unmapped reads (flag 4)
-b: Output in BAM format

These relatively permissive settings are appropriate for ancient DNA, which often has lower mapping quality due to DNA damage and short fragment lengths. Stricter filtering (e.g. -q 30) would discard too much valuable aDNA data.

The filtered BAM is now ready for downstream analysis. At this stage, a modern DNA pipeline would typically move toward diploid genotype calling, especially since this sample shows a robust ~14x coverage on Chromosome 1. However, ancient DNA requires a more cautious approach.

According to the supplementary materials of the study, these libraries were non-UDG treated. This means the DNA fragments retain the characteristic post-mortem damage, specifically C → T transitions, that occur over millennia. In a diploid calling model, these damage “snps” would be incorrectly identified as true heterozygosity, potentially skewing our results in ADMIXTOOLS.

In the next post, I’ll cover how to handle this by using pileupCaller for random haploid calling. By randomly selecting a single high-quality allele at each SNP position, we avoid the biases introduced by DNA damage while maintaining compatibility with the AADR dataset. This will be the final step to generate the genotypes we need for EIGENSTRAT format.

Finally, we can delete the unfiltered BAM and FASTQs:

# Remove the unfiltered BAM
rm ERR14088885.bam ERR14088885.bam.bai

# Remove the FASTQ files (no longer needed after alignment)
rm ERR14088885_1.fastq.gz ERR14088885_2.fastq.gz

System Requirements#

What this Post covers#

Checking if the published BAMs are aligned#

Downloading and Preparing the Reference Genome#

Downloading and Aligning the FASTQs#

Depth Statistics and BAM Filtering#

Filtering the BAM#