Reference-Based vs De Novo Assembly: Choosing the Right Approach
Genome assembly can be tackled in two main ways: aligning reads to a reference genome, or building an assembly from scratch (de novo). Both strategies are powerful, but they shine in different situations. Here’s an expanded look at when to use each, which tools I believe actually perform best based off my experience, and recommendations based off whether you’re starting with short reads or long reads.
Reference-Based Assembly: Using a Genome Map
This approach aligns your sequencing reads to a known reference genome. It’s like reassembling a puzzle with the picture on the box.
Common Tools
- Aligners: BWA, BWA-MEM2, Minimap2, Bowtie2, HISAT2
- Variant Callers: DeepVariant, GATK HaplotypeCaller, FreeBayes, bcftools
Pros
- Fast and computationally efficient
- Ideal for organisms with high-quality reference genomes (ex: human, mouse, yeast, some bacteria)
- Enables variant discovery with known genomic coordinates
Cons
- Bias toward the reference where novel insertions or rearrangements can be missed
- Doesn’t work well without a “close” reference genome
- Struggles in highly repetitive or rearranged regions
Why I’ll be using BWA-MEM2 for Illumina reads
For mapping short reads, BWA-MEM2 is my preference since it produces identical alignments to the classic and widely-used BWA tool but is typically ~1.3–3× faster thanks to [architecture-aware optimizations]. That speedup has been shown in the original BWA-MEM2 acceleration work and I will be creating a benchamrking comparison of various mappers in a later post to provide evidence.
But what about Minimap2?
Minimap2 is extremely fast and provides similar performance metrics to BWA when using short reads. However, I prefer to use Minimap2 for long reads. For short-read variant calling, BWA mappers are still the most widely tested and integrate smoothly with downstream filters, so I default to BWA-MEM2 for Illumina data.
De Novo Assembly: Starting from Scratch
Instead of aligning to an existing genome, de novo assemblies generate longer contiguous sequences (contigs) based solely on overlaps found in your reads. This is essential when no suitable reference exists. I compare it to building a puzzle where you have no idea what the final picture is supposed to be.
Common Tools
Pros
- Essential for new or poorly characterized organisms
- Captures novel sequences, insertions, mobile elements
- Excellent for assembling plasmids, viral genomes, and microbiomes
Cons
- Computationally demanding (CPU, RAM, Storage)
- Quality and completeness depend on read quality and coverage
- Can result in fragmented assemblies with short reads alone
Short-read (Illumina) de novo assemblies
For de novo with Illumina, SPAdes is often the standard choice (and my preferred assembler). Recent evaluations still find SPAdes-based pipelines reliable and repeatable for surveillance use.
Long-read (ONT or PacBio) de novo assemblies
Flye (often the best balance of contiguity and correctness for bacterial genomes) is my personal preference, with Canu also being a strong second option. One caution: long-read assemblers can miss small plasmids, so consider complementary strategies if plasmids matter.
Hybrid (short reads + long reads) de novo assemblies
Hybrid assemblies are substantially more accurate than long-read only assemblies since they make use of the short reads to correct any sequencing errors that are often found in long reads.
For completing small genomes, Unicycler seems to remain the community favorite since it combines long-read structure with short-read polishing to deliver very accurate, circularized assemblies.
Quick Comparison Table
Feature | Reference-Based | De Novo Assembly |
---|---|---|
Reference needed | Yes | No |
Novel genome detection | Limited | Ideal |
Computational needs | Low/Moderate | High |
Tool maturity | Mature | Mature (short reads) and Improving (long reads) |
Best for | Known genomes, variant calling | Novel organisms, structure identification |
When to Use What?
Choose reference-based if:
- You’re working on a model organism
- You want to call SNPs, small indels, or gene expression levels
- You have short reads and a reliable reference
Choose de novo if:
- You’re sequencing something novel or poorly assembled
- You’re targeting structural variants or rearrangements
- You need full genome reconstruction (bacteria, viruses)
What’s Coming Next
In the next three posts, we’ll walk through our first set of hands-on tutorials to finish off the series:
- Mapping reads to a reference and calling variants
- Performing de novo assembly using all three methods
- Comparing and evaluating assembly quality and completeness
Comments