Reference-Based vs De Novo Assembly: Choosing the Right Approach

3 minute read

Genome assembly can be tackled in two main ways: aligning reads to a reference genome, or building an assembly from scratch (de novo). Both strategies are powerful, but they shine in different situations. Here’s an expanded look at when to use each, which tools I believe actually perform best based off my experience, and recommendations based off whether you’re starting with short reads or long reads.

Reference-Based Assembly: Using a Genome Map

This approach aligns your sequencing reads to a known reference genome. It’s like reassembling a puzzle with the picture on the box.

Common Tools

Aligners: BWA, BWA-MEM2, Minimap2, Bowtie2, HISAT2
Variant Callers: DeepVariant, GATK HaplotypeCaller, FreeBayes, bcftools

Pros

Fast and computationally efficient
Ideal for organisms with high-quality reference genomes (ex: human, mouse, yeast, some bacteria)
Enables variant discovery with known genomic coordinates

Cons

Bias toward the reference where novel insertions or rearrangements can be missed
Doesn’t work well without a “close” reference genome
Struggles in highly repetitive or rearranged regions

Why I’ll be using BWA-MEM2 for Illumina reads

For mapping short reads, BWA-MEM2 is my preference since it produces identical alignments to the classic and widely-used BWA tool but is typically ~1.3–3× faster thanks to [architecture-aware optimizations]. That speedup has been shown in the original BWA-MEM2 acceleration work and I will be creating a benchamrking comparison of various mappers in a later post to provide evidence.

But what about Minimap2?

Minimap2 is extremely fast and provides similar performance metrics to BWA when using short reads. However, I prefer to use Minimap2 for long reads. For short-read variant calling, BWA mappers are still the most widely tested and integrate smoothly with downstream filters, so I default to BWA-MEM2 for Illumina data.

De Novo Assembly: Starting from Scratch

Instead of aligning to an existing genome, de novo assemblies generate longer contiguous sequences (contigs) based solely on overlaps found in your reads. This is essential when no suitable reference exists. I compare it to building a puzzle where you have no idea what the final picture is supposed to be.

Common Tools

Short Reads: SPAdes, Velvet, SOAPdenovo2
Long Reads: Flye, Canu, Shasta
Hybrid: Unicycler, MaSuRCA

Pros

Essential for new or poorly characterized organisms
Captures novel sequences, insertions, mobile elements
Excellent for assembling plasmids, viral genomes, and microbiomes

Cons

Computationally demanding (CPU, RAM, Storage)
Quality and completeness depend on read quality and coverage
Can result in fragmented assemblies with short reads alone

Short-read (Illumina) de novo assemblies

For de novo with Illumina, SPAdes is often the standard choice (and my preferred assembler). Recent evaluations still find SPAdes-based pipelines reliable and repeatable for surveillance use.

Long-read (ONT or PacBio) de novo assemblies

Flye (often the best balance of contiguity and correctness for bacterial genomes) is my personal preference, with Canu also being a strong second option. One caution: long-read assemblers can miss small plasmids, so consider complementary strategies if plasmids matter.

Hybrid (short reads + long reads) de novo assemblies

Hybrid assemblies are substantially more accurate than long-read only assemblies since they make use of the short reads to correct any sequencing errors that are often found in long reads.

For completing small genomes, Unicycler seems to remain the community favorite since it combines long-read structure with short-read polishing to deliver very accurate, circularized assemblies.

Quick Comparison Table

Feature	Reference-Based	De Novo Assembly
Reference needed	Yes	No
Novel genome detection	Limited	Ideal
Computational needs	Low/Moderate	High
Tool maturity	Mature	Mature (short reads) and Improving (long reads)
Best for	Known genomes, variant calling	Novel organisms, structure identification

When to Use What?

Choose reference-based if:

You’re working on a model organism
You want to call SNPs, small indels, or gene expression levels
You have short reads and a reliable reference

Choose de novo if:

You’re sequencing something novel or poorly assembled
You’re targeting structural variants or rearrangements
You need full genome reconstruction (bacteria, viruses)

What’s Coming Next

In the next three posts, we’ll walk through our first set of hands-on tutorials to finish off the series:

Mapping reads to a reference and calling variants
Performing de novo assembly using all three methods
Comparing and evaluating assembly quality and completeness

Share on

X Facebook LinkedIn Bluesky

Mario F. Bisconti

Reference-Based vs De Novo Assembly: Choosing the Right Approach

Reference-Based Assembly: Using a Genome Map

Common Tools

Pros

Cons

Why I’ll be using BWA-MEM2 for Illumina reads

But what about Minimap2?

De Novo Assembly: Starting from Scratch

Common Tools

Pros

Cons

Short-read (Illumina) de novo assemblies

Long-read (ONT or PacBio) de novo assemblies

Hybrid (short reads + long reads) de novo assemblies

Quick Comparison Table

When to Use What?

Choose reference-based if:

Choose de novo if:

What’s Coming Next

Share on

Comments

You May Also Enjoy

Mapping with Confidence: A Beginner’s Guide to Reference-Based Assembly

Short Reads vs Long Reads: What’s the Difference and Why It Matters

Terminally Chill: Getting Comfortable with the Command Line

Conda? Mamba? Docker? Figuring Out Package Management Without Losing Your Mind