Dynamic mappers of NGS reads
Karel Břinda (LIGM Université Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Université Paris-Est)
Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) - - PowerPoint PPT Presentation
Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit Paris-Est) Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant
Karel Břinda (LIGM Université Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Université Paris-Est)
Read mapping is a bottleneck in NGS data processing (e.g., for variant calling) A lot of effort constantly invested into the development
None of them supports dynamic updates of the reference during the mapping
Only few papers on this topic exist
Genome for DNA Read Alignment. Seminar work, Harvard University, 2013.
referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013.
mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10, 2012.
Static mapper Reference (index)
MAPPER OUTPUT
1 2 n 1 iter. Read mapping SAM/BAM file
READS
Static mapper Reference (index) Statistics 1 2 n 1 2 n 1 2 n
MAPPER
1 iter. 1 iter. 1 iter. . . . Read mapping Pileup, consensus Update of the reference
OUTPUT
SAM/BAM file
READS
Dynamic mapper SAM/BAM file Reference (index) Statistics 1 2 n
READS MAPPER
1 iter. 1 iter. 1 iter. . . . Read mapping Update of the reference
OUTPUT
Memory requirements Speed Quality of alignment Iterative referencing
+
Dynamic mapping
+
Static mapping
+ ++
Two basic types of mappers:
Data structures must be dynamic
Dynamic FM-index – already studied:
Wheeler transform. Theoretical Computer Science 410(43), 2009.
Algorithms 8(2), 2010.
To make updates, it is necessary to keep simplified pileups (nucleotide counts in an alignment column). It is difficult to deal with insertions. The coordinates of already mapped reads can change during the mapping.
initial place holders (‘*’ character), final small post-processing corrections of the SAM file.
‘A’ counter ‘C’ counter ‘G’ counter ‘T’ counter DEL counter Sum 3 bits 3 bits 3 bits 3 bits 3 bits 15 bits
Example (memory needed for statistics for a single nucleotide)
1 3 5 7 9 11 13 15 17 19 C * * A * * G * * C * * G C * C * * A * …
Example (padded reference, an insertion at pos. 14)
When reference sequence changes too much, some of the already mapped reads should be remapped or unmapped Possible solution:
and take only the last reported alignments for each read
...AAAAATATATATATCGATCTGC... CC _ 1: ATCTATATATCG 2: CCGATCTGC 3: CCCGATCTG 4: ATCCCGATC Reference: Reads:
Dynamic mapper Reference (index) Statistics 1 2 n
READS MAPPER
1 iter. 1 iter. 1 iter. . . . Read mapping Update of the reference
OUTPUT
SAM/BAM file
Static mapper Reference (index)
READS
1 1 2 1 2 n
MAPPER
Statistics 1 iter. 1 iter. 1 iter. . . . Read mapping Pileup, consensus Update of the reference
OUTPUT
SAM/BAM file
1 𝑒 iterations)
Static mapper Reference (index)
READS
d reads d reads d reads d reads d reads d reads
MAPPER
Statistics 1 iter. 1 iter. 1 iter. . . . Read mapping Pileup, consensus Update of the reference
OUTPUT
SAM/BAM file
Goals:
Implementation:
incorporated)
Typical approach:
1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing.
…it is not very useful.
Typical approach:
1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing.
…it is not very useful.
Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997
Threshold 20 (on mapping qualities)
Typical approach:
1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing.
…it is not very useful. It is important to consider all thresholds on mapping qualities!
Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997
A new evaluation software for comparing alignments (C++, Python) It creates interactive HTML reports for a set of BAM files Support of:
Availability
at karel.brinda@univ-mlv.fr
Example of a comparison
Fraction of wrongly mapped reads in mapped reads Part of all reads in %
Mappers: BWA-ALN, BWA-MEM Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21 Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM Sequencing error rate: 0.01 Read length: 100 Read simulator: DWGSim Evaluator: LAVEnder
BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-MEM Borrelia crocidurae Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-MEM Borrelia crocidurae Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-MEM Human chromosome 21 Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES
BWA-MEM Human chromosome 21 Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALN)
employed (e.g., 15%+1%, BWA-MEM)
spot regions)
improvement (especially for, e.g., variant calling)
Gregory Kucherov Valentina Boeva