Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) - - PowerPoint PPT Presentation

dynamic mappers of ngs reads
SMART_READER_LITE
LIVE PREVIEW

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) - - PowerPoint PPT Presentation

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit Paris-Est) Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant


slide-1
SLIDE 1

Dynamic mappers of NGS reads

Karel Břinda (LIGM Université Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Université Paris-Est)

slide-2
SLIDE 2

Introduction

Read mapping is a bottleneck in NGS data processing (e.g., for variant calling) A lot of effort constantly invested into the development

  • f new mappers

None of them supports dynamic updates of the reference during the mapping

slide-3
SLIDE 3

Idea: update reference during the mapping

Only few papers on this topic exist

  • J. Pritt. Efficiently Improving the Reference

Genome for DNA Read Alignment. Seminar work, Harvard University, 2013.

  • A. Ghanayim and D. Geiger. Iterative

referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013.

  • C. S. Iliopoulos et al. An algorithm for

mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10, 2012.

slide-4
SLIDE 4

Mapping – from static to dynamic

  • 1. Static mapping
  • Classical mappers, no updates
  • 2. Iterative referencing
  • Usage of a standard mappers, mapping is followed by calling variants in many iterations
  • 3. Dynamic mapping
  • Mapper is dynamically updating its index accordingly to already mapped reads
slide-5
SLIDE 5

1) Static mapping (standard mappers)

Static mapper Reference (index)

MAPPER OUTPUT

1 2 n 1 iter. Read mapping SAM/BAM file

READS

slide-6
SLIDE 6

2) Iterative referencing (Ghanayim&Geiger, 2013)

Static mapper Reference (index) Statistics 1 2 n 1 2 n 1 2 n

MAPPER

1 iter. 1 iter. 1 iter. . . . Read mapping Pileup, consensus Update of the reference

OUTPUT

SAM/BAM file

READS

slide-7
SLIDE 7

3) Dynamic mapping (no existing mapper until now)

Dynamic mapper SAM/BAM file Reference (index) Statistics 1 2 n

READS MAPPER

1 iter. 1 iter. 1 iter. . . . Read mapping Update of the reference

OUTPUT

slide-8
SLIDE 8

Estimating the usefulness

Memory requirements Speed Quality of alignment Iterative referencing

+

  • ++

Dynamic mapping

  • +

+

Static mapping

+ ++

slide-9
SLIDE 9

Dynamic mappers

slide-10
SLIDE 10

Difficulties – dynamic data structures

Two basic types of mappers:

  • FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.)
  • Hash-table based (e.g., SHRiMP 2, SToRM, etc.)

Data structures must be dynamic

  • Difficult to make dynamic versions
  • More memory needed
  • Worse cache-optimization (=> significant decrease of speed)

Dynamic FM-index – already studied:

  • M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows–

Wheeler transform. Theoretical Computer Science 410(43), 2009.

  • M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete

Algorithms 8(2), 2010.

  • Implementation: http://dfmi.sourceforge.net/
slide-11
SLIDE 11

Difficulties – statistics and reference

To make updates, it is necessary to keep simplified pileups (nucleotide counts in an alignment column). It is difficult to deal with insertions. The coordinates of already mapped reads can change during the mapping.

  • Possible solution: padded reference, many

initial place holders (‘*’ character), final small post-processing corrections of the SAM file.

‘A’ counter ‘C’ counter ‘G’ counter ‘T’ counter DEL counter Sum 3 bits 3 bits 3 bits 3 bits 3 bits 15 bits

Example (memory needed for statistics for a single nucleotide)

1 3 5 7 9 11 13 15 17 19 C * * A * * G * * C * * G C * C * * A * …

Example (padded reference, an insertion at pos. 14)

slide-12
SLIDE 12

Difficulties – remapping, unmapping

When reference sequence changes too much, some of the already mapped reads should be remapped or unmapped Possible solution:

  • Ignore it
  • Iterate over the set of reads more times

and take only the last reported alignments for each read

...AAAAATATATATATCGATCTGC... CC _ 1: ATCTATATATCG 2: CCGATCTGC 3: CCCGATCTG 4: ATCCCGATC Reference: Reads:

slide-13
SLIDE 13

Simulating dynamic mapping

slide-14
SLIDE 14

Dynamic mapping

Dynamic mapper Reference (index) Statistics 1 2 n

READS MAPPER

1 iter. 1 iter. 1 iter. . . . Read mapping Update of the reference

OUTPUT

SAM/BAM file

slide-15
SLIDE 15

Simulation (ideal approach)

Static mapper Reference (index)

READS

1 1 2 1 2 n

MAPPER

Statistics 1 iter. 1 iter. 1 iter. . . . Read mapping Pileup, consensus Update of the reference

OUTPUT

SAM/BAM file

slide-16
SLIDE 16

Simulation (feasible approach:

1 𝑒 iterations)

Static mapper Reference (index)

READS

d reads d reads d reads d reads d reads d reads

MAPPER

Statistics 1 iter. 1 iter. 1 iter. . . . Read mapping Pileup, consensus Update of the reference

OUTPUT

SAM/BAM file

slide-17
SLIDE 17

Our pipeline

Goals:

  • Simulating dynamic mapper using existing static mappers
  • Estimating usefulness of dynamic mapping
  • Making general statements about its benefit

Implementation:

  • Set of several scripts (BASH, Python) and programs (C++)
  • It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be

incorporated)

  • Updates are made by own simple variant caller (simulating real capabilities of mapper)
  • Currently only SNP updates (no indels) and single-end reads supported
slide-18
SLIDE 18

Comparing mappers and alignments

slide-19
SLIDE 19

Comparison of mappers

Typical approach:

1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing.

…it is not very useful.

slide-20
SLIDE 20

Comparison of mappers/alignments

Typical approach:

1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing.

…it is not very useful.

Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997

Threshold 20 (on mapping qualities)

slide-21
SLIDE 21

Comparison of mappers/alignments

Typical approach:

1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing.

…it is not very useful. It is important to consider all thresholds on mapping qualities!

Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997

slide-22
SLIDE 22

LAVEnder

A new evaluation software for comparing alignments (C++, Python) It creates interactive HTML reports for a set of BAM files Support of:

  • DWGsim read simulator (will be extended)
  • Single-end reads

Availability

  • Currently a private repository on GitHub
  • In case of interest, don’t hesitate to contact me

at karel.brinda@univ-mlv.fr

slide-23
SLIDE 23

Example of a comparison

  • Human chromosome 21
  • Sequencing error rate: 0.04
  • Mutation rate: 0.10
  • Single-end reads
  • Simulated by DWGsim
  • Aligned by BWA-MEM

Fraction of wrongly mapped reads in mapped reads Part of all reads in %

slide-24
SLIDE 24

EXPERIMENTS

slide-25
SLIDE 25

Setup

Mappers: BWA-ALN, BWA-MEM Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21 Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM Sequencing error rate: 0.01 Read length: 100 Read simulator: DWGSim Evaluator: LAVEnder

slide-26
SLIDE 26

BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Borrelia BWA-ALN 0.01 mut. rate

slide-27
SLIDE 27

BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-28
SLIDE 28

BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Human Chr. 21 BWA-ALN 0.01 mut. rate

slide-29
SLIDE 29

BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-30
SLIDE 30

BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Borrelia BWA-ALN 0.03 mut. rate

slide-31
SLIDE 31

BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-32
SLIDE 32

BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Human Chr. 21 BWA-ALN 0.03 mut. rate

slide-33
SLIDE 33

BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-34
SLIDE 34

BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Borrelia BWA-ALN 0.05 mut. rate

slide-35
SLIDE 35

BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-36
SLIDE 36

BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Human Chr. 21 BWA-ALN 0.05 mut. rate

slide-37
SLIDE 37

BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-38
SLIDE 38

BWA-MEM Borrelia crocidurae Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Borrelia BWA-MEM 0.15 mut. rate

slide-39
SLIDE 39

BWA-MEM Borrelia crocidurae Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-40
SLIDE 40

BWA-MEM Human chromosome 21 Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES

Human Chr. 21 BWA-MEM 0.15 mut. rate

slide-41
SLIDE 41

BWA-MEM Human chromosome 21 Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING

slide-42
SLIDE 42

Conclusion

  • We have shown:
  • For cases with small number of mutations between genomes, static mapping suffices (e.g., 1%+1%,

BWA-ALN)

  • For cases with high amount of mutations, mapping is much improved when dynamic mapping is

employed (e.g., 15%+1%, BWA-MEM)

  • Real situations: regions with low rates of mutations as well as highly mutated regions (e.g., hot

spot regions)

  • If we are interested also in these regions, dynamic mapping/iterative referencing would provide great

improvement (especially for, e.g., variant calling)

  • Side products of our work:
  • LAVEnder – a new evaluator of alignments
slide-43
SLIDE 43

Thank you for your attention!

Gregory Kucherov Valentina Boeva