dynamic mappers of ngs reads
play

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) - PowerPoint PPT Presentation

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit Paris-Est) Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant


  1. Dynamic mappers of NGS reads Karel Břinda (LIGM Universit é Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Universit é Paris-Est)

  2. Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant calling) A lot of effort constantly invested into the development of new mappers None of them supports dynamic updates of the reference during the mapping

  3. Idea: update reference during the mapping Only few papers on this topic exist ◦ J. Pritt. Efficiently Improving the Reference Genome for DNA Read Alignment. Seminar work, Harvard University, 2013. ◦ A. Ghanayim and D. Geiger. Iterative referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013. ◦ C. S. Iliopoulos et al. An algorithm for mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10 , 2012.

  4. Mapping – from static to dynamic 1. Static mapping ◦ Classical mappers, no updates 2. Iterative referencing ◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations 3. Dynamic mapping ◦ Mapper is dynamically updating its index accordingly to already mapped reads

  5. 1) Static mapping (standard mappers) READS OUTPUT MAPPER Reference (index) SAM/BAM 1 1 iter. 2 n Static mapper file Read mapping

  6. 2) Iterative referencing (Ghanayim&Geiger, 2013) READS MAPPER OUTPUT Statistics Update of the 1 iter. 1 2 n reference Pileup, 1 iter. 1 2 n consensus . Reference (index) . . SAM/BAM 1 1 iter. 2 n Static mapper file Read mapping

  7. 3) Dynamic mapping (no existing mapper until now) READS MAPPER OUTPUT 1 iter. 1 Statistics 1 iter. 2 Update of the reference . . Reference (index) . SAM/BAM 1 iter. n Dynamic mapper file Read mapping

  8. Estimating the usefulness Memory requirements Speed Quality of alignment + -- ++ Iterative referencing -- + + Dynamic mapping + ++ - Static mapping

  9. Dynamic mappers

  10. Difficulties – dynamic data structures Two basic types of mappers: ◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.) ◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.) Data structures must be dynamic ◦ Difficult to make dynamic versions ◦ More memory needed ◦ Worse cache-optimization (=> significant decrease of speed) Dynamic FM-index – already studied: ◦ M. Salson, T. Lecroq , M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows – Wheeler transform. Theoretical Computer Science 410 (43), 2009. ◦ M. Salson, T. Lecroq, M. Léonard , and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete Algorithms 8 (2), 2010. ◦ Implementation: http://dfmi.sourceforge.net/

  11. Difficulties – statistics and reference Example (memory needed for To make updates, it is necessary to keep simplified pileups (nucleotide statistics for a single nucleotide) counts in an alignment column). ‘A’ ‘C’ ‘G’ ‘T’ DEL Sum counter counter counter counter counter It is difficult to deal with insertions. 3 bits 3 bits 3 bits 3 bits 3 bits 15 bits The coordinates of already mapped Example (padded reference, an reads can change during the mapping. insertion at pos. 14) ◦ Possible solution: padded reference, many 1 3 5 7 9 11 13 15 17 19 initial place holders (‘*’ character), final C * * A * * G * * C * * G C * C * * A * … small post-processing corrections of the SAM file.

  12. Difficulties – remapping, unmapping When reference sequence changes too ... AAAAATATATAT AT CGATCTGC ... Reference: CC _ much, some of the already mapped reads should be remapped or Reads: 1: ATCTATATATCG unmapped 2: C CGATCTGC 3: CC CGATCTG 4: AT CC CGATC Possible solution: ◦ Ignore it ◦ Iterate over the set of reads more times and take only the last reported alignments for each read

  13. Simulating dynamic mapping

  14. Dynamic mapping READS MAPPER OUTPUT 1 iter. 1 Statistics 1 iter. 2 Update of the reference . . Reference (index) . SAM/BAM 1 iter. n Dynamic mapper file Read mapping

  15. Simulation (ideal approach) READS MAPPER OUTPUT Statistics 1 1 iter. Update of the reference Pileup, 1 1 iter. 2 consensus . Reference (index) . . SAM/BAM 1 iter. 1 2 n Static mapper file Read mapping

  16. 1 Simulation (feasible approach: 𝑒 iterations) READS MAPPER OUTPUT Statistics Update of the 1 iter. d reads reference Pileup, 1 iter. d reads d reads consensus . Reference (index) . . 1 iter. SAM/BAM d reads d reads d reads Static mapper file Read mapping

  17. Our pipeline Goals: ◦ Simulating dynamic mapper using existing static mappers ◦ Estimating usefulness of dynamic mapping ◦ Making general statements about its benefit Implementation: ◦ Set of several scripts (BASH, Python) and programs (C++) ◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be incorporated) ◦ Updates are made by own simple variant caller (simulating real capabilities of mapper) ◦ Currently only SNP updates (no indels) and single-end reads supported

  18. Comparing mappers and alignments

  19. Comparison of mappers Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful .

  20. Comparison of mappers/alignments Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful . Threshold 20 (on mapping qualities) Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , arXiv:1303.3997

  21. Comparison of mappers/alignments Typical approach: 1. Taking several mappers as black-boxes. 2. Simulating reads. 3. Mapping by the selected mappers. 4. Applying the same threshold on mapping qualities for all reads. 5. Comparing. …it is not very useful . It is important to consider all thresholds on mapping qualities! Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , arXiv:1303.3997

  22. LAVEnder A new evaluation software for comparing alignments (C++, Python) It creates interactive HTML reports for a set of BAM files Support of: ◦ DWGsim read simulator (will be extended) ◦ Single-end reads Availability ◦ Currently a private repository on GitHub ◦ In case of interest, don’t hesitate to contact me at karel.brinda@univ-mlv.fr

  23. Fraction of wrongly mapped reads in mapped reads Example of a comparison • Human chromosome 21 • Sequencing error rate: 0.04 Part of all • Mutation rate: 0.10 reads in % • Single-end reads • Simulated by DWGsim • Aligned by BWA-MEM

  24. EXPERIMENTS

  25. Setup Mappers: BWA-ALN, BWA-MEM Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21 Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM Sequencing error rate: 0.01 Read length: 100 Read simulator: DWGSim Evaluator: LAVEnder

  26. BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.01 mut. rate

  27. BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  28. BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.01 mut. rate

  29. BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  30. BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.03 mut. rate

  31. BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  32. BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.03 mut. rate

  33. BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  34. BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.05 mut. rate

  35. BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

  36. BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.05 mut. rate

  37. BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 ITERATIVE REFERENCING DYNAMIC MAPPING

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend