designing small universal k mer hitting sets for improved
play

Designing small universal k-mer hitting sets for improved analysis - PowerPoint PPT Presentation

National Taiwan University Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777


  1. National Taiwan University Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777 Hung-Yu Chen, R06945024 Vincent Hwang, B05902122

  2. 1 Outline Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Background · Methods and results · Conclusion

  3. 2 Background data. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Sequencing datasets are larger and larger. · New computational ideas are essential to manage and analyze

  4. 3 Minimizer Reducing storage requirements for biological sequence comparison, lexicographically smallest k -mer in it. minimizers of every L -long subsequence in S . Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke; Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363 − 3369 · Given a sequence of length L , the minimizer is the · Given a sequence S of any length, the minimizer set is the set of = ⇒ Every L -long subsequence in S is represented in the set.

  5. 4 Application of Minimizers Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Hashing for read overlapping · Sparse suffjx arrays · Bloom fjlters to speed up sequence search

  6. 5 Hashing for read overlapping Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

  7. 6 Sparse suffjx arrays Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

  8. 7 Bloom fjlters to speed up sequence search Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777

  9. 8 Universal hitting set(UHS) possible sequence of length L must contain at least one k -mer in Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · For integers k , L , a set U k , L is called a UHS of k -mers if every U k , L . · For example, the set of all k -mers is a trivial UHS. · Problem 1 . Given k and L , fjnd a smallest UHS of k -mers.

  10. 9 Hits possible sequence of length L . Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · A k -mer w hits string S , denoted w ⊆ S , if w is a substring in S . · k -mer set X hits string S if there exists w ∈ X such that w ⊆ S . · The UHS in Problem 1 is a set of k -mers U k , L which hits every

  11. 10 Advantages of UHS over minimizers k -mers. The method in this paper can often generate UHSs smaller by a factor of nearly k . dataset. comparable set of k -mers. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · The set of minimizers may be as large as the complete set of · UHS is universal. = ⇒ For any k and L , a UHS needs to be computed only once for every = ⇒ The data structures created for difgerent datasets will contain a

  12. 11 Using de Bruijn graphs to fjnd UHSs Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Problem 2. Given a complete de Bruijn graph D k of order k and an integer L , fjnd a smallest set of vertices U k , L such that any path in D k of length l = L − k passes through at least one vertex of U k , L .

  13. 12 Complete de Bruijn graph label of vertex u is the k -suffjx of l and the label of vertex v is the k -prefjx of l . A complete de Bruijn graph contains all possible Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · A complete de Bruijn graph of order k over alphabet Σ : V : | Σ | k vertices, each labelled with a unique k -mer. E : If there is an edge ( u , v ) with a ( k + 1) -mer label l , then the | Σ | k +1 edges of this type.

  14. 13 How to fjnd the UHS? Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · NP-hard in general(supporting information in the paper). · Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)

  15. 14 How to fjnd UHS? 2. Find the decycling vertex set( V set), X . L length sequences. (i) DOCKS (ii) DOCKSany (iii) DOCKSanyX 5. X is the universal hitting set we’re searching for. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 1. Generate a complete de Bruijn graph of order k , set l = L − k . 3. Remove X from the graph, result in G ′ . 4. Remove vertices from G ′ and add them to S to hit the remained

  16. 15 Decycling de Bruijn graph Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Vertices labeling · Factor · Pure cycling register( PCR k ) · V-set

  17. 16 Decycling de Bruijn graph Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 001 011 000 101 010 111 100 110

  18. 17 Vertices labeling According to the center of mass position in the coordinate Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 For a vertex v ( s 0 , s 1 , . . . , s k − 1 ) , calculate the center of mass. system, label the vertex I if x = 0 , L if x < 0 , R if x > 0 ,

  19. 18 Vertex labeling example Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 v = 010111 , the center of mass’ x value > 0. = ⇒ R . 0 1 1 0 1 1

  20. 19 Factor exactly one of the cycles. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · A factor is a set of cycles such that all vertices in the graph are in · Each cycle has a unique feedback function f ( s 0 , s 1 , . . . , s k − 1 ) = s k .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend