efficient and accurate clustering
play

Efficient and Accurate Clustering for Large-Scale Genetic Mapping - PowerPoint PPT Presentation

Efficient and Accurate Clustering for Large-Scale Genetic Mapping *,++ *, * V. Strnadov (Neeley ) , Aydn Bulu , Jarrod Chapman , Joseph Gonzalez , ++ *,, * John Gilbert , Stefanie Jegelka , Daniel Rokhsar


  1. Efficient and Accurate Clustering for Large-Scale Genetic Mapping *,++ § *, ¶ * V. Strnadová (Neeley ) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez , ++ *,§, ¶ § * John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker ¶ ++ § * Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute

  2. Motivation • High-throughput sequencing methods have produced a flood of inexpensive genetic information • Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets

  3. The Genetic Mapping Problem Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - 𝑛 2 A B A A B A 𝑛 3 A A - - - B 𝑛 4 A - B - B B 𝑛 5 B - B A - A 𝑛 6 A A B A - - 𝑛 7 - - - A B B 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A (missing data) 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B

  4. The Genetic Mapping Problem Data 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 1 𝑛 10 A B - - A - 𝑛 5 𝑛 3 𝑛 6 𝑛 2 𝑛 11 A B A A B A 𝑛 12 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 6 A A B A - - 𝑛 7 - - - A B B 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A (missing data) 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B

  5. The Genetic Mapping Problem Data 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 1 𝑛 10 A B - - A - 𝑛 5 𝑛 3 𝑛 6 𝑛 2 𝑛 11 A B A A B A 𝑛 12 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 3 𝑛 6 A A B A - - 𝑛 6 𝑛 7 𝑛 10 - - - A B B 𝑛 8 A B A B - A 𝑛 4 𝑛 15 𝑛 9 A B - B - - 𝑛 13 𝑛 9 𝑛 10 B B B - A A 𝑛 7 𝑛 8 (missing data) 𝑛 11 A A A A B B 𝑛 12 𝑛 1 𝑛 12 B 𝑛 11 - A B A - 𝑛 14 𝑛 2 𝑛 13 B B - A A - 𝑛 5 𝑛 14 - Linkage group 2 - - B A A Linkage group 1 𝑛 15 B - - A A B

  6. The Genetic Mapping Problem Data 𝒏 𝟑 𝒏 𝟐𝟒 𝒏 𝟐𝟓 𝒏 𝟐𝟔 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝒏 𝟖 cluster 𝒏 𝟐 𝒏 𝟓 𝒏 𝟗 𝒏 𝟘 𝑛 1 𝒏 𝟐𝟏 A B - - A - 𝒏 𝟔 𝒏 𝟒 𝒏 𝟕 𝑛 2 𝒏 𝟐𝟐 A B A A B A 𝒏 𝟐𝟑 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 3 𝑛 6 A A B A - - 𝑛 6 𝑛 10 𝑛 7 - - - A B B 𝑛 8 𝑛 4 A B A B - A 𝑛 15 𝑛 9 A B - B - - 𝑛 13 𝑛 9 𝑛 7 𝑛 10 B B B - A A 𝑛 8 (missing data) 𝑛 11 A A A A B B 𝑛 12 𝑛 11 𝑛 1 𝑛 12 B - A B A - 𝑛 14 𝑛 2 𝑛 13 B 𝑛 5 B - A A - 𝑛 14 - Linkage group 2 - - B A A Linkage group 1 𝑛 15 B - - A A B

  7. The Need for Large-Scale Clustering in Genetic Mapping • Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers • A major bottleneck is the linkage-group-finding phase • Popular mapping tools all handle this phase the same way, with an 𝑃 ( 𝑁 2 ) clustering algorithm for 𝑁 markers

  8. The Need for Large-Scale Clustering in Genetic Mapping • Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers • A major bottleneck is the linkage-group-finding phase • Popular mapping tools all handle this phase the same way, with an 𝑃 ( 𝑁 2 ) clustering algorithm for 𝑁 markers Our solution: A fast , scalable clustering algorithm tailored to genetic marker data

  9. Standard Approach to Genetic Marker Clustering Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - 𝑛 2 A B A A B A 𝑛 3 A A - - - B 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑛 4 A - B - B B 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 10 𝑛 5 𝑛 5 B - B A - A 𝑛 3 𝑛 6 𝑛 6 𝑛 11 A A B A - - 𝑛 12 Linkage group 2 𝑛 7 - - - A B B Linkage group 1 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B

  10. Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (1) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked • (2) Cut all edges below a LOD threshold • (3) The resulting connected components = linkage groups

  11. Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (1) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked • (2) Cut all edges below a LOD threshold • (3) The resulting connected components = linkage groups

  12. LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑄(𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑗 A B - - A - 𝑀𝑃𝐸(𝑛 𝑗 , 𝑛 𝑘 ) = log 10 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑘 A B A A B A Formally, 𝑆 𝑗𝑘 (1 − 𝜄 𝑗𝑘 ) 𝑂𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 +𝑂𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑂𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆+𝑂𝑆

  13. LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑄(𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑗 A B - - A - 𝑀𝑃𝐸(𝑛 𝑗 , 𝑛 𝑘 ) = log 10 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑘 A B A A B A Formally, (1 − 𝜄 𝑗𝑘 ) 𝑆 𝑗𝑘 𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑗𝑘 𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘

  14. LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝟐 𝟒 ) 𝟑 ( 𝟐 𝟒 ) 𝟐 (𝟐 − 𝑛 𝑗 A B - - A - 𝑴𝑷𝑬(𝒏 𝒋 , 𝒏 𝒌 ) = 𝐦𝐩𝐡 𝟐𝟏 = 𝟏. 𝟏𝟖𝟓 𝟏. 𝟔 𝟒 𝑛 𝑘 A B A A B A Formally, (1 − 𝜄 𝑗𝑘 ) 𝑆 𝑗𝑘 𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑗𝑘 𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘

  15. Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (2) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked (2) Cut all edges below a LOD threshold

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend