Efficient and Accurate Clustering for Large-Scale Genetic Mapping - PowerPoint PPT Presentation

Efficient and Accurate Clustering for Large-Scale Genetic Mapping *,++ § *, ¶ * V. Strnadová (Neeley ) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez , ++ *,§, ¶ § * John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker ¶ ++ § * Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute

Motivation • High-throughput sequencing methods have produced a flood of inexpensive genetic information • Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets

The Genetic Mapping Problem Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - 𝑛 2 A B A A B A 𝑛 3 A A - - - B 𝑛 4 A - B - B B 𝑛 5 B - B A - A 𝑛 6 A A B A - - 𝑛 7 - - - A B B 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A (missing data) 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B

The Genetic Mapping Problem Data 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 1 𝑛 10 A B - - A - 𝑛 5 𝑛 3 𝑛 6 𝑛 2 𝑛 11 A B A A B A 𝑛 12 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 6 A A B A - - 𝑛 7 - - - A B B 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A (missing data) 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B

The Genetic Mapping Problem Data 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 1 𝑛 10 A B - - A - 𝑛 5 𝑛 3 𝑛 6 𝑛 2 𝑛 11 A B A A B A 𝑛 12 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 3 𝑛 6 A A B A - - 𝑛 6 𝑛 7 𝑛 10 - - - A B B 𝑛 8 A B A B - A 𝑛 4 𝑛 15 𝑛 9 A B - B - - 𝑛 13 𝑛 9 𝑛 10 B B B - A A 𝑛 7 𝑛 8 (missing data) 𝑛 11 A A A A B B 𝑛 12 𝑛 1 𝑛 12 B 𝑛 11 - A B A - 𝑛 14 𝑛 2 𝑛 13 B B - A A - 𝑛 5 𝑛 14 - Linkage group 2 - - B A A Linkage group 1 𝑛 15 B - - A A B

The Genetic Mapping Problem Data 𝒏 𝟑 𝒏 𝟐𝟒 𝒏 𝟐𝟓 𝒏 𝟐𝟔 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝒏 𝟖 cluster 𝒏 𝟐 𝒏 𝟓 𝒏 𝟗 𝒏 𝟘 𝑛 1 𝒏 𝟐𝟏 A B - - A - 𝒏 𝟔 𝒏 𝟒 𝒏 𝟕 𝑛 2 𝒏 𝟐𝟐 A B A A B A 𝒏 𝟐𝟑 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 3 𝑛 6 A A B A - - 𝑛 6 𝑛 10 𝑛 7 - - - A B B 𝑛 8 𝑛 4 A B A B - A 𝑛 15 𝑛 9 A B - B - - 𝑛 13 𝑛 9 𝑛 7 𝑛 10 B B B - A A 𝑛 8 (missing data) 𝑛 11 A A A A B B 𝑛 12 𝑛 11 𝑛 1 𝑛 12 B - A B A - 𝑛 14 𝑛 2 𝑛 13 B 𝑛 5 B - A A - 𝑛 14 - Linkage group 2 - - B A A Linkage group 1 𝑛 15 B - - A A B

The Need for Large-Scale Clustering in Genetic Mapping • Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers • A major bottleneck is the linkage-group-finding phase • Popular mapping tools all handle this phase the same way, with an 𝑃 ( 𝑁 2 ) clustering algorithm for 𝑁 markers

The Need for Large-Scale Clustering in Genetic Mapping • Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers • A major bottleneck is the linkage-group-finding phase • Popular mapping tools all handle this phase the same way, with an 𝑃 ( 𝑁 2 ) clustering algorithm for 𝑁 markers Our solution: A fast , scalable clustering algorithm tailored to genetic marker data

Standard Approach to Genetic Marker Clustering Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - 𝑛 2 A B A A B A 𝑛 3 A A - - - B 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑛 4 A - B - B B 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 10 𝑛 5 𝑛 5 B - B A - A 𝑛 3 𝑛 6 𝑛 6 𝑛 11 A A B A - - 𝑛 12 Linkage group 2 𝑛 7 - - - A B B Linkage group 1 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B

Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (1) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked • (2) Cut all edges below a LOD threshold • (3) The resulting connected components = linkage groups

LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑄(𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑗 A B - - A - 𝑀𝑃𝐸(𝑛 𝑗 , 𝑛 𝑘 ) = log 10 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑘 A B A A B A Formally, 𝑆 𝑗𝑘 (1 − 𝜄 𝑗𝑘 ) 𝑂𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 +𝑂𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑂𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆+𝑂𝑆

LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑄(𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑗 A B - - A - 𝑀𝑃𝐸(𝑛 𝑗 , 𝑛 𝑘 ) = log 10 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑕𝑓 𝑗𝑘 ) 𝑛 𝑘 A B A A B A Formally, (1 − 𝜄 𝑗𝑘 ) 𝑆 𝑗𝑘 𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑗𝑘 𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘

LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝟐 𝟒 ) 𝟑 ( 𝟐 𝟒 ) 𝟐 (𝟐 − 𝑛 𝑗 A B - - A - 𝑴𝑷𝑬(𝒏 𝒋 , 𝒏 𝒌 ) = 𝐦𝐩𝐡 𝟐𝟏 = 𝟏. 𝟏𝟖𝟓 𝟏. 𝟔 𝟒 𝑛 𝑘 A B A A B A Formally, (1 − 𝜄 𝑗𝑘 ) 𝑆 𝑗𝑘 𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑗𝑘 𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘

Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (2) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked (2) Cut all edges below a LOD threshold

Efficient and Accurate Clustering for Large-Scale Genetic Mapping - PowerPoint PPT Presentation

Efficient and Accurate Clustering for Large-Scale Genetic Mapping ,++ , * V. Strnadov (Neeley ) , Aydn Bulu , Jarrod Chapman , Joseph Gonzalez , ++ ,, John Gilbert , Stefanie Jegelka , Daniel Rokhsar

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Health Care Delivery System Innovations for Children with Medical Complexity (CMC) Collaborative

Rural Transportation Improvement Plan 2021-2024 May 26, 2020 2021-2024 Rural TIP Mary 26, 2020

2014 iGEM 1 NEFU-China Northeast Forestry University 2 N orth E ast F orestry U niversity

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

1 NCAA Rule Book Larger - More Sections Rules in FR Section Case Plays in FI Section Rules

IMPORTANCE OF SECTORS AND PLACE Presentation to EDA Forum Justin Hanney, Head, Employment

Result Clustering for Keyword Search on Graphs Madhulika Mohanty Supervisor: Dr Maya Ramanath

Information Planning Division Director, Lee Kiwan (kiwani@seoul.go.kr) About Seoul Ranked 6th in

Sambuz

Useful Links

Newsletter

Mail Us

Efficient and Accurate Clustering for Large-Scale Genetic Mapping - PowerPoint PPT Presentation

Efficient and Accurate Clustering for Large-Scale Genetic Mapping *,++ *, * V. Strnadov (Neeley ) , Aydn Bulu , Jarrod Chapman , Joseph Gonzalez , ++ *,, * John Gilbert , Stefanie Jegelka , Daniel Rokhsar

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Health Care Delivery System Innovations for Children with Medical Complexity (CMC) Collaborative

Rural Transportation Improvement Plan 2021-2024 May 26, 2020 2021-2024 Rural TIP Mary 26, 2020

2014 iGEM 1 NEFU-China Northeast Forestry University 2 N orth E ast F orestry U niversity

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

1 NCAA Rule Book Larger - More Sections Rules in FR Section Case Plays in FI Section Rules

IMPORTANCE OF SECTORS AND PLACE Presentation to EDA Forum Justin Hanney, Head, Employment

Result Clustering for Keyword Search on Graphs Madhulika Mohanty Supervisor: Dr Maya Ramanath

Information Planning Division Director, Lee Kiwan (kiwani@seoul.go.kr) About Seoul Ranked 6th in

Sambuz

Useful Links

Newsletter

Mail Us

Efficient and Accurate Clustering for Large-Scale Genetic Mapping ,++ , * V. Strnadov (Neeley ) , Aydn Bulu , Jarrod Chapman , Joseph Gonzalez , ++ ,, John Gilbert , Stefanie Jegelka , Daniel Rokhsar