Efficient and Accurate Clustering for Large-Scale Genetic Mapping - - PowerPoint PPT Presentation

efficient and accurate clustering
SMART_READER_LITE
LIVE PREVIEW

Efficient and Accurate Clustering for Large-Scale Genetic Mapping - - PowerPoint PPT Presentation

Efficient and Accurate Clustering for Large-Scale Genetic Mapping *,++ *, * V. Strnadov (Neeley ) , Aydn Bulu , Jarrod Chapman , Joseph Gonzalez , ++ *,, * John Gilbert , Stefanie Jegelka , Daniel Rokhsar


slide-1
SLIDE 1

Efficient and Accurate Clustering for Large-Scale Genetic Mapping

  • V. Strnadová (Neeley) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez ,

John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker

* ++ § ¶ *,++ * *, ¶ § § ++ *,§, ¶ *

Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute

slide-2
SLIDE 2

Motivation

  • High-throughput sequencing methods have produced a flood of

inexpensive genetic information

  • Genetic maps are important to breeding studies but genetic mapping

software is prohibitively slow on large data sets

slide-3
SLIDE 3

The Genetic Mapping Problem

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13 B

B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

(missing data)

Data

slide-4
SLIDE 4

The Genetic Mapping Problem

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13 B

B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

(missing data)

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1 Linkage group 2

cluster

Data

slide-5
SLIDE 5

The Genetic Mapping Problem

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13 B

B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

(missing data)

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14

𝑛8 𝑛15 𝑛2 𝑛1 𝑛9

Linkage group 1 Linkage group 2

cluster Linkage group 1 Linkage group 2

Data

𝑛11 𝑛6 𝑛13 𝑛3 𝑛12 𝑛10 𝑛4 𝑛7 𝑛5 𝑛14

slide-6
SLIDE 6

The Genetic Mapping Problem

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13 B

B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

(missing data)

𝒏𝟐 𝒏𝟑 𝒏𝟗𝒏𝟘 𝒏𝟐𝟔 𝒏𝟒 𝒏𝟓 𝒏𝟔 𝒏𝟕 𝒏𝟖 𝒏𝟐𝟏 𝒏𝟐𝟐 𝒏𝟐𝟑 𝒏𝟐𝟒𝒏𝟐𝟓 Linkage group 1 Linkage group 2

cluster

Data

𝑛8 𝑛15 𝑛2 𝑛1 𝑛9 Linkage group 1 Linkage group 2 𝑛11 𝑛6 𝑛13 𝑛3 𝑛12 𝑛10 𝑛4 𝑛7 𝑛5 𝑛14

slide-7
SLIDE 7

The Need for Large-Scale Clustering in Genetic Mapping

  • Hundreds of thousands of genetic markers available, but current software can
  • nly handle up to ~10,000 markers
  • A major bottleneck is the linkage-group-finding phase
  • Popular mapping tools all handle this phase the same way, with an 𝑃(𝑁2)

clustering algorithm for 𝑁 markers

slide-8
SLIDE 8
  • Hundreds of thousands of genetic markers available, but current software can
  • nly handle up to ~10,000 markers
  • A major bottleneck is the linkage-group-finding phase
  • Popular mapping tools all handle this phase the same way, with an 𝑃(𝑁2)

clustering algorithm for 𝑁 markers

Our solution: A fast, scalable clustering algorithm tailored to genetic marker data

The Need for Large-Scale Clustering in Genetic Mapping

slide-9
SLIDE 9

Standard Approach to Genetic Marker Clustering

cluster

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13 B

B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

Data

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1 Linkage group 2

slide-10
SLIDE 10

Standard Approach to Genetic Marker Clustering

(1) 𝑛1 𝑛2 𝑛8𝑛9 𝑛15 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1 Linkage group 2

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13

B B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

(1) Compute the similarity between all 𝑃(𝑁2) pairs of markers, producing a complete graph with 𝑁 vertices

  • Similarity function is the “LOD score”, a logarithm of odds that two

markers are genetically linked

  • (2) Cut all edges below a LOD threshold
  • (3) The resulting connected components = linkage groups
slide-11
SLIDE 11

Standard Approach to Genetic Marker Clustering

(1) 𝑛1 𝑛2 𝑛8𝑛9 𝑛15 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1 Linkage group 2

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13

B B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

(1) Compute the similarity between all 𝑃(𝑁2) pairs of markers, producing a complete graph with 𝑁 vertices

  • Similarity function is the “LOD score”, a logarithm of odds that two

markers are genetically linked

  • (2) Cut all edges below a LOD threshold
  • (3) The resulting connected components = linkage groups
slide-12
SLIDE 12

LOD Score

Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑀𝑃𝐸(𝑛𝑗, 𝑛𝑘) = log10 𝑄(𝑚𝑗𝑜𝑙𝑏𝑕𝑓𝑗𝑘) 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑕𝑓𝑗𝑘) Formally,

𝑀𝑃𝐸 = log10 (1 − 𝜄𝑗𝑘 )𝑂𝑆𝑗𝑘𝜄𝑗𝑘

𝑆𝑗𝑘

0.5𝑆𝑗𝑘+𝑂𝑆𝑗𝑘 Where: 𝑆𝑗𝑘 = number of recombinant offspring 𝑂𝑆𝑗𝑘 = number of nonrecombinant offspring 𝜄𝑗𝑘 = recombination fraction, i.e.

𝑆 𝑆+𝑂𝑆

𝑛𝑗 A B

  • A
  • 𝑛𝑘

A B A A B A

slide-13
SLIDE 13

LOD Score

Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑀𝑃𝐸(𝑛𝑗, 𝑛𝑘) = log10 𝑄(𝑚𝑗𝑜𝑙𝑏𝑕𝑓𝑗𝑘) 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑕𝑓𝑗𝑘) Formally,

𝑀𝑃𝐸 = log10 (1 − 𝜄𝑗𝑘 )

𝑆𝑗𝑘𝜄𝑗𝑘 𝑆𝑗𝑘

0.5𝑆𝑗𝑘+

𝑆𝑗𝑘

Where: 𝑆𝑗𝑘 = number of recombinant offspring 𝑆𝑗𝑘 = number of nonrecombinant offspring 𝜄𝑗𝑘 = recombination fraction, i.e.

𝑆𝑗𝑘 𝑆𝑗𝑘+ 𝑆𝑗𝑘

𝑛𝑗 A B

  • A
  • 𝑛𝑘

A B A A B A

slide-14
SLIDE 14

LOD Score

Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝑴𝑷𝑬(𝒏𝒋, 𝒏𝒌) = 𝐦𝐩𝐡𝟐𝟏 (𝟐 − 𝟐 𝟒)𝟑( 𝟐 𝟒)𝟐 𝟏. 𝟔𝟒 = 𝟏. 𝟏𝟖𝟓

Formally,

𝑀𝑃𝐸 = log10 (1 − 𝜄𝑗𝑘 )

𝑆𝑗𝑘𝜄𝑗𝑘 𝑆𝑗𝑘

0.5𝑆𝑗𝑘+

𝑆𝑗𝑘

Where: 𝑆𝑗𝑘 = number of recombinant offspring 𝑆𝑗𝑘 = number of nonrecombinant offspring 𝜄𝑗𝑘 = recombination fraction, i.e.

𝑆𝑗𝑘 𝑆𝑗𝑘+ 𝑆𝑗𝑘

𝑛𝑗 A B

  • A
  • 𝑛𝑘

A B A A B A

slide-15
SLIDE 15

Standard Approach to Genetic Marker Clustering

(1) Compute the similarity between all 𝑃(𝑁2) pairs of markers, producing a complete graph with 𝑁 vertices

  • Similarity function is the “LOD score”, a logarithm of odds that

two markers are genetically linked

(2) Cut all edges below a LOD threshold

(2)

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13

B B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 Linkage group 2 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1

slide-16
SLIDE 16

Standard Approach to Genetic Marker Clustering

(1) Compute the similarity between all 𝑃(𝑁2) pairs of markers, producing a complete graph with 𝑁 vertices

  • Similarity function is the “LOD score”, a logarithm of odds that

two markers are genetically linked

(2) Cut all edges below a LOD threshold (3) The resulting connected components = linkage groups

(3)

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10 B

B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13

B B

  • A

A

  • 𝑛14 -
  • B

A A 𝑛15 B

  • A

A B

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 Linkage group 2 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1

slide-17
SLIDE 17

Our Approach: The BubbleCluster Algorithm

Primary assumption: Clusters have a “linear structure”

slide-18
SLIDE 18

Our Approach: The BubbleCluster Algorithm

Primary assumption: Clusters have a “linear structure” Key idea: Maintain a set of representative or “sketch” points which reveal the cluster structure

LOD threshold

slide-19
SLIDE 19

Input: set of markers, LOD threshold 𝜐, non-missing threshold 𝜃, low-quality threshold 𝑑, cluster size threshold 𝜏 Output: set of clusters C and set of representative points R Phase I: Build initial set of clusters and set of representative points using high-quality markers (those with at least 𝜃 non-missing entries) Phase II: Add low quality markers (less than 𝜃 non-missing entries) to intial set of clusters Phase III: Attempt to merge small clusters with large

The BubbleCluster Algorithm: Overview

slide-20
SLIDE 20

The BubbleCluster Algorithm

𝒏 Iteration i: find 𝑠

𝑁𝐵𝑌 ∶= 𝑠 𝑘 for which 𝑀𝑃𝐸(𝑛, 𝑠 𝑘) is

maximal; set 𝐷𝑁𝐵𝑌 ≔ 𝐷𝐿 ∈ 𝐷 containing 𝑠

𝑁𝐵𝑌

𝑀𝑃𝐸(𝑛, 𝑠

𝑘)

𝑠

𝑘

𝐷1 𝐷2

slide-21
SLIDE 21

The BubbleCluster Algorithm

If (𝑀𝑃𝐸 𝑛, 𝑠

𝑁𝐵𝑌 < 𝑴𝑷𝑬_𝒖𝒊𝒔𝒇𝒕𝒊𝒑𝒎𝒆)

𝒏 𝑀𝑃𝐸(𝑛, 𝑠

𝑁𝐵𝑌)

𝑠

𝑁𝐵𝑌

𝐷1 𝐷2

slide-22
SLIDE 22

The BubbleCluster Algorithm

If (𝑀𝑃𝐸 𝑛, 𝑠

𝑁𝐵𝑌 < 𝑴𝑷𝑬_𝒖𝒊𝒔𝒇𝒕𝒊𝒑𝒎𝒆)

𝐷 = 𝐷 ∪ {𝑛} 𝒏 𝑠

𝑁𝐵𝑌

𝐷1 𝐷2 𝐷3

slide-23
SLIDE 23

The BubbleCluster Algorithm

Else If ( IS_INTERIOR 𝑠

𝑁𝐵𝑌 )

𝒏 𝑠

𝑁𝐵𝑌

𝐷1 𝐷2

𝑀𝑃𝐸(𝑛, 𝑠

𝑁𝐵𝑌)

slide-24
SLIDE 24

The BubbleCluster Algorithm

Else If ( IS_INTERIOR 𝑠

𝑁𝐵𝑌 )

𝐷𝑁𝐵𝑌 = 𝐷𝑁𝐵𝑌 ∪ {𝑛} 𝒏 𝑠

𝑁𝐵𝑌

𝐷1 = 𝐷𝑁𝐵𝑌 𝐷2

slide-25
SLIDE 25

Else If ( IS_EXTERIOR 𝑛, 𝑠

𝑁𝐵𝑌 )

𝒏

𝐷2 𝐷1 = 𝐷𝑁𝐵𝑌

𝑠

𝑁𝐵𝑌

The BubbleCluster Algorithm

slide-26
SLIDE 26

The BubbleCluster Algorithm

Else If ( IS_EXTERIOR 𝑛, 𝑠

𝑁𝐵𝑌 )

Add 𝑛 to representative points of 𝐷𝑁𝐵𝑌 Add 𝑛 to 𝐷𝑁𝐵𝑌 𝒏

𝐷2 𝐷1 = 𝐷𝑁𝐵𝑌

𝑠

𝑁𝐵𝑌

slide-27
SLIDE 27

Else // 𝑛 is interior to the outer point 𝑠

𝑁𝐵𝑌

The BubbleCluster Algorithm

𝒏

𝐷2 𝐷1 = 𝐷𝑁𝐵𝑌

𝑠

𝑁𝐵𝑌

slide-28
SLIDE 28

Else // 𝑛 is interior to the outer point 𝑠

𝑁𝐵𝑌

Add 𝑛 to 𝐷𝑁𝐵𝑌

The BubbleCluster Algorithm

𝒏

𝐷2 𝐷1 = 𝐷𝑁𝐵𝑌

𝑠

𝑁𝐵𝑌

slide-29
SLIDE 29

The BubbleCluster Algorithm

𝒏

𝐷1 𝐷2

If 𝑛 has a LOD score above the threshold to two clusters,

slide-30
SLIDE 30

𝐷𝑂𝐹𝑋 = 𝐷1 ∪ 𝐷2 If 𝑛 has a LOD score above the threshold to two clusters, Then merge the clusters and add m to the merged cluster 𝒏

The BubbleCluster Algorithm

slide-31
SLIDE 31

End of Phase I

Stop when all markers with at least 𝜃 non-missing entries have been processed Running time: 𝑷 |𝜣| 𝐦𝐩𝐡𝟑 𝚯 + |𝜣||𝑺|

where: |𝛩| = size of high-quality marker set, |𝑆| = size of representative point set

Phase II: add low-quality markers Phase III: merge small clusters

slide-32
SLIDE 32

BubbleCluster Parameters

LOD threshold Non-missing threshold: determines how many markers are high-quality Recall the LOD score: 𝑀𝑃𝐸 = log10

(1 − 𝜄)𝑂𝑆𝜄𝑆 0.5𝑆+𝑂𝑆

Highest achievable LOD: lim

𝑆→0 log10 (1 − 𝜄)𝑂𝑆𝜄𝑆 0.5𝑆+𝑂𝑆

= 𝑂𝑆 log10 2 𝑛𝑗 A B

  • A
  • 𝑛𝑘

A B A A B A

𝜐1 𝜐2

slide-33
SLIDE 33

LOD threshold Non-missing threshold: determines how many markers are high-quality Recall the LOD score: 𝑀𝑃𝐸 = log10

(1 − 𝜄𝑗𝑘 )

𝑆𝑗𝑘𝜄𝑗𝑘 𝑆𝑗𝑘

0.5𝑆𝑗𝑘+

𝑆𝑗𝑘

𝜐1 𝜐2

BubbleCluster Parameters

slide-34
SLIDE 34

LOD threshold Non-missing threshold: determines how many markers are high-quality Recall the LOD score: 𝑀𝑃𝐸 = log10

(1 − 𝜄𝑗𝑘 )

𝑆𝑗𝑘𝜄𝑗𝑘 𝑆𝑗𝑘

0.5𝑆𝑗𝑘+

𝑆𝑗𝑘

Example LOD: log10

(1 −

1 3)2( 1 3)1

0.53

= 0.074 𝑛𝑗 A B

  • A
  • 𝑛𝑘

A B A A B A

𝜐1 𝜐2

BubbleCluster Parameters

slide-35
SLIDE 35

LOD threshold Non-missing threshold: determines how many markers are high-quality Recall the LOD score: 𝑀𝑃𝐸 = log10

(1 − 𝜄𝑗𝑘 )

𝑆𝑗𝑘𝜄𝑗𝑘 𝑆𝑗𝑘

0.5𝑆𝑗𝑘+

𝑆𝑗𝑘

Highest achievable LOD: lim

𝑆𝑗𝑘→0 log10 (1 − 𝜄𝑗𝑘)

𝑆𝑗𝑘𝜄𝑗𝑘 𝑆𝑗𝑘

0.5𝑆+

𝑆𝑗𝑘

= 𝑆𝑗𝑘log10 2

𝑛𝑗 A B

  • A
  • 𝑛𝑘

A B A A B A

𝜐1 𝜐2

BubbleCluster Parameters

slide-36
SLIDE 36

Evaluation Metric: 𝐺-score

Given a golden standard clustering, the 𝐺-score measures the quality of another clustering by comparing it to the golden standard Range: 0 – 1 The 𝐺-score is a harmonic mean of precision and recall

  • An 𝐺-score of 1 indicates perfect precision and perfect recall for every golden

standard cluster

slide-37
SLIDE 37

Results: BubbleCluster on Real Data Sets

Dataset Size Time F-Score Barley 64K 15 sec 0.9993 Switchgrass 113K 8.9 min 0.9745 Switchgrass 548K 1.9 hrs 0.9894 Wheat 1.58M 1.22 hrs N/A *

* Results under review at Genome Biology

slide-38
SLIDE 38

Comparison of Clustering Algorithms for Simulated Data

Clustering Method 12.5K Markers 25K Markers F-score Time F-score Time JoinMap 0.99964 14 min 0.99982 46 min MSTMap 0.99964 4.5 min 0.99982 20 min PIC 0.47024 11 sec (+ 4min) 0.60782 44 sec (+ 16.5min) BubbleCluster 0.99964 6 sec 0.99982 15 sec

Simulated data created with Nicholas Tinker’s Spaghetti Software.

slide-39
SLIDE 39

4.4 9.8 21.0 47.2 96.6 237.8 6.7 14.0 31.7 86.4 198.7 475.1

1 2 4 8 16 32 64 128 256 512 1024 12.5 25 50 100 200 400 1E-5 1E-4 1E-3 1E-2 1E-1 1E+0

Runtime (s) Dataset Size (in thousands of markers) Error (1 - Fscore)

error, 65% missing error, 35% missing runtime, 65% missing runtime, 35% missing

Scaling Results for Bubble Cluster on Simulated Data

slide-40
SLIDE 40

Effect of the LOD threshold

LOD threshold 5 10 15 20 25 30 F-Score 0.6225 0.9999 0.9999 0.9999 0.9999 0.9999 Time (s) 48.6 67.0 70.9 82.0 106 170

Fixed missing entry threshold, increasing LOD threshold

200K markers, 300 individuals, 35% missing rate 𝜐1 𝜐2

vs.

slide-41
SLIDE 41

Effect of the missing data threshold

Fixed LOD threshold, increasing non-missing entry threshold

Non-missing threshold 132 166 172 179 186 192 F-Score 0.9999 0.9999 0.9992 0.9930 0.9610 0.8948 Time (s) 82.0 84.6 82.7 83.0 81.7 82.0

200K markers, 300 individuals, 35% missing rate 𝜐 𝜐 𝜐

vs.

slide-42
SLIDE 42

Conclusion

By exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data 𝒏

𝐷1 𝐷2

slide-43
SLIDE 43

Conclusion

By exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data 𝒏

𝐷1 𝐷2

While remaining highly accurate, we outperform popular existing tools in both runtime and scalability

Clustering Method 12.5K Markers 25K Markers F-score Time F-score Time JoinMap 0.99964 14 min 0.99982 46 min MSTMap 0.99964 4.5 min 0.99982 20 min PIC 0.47024 11 sec (+ 4 min) 0.60782 44 sec (+ 16.5 min) BubbleCluster 0.99964 6 sec 0.999982 15 sec

slide-44
SLIDE 44

Future Work

  • Use representative points as starting point for ordering phase
slide-45
SLIDE 45
slide-46
SLIDE 46

Future Work

  • Use representative points as starting point for ordering phase
  • Provide a more thorough theoretical analysis of achievable clustering

as well as order accuracy given assumptions about error and missing data rates

  • Develop efficient and accurate, large-scale genetic mapping software
slide-47
SLIDE 47

Thank You

Code for BubbleCluster soon available at: www.ucsb.edu/~veronika

slide-48
SLIDE 48

Backup Slides

slide-49
SLIDE 49

Choosing LOD and non-missing threshold

Goal: minimize 𝑸(mistake) and maximize 𝑮- score Let 𝑞 = 𝑄 𝑀𝑃𝐸 𝑛𝑗, 𝑛𝑘 > 𝑀𝑃𝐸𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑦𝑗 ∈ 𝐷𝑗, 𝑦𝑘 ∈ 𝐷

𝑘, 𝑗 ≠ 𝑘)

Let 𝜐 = LOD threshold

  • By definition of the LOD score, 𝑞 =

1 10𝜐

Let 𝑜𝑑𝑝𝑛𝑞 = number of LOD comparisons we make Then, if we want to ensure that (1 − 𝑞)𝑜𝑑𝑝𝑛𝑞< 1 − 𝜁 then we need: 𝜐 > log10( 1 1 − (1 − 𝜁)

1 𝑜𝑑𝑝𝑛𝑞

) At the same time, we want to include a marker in the high-quality set only if we expect that it will achieve a LOD of 𝜐 or greater with another marker, requiring: 𝑜𝑜𝑛 > 𝜐 (1 − 𝜈) log10 2 Where 𝜈 is the missing rate and 𝑜𝑜𝑛 is the number of non-missing entries in the marker

slide-50
SLIDE 50

Evaluation Metric for Cluster Quality

F-score (range: 0 to 1)

  • Given a “golden standard clustering”, the F-score measures the

quality of a clustering as follows:

  • The F-score between a golden standard cluster 𝑕 and a test cluster 𝑑

is a harmonic mean of precision 𝑄 and recall 𝑆: 𝐺𝑡𝑑𝑝𝑠𝑓 𝑕, 𝑑 = 2𝑄𝑆 𝑄 + 𝑆

  • The overall F-score between the golden standard clustering G and a

test clustering C is a weighted average of the F-scores for each golden standard cluster 𝑕: 𝑝𝑤𝑓𝑠𝑏𝑚𝑚_𝐺𝑡𝑑𝑝𝑠𝑓 𝐻, 𝐷 = 1 𝑛

𝑕 ∋ 𝐻

𝑕 ∗ max

𝑑 ∋𝐷 𝐺𝑡𝑑𝑝𝑠𝑓(𝑕, 𝑑)

slide-51
SLIDE 51

Local Linearity Assumption

Although the LOD score does not obey the triangle inequality, we assume that it does at close ranges and with enough data 𝑴𝑷𝑬(𝒏, 𝒔𝟐) 𝑴𝑷𝑬(𝒏, 𝒔𝟑)

𝒏 𝒔𝟐 𝒔𝟑

𝑴𝑷𝑬(𝒔𝟑, 𝒔𝟐) 𝑴𝑷𝑬 𝒏, 𝒔𝟑 < 𝑴𝑷𝑬 𝒏, 𝒔𝟐 && 𝑴𝑷𝑬 𝒏, 𝒔𝟑 < 𝑴𝑷𝑬 𝒔𝟐, 𝒔𝟑 && 𝑴𝑷𝑬 𝒏, 𝒔𝟐 > 𝑴𝑷𝑬 𝒔𝟐, 𝒔𝟑

𝒏 is a new boundary point

slide-52
SLIDE 52

The Effect of the LOD threshold

The standard approach to clustering can be viewed as single linkage clustering Time complexity: 𝑃(𝑁2)

LOD 10 LOD 9 LOD 8 𝑛15 𝑛2 𝑛1 𝑛8 𝑛9 LOD 7 𝑛4 𝑛3 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13 𝑛14 LOD 6

slide-53
SLIDE 53

The Effect of the LOD threshold

A high LOD threshold ensures that the only edges that remain in the completely connected graph are between markers that are extremely likely to be genetically linked

LOD 10 LOD 9 LOD 8 𝑛15 𝑛2 𝑛1 𝑛8 𝑛9 LOD 7 𝑛4 𝑛3 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13 𝑛14 LOD 6

slide-54
SLIDE 54

The Effect of the LOD threshold

Example: LOD threshold = 10

LOD 10 LOD 9 LOD 8 𝑛15 𝑛2 𝑛1 𝑛8 𝑛9 LOD 7 𝑛4 𝑛3 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13 𝑛14 LOD 6

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 Linkage group 2 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1

slide-55
SLIDE 55

The Effect of the LOD threshold

Example: LOD threshold = 8

LOD 10 LOD 9 LOD 8 𝑛15 𝑛2 𝑛1 𝑛8 𝑛9 LOD 7 𝑛4 𝑛3 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13 𝑛14 LOD 6

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 Linkage group 2 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1

slide-56
SLIDE 56

The Effect of the LOD threshold

Example: LOD threshold = 7

LOD 10 LOD 9 LOD 8 𝑛15 𝑛2 𝑛1 𝑛8 𝑛9 LOD 7 𝑛4 𝑛3 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13 𝑛14 LOD 6

𝑛1 𝑛2 𝑛8𝑛9 𝑛15 Linkage group 2 𝑛3 𝑛4 𝑛5 𝑛6 𝑛7 𝑛10 𝑛11 𝑛12 𝑛13𝑛14 Linkage group 1