Locality sensitive hashing for the edit distance Guillaume Marc - - PowerPoint PPT Presentation

locality sensitive hashing for the edit distance
SMART_READER_LITE
LIVE PREVIEW

Locality sensitive hashing for the edit distance Guillaume Marc - - PowerPoint PPT Presentation

Locality sensitive hashing for the edit distance Guillaume Marc ais, Dan DeBlasio, Prashant Pandey, Carl Kingsford Carnegie Mellon University Overlap computation Reads Compute overlaps between reads (HMAP) Overlap? Instance of


slide-1
SLIDE 1

Locality sensitive hashing for the edit distance

Guillaume Marc ¸ais, Dan DeBlasio, Prashant Pandey, Carl Kingsford

Carnegie Mellon University

slide-2
SLIDE 2

Overlap computation

  • Compute overlaps between reads (HMAP)
  • Instance of “Nearest Neighbor Problem”

for edit distance

  • Use multiple hash tables
  • Need meaningful hash collisions

Reads Overlap?

1

slide-3
SLIDE 3

Overlap computation

  • Compute overlaps between reads (HMAP)
  • Instance of “Nearest Neighbor Problem”

for edit distance

  • Use multiple hash tables
  • Need meaningful hash collisions

Reads Overlap?

1

slide-4
SLIDE 4

Overlap computation

  • Compute overlaps between reads (HMAP)
  • Instance of “Nearest Neighbor Problem”

for edit distance

  • Use multiple hash tables
  • Need meaningful hash collisions

Reads Overlap? Hash Tables

1

slide-5
SLIDE 5

Overlap computation

  • Compute overlaps between reads (HMAP)
  • Instance of “Nearest Neighbor Problem”

for edit distance

  • Use multiple hash tables
  • Need meaningful hash collisions

Reads Overlap? Hash Tables

1

slide-6
SLIDE 6

Locality Sensitive Hashing

Pick h at random from H: Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements.

2

slide-7
SLIDE 7

Locality Sensitive Hashing

Pick h at random from H: Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements.

2

slide-8
SLIDE 8

Locality Sensitive Hashing

The family H is sensitive for distance D if there exists d1 < d2, p1 > p2 such that for all x, y ∈ U D(x, y) ≤ d1 = ⇒ Pr

h∈H[h(x) = h(y)] ≥ p1

D(x, y) ≥ d2 = ⇒ Pr

h∈H[h(x) = h(y)] ≤ p2

  • Low distance ⇐

⇒ High collisions

  • High distance ⇐

⇒ Low collisions Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements.

2

slide-9
SLIDE 9

LSH for the edit distance How to design an LSH for edit distance?

  • LSH for Jaccard distance (minHash) used as proxy
  • Jaccard distance is significantly different than edit distance

3

slide-10
SLIDE 10

LSH for the edit distance How to design an LSH for edit distance?

  • LSH for Jaccard distance (minHash) used as proxy
  • Jaccard distance is significantly different than edit distance

3

slide-11
SLIDE 11

LSH for the edit distance How to design an LSH for edit distance?

  • LSH for Jaccard distance (minHash) used as proxy
  • Jaccard distance is significantly different than edit distance

3

slide-12
SLIDE 12

Jaccard distance

Jaccard distance between sets A, B: J(A, B) = 1 − |A ∩ B| |A ∪ B|

4

slide-13
SLIDE 13

Jaccard distance

Jaccard distance between sets A, B: J(A, B) = 1 − |A ∩ B| |A ∪ B| Jaccard between sequences x, y: Jaccard distance of their k-mer sets J(x, y) = J(K(x), K(y))

  • Low D(x, y) =

⇒ Low J(x, y)

  • High D(x, y)
  • =

⇒ High J(x, y)

4

slide-14
SLIDE 14

Jaccard ignores k-mer repetition

x =

n−k

  • AAAAAAAAAAAAAAA

k

CCCCC y = AAAAA

k

CCCCCCCCCCCCCCC

  • n−k

5

slide-15
SLIDE 15

Jaccard ignores k-mer repetition

x =

n−k

  • AAAAAAAAAAAAAAA

k

CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA

k

CCCCCCCCCCCCCCC

  • n−k

→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}

5

slide-16
SLIDE 16

Jaccard ignores k-mer repetition

x =

n−k

  • AAAAAAAAAAAAAAA

k

CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA

k

CCCCCCCCCCCCCCC

  • n−k

→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} Jaccard distance J(x, y) = 0 Edit distance D(x, y) ≥ 1 − 2k

n

Identical k-mer content and high edit distance

5

slide-17
SLIDE 17

Weighted Jaccard handles repetitions

x =

n−k

  • AAAAAAAAAAAAAAA

k

CCCCC →

  • (AAAAA,1),(AAAAA,2),...,(AAAAA,11)

(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1)

  • y = AAAAA

k

CCCCCCCCCCCCCCC

  • n−k

  • (AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),

(CCCCC,1),(CCCCC,2),...,(CCCCC,11)

  • Weighted Jaccard Jw(x, y) = 1 − k+2

n

Edit distance D(x, y) ≥ 1 − 2k

n

Weighted Jaccard = Jaccard for multi-sets

6

slide-18
SLIDE 18

Jaccard and weighted Jaccard ignore relative order

x = CCCCACCAACACAAAACCC y = AAAACACAACCCCACCAAA

7

slide-19
SLIDE 19

Jaccard and weighted Jaccard ignore relative order

x = CCCCACCAACACAAAACCC → AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC

CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC

  • y = AAAACACAACCCCACCAAA

→ AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC

CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC

  • x, y: de Bruijn sequences,

contain all 16 possible 4-mers once

7

slide-20
SLIDE 20

Jaccard and weighted Jaccard ignore relative order

x = CCCCACCAACACAAAACCC → AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC

CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC

  • y = AAAACACAACCCCACCAAA

→ AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC

CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC

  • x, y: de Bruijn sequences,

contain all 16 possible 4-mers once J(x, y) = Jw(x, y) = 0 D(x, y) = 0.63

7

slide-21
SLIDE 21

Jaccard is different from edit distance

Unlike edit distance, Jaccard is insensitive to:

  • 1. k-mer repetitions
  • 2. relative positions of k-mers

8

slide-22
SLIDE 22

OMH: Order Min Hash

  • minHash is an LSH for Jaccard
  • OMH is a refinement of minHash
  • OMH is sensitive to
  • repeated k-mers
  • relative order of k-mers

9

slide-23
SLIDE 23

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2

10

slide-24
SLIDE 24

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2

AG GT TT TG GA GC CG GG AA AG GT TT TG GA GC CG GG AA AC AT CA CT GT TA TC

10

slide-25
SLIDE 25

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2, m = 6

AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA

2

AG GT TT TG GA GC CG GG AA

3

AG GT TT TG GA GC CG GG AA

4

AG GT TT TG GA GC CG GG AA

5

AG GT TT TG GA GC CG GG AA

6 1

10

slide-26
SLIDE 26

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2, m = 6

AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA

2

AG GT TT TG GA GC CG GG AA

3

AG GT TT TG GA GC CG GG AA

4

AG GT TT TG GA GC CG GG AA

5

AG GT TT TG GA GC CG GG AA

6 1

10

slide-27
SLIDE 27

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2, m = 6

AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA

2

AG GT TT TG GA GC CG GG AA

3

AG GT TT TG GA GC CG GG AA

4

AG GT TT TG GA GC CG GG AA

5

AG GT TT TG GA GC CG GG AA

6 1

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

10

slide-28
SLIDE 28

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2, m = 6

AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA

2

AG GT TT TG GA GC CG GG AA

3

AG GT TT TG GA GC CG GG AA

4

AG GT TT TG GA GC CG GG AA

5

AG GT TT TG GA GC CG GG AA

6 1

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

2

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

3

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

4

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

5

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

6 1

10

slide-29
SLIDE 29

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2, m = 6

AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA

2

AG GT TT TG GA GC CG GG AA

3

AG GT TT TG GA GC CG GG AA

4

AG GT TT TG GA GC CG GG AA

5

AG GT TT TG GA GC CG GG AA

6 1

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

2

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

3

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

4

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

5

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

6 1

10

slide-30
SLIDE 30

minHash & OMH sketches

x = AGTTGAGCGGAAGGTG, k = 2, m = 6, ℓ = 2

AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA

2

AG GT TT TG GA GC CG GG AA

3

AG GT TT TG GA GC CG GG AA

4

AG GT TT TG GA GC CG GG AA

5

AG GT TT TG GA GC CG GG AA

6 1

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

2

AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12

3

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

4

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

5

AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12

6 1

GA AG GG GC AG TG

10

slide-31
SLIDE 31

OMH is a LSH for edit distance

Theorem: OMH is a LSH for edit distance There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance.

11

slide-32
SLIDE 32

Conclusion

  • OMH:
  • improvement on minHash
  • easy to compute
  • locality sensitive for edit distance
  • LSH for other alignment scores?
  • Smallest “gap” achievable?

12

slide-33
SLIDE 33

Conclusion

  • OMH:
  • improvement on minHash
  • easy to compute
  • locality sensitive for edit distance
  • LSH for other alignment scores?
  • Smallest “gap” achievable?

12

slide-34
SLIDE 34

Conclusion

  • OMH:
  • improvement on minHash
  • easy to compute
  • locality sensitive for edit distance
  • LSH for other alignment scores?
  • Smallest “gap” achievable?

12

slide-35
SLIDE 35

GBMF4554

CCF-1256087 CCF-1319998 R01HG007104 R01GM122935

Thank you

Natalie Sauerwald Cong Ma Hongyu Zheng Laura Tung Hongyu Zheng Yutong Qiu Yihang Shen Minh Hoang Mohsen Ferdosi Shawn Baker Yinjie Gao