SLIDE 1
Locality sensitive hashing for the edit distance
Guillaume Marc ¸ais, Dan DeBlasio, Prashant Pandey, Carl Kingsford
Carnegie Mellon University
SLIDE 2 Overlap computation
- Compute overlaps between reads (HMAP)
- Instance of “Nearest Neighbor Problem”
for edit distance
- Use multiple hash tables
- Need meaningful hash collisions
Reads Overlap?
1
SLIDE 3 Overlap computation
- Compute overlaps between reads (HMAP)
- Instance of “Nearest Neighbor Problem”
for edit distance
- Use multiple hash tables
- Need meaningful hash collisions
Reads Overlap?
1
SLIDE 4 Overlap computation
- Compute overlaps between reads (HMAP)
- Instance of “Nearest Neighbor Problem”
for edit distance
- Use multiple hash tables
- Need meaningful hash collisions
Reads Overlap? Hash Tables
1
SLIDE 5 Overlap computation
- Compute overlaps between reads (HMAP)
- Instance of “Nearest Neighbor Problem”
for edit distance
- Use multiple hash tables
- Need meaningful hash collisions
Reads Overlap? Hash Tables
1
SLIDE 6
Locality Sensitive Hashing
Pick h at random from H: Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements.
2
SLIDE 7
Locality Sensitive Hashing
Pick h at random from H: Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements.
2
SLIDE 8 Locality Sensitive Hashing
The family H is sensitive for distance D if there exists d1 < d2, p1 > p2 such that for all x, y ∈ U D(x, y) ≤ d1 = ⇒ Pr
h∈H[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 = ⇒ Pr
h∈H[h(x) = h(y)] ≤ p2
⇒ High collisions
⇒ Low collisions Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements.
2
SLIDE 9 LSH for the edit distance How to design an LSH for edit distance?
- LSH for Jaccard distance (minHash) used as proxy
- Jaccard distance is significantly different than edit distance
3
SLIDE 10 LSH for the edit distance How to design an LSH for edit distance?
- LSH for Jaccard distance (minHash) used as proxy
- Jaccard distance is significantly different than edit distance
3
SLIDE 11 LSH for the edit distance How to design an LSH for edit distance?
- LSH for Jaccard distance (minHash) used as proxy
- Jaccard distance is significantly different than edit distance
3
SLIDE 12
Jaccard distance
Jaccard distance between sets A, B: J(A, B) = 1 − |A ∩ B| |A ∪ B|
4
SLIDE 13 Jaccard distance
Jaccard distance between sets A, B: J(A, B) = 1 − |A ∩ B| |A ∪ B| Jaccard between sequences x, y: Jaccard distance of their k-mer sets J(x, y) = J(K(x), K(y))
⇒ Low J(x, y)
⇒ High J(x, y)
4
SLIDE 14 Jaccard ignores k-mer repetition
x =
n−k
k
CCCCC y = AAAAA
k
CCCCCCCCCCCCCCC
5
SLIDE 15 Jaccard ignores k-mer repetition
x =
n−k
k
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA
k
CCCCCCCCCCCCCCC
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
5
SLIDE 16 Jaccard ignores k-mer repetition
x =
n−k
k
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA
k
CCCCCCCCCCCCCCC
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} Jaccard distance J(x, y) = 0 Edit distance D(x, y) ≥ 1 − 2k
n
Identical k-mer content and high edit distance
5
SLIDE 17 Weighted Jaccard handles repetitions
x =
n−k
k
CCCCC →
- (AAAAA,1),(AAAAA,2),...,(AAAAA,11)
(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1)
k
CCCCCCCCCCCCCCC
→
- (AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),
(CCCCC,1),(CCCCC,2),...,(CCCCC,11)
- Weighted Jaccard Jw(x, y) = 1 − k+2
n
Edit distance D(x, y) ≥ 1 − 2k
n
Weighted Jaccard = Jaccard for multi-sets
6
SLIDE 18
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC y = AAAACACAACCCCACCAAA
7
SLIDE 19 Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC → AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
→ AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
- x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
7
SLIDE 20 Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC → AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
→ AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
- x, y: de Bruijn sequences,
contain all 16 possible 4-mers once J(x, y) = Jw(x, y) = 0 D(x, y) = 0.63
7
SLIDE 21 Jaccard is different from edit distance
Unlike edit distance, Jaccard is insensitive to:
- 1. k-mer repetitions
- 2. relative positions of k-mers
8
SLIDE 22 OMH: Order Min Hash
- minHash is an LSH for Jaccard
- OMH is a refinement of minHash
- OMH is sensitive to
- repeated k-mers
- relative order of k-mers
9
SLIDE 23
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2
10
SLIDE 24
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2
AG GT TT TG GA GC CG GG AA AG GT TT TG GA GC CG GG AA AC AT CA CT GT TA TC
10
SLIDE 25
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2, m = 6
AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA
2
AG GT TT TG GA GC CG GG AA
3
AG GT TT TG GA GC CG GG AA
4
AG GT TT TG GA GC CG GG AA
5
AG GT TT TG GA GC CG GG AA
6 1
10
SLIDE 26
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2, m = 6
AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA
2
AG GT TT TG GA GC CG GG AA
3
AG GT TT TG GA GC CG GG AA
4
AG GT TT TG GA GC CG GG AA
5
AG GT TT TG GA GC CG GG AA
6 1
10
SLIDE 27
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2, m = 6
AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA
2
AG GT TT TG GA GC CG GG AA
3
AG GT TT TG GA GC CG GG AA
4
AG GT TT TG GA GC CG GG AA
5
AG GT TT TG GA GC CG GG AA
6 1
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
10
SLIDE 28
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2, m = 6
AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA
2
AG GT TT TG GA GC CG GG AA
3
AG GT TT TG GA GC CG GG AA
4
AG GT TT TG GA GC CG GG AA
5
AG GT TT TG GA GC CG GG AA
6 1
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
2
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
3
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
4
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
5
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
6 1
10
SLIDE 29
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2, m = 6
AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA
2
AG GT TT TG GA GC CG GG AA
3
AG GT TT TG GA GC CG GG AA
4
AG GT TT TG GA GC CG GG AA
5
AG GT TT TG GA GC CG GG AA
6 1
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
2
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
3
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
4
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
5
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
6 1
10
SLIDE 30
minHash & OMH sketches
x = AGTTGAGCGGAAGGTG, k = 2, m = 6, ℓ = 2
AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA
2
AG GT TT TG GA GC CG GG AA
3
AG GT TT TG GA GC CG GG AA
4
AG GT TT TG GA GC CG GG AA
5
AG GT TT TG GA GC CG GG AA
6 1
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
2
AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12
3
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
4
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
5
AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12
6 1
GA AG GG GC AG TG
10
SLIDE 31
OMH is a LSH for edit distance
Theorem: OMH is a LSH for edit distance There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance.
11
SLIDE 32 Conclusion
- OMH:
- improvement on minHash
- easy to compute
- locality sensitive for edit distance
- LSH for other alignment scores?
- Smallest “gap” achievable?
12
SLIDE 33 Conclusion
- OMH:
- improvement on minHash
- easy to compute
- locality sensitive for edit distance
- LSH for other alignment scores?
- Smallest “gap” achievable?
12
SLIDE 34 Conclusion
- OMH:
- improvement on minHash
- easy to compute
- locality sensitive for edit distance
- LSH for other alignment scores?
- Smallest “gap” achievable?
12
SLIDE 35 GBMF4554
CCF-1256087 CCF-1319998 R01HG007104 R01GM122935
Thank you
Natalie Sauerwald Cong Ma Hongyu Zheng Laura Tung Hongyu Zheng Yutong Qiu Yihang Shen Minh Hoang Mohsen Ferdosi Shawn Baker Yinjie Gao