piazza recitation session review of linear algebra
play

Piazza Recitation session : Review of linear algebra Location: - PowerPoint PPT Presentation

Piazza Recitation session : Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) Deadlines next Thu, 11:59 PM : HW0, HW1 How to find teammates for project? Piazza Team Search Make sure you have


  1. Piazza Recitation session : ¡ Review of linear algebra § Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) Deadlines next Thu, 11:59 PM : ¡ HW0, HW1 How to find teammates for project? ¡ Piazza Team Search ¡ Make sure you have a good dataset accessible 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 1

  2. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  3. ¡ Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates” ¡ Problem: § Too many documents to compare all pairs ¡ Solution: Hash documents so that similar documents hash into the same bucket § Documents in the same bucket are then candidate pairs whose similarity is then evaluated 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

  4. Candidate pairs: Locality- those pairs M i n - H a s h - Docu- S h i n g l i n g sensitive i n g of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

  5. ¡ A k -shingle (or k -gram) is a sequence of k tokens that appears in the document § Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca } ¡ Represent a doc by a set of hash values of its k -shingles ¡ A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1 Ç C 2 |/|C 1 È C 2 | § Similarity of two documents is the Jaccard similarity of their shingles 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

  6. ¡ Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation p Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

  7. ¡ Hash columns of the signature matrix M: Similar columns likely hash to same bucket § Divide matrix M into b bands of r rows (M=b·r) § Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold s ≥ 1 bucket b bands r rows Similarity Matrix M 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8

  8. Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity H a s h signatures that sensitive Points f u n c . we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9

  9. ¡ The S-curve is where the “magic” happens Remember: Probability of sharing Threshold s Probability of Probability=1 equal hash-values ≥ 1 bucket = similarity if t>s No chance if t<s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you This is what we want! Pr[ h p (C 1 ) = h p (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

  10. ¡ Remember: b bands, r rows/band ¡ Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal? ¡ Pick some band ( r rows) § Prob. that elements in a single row of columns C 1 and C 2 are equal = s § Prob. that all rows in a band are equal = s r § Prob. that some row in a band is not equal = 1 - s r ¡ Prob. that all bands are not equal = (1 - s r ) b ¡ Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11

  11. ¡ Picking r and b to get the best S-curve § 50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

  12. 1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold s . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 that the 1 1 r = 10, b = 1..50 Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around s . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - t r ) b 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

  13. Candidate pairs: Locality- those pairs M i n - H a s h - sensitive i n g of signatures Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity

  14. ¡ We have used LSH to find similar documents § More generally, we found similar columns in large sparse matrices with high Jaccard similarity ¡ Can we use LSH for other distance measures? § e.g., Euclidean distances, Cosine distance § Let’s generalize what we’ve learned! 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15

  15. ¡ d() is a distance measure if it is a function from pairs of points x,y to real numbers such that: § ! ", $ ≥ 0 § ! ", $ = 0 ()) " = $ § !(", $) = !($, ") § ! ", $ ≤ !(", -) + !(-, $) (triangle inequality) ¡ Jaccard distance for sets = 1 - Jaccard similarity ¡ Cosine distance for vectors = angle between the vectors ¡ Euclidean distances: § L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension § The most common notion of “distance” § L 1 norm : sum of absolute value of the differences in each dimension § Manhattan distance = distance if you travel along coordinates only 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16

  16. ¡ For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows ¡ A “hash function” is any function that allows us to say whether two elements are “equal” § Shorthand: h(x) = h(y) means “ h says x and y are equal ” ¡ A family of hash functions is any set of hash functions from which we can pick one at random efficiently § Example: The set of Min-Hash functions generated from permutations of rows 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

  17. Suppose we have a space S of points with ¡ a distance measure d(x,y) Critical assumption A family H of hash functions is said to be ¡ ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h Î H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h Î H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 21

  18. Distance Small distance, Notice it’s a distance, not similarity, threshold t hence the S-curve is flipped! high probability p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

  19. ¡ Let: § S = space of all sets, § d = Jaccard distance, § H is family of Min-Hash functions for all permutations of rows ¡ Then for any hash function h Î H : Pr[h(x) = h(y)] = 1 - d(x, y) § Simply restates theorem about Min-Hashing in terms of distances rather than similarities 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

  20. ¡ Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)- sensitive family for S and d . Then probability If distance < 1/3 that Min-Hash values (so similarity ≥ 2/3) agree is > 2/3 ¡ For Jaccard similarity, Min-Hashing gives a (d 1 ,d 2 ,(1-d 1 ),(1-d 2 ))- sensitive family for any d 1 <d 2 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend