Piazza Recitation session : Review of linear algebra Location: - PowerPoint PPT Presentation

Piazza Recitation session : ¡ Review of linear algebra § Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) Deadlines next Thu, 11:59 PM : ¡ HW0, HW1 How to find teammates for project? ¡ Piazza Team Search ¡ Make sure you have a good dataset accessible 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

¡ Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates” ¡ Problem: § Too many documents to compare all pairs ¡ Solution: Hash documents so that similar documents hash into the same bucket § Documents in the same bucket are then candidate pairs whose similarity is then evaluated 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

Candidate pairs: Locality- those pairs M i n - H a s h - Docu- S h i n g l i n g sensitive i n g of signatures ment Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

¡ A k -shingle (or k -gram) is a sequence of k tokens that appears in the document § Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca } ¡ Represent a doc by a set of hash values of its k -shingles ¡ A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1 Ç C 2 |/|C 1 È C 2 | § Similarity of two documents is the Jaccard similarity of their shingles 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

¡ Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation p Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

¡ Hash columns of the signature matrix M: Similar columns likely hash to same bucket § Divide matrix M into b bands of r rows (M=b·r) § Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold s ≥ 1 bucket b bands r rows Similarity Matrix M 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8

Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity H a s h signatures that sensitive Points f u n c . we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9

¡ The S-curve is where the “magic” happens Remember: Probability of sharing Threshold s Probability of Probability=1 equal hash-values ≥ 1 bucket = similarity if t>s No chance if t<s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you This is what we want! Pr[ h p (C 1 ) = h p (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

¡ Remember: b bands, r rows/band ¡ Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal? ¡ Pick some band ( r rows) § Prob. that elements in a single row of columns C 1 and C 2 are equal = s § Prob. that all rows in a band are equal = s r § Prob. that some row in a band is not equal = 1 - s r ¡ Prob. that all bands are not equal = (1 - s r ) b ¡ Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11

¡ Picking r and b to get the best S-curve § 50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold s . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 that the 1 1 r = 10, b = 1..50 Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around s . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - t r ) b 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

Candidate pairs: Locality- those pairs M i n - H a s h - sensitive i n g of signatures Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity

¡ We have used LSH to find similar documents § More generally, we found similar columns in large sparse matrices with high Jaccard similarity ¡ Can we use LSH for other distance measures? § e.g., Euclidean distances, Cosine distance § Let’s generalize what we’ve learned! 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15

¡ d() is a distance measure if it is a function from pairs of points x,y to real numbers such that: § ! ", $ ≥ 0 § ! ", $ = 0 ()) " = $ § !(", $) = !($, ") § ! ", $ ≤ !(", -) + !(-, $) (triangle inequality) ¡ Jaccard distance for sets = 1 - Jaccard similarity ¡ Cosine distance for vectors = angle between the vectors ¡ Euclidean distances: § L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension § The most common notion of “distance” § L 1 norm : sum of absolute value of the differences in each dimension § Manhattan distance = distance if you travel along coordinates only 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16

¡ For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows ¡ A “hash function” is any function that allows us to say whether two elements are “equal” § Shorthand: h(x) = h(y) means “ h says x and y are equal ” ¡ A family of hash functions is any set of hash functions from which we can pick one at random efficiently § Example: The set of Min-Hash functions generated from permutations of rows 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

Suppose we have a space S of points with ¡ a distance measure d(x,y) Critical assumption A family H of hash functions is said to be ¡ ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h Î H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h Î H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 21

Distance Small distance, Notice it’s a distance, not similarity, threshold t hence the S-curve is flipped! high probability p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

¡ Let: § S = space of all sets, § d = Jaccard distance, § H is family of Min-Hash functions for all permutations of rows ¡ Then for any hash function h Î H : Pr[h(x) = h(y)] = 1 - d(x, y) § Simply restates theorem about Min-Hashing in terms of distances rather than similarities 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

¡ Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)- sensitive family for S and d . Then probability If distance < 1/3 that Min-Hash values (so similarity ≥ 2/3) agree is > 2/3 ¡ For Jaccard similarity, Min-Hashing gives a (d 1 ,d 2 ,(1-d 1 ),(1-d 2 ))- sensitive family for any d 1 <d 2 4/10/19 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24

Piazza Recitation session : Review of linear algebra Location: - PowerPoint PPT Presentation

Piazza Recitation session : Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) Deadlines next Thu, 11:59 PM : HW0, HW1 How to find teammates for project? Piazza Team Search Make sure you have

Recitation First recitation tomorrow 56:30 here Linear algebra Geoff Gordon10-701

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

Extreme Computing Introduction to Cloud Computing and MapReduce 1 Piazza Forum

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

A Review of Linear Algebra Mohammad Emtiyaz Khan CS,UBC A Review of Linear Algebra p.1/13

Linear Algebra Review Leila Wehbe January 29, 2013 Leila Wehbe Linear Algebra Review Metrics

Parallel Programming Parallel Programming 0024 0024 Recitation Week 7 Recitation Week 7

10-701/15-781 Recitation #1: Linear Algebra Review Jing Xiang Sept. 17, 2013 1 Properties of

Thermalization and Random Matrices Anatoly Dymarsky University of Kentucky Great Lakes Strings

Local regime of 1d random band matrices Tatyana Shcherbina Princeton University QMath13:

Natural Language Processing STOR 390 4/18/17 Kurt Vonnegut on the Shapes of Stories

IIT Bombay Course Code : EE 611 Department: Electrical Engineering Instructor Name: Jayanta

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Piazza Recitation session : Review of linear algebra Location: - PowerPoint PPT Presentation

Piazza Recitation session : Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) Deadlines next Thu, 11:59 PM : HW0, HW1 How to find teammates for project? Piazza Team Search Make sure you have

Recitation First recitation tomorrow 56:30 here Linear algebra Geoff Gordon10-701

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Expressive Linear Algebra in Haskell Henning Thielemann 2019-08-21 Expressive Linear Algebra in

Extreme Computing Introduction to Cloud Computing and MapReduce 1 Piazza Forum

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

A Review of Linear Algebra Mohammad Emtiyaz Khan CS,UBC A Review of Linear Algebra p.1/13

Linear Algebra Review Leila Wehbe January 29, 2013 Leila Wehbe Linear Algebra Review Metrics

Parallel Programming Parallel Programming 0024 0024 Recitation Week 7 Recitation Week 7

10-701/15-781 Recitation #1: Linear Algebra Review Jing Xiang Sept. 17, 2013 1 Properties of

Thermalization and Random Matrices Anatoly Dymarsky University of Kentucky Great Lakes Strings

Local regime of 1d random band matrices Tatyana Shcherbina Princeton University QMath13:

Natural Language Processing STOR 390 4/18/17 Kurt Vonnegut on the Shapes of Stories

IIT Bombay Course Code : EE 611 Department: Electrical Engineering Instructor Name: Jayanta

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE