Recitation sessions : Review of proof techniques and probability - PowerPoint PPT Presentation

Recitation sessions : ¡ Review of proof techniques and probability § Friday January 17, 3:00-4:10 PM in Skilling Auditorium ¡ Review of linear algebra § Friday January 17, 4:20-5:20 PM in Skilling Auditorium Deadlines tonight, 11:59 PM : ¡ Colab 0 (Spark Tutorial), Colab 1 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 1

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

¡ Task: Given a large number ( N in the millions or billions) of documents, find “near duplicates” ¡ Problem: § Too many documents to compare all pairs ¡ Solution: Hash documents so that similar documents hash into the same bucket § Documents in the same bucket are then candidate pairs whose similarity is then evaluated 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

Candidate pairs: Locality- those pairs M i n - H a s h - Docu- sensitive S h i n g l i n g of signatures ment i n g Hashing that we need to test for similarity The set Signatures: of strings short integer of length k vectors that that appear represent the in the doc- sets, and ument reflect their similarity 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

¡ A k -shingle (or k -gram) is a sequence of k tokens that appears in the document § Example: k=2 ; D 1 = abcab Set of 2-shingles: C 1 = S(D 1 ) = { ab , bc , ca } ¡ Represent a doc by a set of hash values of its k -shingles ¡ A natural similarity measure is then the Jaccard similarity: sim (D 1 , D 2 ) = |C 1 Ç C 2 |/|C 1 È C 2 | § Similarity of two documents is the Jaccard similarity of their shingles 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

¡ Min-Hashing : Convert large sets into short signatures, while preserving similarity: Pr[ h (C 1 ) = h (C 2 )] = sim (D 1 , D 2 ) Permutation p Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 3 2 1 0 0 1 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 Similarities of columns and 0 1 0 1 1 6 6 signatures (approx.) match! 1-3 2-4 1-2 3-4 5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

¡ Hash columns of the signature matrix M: Similar columns likely hash to same bucket § Divide matrix M into b bands of r rows (M=b·r) § Candidate column pairs are those that hash to the same bucket for ≥ 1 band Buckets Prob. of sharing Threshold t ≥ 1 bucket b bands r rows Similarity Matrix M 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

Signatures: short Candidate pairs: integer signatures that those pairs of Locality- reflect point similarity H a s h signatures that Points sensitive f u n c . we need to test Hashing for similarity Design a locality sensitive Apply the hash function (for a given “Bands” technique distance metric) 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

¡ The S-curve is where the “magic” happens Remember: Probability of sharing Threshold t Probability of Probability=1 equal hash-values ≥ 1 bucket = similarity if s>t No chance if s<t Similarity s of two sets Similarity s of two sets This is what 1 hash-code gives you This is what we want! Pr[ h p (C 1 ) = h p (C 2 )] = s im (D 1 , D 2 ) How to get a step-function? By choosing r and b ! 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

¡ Remember: b bands, r rows/band ¡ Let sim( C 1 , C 2 ) = s What’s the prob. that at least 1 band is equal? ¡ Pick some band ( r rows) § Prob. that elements in a single row of columns C 1 and C 2 are equal = s § Prob. that all rows in a band are equal = s r § Prob. that some row in a band is not equal = 1 - s r ¡ Prob. that all bands are not equal = (1 - s r ) b ¡ Prob. that at least 1 band is equal = 1 - (1 - s r ) b P(C 1 , C 2 is a candidate pair) = 1 - (1 - s r ) b 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

¡ Picking r and b to get the best S-curve § 50 hash-functions (r=5, b=10) 1 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity, s 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

1 1 r = 5, b = 1..50 Prob(Candidate pair) r = 1..10, b = 1 0.9 0.9 0.8 0.8 Given a fixed 0.7 0.7 0.6 0.6 threshold t . 0.5 0.5 0.4 0.4 0.3 0.3 We want choose 0.2 0.2 0.1 0.1 r and b such 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 that the 1 1 r = 10, b = 1..50 Prob(Candidate pair) 0.9 0.9 P(Candidate 0.8 0.8 0.7 0.7 pair) has a 0.6 0.6 “step” right 0.5 0.5 0.4 0.4 around t . 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity Similarity prob = 1 - (1 - s r ) b 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

Visualization of the effect of threshold, band size, and # of rows in LSH by Trenton Chang (Thank you!!) https://www.desmos.com/calculator/lzzvfjiujn 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

Candidate pairs: Locality- those pairs M i n - H a s h - sensitive of signatures i n g Hashing that we need to test for similarity Signatures: short vectors that represent the sets, and reflect their similarity

¡ We have used LSH to find similar documents § More generally, we found similar columns in large sparse matrices with high Jaccard similarity ¡ Can we use LSH for other distance measures? § e.g., Euclidean distances, Cosine distance § Let’s generalize what we’ve learned! 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

¡ 𝒆(⋅) is a distance measure if it is a function from pairs of points x,y to real numbers such that: § 𝑒 𝑦, 𝑧 ≥ 0 § 𝑒(𝑦, 𝑧) = 0 𝑗𝑔𝑔 𝑦 = 𝑧 § 𝑒(𝑦, 𝑧) = 𝑒(𝑧, 𝑦) § 𝑒 𝑦, 𝑧 ≤ 𝑒(𝑦, 𝑨) + 𝑒(𝑨, 𝑧) (triangle inequality) ¡ Jaccard distance for sets = 1 - Jaccard similarity ¡ Cosine distance for vectors = angle between the vectors ¡ Euclidean distances: § L 2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension § The most common notion of “distance” § L 1 norm : sum of absolute value of the differences in each dimension § Manhattan distance = distance if you travel along axes only 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

¡ For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows ¡ A “hash function” is any function that allows us to say whether two elements are “equal” § Shorthand: h(x) = h(y) means “ h says x and y are equal ” ¡ A family of hash functions is any set of hash functions from which we can efficiently pick one at random § Example: The set of Min-Hash functions generated from permutations of rows 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

Suppose we have a space S of points with ¡ a distance measure d(x,y) Critical assumption A family H of hash functions is said to be ¡ ( d 1 , d 2 , p 1 , p 2 )- sensitive if for any x and y in S : 1. If d(x, y) < d 1 , then the probability over all h Î H , that h(x) = h(y) is at least p 1 2. If d(x, y) > d 2 , then the probability over all h Î H , that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

Distance Small distance, Notice it’s distance, not similarity, threshold t high probability hence the S-curve is flipped! p 1 Pr [ h (x) = h (y)] p 2 Large distance, low probability of hashing to the same value d 1 d 2 Distance d(x,y) 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

¡ Let: § S = space of all sets, § d = Jaccard distance, § H is family of Min-Hash functions for all permutations of rows ¡ Then for any hash function h Î H : Pr[h(x) = h(y)] = 1 - d(x, y) § Simply restates theorem about Min-Hashing in terms of distances rather than similarities 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

Recitation sessions : Review of proof techniques and probability - PowerPoint PPT Presentation

Recitation sessions : Review of proof techniques and probability Friday January 17, 3:00-4:10 PM in Skilling Auditorium Review of linear algebra Friday January 17, 4:20-5:20 PM in Skilling Auditorium Deadlines tonight, 11:59 PM :

Mount Eliza Secondary College Steiner Stream 20 sessions 10 8 sessions 6 sessions 4 sessions

Parallel Programming Parallel Programming 0024 0024 Recitation Week 7 Recitation Week 7

3515ICT Theory of Computation Some sample proofs 4-0 Proof types 1. Proof

Earth Movement and Earth Movement and Solar Calendar Solar Calendar Recitation 2 Recitation 2

Recitation First recitation tomorrow 56:30 here Linear algebra Geoff Gordon10-701

TOURNAMENT PAPER WORK REVIEW TOURNAMENT PLAYER VERIFICATION FORM Proof of Age Proof of

Eventful Sessions: Eventful Sessions: Types, Programming and Bisimilarity Raymond Hu, Dimitrios

Table of Contents September 12 Opening Plenary Session 7 Breakout Sessions 1 7-8 Breakout

PROOF installation/usage Attila Krasznahorkay for the Tier3 PROOF WG Wednesday, June 9, 2010

CS 671 Automated Reasoning Proof Automation in First Order Logic 1. Tactic-based proof search 2.

Development of Curriculum and Training Materials for Peer Led Team Learning Recitation Sessions

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

Proof Mining: Proof Interpretations and Their Use in Mathematics Ulrich Kohlenbach Department of

PROOF of the Pudding in Canada PROOF of the Pudding in Canada 2010 ITMAT International Symposium

N OT A SINGLE PROOF ASSISTANT FOR ALL BUT PROOF ASSISTANTS FOR EVERYONE N ICOLAS T ABAREAU Not

Convex Optimization 1. Introduction Prof. Ying Cui Department of Electrical Engineering

CSC 2515 Lecture 7: Expectation-Maximization Marzyeh Ghassemi Material and slides developed by

Announcements Reminder: Pset 2 due Wed March 2 Fitting a transformation: Midterm exam is

Background Information Stephen D. Bay and Michael J. Pazzani University of California, Irvine

IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta

Measurement and Data Data describes the real world Data maps entities in the domain of

Natural Language Processing and Information Retrieval Indexing and Vector Space Models

Clustering Algorithms Johannes Bl omer WS 2015/16 1 / 20 Introduction Clustering techniques