http://www.mmds.org Many problems can be expressed as finding - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

 Many problems can be expressed as finding “similar” objects:  Find near(est)-neighbors  Example Applications:  Pages with similar words  For duplicate detection, clustering by topic  Customers who purchased similar products  kNN classification, collaborative filtering  Images with similar features  Image recommendation  Record linkage (deduplication) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

[Hays and Efros, SIGGRAPH 2007] 10 nearest neighbors from a collection of 2 million images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

 Given: (High dimensional) data points 𝒚 𝟐 , 𝒚 𝟑 , …  For example: Image is a vector of pixel colors 1 2 1 → [1 2 1 0 2 1 0 1 0] 0 2 1 0 1 0  And some distance function 𝒆(𝒚 𝟐 , 𝒚 𝟑 )  Which quantifies the “distance” between 𝒚 𝟐 and 𝒚 𝟑  Goal: Find all pairs of data points (𝒚 𝒋 , 𝒚 𝒌 ) that are within some distance threshold 𝒆 𝒚 𝒋 , 𝒚 𝒌 ≤ 𝒕 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

 Given: (High dimensional) data points 𝒚 𝟐 , 𝒚 𝟑 , …  For example: Image is a vector of pixel colors 1 2 1 → [1 2 1 0 2 1 0 1 0] 0 2 1 0 1 0  And some distance function 𝒆(𝒚 𝟐 , 𝒚 𝟑 )  Which quantifies the “distance” between 𝒚 𝟐 and 𝒚 𝟑  Goal: Find all pairs of data points (𝒚 𝒋 , 𝒚 𝒌 ) that are within some distance threshold 𝒆 𝒚 𝒋 , 𝒚 𝒌 ≤ 𝒕  Naïve solution would take 𝑷 𝑶 𝟑  where 𝑶 is the number of data points J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

 Hash objects to buckets such that objects that are similar hash to the same bucket  Only compare candidates in each bucket  Benefits: Instead of O(N 2 ) comparisons, we need O(N) to find similar documents  Hash functions depend on similarity functions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

 Goal: Given a large number ( 𝑂 in the millions or billions) of documents, find “near duplicate” pairs  Applications:  Mirror websites, or approximate mirrors  Similar news articles at many news sites  Problems:  Too many documents to compare all pairs J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

 Shingling: Convert documents to sets  Simple approaches:  Document = set of words appearing in document  Document = set of “important” words J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

 Need to account for ordering of words!  Document = set of Shingles  A set of k -shingles (or k -grams) is a set of k- sequence tokens that appears in the doc  Tokens can be characters, words  Example:  k=2 ;  D 1 = abcab  Set of 2-shingles: S(D 1 ) = { ab , bc , ca } J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 A natural similarity measure is the Jaccard similarity: sim (D 1 , D 2 ) = |C 1  C 2 |/|C 1  C 2 | J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

 Encode sets using 0/1 (bit, boolean) vectors  One dimension per element in the universal set  Interpret set intersection as bitwise AND , and set union as bitwise OR  Example: C 1 = 10111; C 2 = 10011  Size of intersection = 3 ; size of union = 4 ,  Jaccard similarity = 3/4  Distance: d(C 1 ,C 2 ) = 1 – (Jaccard similarity) = 1/4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

 Rows = elements (shingles)  Columns = sets (documents) Documents  1 in row e and column s if and only 1 1 1 0 if e is a member of s 1 1 0 1  Column similarity is the Jaccard 0 1 0 1 similarity of the corresponding Shingles 0 0 0 1 sets  Typical matrix is sparse! 1 0 0 1  Example: sim(C 1 ,C 2 ) = ? 1 1 1 0 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

 Rows = elements (shingles)  Columns = sets (documents) Documents  1 in row e and column s if and only 1 1 1 0 if e is a member of s 1 1 0 1  Column similarity is the Jaccard 0 1 0 1 similarity of the corresponding Shingles 0 0 0 1 sets  Typical matrix is sparse! 1 0 0 1  Example: sim(C 1 ,C 2 ) = ? 1 1 1 0  Size of intersection = 3; size of union = 6, 1 0 1 0 Jaccard similarity = 3/6  d(C 1 ,C 2 ) = 1 – (Jaccard similarity) = 3/6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

 Suppose we need to find near-duplicate documents among 𝑶 = 𝟐 million documents  Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs  𝑶(𝑶 − 𝟐)/𝟑 ≈ 5*10 11 comparisons  At 10 5 secs/day and 10 6 comparisons/sec, it would take 5 days  For 𝑶 = 𝟐𝟏 million, it takes more than a year… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

 Key Idea: “hash” each column C to a small signature h(C) :  (1) h(C) is small enough that the signature fits in RAM  (2) sim(C 1 , C 2 ) is the same as the “similarity” of signatures h(C 1 ) and h(C 2 )  Locality sensitive hashing:  If sim(C 1 ,C 2 ) is high, then with high prob. h(C 1 ) = h(C 2 )  If sim(C 1 ,C 2 ) is low, then with high prob. h(C 1 ) ≠ h(C 2 )  Expect that “most” pairs of near duplicate docs hash into the same bucket! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

Input matrix (Shingles x Documents) 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

2 nd element of the permutation is the first to map to a 1 Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 0 1 0 1 1 6 6 4 th element of the permutation is the first to map to a 1 5 7 1 1 0 1 0 4 5 5 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

 Imagine the rows of the boolean matrix permuted under random permutation   Define a “hash” function h  (C) = the index of the first (in the permuted order  ) row in which column C has value 1 : h  (C) = min   (C)  Use several (e.g., 100) independent hash functions (that is, permutations) to create a signature of a column J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

 Permuting rows even once is prohibitive  Row hashing!  Pick K hash functions k i  Ordering under k i gives a random row permutation! How to pick a random hash function h(x)? Universal hashing: h a,b (x)= (a·x+b) mod N where: a,b … random integers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

 One-pass implementation  For each column C  Initialize all hash values for each permutation i: sig(C)[i] =   For each row  If there is a 1 in column C  Update hash value of column C if the row number in the current permutation is smaller than current value  If k i < sig(C)[i] , then sig(C)[i]  k i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Permutation  Input matrix (Shingles x Documents) Signature matrix M 2 4 1 0 1 0 3 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 2 0 1 0 1 1 6 6 5 7 1 1 0 1 0 4 5 5 1 0 1 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

1 0 1 0 1 0 0 1 0 1 0 1  One bit matching (given a  ) 0 1 0 1 Pr[ h  (C 1 ) = h  (C 2 )] = 0 1 0 1 1 0 1 0 1 0 1 0 Signature matrix M 2 1 2 1 2 1 4 1 1 2 1 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

1 0 1 0 1 0 0 1 0 1 0 1  One bit matching (given  ) 0 1 0 1 Pr[ h  (C 1 ) = h  (C 2 )] = Sim(C1, C2) 0 1 0 1 1 0 1 0 1 0 1 0 Signature matrix M 2 1 2 1 2 1 4 1 1 2 1 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

http://www.mmds.org Many problems can be expressed as finding - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

www.Every-Mind.org www.Every-Mind.org www.Every-Mind.org www.Every-Mind.org

Who needs Standards... Patrick Curran: Chair, Java Community Process (patrick@jcp.org)

westgov.org #WGA17 JIM OGSBURY Executive Director westgov.org #WGA17 westgov.org

Ninja Scaning by Fyodor CanSecWest 2009 March 20, 3:50 PM

E: E: E: E: nirmal.ghorawat@icai.org nirmal.ghorawat@icai.org nirmal.ghorawat@icai.org

OFBiz Development with Docker http://ofbiz.apache.org http://docker.io/ 2015-04-15

Logic in Action Chapter 9: Proofs http://www.logicinaction.org/ ( http://www.logicinaction.org/ )

http://dx.doi.org/10.1145/2207676.2207704 http://dx.doi.org/10.1145/2663204.2663270 Visual

http://ar.wikipedia.org/wiki / http :// www . masraheon . com / . htm 3 .

PRIVACY TRENDS ITZONTARGET PRIVACY TRENDS http://www.globalconsentmanager.com/ Why should you

http://ecademy.agnessco http://ecademy.agnessco http://ecademy.agnessco http://ecademy.agnessco

http://www.mmds.org Much of the course will be devoted

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Sibyl A Practical Internet Route Oracle talo Cunha P. Marchetta, M. Calder, Y-C. Chiu B.

http://cs246.stanford.edu Many real-world problems Web Search and Text Mining Billions

http://cs246.stanford.edu Goal: Given a large number (N in the millions or billions) of text

NP and NP Completeness Lecture 23 April 20, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017

Model Checking for Symbolic-Heap Separation Logic with Inductive Predicates James Brotherston 1

https://conferences.lbl.gov/event/192/ Next NSD staff meeting: 14 th of May 2019 Notes on NSD

Driving best value for delivery of the National Child Measurement Programme Alison Gahagan and

POWER CORRECTIONS FROM MILAN TO LHC Gavin P . Salam, CERN Giuseppe Marchesini Memorial Meeting