http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation

�� Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps Apps data data data data data data learning learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Network Web Decision Association Clustering Analysis advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

[Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

[Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

[Hays and Efros, SIGGRAPH 2007] 10 nearest neighbors from a collection of 20,000 images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

[Hays and Efros, SIGGRAPH 2007] 10 nearest neighbors from a collection of 2 million images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

� Many problems can be expressed as finding “similar” sets: � Find near-neighbors in high-dimensional space � Examples: � Pages with similar words � For duplicate detection, classification by topic � Customers who purchased similar products � Products with similar customer sets � Images with similar features � Users who visited similar websites J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

� Given: High dimensional data points � � � � � � � � For example: Image is a long vector of pixel colors � � � � �� And some distance function �� Which quantifies the “distance” between � � and � � � Goal: Find all pairs of data points �� that are within some distance threshold � � � � � � � � � Note: Naïve solution would take � � � � � � � where � is the number of data points � MAGIC: This can be done in � � !! How? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

� Last time: Finding frequent pairs �� ! �� ( �� ( �� ! "��#��$% "��#��$%� �� '�� &��'�� '�� '��) &�� !� �� (� �� J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

� Last time: Finding frequent pairs � Further improvement: PCY � Pass 1: �� ! � Count exact frequency of each item: � Take pairs of items {i,j}, hash them into B buckets and count of the number of pairs that hashed to each bucket: ,�� , �� #��*�+% �� #��*%�#��+%�#*�+% J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

� Last time: Finding frequent pairs � Further improvement: PCY � Pass 1: �� ! � Count exact frequency of each item: � Take pairs of items {i,j}, hash them into B buckets and count of the number of pairs that hashed to each bucket: � Pass 2: ,�� , �� For a pair {i,j} to be a candidate for a frequent pair , its singletons {i}, {j} �� #��*�+% �� #��*%�#��+%�#*�+% have to be frequent and the pair �� #��*�-% has to hash to a frequent bucket! �� #��*%�#��-%�#*�-% J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

� Last time: Finding frequent pairs � Further improvement: PCY Previous lecture: A-Priori Main idea: Candidates � Pass 1: �� ! Instead of keeping a count of each pair, only keep a count � Count exact frequency of each item: of candidate pairs! � Take pairs of items {i,j}, hash them into B buckets and Today’s lecture: Find pairs of similar docs count of the number of pairs that hashed to each bucket: Main idea: Candidates -- Pass 1: Take documents and hash them to buckets such that � Pass 2: ,�� , documents that are similar hash to the same bucket �� For a pair {i,j} to be a candidate for -- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket) a frequent pair , its singletons have �� #��*�+% Benefits: Instead of O(N 2 ) comparisons, we need O(N) �� #��*%�#��+%�#*�+% to be frequent and its has to hash comparisons to find similar documents �� #��*�-% to a frequent bucket! �� #��*%�#��-%�#*�-% J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

� Goal: Find near-neighbors in high-dim. space � We formally define “near neighbors” as points that are a “small distance” apart � For each application, we first need to define what “ distance ” means � Today: Jaccard distance/similarity � The Jaccard similarity of two sets is the size of their intersection divided by the size of their union: sim (C 1 , C 2 ) = |C 1 � � C 2 |/|C 1 � � C 2 | � � � � � Jaccard distance: d (C 1 , C 2 ) = 1 - |C 1 � � � � C 2 |/|C 1 � � C 2 | � � +�� .�� /�� 0�+�. /�� 0�1�. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

� Goal: Given a large number ( � in the millions or billions) of documents, find “near duplicate” pairs � Applications: � Mirror websites, or approximate mirrors � Don’t want to show both in search results � Similar news articles at many news sites � Cluster articles by “same story” � Problems: � Many small pieces of one document can appear out of order in another � Too many documents to compare all pairs � Documents are so large or so many that they cannot fit in main memory J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

1. Shingling: Convert documents to sets 2. Min-Hashing: Convert large sets to short signatures, while preserving similarity Locality-Sensitive Hashing: Focus on 3. pairs of signatures likely to be from similar documents � Candidate pairs! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Name: Prone Leg Curl Tube Thickness: 3.0mm Dim: 196013501180mm Weight: 400KG Model No: EJ01

Name: Leg Extension Tube Thickness: 2.5mm Dim: 140105150cm Weight: 214KG Model No: OE502

Name: Prone Leg Curl Tube Thickness: 2.5mm Dim: 15299135cm Weight: 216 KG Model No: TT101

Infinite Campus Parent Portal Scan and Go https://goo.gl/kNtHrw Infinite Campus Parent Portal

Happy 103rd birthday, Richard Guy Karl Dilcher Infinite products Infinite products involving

Infinite dimensional sub-Riemannian geometry Sylvain Arguill` ere (CIS, Johns Hopkins

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Compositions and Infinite Matrices Rod Canfield 9 Feb 2013 Compositions and Infinite Matrices

The Analysis of Infinite-State Systems Bernard Boigelot Universit e de Li` ege

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4

Jeffrey D. Ullman Stanford University It has been said that the mark of a computer scientist

DATA MINING LECTURE 5 Sketching, Locality Sensitive Hashing 2 Jaccard Similarity The

Bag-of-Words Models and Beyond Sentiment, Subjectivity, and Stance Ling 575 April 8, 2014

Locality Sensitive Hashing & ANN CS 584: Big Data Analytics Material adapted from Piotr

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

http://www.mmds.org High dim. High dim. Graph Graph Infinite - PowerPoint PPT Presentation

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Name: Prone Leg Curl Tube Thickness: 3.0mm Dim: 1960*1350*1180mm Weight: 400KG Model No: EJ01

Name: Leg Extension Tube Thickness: 2.5mm Dim: 140*105*150cm Weight: 214KG Model No: OE502

Name: Prone Leg Curl Tube Thickness: 2.5mm Dim: 152*99*135cm Weight: 216 KG Model No: TT101

Infinite Campus Parent Portal Scan and Go https://goo.gl/kNtHrw Infinite Campus Parent Portal

Happy 103rd birthday, Richard Guy Karl Dilcher Infinite products Infinite products involving

Infinite dimensional sub-Riemannian geometry Sylvain Arguill` ere (CIS, Johns Hopkins

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Compositions and Infinite Matrices Rod Canfield 9 Feb 2013 Compositions and Infinite Matrices

The Analysis of Infinite-State Systems Bernard Boigelot Universit e de Li` ege

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4

Jeffrey D. Ullman Stanford University It has been said that the mark of a computer scientist

DATA MINING LECTURE 5 Sketching, Locality Sensitive Hashing 2 Jaccard Similarity The

Bag-of-Words Models and Beyond Sentiment, Subjectivity, and Stance Ling 575 April 8, 2014

Locality Sensitive Hashing &amp; ANN CS 584: Big Data Analytics Material adapted from Piotr

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Name: Prone Leg Curl Tube Thickness: 3.0mm Dim: 196013501180mm Weight: 400KG Model No: EJ01

Name: Leg Extension Tube Thickness: 2.5mm Dim: 140105150cm Weight: 214KG Model No: OE502

Name: Prone Leg Curl Tube Thickness: 2.5mm Dim: 15299135cm Weight: 216 KG Model No: TT101

Locality Sensitive Hashing & ANN CS 584: Big Data Analytics Material adapted from Piotr