Data Mining Learning from Large Data Sets Lecture 3 - PowerPoint PPT Presentation

Data ¡Mining ¡ Learning ¡from ¡Large ¡Data ¡Sets ¡ Lecture ¡3 ¡– ¡Locality ¡Sensi7ve ¡ Hashing ¡ ¡ 263-‑5200-‑00L ¡ Andreas ¡Krause ¡

Announcement ¡ � No ¡class ¡next ¡week ¡ 2 ¡

Review: ¡ ¡ Fast ¡near ¡neighbor ¡search ¡ ¡ ¡in ¡high ¡dimensions ¡ 3 ¡

Locality ¡sensi7ve ¡hashing ¡ � Idea : ¡Create ¡hash ¡func7on ¡that ¡maps ¡“similar” ¡items ¡ to ¡same ¡bucket ¡ Hashtable ¡ 0 ¡ 1 ¡ 2 ¡ 3 ¡ ¡ � Key ¡problem : ¡Is ¡it ¡possible ¡to ¡construct ¡such ¡hash ¡ func7ons?? ¡ � Depends ¡on ¡the ¡distance ¡func7on ¡ � Possible ¡for ¡Jaccard ¡distance!! ¡ J ¡ � Some ¡other ¡distance ¡func7ons ¡work ¡as ¡well ¡ ¡ 4 ¡

Recall: ¡Shingle ¡Matrix ¡ documents ¡ 1 ¡ ¡ 0 ¡ 1 ¡ 0 ¡ Sim( A, B ) = | A ∩ B | 1 ¡ 0 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ | A ∪ B | shingles ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 5 ¡

Min-‑hashing ¡ � Simple ¡hash ¡func7on, ¡constructed ¡in ¡the ¡following ¡way: ¡ � Use ¡random ¡permuta7on ¡π ¡to ¡reorder ¡the ¡rows ¡of ¡the ¡matrix ¡ � Must ¡use ¡same ¡permuta7on ¡for ¡all ¡columns ¡C!! ¡ � h ( C ) ¡= ¡minimum ¡row ¡number ¡in ¡which ¡permuted ¡column ¡ ¡ ¡ ¡ ¡contains ¡a ¡1 ¡ h ( C ) = h π ( C ) = i : C ( i )=1 π ( i ) min 6 ¡

Min-‑hashing ¡property ¡ � Want ¡that ¡similar ¡documents ¡(columns) ¡have ¡same ¡ value ¡of ¡hash ¡func7on ¡(with ¡high ¡probability) ¡ � Turns ¡out ¡it ¡holds ¡that ¡ Pr[ h ( C 1 ) = h ( C 2 )] = Sim( C 1 , C 2 ) � Need ¡to ¡control ¡false ¡posi7ves ¡and ¡misses. ¡ 7 ¡

Min-‑hash ¡signatures ¡ Input ¡matrix ¡ ¡ Signature ¡matrix ¡ M ¡ 1 ¡ 4 ¡ 1 ¡ ¡ 0 ¡ 1 ¡ 0 ¡ 3 2 ¡ 1 ¡ 2 ¡ 1 ¡ 1 ¡ 0 ¡ 0 ¡ 1 ¡ 3 ¡ 2 ¡ 4 2 ¡ 1 ¡ 4 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 7 ¡ 1 ¡ 7 1 ¡ 2 ¡ 1 ¡ 2 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 6 ¡ 3 ¡ 6 0 ¡ 1 ¡ 0 ¡ 1 ¡ 2 ¡ 6 ¡ 1 Similari7es: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1-‑3 ¡ ¡ ¡ ¡ ¡ ¡2-‑4 ¡ ¡ ¡ ¡1-‑2 ¡ ¡ ¡3-‑4 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 5 ¡ 7 ¡ 2 Col/Col ¡ ¡ ¡ ¡0.75 ¡ ¡ ¡ ¡0.75 ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡ 4 ¡ 5 ¡ 5 1 ¡ 0 ¡ 1 ¡ 0 ¡ Sig/Sig ¡ ¡ ¡ ¡ ¡0.67 ¡ ¡ ¡ ¡1.00 ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡ 8 ¡

Hashing ¡bands ¡of ¡M ¡ Buckets Matrix M b bands r rows 9 ¡

One ¡hash ¡func7on ¡ 1 0.8 P(hash hit) 0.6 r=1 b=1 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Similarity 10 ¡

100 ¡hash ¡func7ons ¡ 1 0.8 P(hash hit) 0.6 r=10 b=10 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Similarity 11 ¡

100 ¡hash ¡func7ons ¡ 1 0.9 0.8 0.7 0.6 r=1 r=2 r=5 r=10 r=20 0.5 b=100 b=50 b=20 b=10 b=5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity 12 ¡

1000 ¡hash ¡func7ons ¡ 1 0.8 0.6 r=1 r=2 r=5 r=10 r=20 r=50 b=1000 b=500 b=200 b=100 b=50 b=20 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity 13 ¡

10000 ¡hash ¡func7ons ¡ 1 0.8 0.6 r=1 r=2 r=5 r=10 r=20 r=50 b=10000 b=5000 b=2000 b=1000 b=500 b=200 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity 14 ¡

LSH ¡more ¡generally ¡ � So ¡far ¡we ¡have ¡considered ¡ � Min-‑hashing ¡for ¡compu7ng ¡compact ¡document ¡signatures ¡ represen7ng ¡Jaccard ¡similarity ¡ � Locality ¡Sensi7ve ¡Hashing ¡(LSH) ¡for ¡decreasing ¡false ¡ nega7ves ¡and ¡false ¡posi7ves ¡ � Let’s ¡us ¡do ¡duplicate ¡detec7on ¡without ¡requiring ¡pairwise ¡ comparisons! ¡ � Can ¡we ¡generalize ¡what ¡we ¡learned? ¡ � Other ¡data ¡types ¡(e.g., ¡real ¡vectors ¡ è ¡images) ¡ � Other ¡distance ¡func7ons ¡(Euclidean? ¡Cosine?) ¡ 15 ¡

Key ¡insight ¡behind ¡LSH ¡ � LSH ¡allows ¡to ¡boost ¡the ¡gap ¡between ¡similar ¡ (Sim(C1,C2)>s) ¡non-‑similar ¡(Sim(C1,C2)<s’ ¡for ¡s’ ¡< ¡s) ¡pairs ¡ 1 1 0.8 0.8 P(hash hit) 0.6 P(hash hit) 0.6 r=1 r=10 b=1 b=10 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Similarity Similarity 16 ¡

LSH ¡more ¡generally ¡ � Consider ¡a ¡metric ¡space ¡ (S,d) , ¡and ¡a ¡family ¡ F ¡of ¡hash ¡ func7ons ¡ h: ¡S à B ¡ � F ¡is ¡called ¡ (d 1 , ¡d 2 , ¡p 1 , ¡p 2 )-‑sensi5ve ¡ if ¡ ¡ ¡ ∀ x, y ∈ S : d ( x, y ) ≤ d 1 ⇒ Pr[ h ( x ) = h ( y )] ≥ p 1 ∀ x, y ∈ S : d ( x, y ) ≥ d 2 ⇒ Pr[ h ( x ) = h ( y )] ≤ p 2 17 ¡

Example ¡ P(hit) ¡ d ¡ 18 ¡

Example: ¡Jaccard-‑distance ¡ Recall, ¡we ¡want: ¡ ∀ x, y ∈ S : d ( x, y ) ≤ d 1 ⇒ Pr[ h ( x ) = h ( y )] ≥ p 1 ∀ x, y ∈ S : d ( x, y ) ≥ d 2 ⇒ Pr[ h ( x ) = h ( y )] ≤ p 2 19 ¡

Boos7ng ¡a ¡LS ¡hash ¡family ¡ � Can ¡we ¡reduce ¡false ¡posi7ves ¡and ¡false ¡nega7ves ¡(create ¡ “S-‑curve ¡effect”) ¡for ¡arbitrary ¡LS ¡hash ¡func7ons?? ¡ � Can ¡apply ¡same ¡par77oning ¡technique! ¡ � AND/OR ¡construc7on ¡ 20 ¡

r-‑way ¡AND ¡of ¡hash ¡func7on ¡ � Goal : ¡Decrease ¡false ¡posi7ves ¡ � Convert ¡hash ¡family ¡ F ¡to ¡new ¡family ¡ F ’ ¡ � Each ¡member ¡of ¡ F ’ ¡consists ¡of ¡a ¡“vector” ¡of ¡r ¡hash ¡ func7ons ¡from ¡ F ¡ � For ¡ h ¡= ¡[ h 1 ,…, h r ] ¡in ¡ F ’ , ¡h(x)=h(y) ¡ ó ¡h i (x)=h i (y) ¡for ¡ all ¡ i . ¡ � Theorem: ¡Suppose ¡ F ¡is ¡( d 1 , d 2 , p 1 , p 2 )-‑sensi7ve. ¡ ¡ Then ¡ F’ ¡is ¡ ( ¡ ¡ ¡ ¡ ¡ ¡ ¡ , ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ , ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ) -‑sensi7ve ¡ 21 ¡

b-‑way ¡OR ¡of ¡hash ¡func7on ¡ � Goal : ¡Decrease ¡false ¡nega7ves ¡ � Convert ¡hash ¡family ¡ F ¡to ¡new ¡family ¡ F’ ¡ � Each ¡member ¡of ¡ F’ ¡consists ¡of ¡a ¡“vector” ¡of ¡b ¡hash ¡ func7ons ¡from ¡ F ¡ � For ¡ h ¡= ¡[ h 1 ,…, h r ] ¡in ¡ F’ , ¡h(x)=h(y) ¡ ó ¡h i (x)=h i (y) ¡for ¡ some ¡ i. ¡ � Theorem: ¡Suppose ¡ F ¡is ¡( d 1 , d 2 , p 1 , p 2 )-‑sensi7ve. ¡ ¡ Then ¡ F’ ¡is ¡ ( ¡ ¡ ¡ ¡ ¡ ¡ ¡ , ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ , ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ) -‑ sensi7ve ¡ 22 ¡

Composing ¡AND ¡and ¡OR ¡ � Suppose ¡we ¡start ¡with ¡a ¡( d 1 , d 2 , p 1 , p 2 )-‑sensi7ve ¡F ¡ � First ¡apply ¡r-‑way ¡AND, ¡then ¡b-‑way ¡OR ¡ � This ¡results ¡in ¡ ( d 1 , d 2 , ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ) ¡sensi7ve ¡F’ ¡ � Can ¡also ¡reverse ¡order ¡of ¡AND ¡and ¡OR ¡ � This ¡results ¡in ¡ ( d 1 , d 2 , ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ) ¡sensi7ve ¡F’ ¡ 23 ¡

Example ¡ 1 0.9 0.8 0.7 0.6 OR − AND 0.5 AND − OR 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 24 ¡

Cascading ¡construc7ons ¡ � Can ¡also ¡combine ¡all ¡previous ¡construc7ons ¡ � For ¡example, ¡first ¡apply ¡(4,4) ¡OR-‑AND ¡construc7on ¡ followed ¡by ¡a ¡(4,4) ¡AND-‑OR ¡construc7on. ¡ � Transforms ¡a ¡(.2,.8,.8,.2)-‑sensi7ve ¡family ¡into ¡a ¡ ¡ (.2,.8,.9999996,.0008715)-‑sensi7ve ¡family! ¡ � How ¡many ¡hash ¡func7ons ¡are ¡used? ¡ 25 ¡

Other ¡examples ¡of ¡LS ¡families ¡ � So ¡far : ¡Jaccard ¡distance ¡has ¡a ¡LS ¡hash ¡family ¡ � Several ¡other ¡distance ¡func7ons ¡do ¡too ¡ � Cosine ¡distance ¡ � Euclidean ¡distance ¡ 26 ¡

LSH ¡for ¡Cosine ¡Distance ¡ 27 ¡

Data Mining Learning from Large Data Sets Lecture 3 - PowerPoint PPT Presentation

Data Mining Learning from Large Data Sets Lecture 3 Locality Sensi7ve Hashing 263-5200-00L Andreas Krause Announcement No class next week

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Mississippi United To End Homelessness Coalition Balance of State CoC Membership Meeting Minutes

SUDAN Humanitarian impact of multiple protracted crises KEY PRIORITIES CRISIS IMPACT OVERVIEW

Operating System Labs Yuanbin Wu cs@ecnu Announcement Project 1 due 21:00 Oct. 18

Breadth of CS32s subject matter (Reader p. 14) Underlying computer system = hardware +

N0.1 females males drinker 0 person drinkers 3 persons non-drinkers

Probabilistic Data Integration and Data Exchange Livia Predoiu predoiu@ovgu.de DEIS 2010

and Applications Lecture 4: Reasoning with Ontologies Juan Carlos Nieves Snchez November 14,

Fast-rolling relaxion KMI, Nagoya University Yutaro Shoji In collaboration with M. Ibe and M.