Data Mining Learning from Large Data Sets Lecture 2 - PowerPoint PPT Presentation

Data ¡Mining ¡ Learning ¡from ¡Large ¡Data ¡Sets ¡ Lecture ¡2 ¡– ¡Nearest ¡neighbor ¡ search ¡ ¡ 263-‑5200-‑00L ¡ Andreas ¡Krause ¡

Announcement ¡ � Homework ¡1 ¡out ¡by ¡tomorrow ¡ 2 ¡

Topics ¡ � Approximate ¡retrieval ¡ � Given ¡a ¡query, ¡find ¡“most ¡similar” ¡item ¡in ¡a ¡large ¡data ¡set ¡ � Applica'ons : ¡GoogleGoggles, ¡Shazam, ¡… ¡ � Supervised ¡learning ¡ (ClassificaZon, ¡Regression) ¡ � Learn ¡a ¡concept ¡(funcZon ¡mapping ¡queries ¡to ¡labels) ¡ � Applica'ons : ¡Spam ¡filtering, ¡predicZng ¡price ¡changes, ¡… ¡ � Unsupervised ¡learning ¡(Clustering, ¡dimension ¡reducZon) ¡ � IdenZfy ¡clusters, ¡“common ¡pa]erns”; ¡anomaly ¡detecZon ¡ � Applica'ons : ¡Recommender ¡systems, ¡fraud ¡detecZon, ¡… ¡ � Interac7ve ¡data ¡mining ¡ � Learning ¡through ¡experimentaZon ¡/ ¡from ¡limited ¡feedback ¡ � Applica'ons : ¡Online ¡adverZsing, ¡opt. ¡UI, ¡learning ¡rankings, ¡… ¡ 3 ¡

Today: ¡ ¡ Fast ¡nearest ¡neighbor ¡search ¡ ¡ ¡in ¡high ¡dimensions ¡ 4 ¡

MulZmedia ¡retrieval ¡ shazam.com ¡ Google.com ¡ 5 ¡

Image ¡compleZon ¡ [Hays ¡and ¡Efros, ¡SIGGRAPH ¡2007] ¡ 6 ¡

Nearest-‑neighbor ¡search ¡ 7 ¡

ProperZes ¡of ¡distance ¡fn’s ¡(metrics) ¡ A ¡funcZon ¡ ¡ d : S × S → R ¡ is ¡called ¡a ¡distance ¡funcZon ¡(metric) ¡if ¡it ¡is ¡ ¡NonnegaZve: ¡ ∀ s, t ∈ S : d ( s, t ) ≥ 0 ¡ ¡Discerning: ¡ d ( s, t ) = 0 ⇒ s = t ¡ ¡Symmetric: ¡ ∀ s, t : d ( s, t ) = d ( t, s ) ¡ ¡Triangle ¡inequality: ¡ ∀ s, t, r : d ( s, t ) + d ( t, r ) ≥ d ( s, r ) 8 ¡

RepresenZng ¡objects ¡as ¡vectors ¡ [.3 ¡.01 ¡.1 ¡2.3 ¡0 ¡0 ¡1.1 ¡…] ¡ The ¡quick ¡brown ¡ ¡ [0 ¡1 ¡0 ¡0 ¡0 ¡1 ¡1 ¡0 ¡1 ¡0 ¡0 ¡0] ¡ fox ¡jumps ¡over ¡ ¡ the ¡lazy ¡dog ¡… ¡ � Oien, ¡represent ¡objects ¡as ¡vectors ¡ � Bag ¡of ¡words ¡for ¡documents ¡ � Feature ¡vectors ¡for ¡images ¡(SIFT, ¡GIST, ¡PHOG, ¡etc.) ¡ � … ¡ � Allows ¡to ¡use ¡the ¡same ¡distances ¡/ ¡same ¡algorithms ¡ for ¡different ¡object ¡types ¡ 9 ¡

Examples: ¡Distance ¡of ¡vectors ¡in ¡R D ¡ � Euclidean ¡distance ¡ ¡ � Manha]an ¡distance ¡ � ¡ ¡ ¡ ¡distances: ¡ ¡ ` p ! 1 /p D X i | p d p ( x, x 0 ) = | x i − x 0 i =1 10 ¡

Cosine ¡distance ¡ � Cosine ¡distance ¡ x T x 0 d ( x, x 0 ) = arccos || x || 2 || x 0 || 2 11 ¡

Edit ¡distance ¡ Edit ¡distance: ¡How ¡many ¡inserts ¡and ¡deletes ¡are ¡ necessary ¡to ¡transform ¡one ¡string ¡to ¡another? ¡ ¡ Example: ¡ � ¡d(“The ¡quick ¡brown ¡fox”,”The ¡quikc ¡brwn ¡fox”) ¡ � ¡d(“GATTACA”,”ATACAT”) ¡ ¡ � ¡Allows ¡various ¡extensions ¡(mutaZons; ¡reversal; ¡…) ¡ � ¡Can ¡compute ¡in ¡polynomial ¡Zme, ¡but ¡expensive ¡for ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡large ¡texts ¡ è ¡We ¡will ¡focus ¡on ¡vector ¡representaZon ¡ ¡ 12 ¡

Many ¡real-‑world ¡problems ¡are ¡high-‑dimensional ¡ � Text ¡on ¡the ¡web ¡ � Billions ¡of ¡documents, ¡millions ¡of ¡terms ¡ � In ¡Bag ¡Of ¡Words ¡representaZon, ¡each ¡term ¡is ¡a ¡dimension.. ¡ � Scene ¡compleZon, ¡image ¡classificaZon, ¡… ¡ � Large ¡# ¡of ¡image ¡features ¡ � ScienZfic ¡data ¡ � Large ¡number ¡of ¡measurements ¡ � Product ¡recommendaZons ¡and ¡adverZsing ¡ � Millions ¡of ¡customers, ¡millions ¡of ¡products ¡ � Traces ¡of ¡behavior ¡(websites ¡visited, ¡searches, ¡…) ¡ ¡ 13 ¡

Curse ¡of ¡dimensionality ¡ � Suppose ¡we ¡would ¡like ¡to ¡find ¡neighbors ¡of ¡maximum ¡ distance ¡at ¡most ¡.1 ¡in ¡[0,1] D ¡ � Suppose ¡we ¡have ¡N ¡data ¡points ¡sampled ¡uniformly ¡at ¡ random ¡from ¡[0,1] D ¡ 14 ¡

Curse ¡of ¡dimensionality ¡ � Theorem ¡[Beyer ¡et ¡al. ¡‘99] ¡Fix ¡ε>0 ¡and ¡N. ¡Under ¡fairly ¡ weak ¡assumpZons ¡on ¡the ¡distribuZon ¡of ¡the ¡data ¡ D →∞ P [ d max ( N, D ) ≤ (1 + ε ) d min ( N, D )] = 1 lim 15 ¡

The Blessing of Large Data Hays ¡and ¡Efros, ¡SIGGRAPH ¡2007 ¡

10 ¡nearest ¡neighbors ¡from ¡a ¡ collecZon ¡of ¡20,000 ¡images ¡ Hays ¡and ¡Efros, ¡SIGGRAPH ¡2007 ¡

10 ¡nearest ¡neighbors ¡from ¡a ¡ collecZon ¡of ¡2 ¡million ¡images ¡ Hays ¡and ¡Efros, ¡SIGGRAPH ¡2007 ¡

ApplicaZon: ¡Find ¡similar ¡documents ¡ � Find ¡“near-‑duplicates” ¡among ¡a ¡large ¡collecZon ¡of ¡ documents ¡ � Find ¡clusters ¡in ¡a ¡document ¡collecZon ¡(blog ¡arZcles) ¡ � Spam ¡detecZon ¡ � Detect ¡plagiarism ¡ � … ¡ � What ¡does ¡“near-‑duplicates” ¡mean? ¡ 19 ¡

Near-‑duplicates ¡ � Naïve ¡approach: ¡ � Represent ¡documents ¡as ¡“bag ¡of ¡words” ¡ ¡ � Apply ¡nearest-‑neighbor ¡search ¡on ¡resulZng ¡vectors ¡ � Doesn’t ¡work ¡too ¡well ¡in ¡this ¡sesng. ¡ 20 ¡

Shingling ¡ � To ¡keep ¡track ¡of ¡word ¡order, ¡extract ¡k-‑shingles ¡ ¡ (aka ¡k-‑grams) ¡ � Document ¡represented ¡as ¡“bag ¡of ¡k-‑shingles” ¡ � Example: ¡ ¡ ¡ ¡ a ¡b ¡c ¡a ¡b ¡ ¡ ¡ 21 ¡

Shingling ¡implementaZon ¡ � How ¡large ¡should ¡one ¡choose ¡k? ¡ � Long ¡enough ¡s.t. ¡the ¡don’t ¡occur ¡“by ¡chance” ¡ � Short ¡enough ¡so ¡that ¡one ¡expects ¡“similar” ¡documents ¡to ¡ share ¡some ¡k-‑shingles ¡ � Storing ¡shingles ¡ � Want ¡to ¡save ¡space ¡by ¡compressing ¡ � Oien, ¡simply ¡hashing ¡works ¡well ¡(e.g., ¡hash ¡10-‑shingle ¡to ¡4 ¡ bytes) ¡ 22 ¡

Comparing ¡shingled ¡documents ¡ � Documents ¡are ¡now ¡represented ¡as ¡sets ¡of ¡shingles ¡ � Want ¡to ¡compare ¡two ¡sets ¡ � E.g.: ¡A={1,3,7}; ¡B={2,3,4,7} ¡ 23 ¡

Jaccard ¡distance ¡ � Jaccard ¡similarity: ¡ Sim( A, B ) = | A ∩ B | | A ∪ B | � Jaccard ¡distance: ¡ d ( A, B ) = 1 − | A ∩ B | | A ∪ B | 24 ¡

Example ¡ 25 ¡

Near-‑duplicate ¡detecZon ¡ � Want ¡to ¡find ¡documents ¡that ¡have ¡similar ¡sets ¡of ¡ ¡ k-‑shingles ¡ � Naïve ¡approach: ¡ � For ¡i=1:N ¡ � For ¡j=1:N ¡ � Compute ¡d(i,j) ¡ � If ¡d(i,j) ¡< ¡ε ¡then ¡declare ¡near-‑duplicate ¡ � Infeasible ¡even ¡for ¡moderately ¡large ¡N ¡ L ¡ � Can ¡we ¡do ¡beGer?? ¡ 26 ¡

Warm-‑up ¡ � Given ¡a ¡large ¡collecZon ¡of ¡documents, ¡determine ¡ whether ¡there ¡exist ¡ exact ¡ duplicates? ¡ � Compute ¡hash ¡code ¡/ ¡checksum ¡(e.g., ¡MD5) ¡for ¡all ¡ documents ¡ � Check ¡whether ¡the ¡same ¡checksum ¡appears ¡twice ¡ � Both ¡can ¡be ¡easily ¡parallelized ¡ 27 ¡

Locality ¡sensiZve ¡hashing ¡ � Idea : ¡Create ¡hash ¡funcZon ¡that ¡maps ¡“similar” ¡items ¡ to ¡same ¡bucket ¡ Hashtable ¡ 0 ¡ 1 ¡ 2 ¡ 3 ¡ ¡ � Key ¡problem : ¡Is ¡it ¡possible ¡to ¡construct ¡such ¡hash ¡ funcZons?? ¡ � Depends ¡on ¡the ¡distance ¡funcZon ¡ � Possible ¡for ¡Jaccard ¡distance!! ¡ J ¡ � Some ¡other ¡distance ¡funcZons ¡work ¡as ¡well ¡ ¡ 28 ¡

Shingle ¡Matrix ¡ documents ¡ 1 ¡ ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ shingles ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 29 ¡

Min-‑hashing ¡ � Simple ¡hash ¡funcZon, ¡constructed ¡in ¡the ¡following ¡way: ¡ � Use ¡random ¡permutaZon ¡π ¡to ¡reorder ¡the ¡rows ¡of ¡the ¡matrix ¡ � Must ¡use ¡same ¡permutaZon ¡for ¡all ¡columns ¡C!! ¡ � h ( C ) ¡= ¡minimum ¡row ¡number ¡in ¡which ¡permuted ¡column ¡ ¡ ¡ ¡ ¡contains ¡a ¡1 ¡ h ( C ) = h π ( C ) = i : C ( i )=1 π ( i ) min 30 ¡

Data Mining Learning from Large Data Sets Lecture 2 - PowerPoint PPT Presentation

Data Mining Learning from Large Data Sets Lecture 2 Nearest neighbor search 263-5200-00L Andreas Krause Announcement Homework 1 out by

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Web Characteristics CE-324: Modern Information Retrieval Sharif University of Technology M.

Data Leak Detection As a Service Xiaokui Shu and Danfeng (Daphne) Yao Department of Computer

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Midterm Review Li Xiong Department of Mathematics and Computer Science Emory University

Jeffrey D. Ullman You can download a free copy of Mining of Massive Datasets , by Jure

Discovering Similar Passages Within Large Text Documents Demetrios

State Board of Land Commissioners September 19, 2017 Boise, Idaho Increase pace and scale of

Development of High Data Readout Rate Pixel Module and Detector Hybridization at Fermilab