Similarity Search Stony Brook University CSE545, Fall 2016 Finding - PowerPoint PPT Presentation

Similarity Search Stony Brook University CSE545, Fall 2016

Finding Similar Items ● Applications ○ Document Similarity: ■ Mirrored web-pages ■ Plagiarism; Similar News ○ Recommendations: ■ Online purchases ■ Movie ratings ○ Entity Resolution ○ Fingerprint Matching

Finding Similar Items: What we will cover ● Set Similarity ○ Shingling ○ Minhashing ○ Locality-sensitive hashing ● Embeddings ● Distance Metrics ● High-Degree of Similarity

Document Similarity Challenge: How to represent the document in a way that can be efficiently encoded and compared?

Shingles Goal: Convert documents to sets

Shingles Goal: Convert documents to sets k-shingles (aka “character n-grams”) - sequence of k characters

Shingles Goal: Convert documents to sets k-shingles (aka “character n-grams”) - sequence of k characters E.g. k =2 doc=”abcdabd” singles(doc, 2) = {ab, bc, cd, da, bd}

Shingles Goal: Convert documents to sets k-shingles (aka “character n-grams”) - sequence of k characters E.g. k =2 doc=”abcdabd” singles(doc, 2) = {ab, bc, cd, da, bd} ● Similar documents will have many common shingles ● Changing words or order has minimal effect. ● In practice use 5 < k < 10

Shingles Goal: Convert documents to sets k-shingles (aka “character n-grams”) - sequence of k characters Large enough that any given shingle appearing a document is highly unlikely (e.g. < .1% chance) E.g. k =2 doc=”abcdabd” Can hash large singles to smaller (e.g. 9-shingles into 4 bytes) singles(doc, 2) = {ab, bc, cd, da, bd} Can also use words (aka n-grams). ● Similar documents will have many common shingles ● Changing words or order has minimal effect. ● In practice use 5 < k < 10

Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

Minhashing Goal: Convert sets to shorter ids, signatures

Minhashing - Background Goal: Convert sets to shorter ids, signatures Jaccard Similarity: Characteristic Matrix: …. (Leskovec at al., 2014; http://www.mmds.org/)

Minhashing - Background Goal: Convert sets to shorter ids, signatures Jaccard Similarity: Characteristic Matrix: …. (Leskovec at al., 2014; http://www.mmds.org/) often very sparse! (lots of zeros)

Minhashing - Background Characteristic Matrix: S 1 S 2 Jaccard Similarity: ab 1 1 bc 0 1 de 1 0 ah 1 1 ha 0 0 ed 1 1 ca 0 1

Minhashing - Background Characteristic Matrix: S 1 S 2 Jaccard Similarity: ab 1 1 * * bc 0 1 * de 1 0 * ah 1 1 ** ha 0 0 ed 1 1 ** ca 0 1 *

Minhashing - Background Characteristic Matrix: S 1 S 2 Jaccard Similarity: ab 1 1 * * bc 0 1 * de 1 0 * ah 1 1 ** ha 0 0 sim ( S 1, S 2 ) = 3 / 6 (# both have / # at least one has) ed 1 1 ** ca 0 1 *

Minhashing - Background Characteristic Matrix: How many different rows are possible? S 1 S 2 ab 1 1 * * bc 0 1 * de 1 0 * ah 1 1 ** ha 0 0 ed 1 1 ** ca 0 1 *

Minhashing - Background Characteristic Matrix: How many different rows are possible? S 1 S 2 ab 1 1 * * 1, 1 -- type a bc 0 1 * 1, 0 -- type b de 1 0 * 0, 1 -- type c ah 1 1 ** 0, 0 -- type d ha 0 0 ed 1 1 ** sim ( S 1, S 2 ) = a / (a+b+c) ca 0 1 *

Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

Minhashing Characteristic Matrix: S 1 S 2 S 3 S 4 ab 1 0 1 0 bc 1 0 0 1 de 0 1 0 1 ah 0 1 0 1 ha 0 1 0 1 ed 1 0 1 0 ca 1 0 1 0 (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to first row Characteristic Matrix: where set appears. S 1 S 2 S 3 S 4 ab 1 0 1 0 bc 1 0 0 1 de 0 1 0 1 ah 0 1 0 1 ha 0 1 0 1 ed 1 0 1 0 ca 1 0 1 0 (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to first row Characteristic Matrix: where set appears. permuted S 1 S 2 S 3 S 4 order ab 1 0 1 0 1 ha bc 1 0 0 1 2 ed de 0 1 0 1 3 ab ah 0 1 0 1 4 bc ha 0 1 0 1 5 ca ed 1 0 1 0 6 ah ca 1 0 1 0 7 de (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to first row Characteristic Matrix: where set appears. permuted S 1 S 2 S 3 S 4 order 3 ab 1 0 1 0 1 ha 4 bc 1 0 0 1 2 ed 7 de 0 1 0 1 3 ab 6 ah 0 1 0 1 4 bc 1 ha 0 1 0 1 5 ca 2 ed 1 0 1 0 6 ah 5 ca 1 0 1 0 7 de (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to first row Characteristic Matrix: where set appears. permuted S 1 S 2 S 3 S 4 order 3 ab 1 0 1 0 1 ha 4 bc 1 0 0 1 2 ed 7 de 0 1 0 1 3 ab 6 ah 0 1 0 1 4 bc 1 ha 0 1 0 1 5 ca h (S 1 ) = ed #permuted row 2 2 ed 1 0 1 0 6 ah h (S 2 ) = ha #permuted row 1 5 h (S 3 ) = ca 1 0 1 0 7 de (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to first row Characteristic Matrix: where set appears. permuted S 1 S 2 S 3 S 4 order 3 ab 1 0 1 0 1 ha 4 bc 1 0 0 1 2 ed 7 de 0 1 0 1 3 ab 6 ah 0 1 0 1 4 bc 1 ha 0 1 0 1 5 ca h (S 1 ) = ed #permuted row 2 2 ed 1 0 1 0 6 ah h (S 2 ) = ha #permuted row 1 5 h (S 3 ) = ed #permuted row 2 ca 1 0 1 0 7 de h (S 4 ) = (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to first row Characteristic Matrix: where set appears. permuted S 1 S 2 S 3 S 4 order 3 ab 1 0 1 0 1 ha 4 bc 1 0 0 1 2 ed 7 de 0 1 0 1 3 ab 6 ah 0 1 0 1 4 bc 1 ha 0 1 0 1 5 ca h (S 1 ) = ed #permuted row 2 2 ed 1 0 1 0 6 ah h (S 2 ) = ha #permuted row 1 5 h (S 3 ) = ed #permuted row 2 ca 1 0 1 0 7 de h (S 4 ) = ha #permuted row 1 (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to rows. Characteristic Matrix: Signature matrix: M S 1 S 2 S 3 S 4 ● Record first row where each set had a 1 in the given permutation 3 ab 1 0 1 0 4 bc 1 0 0 1 S 1 S 2 S 3 S 4 7 de 0 1 0 1 h 1 2 1 2 1 6 ah 0 1 0 1 1 ha 0 1 0 1 h 1 (S 1 ) = ed #permuted row 2 2 ed 1 0 1 0 h 1 (S 2 ) = ha #permuted row 1 5 h 1 (S 3 ) = ed #permuted row 2 ca 1 0 1 0 h 1 (S 4 ) = ha #permuted row 1 (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to rows. Characteristic Matrix: Signature matrix: M S 1 S 2 S 3 S 4 ● Record first row where each set had a 1 in the given permutation 3 ab 1 0 1 0 4 bc 1 0 0 1 S 1 S 2 S 3 S 4 7 de 0 1 0 1 h 1 2 1 2 1 6 ah 0 1 0 1 1 ha 0 1 0 1 h (S 1 ) = ed #permuted row 2 2 ed 1 0 1 0 h (S 2 ) = ha #permuted row 1 5 h (S 3 ) = ed #permuted row 2 ca 1 0 1 0 h (S 4 ) = ha #permuted row 1 (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to rows. Characteristic Matrix: Signature matrix: M S 1 S 2 S 3 S 4 ● Record first row where each set had a 1 in the given permutation 4 3 ab 1 0 1 0 2 4 bc 1 0 0 1 S 1 S 2 S 3 S 4 1 7 de 0 1 0 1 h 1 2 1 2 1 3 6 ah 0 1 0 1 h 2 6 1 ha 0 1 0 1 7 2 ed 1 0 1 0 5 5 ca 1 0 1 0 (Leskovec at al., 2014; http://www.mmds.org/)

Minhash function: h Minhashing ● Based on permutation of rows in the characteristic matrix, h maps sets to rows. Characteristic Matrix: Signature matrix: M S 1 S 2 S 3 S 4 ● Record first row where each set had a 1 in the given permutation 4 3 ab 1 0 1 0 2 4 bc 1 0 0 1 S 1 S 2 S 3 S 4 1 7 de 0 1 0 1 h 1 2 1 2 1 3 6 ah 0 1 0 1 h 2 2 1 4 1 6 1 ha 0 1 0 1 7 2 ed 1 0 1 0 5 5 ca 1 0 1 0 (Leskovec at al., 2014; http://www.mmds.org/)

Similarity Search Stony Brook University CSE545, Fall 2016 Finding - PowerPoint PPT Presentation

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in

Similarity search Evaluating Strategies for Given a query Web page q , return Web Similarity

Survey Similarity search for complex similarity models Analysis of previous solution for k

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Locality Sensitive Hashing & ANN CS 584: Big Data Analytics Material adapted from Piotr

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4

Jeffrey D. Ullman Stanford University It has been said that the mark of a computer scientist

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Statistical Methods for Dating Collections of Historical Documents Michael Gervers University of

How similar are these? 1 Whats the Problem? Finding similar items with respect to some

Sambuz

Useful Links

Newsletter

Mail Us

Similarity Search Stony Brook University CSE545, Fall 2016 Finding - PowerPoint PPT Presentation

Similarity Search Stony Brook University CSE545, Fall 2016 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Problems Martin Aumller IT University of Copenhagen Roadmap 01 02 03 Similarity Search in

Similarity search Evaluating Strategies for Given a query Web page q , return Web Similarity

Survey Similarity search for complex similarity models Analysis of previous solution for k

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Locality Sensitive Hashing &amp; ANN CS 584: Big Data Analytics Material adapted from Piotr

http://www.mmds.org High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

V.4 MapReduce 1. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4

Jeffrey D. Ullman Stanford University It has been said that the mark of a computer scientist

Similarity Search CSE545 - Spring 2020 Stony Brook University H. Andrew Schwartz A B Big

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Statistical Methods for Dating Collections of Historical Documents Michael Gervers University of

How similar are these? 1 Whats the Problem? Finding similar items with respect to some

Sambuz

Useful Links

Newsletter

Mail Us

Locality Sensitive Hashing & ANN CS 584: Big Data Analytics Material adapted from Piotr