FAQs Your disk quota is 20GB (per student) If you need more space, - - PDF document

faqs
SMART_READER_LITE
LIVE PREVIEW

FAQs Your disk quota is 20GB (per student) If you need more space, - - PDF document

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara CS435 Introduction to Big Data - Spring 2016 CS535 BIG DATA PART B. GEAR SESSIONS SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State


slide-1
SLIDE 1

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

CS435 Introduction to Big Data - Spring 2016

FAQs

  • Your disk quota is 20GB (per student)
  • If you need more space, please let me know ASAP

CS435 Introduction to Big Data - Spring 2016

slide-2
SLIDE 2

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class

  • Part 1: Counting Triangles (from the last lecture)
  • Part 2: Locality Sensitive Hashing

CS435 Introduction to Big Data - Spring 2016

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 4. Social Network Analysis

Counting Triangles

CS435 Introduction to Big Data - Spring 2016

slide-3
SLIDE 3

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Why Count Triangles?: Probability for Random Graphs

  • If we start with n nodes and add m edges to a graph at random, there will be an expected

number of triangles in the graph

  • There are !

3 sets of three nodes

  • Approximately n3/6 sets of three nodes that might be a triangle
  • The probability of an edge between any two given nodes being added
  • m/ !

2

  • approximately 2m/n2
  • The probability that any set of three nodes has edges between each pair
  • if those edges are independently chosen to be present or absent
  • Approximately (2m/n2)3 = 8m3/n6
  • Thus, the expected number of triangles in a graph of n nodes and m randomly selected edges
  • Approximately (8m3/n6)(n3/6) = 4 (m/n)3

CS435 Introduction to Big Data - Spring 2016

Why Count Triangles?: How about Social Network Graph?

  • If a graph is a social network graph,
  • n nodes (n users)
  • m edges (with m pairs of friends)
  • Do we expect the number of triangle to be,
  • a. Same
  • b. Much greater than the value for a random graph

c.

Much smaller than the value for a random graph

CS435 Introduction to Big Data - Spring 2016

slide-4
SLIDE 4

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Why Count Triangles?: How about Social Network Graph?

  • If a graph is a social network graph,
  • n nodes (n users)
  • m edges (with m pairs of friends)
  • Do we expect the number of triangle to be,
  • a. Same
  • b. Much greater than the value for a random graph

c.

Much smaller than the value for a random graph Why? If A and B are friends, and A is also a friend of C, there should be a much greater chance than average that B and C are also friends Counting the number of triangles helps us to measure the extent to which a graph looks like a social network

CS435 Introduction to Big Data - Spring 2016

What Else with the Counting Triangles?

  • Counting the number of triangles helps us to measure the extent to which a graph looks

like a social network

  • It also shows some characteristics of social networks
  • E.g. the age of a community is related to the density of triangles

CS435 Introduction to Big Data - Spring 2016

slide-5
SLIDE 5

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

An Algorithm for Finding Triangles

  • Suppose we have a graph of n nodes and m (≥ n) edges. For convenience, assume the

nodes are integers 1, 2, . . . , n

  • Heavy hitter
  • If the degree is at least !
  • Heavy hitter triangle
  • Triangle all three of whose nodes are heavy hitters
  • Note that the number of heavy hitter nodes is no more than 2 !
  • Otherwise, the sum of the degrees of the heavy hitter nodes would be more than 2m
  • Each edge contributes to the degree of only two nodes

CS435 Introduction to Big Data - Spring 2016

  • 1. Preparing for the Data Structures
  • Step 1. Compute the degree of each node
  • Examine each edge and add 1 to the count of each of its two nodes
  • The total time required is O(m)
  • Step 2. Create an index on edges, with the pair of nodes at its ends as the key.
  • For the given two nodes, whether the edge between them exists
  • A hash table suffices
  • O(m)
  • Expected time to answer a query about the existence of an edge is a constant
  • Step 3. Create another index of edges, this one with key equal to a single node
  • Given a node v, we can retrieve the nodes adjacent to v in time proportional to the number of those

nodes

CS435 Introduction to Big Data - Spring 2016

slide-6
SLIDE 6

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

  • 2. Sorting Nodes
  • Sorting nodes
  • First criteria: By degree
  • Second criteria: if v and u have the same degree, recall that both v and u are integers,

so order them numerically

  • Therefore, we say v ≺ u if and only if either

1) The degree of v is less than the degree of u, or 2) The degrees of u and v are the same, and v < u

CS435 Introduction to Big Data - Spring 2016

  • 3. Counting Triangles [1/2]
  • Heavy-Hitter Triangles
  • There are only O( !) heavy-hitter nodes
  • There are O(m3/2) possible heavy-hitter triangles, and using the index on edges we can check if all

three edges exist in O(1) time. Therefore, O(m3/2) time is needed to find all the heavy-hitter triangles

  • Other Triangles
  • Consider each edge (v1,v2)
  • If both v1 and v2 are heavy hitters, ignore this edge
  • Suppose that v1 is not a heavy hitter and moreover v1 ≺ v2
  • Let u1, u2, . . . , uk be the nodes adjacent to v1
  • Note that k < √m
  • We can find these nodes, using the index on nodes, in O(k) time, which is surely O(√m) time

CS435 Introduction to Big Data - Spring 2016

slide-7
SLIDE 7

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

  • 3. Counting Triangles [2/2]
  • Other Triangles-continued
  • For each ui we can use the first index to check whether edge (ui,v2) exists in O(1) time
  • We can also determine the degree of ui in O(1) time, because we have counted all the nodes’ degrees
  • We count the triangle {v1,v2,ui} if and only if the edge (ui,v2) exists, and v1 ≺ ui
  • A triangle is counted only once
  • v1 is the node of the triangle that precedes both other nodes of the triangle according to the ≺ ordering
  • Time to process all the nodes adjacent to v1 is O(√m)
  • Since there are m edges, the total time spent counting other triangles is O(m3/2)
  • The time to find heavy hitter triangles is O(m3/2) and so is the time to find the other

triangles

  • Thus, the total time of the algorithm is O(m3/2)

CS435 Introduction to Big Data - Spring 2016

GEAR Session 5. Algorithmic Techniques for Big Data

Lecture 1. Locality Sensitive Hashing

Introduction

CS435 Introduction to Big Data - Spring 2016

slide-8
SLIDE 8

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

Traditional hash functions

  • Cryptographic hash function (e.g. SHA-1)
  • Should be difficult to reverse
  • Designed to map a data to an integer that can be used to look in a

particular bucket within the hash table (e.g. hashtables)

  • Key properties for the non-cryptographic hash functions
  • Efficiently computable
  • Should uniformly distribute the keys
  • Two inputs will result in hash outputs that are either different or the same based on key properties of the inputs.

CS435 Introduction to Big Data - Spring 2016

Locality-sensitive hash functions

  • Hash value collisions are more likely for two input values
  • “Close” together than for inputs that are far apart
  • Many different definitions regarding the “closeness”
  • Neighboring
  • Similarity

CS435 Introduction to Big Data - Spring 2016

slide-9
SLIDE 9

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Finding the most similar documents

  • Measuring similarity between pairs of documents
  • Extremely expensive
  • Example
  • 1M documents, signatures of length 250 ( 4 Byte each )
  • 1M x 1,000Bytes = 1GB
  • Number of comparisons = 1MC2 (Half of trillion pairs)
  • 1ms per calculation of similarity
  • 6 days to complete computing
  • Do we need to calculate the similarity for all of the pairs?

CS435 Introduction to Big Data - Spring 2016

Distance measure

  • A distance measure over a space is a function d(x,y) that takes two points in the space

as arguments and produces a real number that satisfies the following axioms:

1.

d(x,y) ≥ 0 (no negative distance)

2.

d(x,y) = 0 if and only if x = y

3.

d(x,y) = d(y,x) (distance is symmetric)

4.

d(x,y) ≤ d(x,z)+d(z,y) (the triangle inequality)

CS435 Introduction to Big Data - Spring 2016

slide-10
SLIDE 10

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Distance measures

  • Euclidean distances
  • Jaccard distance
  • d(x,y) = 1-SIM(x,y)
  • Cosine distance
  • Degree between the vectors
  • Hamming distance
  • The number of components in which they differ
  • 10111 and 11110?

d([x1, x2,..., xn],[y1, y2,..., yn]) = |

i=1

n

∑ xi − yi |2

CS435 Introduction to Big Data - Spring 2016

GEAR Session 5. Algorithmic Techniques for Big Data

Lecture 1. Locality Sensitive Hashing

Introduction: Minhashing

CS435 Introduction to Big Data - Spring 2016

slide-11
SLIDE 11

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

LSH: Locality Sensitive Hashing

  • Locality Sensitive hashing (LSH) is a set of techniques that dramatically speed up

search-for-neighbors or near-duplication detection on data

  • E.g. Duplicate detection
  • E.g. Lookups of nearby points from a geospatial dataset

CS435 Introduction to Big Data - Spring 2016

Locality-sensitive hashing (LSH)

  • Reduce false positives
  • Dissimilar pairs in the same bucket
  • Reduce false negatives
  • Similar pairs in different buckets

CS435 Introduction to Big Data - Spring 2016

slide-12
SLIDE 12

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

Concept of Locality Sensitive Hashing

CS435 Introduction to Big Data - Spring 2016

Define h1:ℝ2→ℤ h1:R2→Z for a point x=(x1,x2)∈ℝ2 x=(x1,x2)∈ ℝ2 by h1(x):=⌊x1⌋; h1(x):=⌊x1⌋; h1(x) is the largest integer a for which a≤x1 For example, h1((3.2,− 1.2))=3 Define h2:ℝ2→ℤ h2:R2→Z for a point x=(x1,x2)∈ℝ2 x=(x1,x2)∈ ℝ2 by h2(x):=⌊x2⌋; h2(x):=⌊x2⌋; h2(x) is the largest integer a for which a≤x2 What if we use both of them? &~( ⟺ *ℎ, & = ℎ, ( , &/0 ℎ1 & = ℎ1 (

Similarity-preserving summaries of set

  • Signatures
  • Replacing large sets of n-grams by much smaller representations
  • We should be able to compare the signatures of two sets and estimate the Jaccard

similarity

  • Of the underlying sets from the signatures alone

CS435 Introduction to Big Data - Spring 2016

slide-13
SLIDE 13

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Matrix Representations of Sets

Element S1 S2 S3 S4 a 1 1 b 1 c 1 1 d 1 1 1 e 1

{a,b,c,d,e} S1 = {a,d} S2 = {c} S3 = {b,d,e} S4 = {a,c,d}

CS435 Introduction to Big Data - Spring 2016

Characteristic Matrix Sets: Columns

  • f the matrix

Elements of the universal set from which elements of the sets are drawn Note that the characteristic matrix is unlikely to be the way the data is stored

Minhashing

  • Signature generating algorithm
  • Minhash of the characteristic matrix
  • Select a permutation of the rows (See the element column)
  • Minhash(π) of a set is the number of the row (element) with first non-zero in the permuted order π
  • π = (b,e,a,d,c)

Element S1 S2 S3 S4 b 1 e 1 a 1 1 d 1 1 1 c 1 1

CS435 Introduction to Big Data - Spring 2016

slide-14
SLIDE 14

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

Minhashing

  • The minhash value of any column is the number of the first row in which the column

has a 1

  • For the Set S1
  • Row “a” is the first row (after b and e) that has a 1
  • The minhash function, h(S1) = a
  • Similarly, h(S2)=c, h(S3)=b, h(S4)=a

CS435 Introduction to Big Data - Spring 2016

Element S1 S2 S3 S4 b 1 e 1 a 1 1 d 1 1 1 c 1 1

Minhashing and Jaccard Similarity

  • There is a connection between minhashing and Jaccard Similarity
  • Jaccard Similarity
  • The probability that the minhash function for a random permutation of rows

produces the same value for two sets equals the Jaccard similarity of those sets

CS435 Introduction to Big Data - Spring 2016

slide-15
SLIDE 15

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

Minhash and Jaccard Similarity

  • Theorem:
  • P(minhash(S) = minhash (T)) =JaccardSIM (S,T)

Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is, X/(X+Y) = JaccardSIM (S,T) Element S T b e a 1 d 1 1 c 1

CS435 Introduction to Big Data - Spring 2016

Minhash Signatures

  • Pick (at random) some number n of permutations of the rows of the characteristic

matrix M

  • E.g. 100 permutations or several hundred permutations
  • The minhash functions are determined by these permutations h1, h2, h3, …. hn
  • Minhash signature for S
  • Vector [h1(S), h2(S), h3(S),…, hn(S)]

CS435 Introduction to Big Data - Spring 2016

slide-16
SLIDE 16

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

Computing Minhash Signatures

  • Permutating the order of Elements?
  • It is NOT feasible to permute a large characteristic matrix explicitly
  • N element will need N! permutations!
  • Can we simulate the effect of a random permutation?

CS435 Introduction to Big Data - Spring 2016

Using a random hash function [1/2]

  • To simulate permutations effectively
  • Use a Random hash function
  • Maps row numbers to as many buckets as there are rows
  • 0,1,…, k-1 to bucket numbers 0 ~ k-1
  • Maps some pairs of integers to the same bucket
  • Leaves other buckets unfilled
  • However, not too many collisions

CS435 Introduction to Big Data - Spring 2016

slide-17
SLIDE 17

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

Using a random hash function [2/2]

  • Pick n randomly chosen hash functions h1, h2, …. hn on the rows
  • In the signature matrix, let SIG(i,c) be the element of the signature matrix for the ith hash

function and column c

  • Initially, the set SIG(i,c) to ∞ for all i and c
  • For each row r
  • Compute h1(r), h2(r), …. hn(r)
  • For each column c
  • If c has 0 in row, do nothing
  • If c has 1 in row r, then for each i = 1,2,…,n set SIG(i,c) to the smaller of the current value of SIG(i,c) and

hi(r)

CS435 Introduction to Big Data - Spring 2016

Computing Minhash Signatures: Example

  • Form a signature matrix
  • The ith column of M is replaced by the min hash signature of the ith column
  • Start with a compressed form for a sparse matrix

Element S1 S2 S3 S4 b 1 e 1 … … … … … d 1 1 1 m 1 1 >109 elements?

CS435 Introduction to Big Data - Spring 2016

slide-18
SLIDE 18

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

Computing Minhash Signatures: Example

Row (element) S1 S2 S3 S4 Hash 1 1 1 Hash 2 1 … … … … … … … … … … Hash M-1 1 1 1 Hash M 1

CS435 Introduction to Big Data - Spring 2016

Computing Minhash Signatures: Example

Row (element) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 (Row # +1) mod 5 (3 x Row # +1) mod 5

CS435 Introduction to Big Data - Spring 2016

slide-19
SLIDE 19

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

Computing Minhash Signatures: Example

S1 S2 S3 S4 h1 ∞ ∞ ∞ ∞ h2 ∞ ∞ ∞ ∞

  • h1(x) = (x + 1) mod 5
  • h2(x) = (3x +1) mod 5
  • h1(0) and h2(0) are both 1
  • The row numbered 0 has 1’s in S1 and S4

S1 S2 S3 S4 h1 1 ∞ ∞ 1 h2 1 ∞ ∞ 1

CS435 Introduction to Big Data - Spring 2016

Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 h1(x) h2(x)

Computing Minhash Signatures: Example

  • Row number 1
  • Only in S3 is 1
  • Hash value
  • H1(1)=2 and h2(1)=4

S1 S2 S3 S4 h1 1 ∞ 2 1 h2 1 ∞ 4 1 S1 S2 S3 S4 h1 1 ∞ ∞ 1 h2 1 ∞ ∞ 1

CS435 Introduction to Big Data - Spring 2016

Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 h1(x) h2(x)

slide-20
SLIDE 20

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

Computing Minhash Signatures: Example

  • Row number 2
  • S2 and S4
  • Hash value
  • h1(2)=3 and h2(2)=2

S1 S2 S3 S4 h1 1 3 2 1 h2 1 2 4 1 S1 S2 S3 S4 h1 1 ∞ 2 1 h2 1 ∞ 4 1

CS435 Introduction to Big Data - Spring 2016

Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 h1(x) h2(x)

Computing Minhash Signatures: Example

  • Row number 3
  • S1, S3 and S4 have 1
  • Hash value
  • h1(3)=4 and h2(3)=0

S1 S2 S3 S4 h1 1 3 2 1 h2 1 2 4 1 S1 S2 S3 S4 h1 1 3 2 1 h2 2

CS435 Introduction to Big Data - Spring 2016

Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 h1(x) h2(x)

slide-21
SLIDE 21

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 21

Computing Minhash Signatures: Example

  • Row number 4
  • S1, S3 and S4 have 1
  • Hash value
  • h1(4)=0 and h2(4)=3

S1 S2 S3 S4 h1 1 3 2 1 h2 1 2 4 1 S1 S2 S3 S4 h1 1 3 1 h2 2

CS435 Introduction to Big Data - Spring 2016

Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 h1(x) h2(x)

Impact of computing the Minhash signature

  • N number of rows in the minhash table has been represented as a minhash signature

table with M number of rows

  • N >> M
  • Can we say that the Jaccard Similarity of S1 and S4 are 1?
  • A fraction of rows (original) will not be represented in current matrix
  • The # of rows in the signature matrix is too small in this example

CS435 Introduction to Big Data - Spring 2016

Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 S1 S2 S3 S4 h1 1 3 1 h2 2

slide-22
SLIDE 22

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 22

GEAR Session 5. Algorithmic Techniques for Big Data

Lecture 1. Locality Sensitive Hashing

Locality Sensitive Hashing for Minhash Signatures

CS435 Introduction to Big Data - Spring 2016

Applying Minhashing for Large Document Corpus

  • Minhash compresses large documents into small signatures and preserve the expected

similarity of any pair of documents

  • Challenge
  • What if the number of pairs of documents may be too large, even if there are not too many

documents? The pair-wise similarity comparison will be still expensive

  • Can we compare only the most similar pairs or all pairs that are above some lower

bound in similarity?

CS435 Introduction to Big Data - Spring 2016

slide-23
SLIDE 23

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 23

Generating a LSH for Our Example

  • Divide the signature matrix into b bands consisting of r rows each
  • For each band, there is a hash function that takes vectors of r integers (the portion of
  • ne column within that band) and hashes them to some large number of buckets
  • We can use the same hash function for all the bands, but we use a separate bucket

array for each band

  • Columns with the same vector in different bands will not hash to the same bucket

CS435 Introduction to Big Data - Spring 2016

If A and B are similar the number of identical matching blue boxes will be higher Minhash Signature A Minhash Signature B

Dividing a signature matrix into 4 bands and 3 rows per band

….. 1 1 0 1 2 … ….. 3 2 3 2 2 … ….. 1 1 0 1 1 … BAND 1 BAND 2 BAND 3 BAND 4 The first band of MinHash signature D10 D11 D12 D13D14

D11(1,2,1) and D13 (1,2,1) will go to the same bucket (1,3,1) and (0,3,0) will NOT go to the same bucket unless they show the same values in the other band

CS435 Introduction to Big Data - Spring 2016

A minhash signature for a document D

slide-24
SLIDE 24

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 24

Analysis of the Banding Technique [1/2]

  • Suppose that we use b bands of r rows each
  • Suppose that a particular pair of documents have Jaccard Similarity s
  • P(the minhash signatures for these documents agree in any one particular row of the signature matrix) = s

CS435 Introduction to Big Data - Spring 2016

Analysis of the Banding Technique [2/2]

  • The probability that documents (signatures) become a candidate pair (Signatures

should match at least in ONE band):

  • s= the probability the minhash signatures for these documents agree in any one particular row
  • f the signature matrix = Jaccard Similarity value for a pair of documents
  • P(the signatures agree in all rows of one particular band)=sr
  • P(the signatures do not agree in at least one row of a particular band)=1-sr
  • P(the signatures do not agree in all rows of any of the bands)= (1-sr)b

P(the signatures agree in all the rows of at least one band) =1-(1-sr)b =Probability that these documents (or rather their signatures) become a candidate pair

CS435 Introduction to Big Data - Spring 2016

slide-25
SLIDE 25

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 25

Finding the Right Threshold

  • The value of similarity s to determine two documents are similar
  • An approximation of the threshold is (1/b)1/r
  • If b = 16 and r = 4, then the threshold is approximately at s = 1/2

CS435 Introduction to Big Data - Spring 2016

Example

  • For example
  • Case 1: If there are 16 bands and each band contains 4 rows
  • s=0.5
  • Case 2: If there are 8 bands and each band contains 8 rows
  • s=0.77
  • Case 3: If there are 4 bands and each band contains 16 rows
  • s=0.91
  • Which case will provide the highest chance to be candidates for a pair of similar

documents?

  • a. Case 1
  • b. Case 2

c.

Case 3

CS435 Introduction to Big Data - Spring 2016

slide-26
SLIDE 26

CS535 Big Data 4/22/2020 Week 13-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 26

Example

  • For example
  • Case 1: If there are 16 bands and each band contains 4 rows
  • s=0.5
  • Case 2: If there are 8 bands and each band contains 8 rows
  • s=0.77
  • Case 3: If there are 4 bands and each band contains 16 rows
  • s=0.91
  • Which case will have highest number of candidates?
  • a. Case 1
  • b. Case 2
  • c. Case 3

CS435 Introduction to Big Data - Spring 2016

Example

  • Suppose that there are 20 bands (b=20), and each band includes 5 rows (r=5).
  • P(the signatures agree in all the rows of at least one band)=1-(1-sr)b

s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6 .802 .7 .975 .8 .9996

CS435 Introduction to Big Data - Spring 2016