Web Characteristics CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation

web characteristics
SMART_READER_LITE
LIVE PREVIEW

Web Characteristics CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation

Web Characteristics CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Some slides have been adapted from:


slide-1
SLIDE 1

Web Characteristics

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Spring 2020

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Some slides have been adapted from: Profs. Leskovec, Rajaraman, and Ullman (Mining

  • f Massive Datasets course, Stanford)
slide-2
SLIDE 2

Web document collection

} No design/co-ordination } Distributed content creation, linking,

democratization of publishing

} Content includes truth, lies, obsolete information,

contradictions …

} Unstructured (text, html, …), semi-structured (XML,

annotated photos), structured (Databases)…

} Scale much larger than previous text collections …

but corporate records are catching up

} Growth

– slowed down from initial “volume doubling every few months” but still expanding

} Content can be dynamically generated

The Web

  • Sec. 19.2

2

slide-3
SLIDE 3

Web search basics

The Web Ad indexes

Web

Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds) Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer Indexes

Search

User

  • Sec. 19.4.1

3

slide-4
SLIDE 4

Web graph

} HTML pages together with hyperlinks between them

} Can be modeled as a directed graph

} Anchor text: text surrounding the origin of the hyper-

link on page A

4

slide-5
SLIDE 5

Users’ empirical evaluation of results

} Quality of pages varies widely

} Relevance is not enough } Other desirable qualities (non IR!!)

} Content:Trustworthy, diverse, non-duplicated, well maintained } Web readability: display correctly & fast } No annoyances: pop-ups, etc.

} Precision vs. recall:

} On the web, recall seldom matters

} What matters

} Precision at 1? Precision above the fold? } Comprehensiveness – must be able to deal with obscure queries

¨ Recall matters when the number of matches is very small

} User perceptions may be unscientific, but are significant over

a large aggregate

5

slide-6
SLIDE 6

User Needs

} Need [Brod02, RL04]

} Informational – want to learn about something (~40% / 65%) } Navigational – want to go to that page (~25% / 15%) } Transactional – want to do something (web-mediated) (~35% / 20%)

} Access a service } Downloads } Shop

} Gray areas

} Find a good hub } Exploratory search “see what’s there”

Low hemoglobin United Airlines Seattle weather Mars surface images Canon S410 Car rental Brasil

  • Sec. 19.4.1

6

slide-7
SLIDE 7

SPAM

(SEARCH ENGINE OPTIMIZATION)

7

slide-8
SLIDE 8

The trouble with paid search ads …

} It costs money. What’s the alternative? } Search Engine Optimization:

} “Tuning” your web page to rank highly in the algorithmic search

results for selected keywords

} Alternative to paying for placement } Thus, intrinsically a marketing function

} Performed by companies, webmasters & consultants

(“Search engine optimizers”) for their clients

} Some perfectly legitimate, some very shady

  • Sec. 19.2.2

8

slide-9
SLIDE 9

Search Engine Optimizer (SEO)

} Motives

} Commercial, political, religious, lobbies } Promotion funded by advertising budget

} Operators

} Contractors (Search Engine Optimizers) for lobbies, companies } Web masters } Hosting services

} Forums

} E.g.,Web master world ( www.webmasterworld.com )

} Search engine specific tricks } Discussions about academic papers J

  • Sec. 19.2.2

9

slide-10
SLIDE 10

Simplest forms

} First generation engines relied heavily on tf/idf

} The top-ranked pages for the query maui resort were the ones

containing the most maui’s and resort’s

} SEOs responded with dense repetitions of chosen terms

} e.g., maui resort maui resort maui resort } Often, the repetitions would be in the same color as the

background of the web page

} Repeated terms got indexed by crawlers } But not visible to humans on browsers

Pure word density cannot be trusted as an IR signal

  • Sec. 19.2.2

10

slide-11
SLIDE 11

Variants of keyword stuffing

} Misleading meta-tags, excessive repetition } Hidden text with colors, style sheet tricks, etc.

Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, …”

  • Sec. 19.2.2

11

slide-12
SLIDE 12

Cloaking

} Serve fake content to search engine spider } DNS cloaking: Switch IP address. Impersonate

Is this a Search Engine spider? N Y SPAM Real Doc

Cloaking

  • Sec. 19.2.2

12

slide-13
SLIDE 13

More spam techniques

} Doorway pages

} Pages optimized for a single keyword that re-direct to the

real target page

} Link spamming

} Mutual admiration societies, hidden links, awards } Domain flooding: numerous domains that point or re-direct to

a target page

} Robots

} Fake query stream – rank checking programs

} “Curve-fit” ranking programs of search engines

} Millions of submissions via Add-Url

  • Sec. 19.2.2

13

slide-14
SLIDE 14

The war against spam

} Quality

signals: Prefer authoritative pages based on:

} Votes from authors (linkage signals) } Votes from users (usage signals)

} Policing of URL submissions

} Anti robot test

} Limits on meta-keywords } Robust link analysis

} Ignore

statistically implausible linkage (or text)

} Use

link analysis to detect spammers (guilt by association) } Spam recognition by machine

learning

} Training set based on known spam

} Family friendly filters

} Linguistic

analysis, general classification techniques, etc.

} For images: flesh tone detectors,

source text analysis, etc. } Editorial intervention

} Blacklists } Top queries audited } Complaints addressed } Suspect pattern detection

14

slide-15
SLIDE 15

More on spam

} Web search engines have policies on SEO practices they

tolerate/block

} http://help.yahoo.com/help/us/ysearch/index.html } http://www.google.com/intl/en/webmasters/

} Adversarial IR: the unending (technical) battle between

SEO’s and web search engines

} Research http://airweb.cse.lehigh.edu/

15

slide-16
SLIDE 16

DUPLICATE DETECTION

16

slide-17
SLIDE 17

Duplicate documents

} The web is full of duplicated content } Strict duplicate detection = exact match

} Not as common

} But many, many cases of near duplicates

} E.g., last-modified date the only difference between two copies

  • f a page
  • Sec. 19.6

17

slide-18
SLIDE 18

Duplicate/near-duplicate detection

} Duplication: Exact match can be detected with fingerprints } Near-Duplication:Approximate match

} Overview

} Compute

syntactic similarity with an edit-distance measure

} Use similarity threshold to detect near-duplicates

¨ E.g., Similarity > 80% => Docs are “near duplicates” ¨ Not transitive though sometimes used transitively

  • Sec. 19.6

18

slide-19
SLIDE 19

Computing Similarity

} Features:

} Segments of a doc (natural or artificial breakpoints) } Shingles (Word N-Grams)

} Similarity Measure between two docs (= sets of shingles)

} Jaccard coefficient:

!∩# !∪#

  • Sec. 19.6

19

slide-20
SLIDE 20

Example

20

} Doc A:“a rose is red a rose is white” } Doc B:“a rose is white a rose is red”

Doc A: 4 shingles “a rose is red” “rose is red a” “is red a rose” “red a rose is” “a rose is white” Doc B: 4 shingles “a rose is white” “rose is white a” “is white a rose” “white a rose is” “a rose is red” 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 = 0.25

slide-21
SLIDE 21

Shingles + Set Intersection

Doc A Shingle set A Sketch A Doc B Shingle set B Sketch B

Jaccard

  • Sec. 19.6

21

§ Computing exact set intersection of shingles between all pairs of docs is expensive/intractable § Approximate using a cleverly chosen subset of shingles from each (called sketch) § Estimate /(!)∩/(#)

/(!)∪/(#) based on short sketches of Doc A and B

slide-22
SLIDE 22

From sets to Boolean matrices

22

} Rows =elements of the universal set.

} Example: the set of all k shingles.

} Columns =sets.

} View sets as columns of a matrix 𝐷; one row for each element

in the universe of shingles.

} 𝐷𝑗𝑘 = 1 indicates presence of shingle 𝑗 in set 𝑘

} Typical matrix is sparse.

slide-23
SLIDE 23

Example: Column similarity

23

slide-24
SLIDE 24

For types of rows (for a pair of columns)

24

} For columns 𝐷𝑗, 𝐷𝑘, four types of rows

𝑫𝒋 𝑫𝒌 1 1 1 1

} 𝑜::: # of rows where both columns are one (# of the items that

exist in both sets 𝐷; and 𝐷

<)

} 𝑜:=: # of rows where 𝐷; contains 1 but 𝐷

< contains 0 (# of the items

that exist in both sets 𝐷; and 𝐷

<)

} and so on

𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝐷;, 𝐷

< =

𝑜:: 𝑜:= + 𝑜=: + 𝑜::

slide-25
SLIDE 25

25

Min-Hashing

} Imagine the rows of the boolean matrix permuted under

random permutation p

} Define a “hash” function hp(C) = the index of the first

(in the permuted order p) row in which column C has value 1

} Use several (e.g., 100) independent hash functions (that is,

permutations) to create a signature of a column

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

slide-26
SLIDE 26

Minhashing: example

26

Randomly permute rows through permutation 𝜌 ℎB(𝐷𝑗) , 𝜌 permutation each For denotes the index of first row with 1 in column 𝐷; (after the permutation 𝜌)

slide-27
SLIDE 27

Minhashing

27

} Imagine the rows permuted randomly, define minhash function:

} ℎB(𝐷𝑗) = index of first row with 1 in column 𝐷; (after the permutation 𝜌

  • n rows)

} Use several (e.g., 100) independent permutations to create a

signature for each column.

} The signatures can be displayed in another matrix:

} The signature matrix – whose columns represent the sets and the rows

represent the minhash values, in order for that column.

slide-28
SLIDE 28

A property of minhashing

28

} The probability (over all permutations of the rows) that

ℎB(𝐷;) = ℎB(𝐷

<) is the same as 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝐷;, 𝐷 <).

} Both are 𝑜:: /(𝑜=: + 𝑜:= + 𝑜:: )!

Proof sketch:

  • Look down the permuted columns 𝐷; and 𝐷

< until we see a 1.

  • If both columns have one in this row then ℎB(𝐷;) = ℎB(𝐷

<) .

However, if only one of them contains 1 ℎB(𝐷;) ≠ ℎB(𝐷

<)

slide-29
SLIDE 29

Proof

} We intend to show 𝑄 ℎB 𝐷; = ℎB 𝐷

<

= 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝐷;, 𝐷

<)

} Look down columns 𝐷;, 𝐷

< until first row that at least one of 𝐷; or 𝐷 <

are non-zero ⇒ the corresponding shingle to this row is in the 𝐷; ∪ 𝐷

<.

} If both 𝐷; and 𝐷

< are non-zero in this row (ℎB 𝐷; = ℎB 𝐷 < ), the

corresponding shingle is in the 𝐷; ∩ 𝐷

<.

} Thus, in each permutation we indeed select a random sample from 𝐷; ∪

𝐷

< and check that if it exist also in 𝐷; ∩ 𝐷 <.

} Therefore,

the expectation

  • f

ℎB 𝐷; = ℎB 𝐷

<

  • n

different permutations 𝜌 shows 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝐷;, 𝐷

<)

  • Sec. 19.6

29

slide-30
SLIDE 30

Similarity for signatures

30

} The similarity of signatures is the fraction of the

minhash functions in which they agree.

} Thinking of signatures as columns of integers, the similarity of

signatures is the fraction of rows in which they agree.

} Thus, the expected similarity of two signatures equals the

Jaccard similarity of the columns or sets that the signatures represent.

} And the longer the signatures, the smaller will be the expected

error.

slide-31
SLIDE 31

Min Hashing: Example

31

slide-32
SLIDE 32

Sketch of a document

} Create a “sketch vector” (of size ~200) for each doc 𝐸:

} Let 𝑔 map all shingles to 0. . . 2𝑛 − 1 (e.g., 𝑔= fingerprinting).

Indeed, it maps each set of shingles to an m-bit integer.

} For i=1 to size of sketch vector

} Let 𝜌𝑗 be a random permutation } Pick 𝑇𝑙𝑓𝑢𝑑ℎP 𝑗

be the minimum of shingle value for doc 𝐸 after performing permutation 𝜌𝑗 on these numbers } Docs that share ≥ threshold (e.g. 90%) corresponding

sketch vector elements are near duplicates

} If the size of sketch vector is 𝑁, we have 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝐵, 𝐶) ≈

∑ V(/WXYZ[\ ; ]/WXYZ[^ ; )

_ `ab

c

  • Sec. 19.6

32

slide-33
SLIDE 33

Computing Sketch[i] for Doc1

Document 1

64

2

64

2

64

2

64

2

Start with 64-bit f(shingles) Permute on the number line Pick the min value

  • Sec. 19.6

33

slide-34
SLIDE 34

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

64

2

64

2

64

2

64

2

64

2

64

2

64

2

64

2 Are these equal?

200

p … ,

2

p ,

1

p

random permutations: 200 Test for

A B

  • Sec. 19.6

34

slide-35
SLIDE 35

However…

Document 1 Document 2

64

2

64

2

64

2

64

2

64

2

64

2

64

2

64

2

B A

  • Sec. 19.6

35

The shingle with the MIN value in both of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

slide-36
SLIDE 36

Min-Hash sketches: summary

} Use 𝑔 to map shingles to 𝑛 bits } Pick P random row permutations of the numbers } MinHash sketch } 𝑇𝑙𝑓𝑢𝑑ℎd[𝑗] shows the first row with 1 in the column C

in the 𝑗-th permutation

} Similarity of signatures

} Let 𝑡𝑗𝑛[𝑡𝑙𝑓𝑢𝑑ℎd, 𝑡𝑙𝑓𝑢𝑑ℎdh] = fraction of identical elements in

the vectors 𝑡𝑙𝑓𝑢𝑑ℎd and 𝑡𝑙𝑓𝑢𝑑ℎdh

} fraction of permutations where MinHash values agree

} Observe 𝑡𝑗𝑛[𝑡𝑙𝑓𝑢𝑑ℎd, 𝑡𝑙𝑓𝑢𝑑ℎdh] ≈ 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝐵, 𝐶)

  • Sec. 19.6

36

slide-37
SLIDE 37

Implementation of Min-Hashing

37

} Suppose one billion rows. } Hard to pick a random permutation of 1…billion. } Representing a random permutation requires 1 billion

entries.

} Accessing rows in permuted order leads to thrashing.

slide-38
SLIDE 38

Implementation of Min-Hashing

38

} A good approximation to permuting rows:

} Pick, say, 100 hash functions.

} ℎ;(.) gives order of rows for i-th permutation.

} For each column and each hash function ℎ;, 𝑇𝑙𝑓𝑢𝑑ℎd(𝑗)

will become the smallest value of ℎ; for singles of doc 𝐷

slide-39
SLIDE 39

All signature pairs

n We have an extremely efficient method for estimating a

Jaccard coefficient for a single pair of docs.

n But we still have to estimate N2 coefficients where N is

the number of web pages (Still slow)

n One solution: locality sensitive hashing (LSH) n Another solution: sorting (Henzinger 2006)

  • Sec. 19.6

39

slide-40
SLIDE 40

Locality Sensitive Hashing

40

Document The set of strings of length k that appear in the document Signatures: short integer vectors that represent the sets, and reflect their similarity Locality- Sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

slide-41
SLIDE 41

LSH: First Cut

} Goal: Find documents with Jaccard similarity at least s

(for some similarity threshold, e.g., s=0.8)

} LSH – General idea: Use a function f(x,y) that tells

whether x and y is a candidate pair: a pair of elements whose similarity must be evaluated

} For Min-Hash matrices:

} Hash columns of signature matrix M to many buckets } Each

pair

  • f

documents that hashes into the same bucket is a candidate pair

41

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

1 2 1 2 1 4 1 2 2 1 2 1

slide-42
SLIDE 42

Candidates from Min-Hash

} Pick a similarity threshold s (0 < s < 1) } Columns x and y of M are a candidate pair if their

signatures agree on at least fraction s of their rows: M (i, x) = M (i, y) for at least fraction s values of i

} We expect documents x and y to have the same (Jaccard)

similarity as their signatures

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 42

1 2 1 2 1 4 1 2 2 1 2 1

slide-43
SLIDE 43

LSH for Min-Hash

} Big idea: Hash columns of signature matrix M

several times

} Arrange that (only) similar columns are likely to hash

to the same bucket, with high probability

} Candidate pairs are those that hash to the same

bucket

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 43

1 2 1 2 1 4 1 2 2 1 2 1

slide-44
SLIDE 44

Partition M into b Bands

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 44

Signature matrix M r rows per band b bands One signature

1 2 1 2 1 4 1 2 2 1 2 1

slide-45
SLIDE 45

Partition M into Bands

} Divide matrix M into b bands of r rows } For each band, hash its portion of each column to a

hash table with k buckets

} Make k as large as possible

} Candidate column pairs are those that hash to the

same bucket for ≥ 1 band

} Tune

b and r to catch most similar pairs, but few non-similar pairs

45

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

slide-46
SLIDE 46

Matrix M r rows b bands

Buckets

Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are different.

Hashing Bands

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 46

slide-47
SLIDE 47

Simplifying Assumption

} There are enough buckets that columns are unlikely to

hash to the same bucket unless they are identical in a particular band

} Hereafter, we

assume that “same bucket” means “identical in that band”

} Assumption needed only to simplify analysis, not for

correctness of algorithm

47

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

slide-48
SLIDE 48

Example of Bands

Assume the following case:

} Suppose 100,000 columns of M (100k docs)

} Signatures of 100 integers (rows) } Therefore, signatures take 40Mb

} Choose b = 20 bands of r = 5 integers/band } Goal: Find pairs of documents that are at least s = 0.8

similar

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 48

1 2 1 2 1 4 1 2 2 1 2 1

slide-49
SLIDE 49

C1, C2 are 80% Similar

} Find pairs of ³ s=0.8 similarity, set b=20, r=5 } Assume: sim(C1, C2) = 0.8

} Since sim(C1, C2) ³ s, we want C1, C2 to be a candidate pair: We want

them to hash to at least 1 common bucket (at least one band is identical)

} Probability

C1, C2 identical in

  • ne

particular band: (0.8)5 = 0.328

} Probability C1, C2 are not similar in all of the 20 bands: (1-

0.328)20 = 0.00035

} i.e.,

about 1/3000th

  • f

the 80%-similar column pairs are false negatives (we miss them)

} We would find 99.965% pairs of truly similar documents

49

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

1 2 1 2 1 4 1 2 2 1 2 1

slide-50
SLIDE 50

C1, C2 are 30% Similar

} Find pairs of ³ s=0.8 similarity, set b=20, r=5 } Assume: sim(C1, C2) = 0.3

} Since

sim(C1, C2) < s we want C1, C2 to hash to NO common buckets (all bands should be different)

} Probability C1, C2 identical in one particular band:

(0.3)5 = 0.00243

} Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 -

0.00243)20 = 0.0474

} In other words, approximately 4.74% pairs of docs with similarity 0.3%

end up becoming candidate pairs

} They are false positives since we will have to examine them (they are

candidate pairs) but then it will turn out their similarity is below threshold s

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 50

1 2 1 2 1 4 1 2 2 1 2 1

slide-51
SLIDE 51

LSH Involves a Tradeoff

} Pick:

} The number of Min-Hashes (rows of M) } The number of bands b, and } The number of rows r per band

to balance false positives/negatives

} Example: If we had only 15 bands of 5 rows, the

number of false positives would go down, but the number of false negatives would go up

51

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

1 2 1 2 1 4 1 2 2 1 2 1

slide-52
SLIDE 52

Analysis of LSH – What We Want

  • f two sets

)

2

C ,

1

C ( sim t = Similarity Probability

  • f sharing

a bucket Similarity threshold s No chance if t < s Probability = 1 if t > s

52

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

slide-53
SLIDE 53

What 1 Band of 1 Row Gives You

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 53

Remember: Probability of equal hash-values = similarity

  • f two sets

)

2

C ,

1

C ( sim t = Similarity Probability

  • f sharing

a bucket

slide-54
SLIDE 54

b bands, r rows/band

} Columns C1 and C2 have similarity s } Pick any band (r rows)

} Prob. that all rows in band equal = sr } Prob. that some row in band unequal = 1 - sr

} Prob. that no band identical = (1 - sr)b } Prob. that at least 1 band identical = 1 - (1 - sr)b

54

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

slide-55
SLIDE 55

What b Bands of r Rows Gives You

r

s

All rows

  • f a band

are equal

1 -

Some row

  • f a band

unequal

(

b

)

No bands identical

1 -

At least

  • ne band

identical

r / 1

b) / 1 ~ ( t

55

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

  • f two sets

)

2

C ,

1

C ( sim t= Similarity Probability

  • f sharing

a bucket

slide-56
SLIDE 56

Example: b = 20; r = 5

} Similarity threshold s } Prob. that at least 1 band is identical:

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 56

s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6 .802 .7 .975 .8 .9996

slide-57
SLIDE 57

Picking r and b: The S-curve

} Picking r and b to get the best S-curve

} 50 hash-functions (r=5, b=10)

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 57

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Blue area: False Negative rate Green area: False Positive rate Similarity

  • Prob. sharing a bucket
slide-58
SLIDE 58

LSH Summary

} Tune M, b, r to get almost all pairs with similar

signatures, but eliminate most pairs that do not have similar signatures

} Check in main memory that candidate pairs really do

have similar signatures

} Optional: In another pass through data, check that the

remaining candidate pairs really represent similar documents

58

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org

slide-59
SLIDE 59

Summary: 3 Steps

} Shingling: Convert documents to sets

} We used hashing to assign each shingle an ID

} Min-Hashing: Convert large sets to short signatures, while

preserving similarity

} We used similarity preserving hashing to generate signatures

with property Pr[hp(C1) = hp(C2)] = sim(C1, C2)

} We used hashing to get around generating random permutations

} Locality-Sensitive Hashing: Focus on pairs of signatures

likely to be from similar documents

} We used hashing to find candidate pairs of similarity ³ s

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of

Massive Datasets, http://www.mmds.org 59

slide-60
SLIDE 60

More resources

} IIR Chapter 19. } MMD Chapter 3.

60