Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim - - PowerPoint PPT Presentation

random sampling from a search engine s index
SMART_READER_LITE
LIVE PREVIEW

Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim - - PowerPoint PPT Presentation

Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion 1 Search Engine Samplers Search Engine Web Public Public D Index Interface Interface Top k results Queries


slide-1
SLIDE 1

1

Random Sampling from a Search Engine‘s Index

Ziv Bar-Yossef Maxim Gurevich

Department of Electrical Engineering Technion

slide-2
SLIDE 2

2

Search Engine Samplers

Index Public Interface Public Interface

Search Engine

Sampler Web

D

Queries Top k results

Random document x ∈ D

Indexed Documents

slide-3
SLIDE 3

3

Motivation

Useful tool for search engine evaluation:

Freshness

Fraction of up-to-date pages in the index

Topical bias

Identification of overrepresented/underrepresented topics

Spam

Fraction of spam pages in the index

Security

Fraction of pages in index infected by viruses/worms/trojans

Relative Size

Number of documents indexed compared with other search

engines

slide-4
SLIDE 4

4

Size Wars

August 2005 : We index 20 billion documents.

So, who’s right?

September 2005 : We index 8 billion documents, but

  • ur index is 3 times larger than our competition’s.
slide-5
SLIDE 5

5

Related Work

Random Sampling from a Search Engine’s

Index

[BharatBroder98, CheneyPerry05, GulliSignorni05] Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

slide-6
SLIDE 6

6

Our Contributions

A pool-based sampler

Guaranteed to produce near-uniform samples

A random walk sampler

After sufficiently many steps, guaranteed to produce

near-uniform samples

Does not need an explicit lexicon/pool at all!

Focus of this talk

slide-7
SLIDE 7

7

Search Engines as Hypergraphs

results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:

Vertices:

Indexed documents

Hyperedges:

{ result(q) | q ∈ P }

www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com

“news” “bbc” “google” “maps”

en.wikipedia.org/wiki/BBC

slide-8
SLIDE 8

8

Query Cardinalities and Document Degrees

Query cardinality:

card(q) = |results(q)|

Document degree: deg(x) = |queries(x)| Examples:

card(“news”) = 4, card(“bbc”) = 3 deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com

“news” “bbc” “google” “maps”

en.wikipedia.org/wiki/BBC

slide-9
SLIDE 9

9

The Pool-Based Sampler:

Preprocessing Step

C

Large corpus

P

q1 … … Query Pool

Example: P = all 3-word phrases that occur in C

If “to be or not to be” occurs in C, P contains:

“to be or”, “be or not”, “or not to”, “not to be”

Choose P that “covers” most documents in D

q2

slide-10
SLIDE 10

10

Monte Carlo Simulation

We don’t know how to generate uniform samples from D

directly

How can we use biased samples to generate uniform

samples?

Samples with weights that represent their bias can be

used to simulate uniform samples Monte Carlo Simulation Methods

Rejection Sampling Rejection Sampling Importance Sampling Importance Sampling Metropolis- Hastings Metropolis- Hastings Maximum- Degree Maximum- Degree

slide-11
SLIDE 11

11

Document Degree Distribution

We are able to generate biased samples from

the “document degree distribution”

Advantage: Can compute weights representing

the bias of p:

slide-12
SLIDE 12

12

Rejection Sampling [von Neumann]

accept := false while (not accept)

generate a sample x from p toss a coin whose heads probability is wp(x) if coin comes up heads,

accept := true

return x

slide-13
SLIDE 13

13

Pool-Based Sampler

Degree distribution sampler Degree distribution sampler Search Engine Search Engine Rejection Sampling Rejection Sampling q1,q2,… results(q1), results(q2),… x Pool-Based Sampler (x (x1

1,1/deg(x

,1/deg(x1

1)),

)), (x (x2

2,1/deg(x

,1/deg(x2

2)),

)),… …

Uniform sample Documents sampled from degree distribution with corresponding weights

Degree distribution: p(x) = deg(x) / Σx’deg(x’)

slide-14
SLIDE 14

14

Sampling documents by degree

Select a random q ∈ P Select a random x ∈ results(q) Documents with high degree are more likely to be sampled If we sample q uniformly “oversample” documents that

belong to narrow queries

We need to sample q proportionally to its cardinality

www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com

“news” “bbc” “google” “maps”

en.wikipedia.org/wiki/BBC

slide-15
SLIDE 15

15

Sampling queries by cardinality

Sampling queries from pool uniformly:

Easy

Sampling queries from pool by cardinality: Hard

Requires knowing cardinalities of all queries in the search

engine

Use Monte Carlo methods to simulate biased sampling

via uniform sampling:

Sample queries uniformly from P Compute “cardinality weight” for each sample: Obtain queries sampled by their cardinality

slide-16
SLIDE 16

16

Dealing with Overflowing Queries

Problem: Some queries may overflow (card(q) > k)

Bias towards highly ranked documents

Solutions:

Select a pool P in which overflowing queries are rare

(e.g., phrase queries)

Skip overflowing queries Adapt rejection sampling to deal with approximate

weights

Theorem: Samples of PB sampler are at most β-away from

  • uniform. (β = overflow probability of P)
slide-17
SLIDE 17

17

Bias towards Long Documents

0% 10% 20% 30% 40% 50% 60% 1 2 3 4 5 6 7 8 9 10

Deciles of documents ordered by size Percent of documents from sample .

Pool Based Random Walk Bharat-Broder

slide-18
SLIDE 18

18

Relative Sizes of Google, MSN and Yahoo!

Google = 1 Yahoo! = 1.28 MSN Search = 0.73

slide-19
SLIDE 19

19

Conclusions

Two new search engine samplers

Pool-based sampler Random walk sampler

Samplers are guaranteed to produce near-

uniform samples, under plausible assumptions.

Samplers show no or little bias in

experiments.

slide-20
SLIDE 20

20

Thank You

slide-21
SLIDE 21

21

Top-Level Domains in Google, MSN and Yahoo!

0% 10% 20% 30% 40% 50% 60% c

  • m
  • r

g n e t u k e d u d e a u g

  • v

c a u s i t n

  • e

s i e i n f

  • Top level domain name

P ercent of docum ents from sam ple

Google MSN Yahoo!

slide-22
SLIDE 22

22

Query Cardinality Distribution

Unrealistic assumptions:

Can sample queries from the cardinality distribution

In practice, don’t know a priori card(q) for all q ∈ P

∀q ∈ P, 1 ≤ card(q) ≤ k

In practice, some queries underflow (card(q) = 0) or

  • verflow (card(q) > k)

results(q) = { documents returned on query q } card(q) = |results(q)| Cardinality distribution:

slide-23
SLIDE 23

23

Degree Distribution Sampler

Search Engine Search Engine results(q) x Cardinality Distribution Sampler Cardinality Distribution Sampler Sample x uniformly from results(q) Sample x uniformly from results(q) q Degree Distribution Sampler

Query sampled from cardinality distribution Document sampled from degree distribution

slide-24
SLIDE 24

24

Cardinality Distribution Sampler

Search Engine Search Engine q Cardinality Distribution Sampler Uniform Query Sampler Uniform Query Sampler Rejection Sampling Rejection Sampling q1,q2,… card(q1), card(q2),… (q1,card(q1)/k), (q2,card(q2)/k), …

Sample from cardinality distribution Uniform samples from P

slide-25
SLIDE 25

25

Degree Distribution Sampler Degree Distribution Sampler

Complete Pool-Based Sampler

Search Engine Search Engine Rejection Sampling Rejection Sampling x (x,1/deg(x)),…

Uniform document sample Documents sampled from degree distribution with corresponding weights

Uniform Query Sampler Uniform Query Sampler Rejection Sampling Rejection Sampling (q,card(q)),…

Uniform query sample Query sampled from cardinality distribution

(q,results(q)),…

slide-26
SLIDE 26

26

A random walk sampler

Define a graph G over the indexed documents

(x,y) ∈ E iff results(x) ∩ results(y) ≠ ∅

  • Run a random walk on G

Limit distribution = degree distribution Use MCMC methods to make limit distribution uniform.

Metropolis-Hastings Maximum-Degree

Does not need a preprocessing step Less efficient than the pool-based sampler