Effective Change Detection Using Sampling Junghoo John Cho - - PowerPoint PPT Presentation

effective change detection using sampling
SMART_READER_LITE
LIVE PREVIEW

Effective Change Detection Using Sampling Junghoo John Cho - - PowerPoint PPT Presentation

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA Problem Polling Update Query Local database Remote database Application Web search engines/crawlers Web archive Data warehouse . . .


slide-1
SLIDE 1

Effective Change Detection Using Sampling

Junghoo “John” Cho Alexandros Ntoulas UCLA

slide-2
SLIDE 2

Junghoo "John" Cho (UCLA Computer Science) 2

Application

Web search engines/crawlers Web archive Data warehouse

. . .

Problem

Polling Remote database Local database Query Update

slide-3
SLIDE 3

Junghoo "John" Cho (UCLA Computer Science) 3

Existing Approach

Round robin

Download pages in a round robin manner

Change-frequency based [CLW98, CGM00,

EMT01]

Estimate the change frequency Adjust download frequency Proven to be optimal

slide-4
SLIDE 4

Junghoo "John" Cho (UCLA Computer Science) 4

Our Approach

Sampling-based

Sample k pages from each source Download more pages from the source with more

changed samples

slide-5
SLIDE 5

Junghoo "John" Cho (UCLA Computer Science) 5

Comparison

Frequency based

Proven to be optimal Change history required Difficult to estimate change frequency

Sampling based

Can be worse than frequency based policy No history/frequency-estimation required

Experimental comparison later

slide-6
SLIDE 6

Junghoo "John" Cho (UCLA Computer Science) 6

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy

How many samples?

Dynamic sample size adjustment?

What if we have very limited resources?

slide-7
SLIDE 7

Junghoo "John" Cho (UCLA Computer Science) 7

Is Correlation Necessary?

Random sampling Correlation not necessary. Only random

sampling

More discussion later

4/5 1/5

slide-8
SLIDE 8

Junghoo "John" Cho (UCLA Computer Science) 8

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy

How many samples?

Dynamic sample size adjustment?

What if we have very limited resources?

slide-9
SLIDE 9

Junghoo "John" Cho (UCLA Computer Science) 9

Download Model (1)

Fixed download cycle

Say, once a month

Fixed download resources in each cycle

Say, 100,000 page download every month

Goal

Download as many changes as we can ChangeRatio =

No of changed & downloaded pages No of downloaded pages

slide-10
SLIDE 10

Junghoo "John" Cho (UCLA Computer Science) 10

Download Model (2)

Two-stage sampling policy

Sampling stage Download stage

Sampling requires page download

slide-11
SLIDE 11

Junghoo "John" Cho (UCLA Computer Science) 11

How to Use Sampling Result?

Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining

4/5 1/5

A B

slide-12
SLIDE 12

Junghoo "John" Cho (UCLA Computer Science) 12

Proportional Policy

Download pages proportionally to the

detected changes

8 pages from A, 2 pages from B

4/5 1/5

A B

slide-13
SLIDE 13

Junghoo "John" Cho (UCLA Computer Science) 13

Greedy Policy

Download pages from the sites with most

changes

10 pages from A

4/5 1/5

A B

slide-14
SLIDE 14

Junghoo "John" Cho (UCLA Computer Science) 14

Optimality of Greedy

Theorem

Greedy is optimal if we make download decisions

purely based on sampling results

Probabilistic optimality for their expected values

slide-15
SLIDE 15

Junghoo "John" Cho (UCLA Computer Science) 15

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy

How many samples?

Dynamic sample size adjustment?

What if we have very limited resources?

slide-16
SLIDE 16

Junghoo "John" Cho (UCLA Computer Science) 16

How Many Samples?

Too few samples

Inaccurate change estimates

Too many samples

“Waste” of resources for sampling

How to determine optimal sample size?

slide-17
SLIDE 17

Junghoo "John" Cho (UCLA Computer Science) 17

Optimal Sample Size

Factors to consider

Total number of pages that we maintain Number of pages that we can download in the

current cycle

Number of pages in each Web site Change distribution

Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100

slide-18
SLIDE 18

Junghoo "John" Cho (UCLA Computer Science) 18

Change Fraction Distribution

ρ fraction of sites f( ρ ) ρt

ρi : fraction of changed pages in site i f(ρ): distribution of ρ values

slide-19
SLIDE 19

Junghoo "John" Cho (UCLA Computer Science) 19

Optimal Sample Size

N: no of pages in a site r: no of pages to download / no of pages we

maintain

Analysis is complex

  • is a good rule of thumb

Nr f (ρt) 6(ρ

r − ρ

) Nr

slide-20
SLIDE 20

Junghoo "John" Cho (UCLA Computer Science) 20

Dynamic Sample Size?

Do we need the same sample size for every

site?

A: ρ = 0, B: ρ = 0.45, C: ρ = 0.55, D: ρ = 1

slide-21
SLIDE 21

Junghoo "John" Cho (UCLA Computer Science) 21

Adaptive Sampling

If the estimated ρ is high/low enough, make

an early decision

What does “high enough” mean?

Confidence interval above threshold

ρ ρt

( )

ρi

( )

ρi

( )

ρi

slide-22
SLIDE 22

Junghoo "John" Cho (UCLA Computer Science) 22

In the Paper

More details on

Optimal sample size Adaptive policy

The cases where resource is too limited for

sampling

slide-23
SLIDE 23

Junghoo "John" Cho (UCLA Computer Science) 23

Experiments

353,000 pages from 252 sites

Mostly popular sites

Yahoo, CNN, Microsoft, …

~ 1400 pages from each site Followed the links in the breadth-first manner

Monthly change history for 6 months

5 download cycles

In experiments, 100,000 page downloads in

each download cycle

slide-24
SLIDE 24

Junghoo "John" Cho (UCLA Computer Science) 24

Comparison of Policies

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 RR FRQ PRP GRD ADP

ChangeRatio

slide-25
SLIDE 25

Junghoo "John" Cho (UCLA Computer Science) 25

Optimal Sample Size

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 50 100 150 200 250

Optimal sample size ~ 10 through 60 ~ 20

Nr

ChangeRatio Sample Size

slide-26
SLIDE 26

Junghoo "John" Cho (UCLA Computer Science) 26

Comparison of Long-Term Performance

Problem: We have only 5-download-cycle

data

Solution: Extrapolate the history

?

Repeat

slide-27
SLIDE 27

Junghoo "John" Cho (UCLA Computer Science) 27

Frequency vs. Sampling

0.5 0.6 0.7 0.8 0.9 100 200 300 400

Download Cycle ChangeRatio Frequency Greedy

slide-28
SLIDE 28

Junghoo "John" Cho (UCLA Computer Science) 28

Related Work

Frequency-based policy

Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001

Source cooperation

Olston et al., SIGMOD 2002

slide-29
SLIDE 29

Junghoo "John" Cho (UCLA Computer Science) 29

Conclusion

Sampling-based policy

Great short-term performance No change history required

Frequency-based policy

Potentially good long-term performance if the

change frequency does not change

Greedy is easy to implement and shows high

performance

slide-30
SLIDE 30

Junghoo "John" Cho (UCLA Computer Science) 30

Future Work

Combination of sampling and frequency

based policies

Switch to the frequency-based policy after a while

Good partitioning for sampling?

Site based? Directory based? Content based? Link-structure based?