Effective Change Detection Using Sampling Junghoo John Cho - - PowerPoint PPT Presentation
Effective Change Detection Using Sampling Junghoo John Cho - - PowerPoint PPT Presentation
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA Problem Polling Update Query Local database Remote database Application Web search engines/crawlers Web archive Data warehouse . . .
Junghoo "John" Cho (UCLA Computer Science) 2
Application
Web search engines/crawlers Web archive Data warehouse
. . .
Problem
Polling Remote database Local database Query Update
Junghoo "John" Cho (UCLA Computer Science) 3
Existing Approach
Round robin
Download pages in a round robin manner
Change-frequency based [CLW98, CGM00,
EMT01]
Estimate the change frequency Adjust download frequency Proven to be optimal
Junghoo "John" Cho (UCLA Computer Science) 4
Our Approach
Sampling-based
Sample k pages from each source Download more pages from the source with more
changed samples
Junghoo "John" Cho (UCLA Computer Science) 5
Comparison
Frequency based
Proven to be optimal Change history required Difficult to estimate change frequency
Sampling based
Can be worse than frequency based policy No history/frequency-estimation required
Experimental comparison later
Junghoo "John" Cho (UCLA Computer Science) 6
Questions
Are we assuming correlation? How to use sampling results?
Proportional vs Greedy
How many samples?
Dynamic sample size adjustment?
What if we have very limited resources?
Junghoo "John" Cho (UCLA Computer Science) 7
Is Correlation Necessary?
Random sampling Correlation not necessary. Only random
sampling
More discussion later
4/5 1/5
Junghoo "John" Cho (UCLA Computer Science) 8
Questions
Are we assuming correlation? How to use sampling results?
Proportional vs Greedy
How many samples?
Dynamic sample size adjustment?
What if we have very limited resources?
Junghoo "John" Cho (UCLA Computer Science) 9
Download Model (1)
Fixed download cycle
Say, once a month
Fixed download resources in each cycle
Say, 100,000 page download every month
Goal
Download as many changes as we can ChangeRatio =
No of changed & downloaded pages No of downloaded pages
Junghoo "John" Cho (UCLA Computer Science) 10
Download Model (2)
Two-stage sampling policy
Sampling stage Download stage
Sampling requires page download
Junghoo "John" Cho (UCLA Computer Science) 11
How to Use Sampling Result?
Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining
4/5 1/5
A B
Junghoo "John" Cho (UCLA Computer Science) 12
Proportional Policy
Download pages proportionally to the
detected changes
8 pages from A, 2 pages from B
4/5 1/5
A B
Junghoo "John" Cho (UCLA Computer Science) 13
Greedy Policy
Download pages from the sites with most
changes
10 pages from A
4/5 1/5
A B
Junghoo "John" Cho (UCLA Computer Science) 14
Optimality of Greedy
Theorem
Greedy is optimal if we make download decisions
purely based on sampling results
Probabilistic optimality for their expected values
Junghoo "John" Cho (UCLA Computer Science) 15
Questions
Are we assuming correlation? How to use sampling results?
Proportional vs Greedy
How many samples?
Dynamic sample size adjustment?
What if we have very limited resources?
Junghoo "John" Cho (UCLA Computer Science) 16
How Many Samples?
Too few samples
Inaccurate change estimates
Too many samples
“Waste” of resources for sampling
How to determine optimal sample size?
Junghoo "John" Cho (UCLA Computer Science) 17
Optimal Sample Size
Factors to consider
Total number of pages that we maintain Number of pages that we can download in the
current cycle
Number of pages in each Web site Change distribution
Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100
Junghoo "John" Cho (UCLA Computer Science) 18
Change Fraction Distribution
ρ fraction of sites f( ρ ) ρt
ρi : fraction of changed pages in site i f(ρ): distribution of ρ values
Junghoo "John" Cho (UCLA Computer Science) 19
Optimal Sample Size
N: no of pages in a site r: no of pages to download / no of pages we
maintain
Analysis is complex
- is a good rule of thumb
Nr f (ρt) 6(ρ
r − ρ
) Nr
Junghoo "John" Cho (UCLA Computer Science) 20
Dynamic Sample Size?
Do we need the same sample size for every
site?
A: ρ = 0, B: ρ = 0.45, C: ρ = 0.55, D: ρ = 1
Junghoo "John" Cho (UCLA Computer Science) 21
Adaptive Sampling
If the estimated ρ is high/low enough, make
an early decision
What does “high enough” mean?
Confidence interval above threshold
ρ ρt
( )
ρi
( )
ρi
( )
ρi
Junghoo "John" Cho (UCLA Computer Science) 22
In the Paper
More details on
Optimal sample size Adaptive policy
The cases where resource is too limited for
sampling
Junghoo "John" Cho (UCLA Computer Science) 23
Experiments
353,000 pages from 252 sites
Mostly popular sites
Yahoo, CNN, Microsoft, …
~ 1400 pages from each site Followed the links in the breadth-first manner
Monthly change history for 6 months
5 download cycles
In experiments, 100,000 page downloads in
each download cycle
Junghoo "John" Cho (UCLA Computer Science) 24
Comparison of Policies
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 RR FRQ PRP GRD ADP
ChangeRatio
Junghoo "John" Cho (UCLA Computer Science) 25
Optimal Sample Size
0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 50 100 150 200 250
Optimal sample size ~ 10 through 60 ~ 20
Nr
ChangeRatio Sample Size
Junghoo "John" Cho (UCLA Computer Science) 26
Comparison of Long-Term Performance
Problem: We have only 5-download-cycle
data
Solution: Extrapolate the history
?
Repeat
Junghoo "John" Cho (UCLA Computer Science) 27
Frequency vs. Sampling
0.5 0.6 0.7 0.8 0.9 100 200 300 400
Download Cycle ChangeRatio Frequency Greedy
Junghoo "John" Cho (UCLA Computer Science) 28
Related Work
Frequency-based policy
Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001
Source cooperation
Olston et al., SIGMOD 2002
Junghoo "John" Cho (UCLA Computer Science) 29
Conclusion
Sampling-based policy
Great short-term performance No change history required
Frequency-based policy
Potentially good long-term performance if the
change frequency does not change
Greedy is easy to implement and shows high
performance
Junghoo "John" Cho (UCLA Computer Science) 30
Future Work
Combination of sampling and frequency
based policies
Switch to the frequency-based policy after a while
Good partitioning for sampling?
Site based? Directory based? Content based? Link-structure based?