effective change detection using sampling
play

Effective Change Detection Using Sampling Junghoo John Cho - PowerPoint PPT Presentation

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA Problem Polling Update Query Local database Remote database Application Web search engines/crawlers Web archive Data warehouse . . .


  1. Effective Change Detection Using Sampling Junghoo “John” Cho Alexandros Ntoulas UCLA

  2. Problem Polling Update Query Local database Remote database � Application � Web search engines/crawlers � Web archive � Data warehouse . . . Junghoo "John" Cho (UCLA Computer Science) 2

  3. Existing Approach � Round robin � Download pages in a round robin manner � Change-frequency based [CLW98, CGM00, EMT01] � Estimate the change frequency � Adjust download frequency � Proven to be optimal Junghoo "John" Cho (UCLA Computer Science) 3

  4. Our Approach � Sampling-based � Sample k pages from each source � Download more pages from the source with more changed samples Junghoo "John" Cho (UCLA Computer Science) 4

  5. Comparison � Frequency based � Proven to be optimal � Change history required � Difficult to estimate change frequency � Sampling based � Can be worse than frequency based policy � No history/frequency-estimation required � Experimental comparison later Junghoo "John" Cho (UCLA Computer Science) 5

  6. Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 6

  7. Is Correlation Necessary? � Random sampling 4/5 1/5 � Correlation not necessary. Only random sampling � More discussion later Junghoo "John" Cho (UCLA Computer Science) 7

  8. Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 8

  9. Download Model (1) � Fixed download cycle � Say, once a month � Fixed download resources in each cycle � Say, 100,000 page download every month � Goal � Download as many changes as we can � ChangeRatio = No of changed & downloaded pages No of downloaded pages Junghoo "John" Cho (UCLA Computer Science) 9

  10. Download Model (2) � Two-stage sampling policy � Sampling stage � Download stage � Sampling requires page download Junghoo "John" Cho (UCLA Computer Science) 10

  11. How to Use Sampling Result? � Sites A and B, each with 20 pages � 20 total download, 5 samples from each site � 10 page download remaining 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 11

  12. Proportional Policy � Download pages proportionally to the detected changes � 8 pages from A, 2 pages from B 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 12

  13. Greedy Policy � Download pages from the sites with most changes � 10 pages from A 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 13

  14. Optimality of Greedy � Theorem � Greedy is optimal if we make download decisions purely based on sampling results � Probabilistic optimality for their expected values Junghoo "John" Cho (UCLA Computer Science) 14

  15. Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 15

  16. How Many Samples? � Too few samples � Inaccurate change estimates � Too many samples � “Waste” of resources for sampling � How to determine optimal sample size? Junghoo "John" Cho (UCLA Computer Science) 16

  17. Optimal Sample Size � Factors to consider � Total number of pages that we maintain � Number of pages that we can download in the current cycle � Number of pages in each Web site � Change distribution � Scenario 1 -- A: 90/100, B: 10/100 � Scenario 2 -- A: 60/100, B: 40/100 Junghoo "John" Cho (UCLA Computer Science) 17

  18. Change Fraction Distribution fraction of sites f( ρ ) ρ ρ t � ρ i : fraction of changed pages in site i � f( ρ ): distribution of ρ values Junghoo "John" Cho (UCLA Computer Science) 18

  19. Optimal Sample Size Nr f ( ρ t ) 6( ρ r − ρ ) � N : no of pages in a site � r : no of pages to download / no of pages we maintain � Analysis is complex Nr is a good rule of thumb � Junghoo "John" Cho (UCLA Computer Science) 19

  20. Dynamic Sample Size? � Do we need the same sample size for every site? � A: ρ = 0, B: ρ = 0.45, C: ρ = 0.55, D: ρ = 1 Junghoo "John" Cho (UCLA Computer Science) 20

  21. Adaptive Sampling � If the estimated ρ is high/low enough, make an early decision � What does “high enough” mean? � Confidence interval above threshold ( ) ( ) ( ) ρ i ρ i ρ i ρ ρ t Junghoo "John" Cho (UCLA Computer Science) 21

  22. In the Paper � More details on � Optimal sample size � Adaptive policy � The cases where resource is too limited for sampling Junghoo "John" Cho (UCLA Computer Science) 22

  23. Experiments � 353,000 pages from 252 sites � Mostly popular sites � Yahoo, CNN, Microsoft, … � ~ 1400 pages from each site � Followed the links in the breadth-first manner � Monthly change history for 6 months � 5 download cycles � In experiments, 100,000 page downloads in each download cycle Junghoo "John" Cho (UCLA Computer Science) 23

  24. Comparison of Policies ChangeRatio 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 RR FRQ PRP GRD ADP Junghoo "John" Cho (UCLA Computer Science) 24

  25. Optimal Sample Size ChangeRatio 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0 50 100 150 200 250 Sample Size Optimal sample size ~ 10 through 60 ~ 20 Nr Junghoo "John" Cho (UCLA Computer Science) 25

  26. Comparison of Long-Term Performance � Problem: We have only 5-download-cycle ? data � Solution: Extrapolate the history Repeat Junghoo "John" Cho (UCLA Computer Science) 26

  27. Frequency vs. Sampling ChangeRatio 0.9 Frequency 0.8 Greedy 0.7 0.6 0.5 0 100 200 300 400 Download Cycle Junghoo "John" Cho (UCLA Computer Science) 27

  28. Related Work � Frequency-based policy � Coffman et al., Journal of Scheduling 1998 � Cho et al., SIGMOD 2000 � Edwards et al., WWW 2001 � Source cooperation � Olston et al., SIGMOD 2002 Junghoo "John" Cho (UCLA Computer Science) 28

  29. Conclusion � Sampling-based policy � Great short-term performance � No change history required � Frequency-based policy � Potentially good long-term performance if the change frequency does not change � Greedy is easy to implement and shows high performance Junghoo "John" Cho (UCLA Computer Science) 29

  30. Future Work � Combination of sampling and frequency based policies � Switch to the frequency-based policy after a while � Good partitioning for sampling? � Site based? Directory based? � Content based? � Link-structure based? Junghoo "John" Cho (UCLA Computer Science) 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend