Effective Change Detection Using Sampling Junghoo John Cho - PowerPoint PPT Presentation

Effective Change Detection Using Sampling Junghoo “John” Cho Alexandros Ntoulas UCLA

Problem Polling Update Query Local database Remote database � Application � Web search engines/crawlers � Web archive � Data warehouse . . . Junghoo "John" Cho (UCLA Computer Science) 2

Existing Approach � Round robin � Download pages in a round robin manner � Change-frequency based [CLW98, CGM00, EMT01] � Estimate the change frequency � Adjust download frequency � Proven to be optimal Junghoo "John" Cho (UCLA Computer Science) 3

Our Approach � Sampling-based � Sample k pages from each source � Download more pages from the source with more changed samples Junghoo "John" Cho (UCLA Computer Science) 4

Comparison � Frequency based � Proven to be optimal � Change history required � Difficult to estimate change frequency � Sampling based � Can be worse than frequency based policy � No history/frequency-estimation required � Experimental comparison later Junghoo "John" Cho (UCLA Computer Science) 5

Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 6

Is Correlation Necessary? � Random sampling 4/5 1/5 � Correlation not necessary. Only random sampling � More discussion later Junghoo "John" Cho (UCLA Computer Science) 7

Download Model (1) � Fixed download cycle � Say, once a month � Fixed download resources in each cycle � Say, 100,000 page download every month � Goal � Download as many changes as we can � ChangeRatio = No of changed & downloaded pages No of downloaded pages Junghoo "John" Cho (UCLA Computer Science) 9

Download Model (2) � Two-stage sampling policy � Sampling stage � Download stage � Sampling requires page download Junghoo "John" Cho (UCLA Computer Science) 10

How to Use Sampling Result? � Sites A and B, each with 20 pages � 20 total download, 5 samples from each site � 10 page download remaining 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 11

Proportional Policy � Download pages proportionally to the detected changes � 8 pages from A, 2 pages from B 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 12

Greedy Policy � Download pages from the sites with most changes � 10 pages from A 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 13

Optimality of Greedy � Theorem � Greedy is optimal if we make download decisions purely based on sampling results � Probabilistic optimality for their expected values Junghoo "John" Cho (UCLA Computer Science) 14

How Many Samples? � Too few samples � Inaccurate change estimates � Too many samples � “Waste” of resources for sampling � How to determine optimal sample size? Junghoo "John" Cho (UCLA Computer Science) 16

Optimal Sample Size � Factors to consider � Total number of pages that we maintain � Number of pages that we can download in the current cycle � Number of pages in each Web site � Change distribution � Scenario 1 -- A: 90/100, B: 10/100 � Scenario 2 -- A: 60/100, B: 40/100 Junghoo "John" Cho (UCLA Computer Science) 17

Change Fraction Distribution fraction of sites f( ρ ) ρ ρ t � ρ i : fraction of changed pages in site i � f( ρ ): distribution of ρ values Junghoo "John" Cho (UCLA Computer Science) 18

Optimal Sample Size Nr f ( ρ t ) 6( ρ r − ρ ) � N : no of pages in a site � r : no of pages to download / no of pages we maintain � Analysis is complex Nr is a good rule of thumb � Junghoo "John" Cho (UCLA Computer Science) 19

Dynamic Sample Size? � Do we need the same sample size for every site? � A: ρ = 0, B: ρ = 0.45, C: ρ = 0.55, D: ρ = 1 Junghoo "John" Cho (UCLA Computer Science) 20

Adaptive Sampling � If the estimated ρ is high/low enough, make an early decision � What does “high enough” mean? � Confidence interval above threshold ( ) ( ) ( ) ρ i ρ i ρ i ρ ρ t Junghoo "John" Cho (UCLA Computer Science) 21

In the Paper � More details on � Optimal sample size � Adaptive policy � The cases where resource is too limited for sampling Junghoo "John" Cho (UCLA Computer Science) 22

Experiments � 353,000 pages from 252 sites � Mostly popular sites � Yahoo, CNN, Microsoft, … � ~ 1400 pages from each site � Followed the links in the breadth-first manner � Monthly change history for 6 months � 5 download cycles � In experiments, 100,000 page downloads in each download cycle Junghoo "John" Cho (UCLA Computer Science) 23

Comparison of Policies ChangeRatio 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 RR FRQ PRP GRD ADP Junghoo "John" Cho (UCLA Computer Science) 24

Optimal Sample Size ChangeRatio 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0 50 100 150 200 250 Sample Size Optimal sample size ~ 10 through 60 ~ 20 Nr Junghoo "John" Cho (UCLA Computer Science) 25

Comparison of Long-Term Performance � Problem: We have only 5-download-cycle ? data � Solution: Extrapolate the history Repeat Junghoo "John" Cho (UCLA Computer Science) 26

Frequency vs. Sampling ChangeRatio 0.9 Frequency 0.8 Greedy 0.7 0.6 0.5 0 100 200 300 400 Download Cycle Junghoo "John" Cho (UCLA Computer Science) 27

Related Work � Frequency-based policy � Coffman et al., Journal of Scheduling 1998 � Cho et al., SIGMOD 2000 � Edwards et al., WWW 2001 � Source cooperation � Olston et al., SIGMOD 2002 Junghoo "John" Cho (UCLA Computer Science) 28

Conclusion � Sampling-based policy � Great short-term performance � No change history required � Frequency-based policy � Potentially good long-term performance if the change frequency does not change � Greedy is easy to implement and shows high performance Junghoo "John" Cho (UCLA Computer Science) 29

Future Work � Combination of sampling and frequency based policies � Switch to the frequency-based policy after a while � Good partitioning for sampling? � Site based? Directory based? � Content based? � Link-structure based? Junghoo "John" Cho (UCLA Computer Science) 30

Effective Change Detection Using Sampling Junghoo John Cho - PowerPoint PPT Presentation

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA Problem Polling Update Query Local database Remote database Application Web search engines/crawlers Web archive Data warehouse . . .

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Create Sampling Distributions from Single Die V0G 11/16/2016 V0G Create Sampling Distribution

Introduction to Sampling for Non-Statisticians Dr. Safaa R. Amer Overview Part I Part II

Medicare and Medicaid Audit Sampling Strategies Sampling Strategies Creating Sampling Plans and

CS786 Lecture 13: May 14, 2012 Sampling techniques [KF Chapter 12] CS786 P. Poupart 2012 1

Image-based change detection to reduce false alarms in the Vision1200 synthetic aperture sonar

Sequential techniques for Hypothesis testing & Change detection George V. Moustakides

Web Security Secure Socket Layer (SSL) December 7, 2000 SSL 2 Web Security authentication:

A Comparative Review of HTTP/1.1, HTTP/2 & HTTP/3 December 3, 2018 Nancy Mogire WHAT

Lessons learned from the theory and practice of Simulated data - One change (Signal and spectral

On statistical change detection for FDI Michle Basseville IRISA/CNRS, Rennes, France

Privately Detecting Changes in Unknown Distributions Wanrong Zhang, Georgia Tech joint work with

Speaker Change Detection using Siamese Networks Siamese layers share their Acoustic Data

Effective Change Detection Using Sampling Junghoo John Cho - PowerPoint PPT Presentation

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA Problem Polling Update Query Local database Remote database Application Web search engines/crawlers Web archive Data warehouse . . .

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Create Sampling Distributions from Single Die V0G 11/16/2016 V0G Create Sampling Distribution

Introduction to Sampling for Non-Statisticians Dr. Safaa R. Amer Overview Part I Part II

Medicare and Medicaid Audit Sampling Strategies Sampling Strategies Creating Sampling Plans and

CS786 Lecture 13: May 14, 2012 Sampling techniques [KF Chapter 12] CS786 P. Poupart 2012 1

Image-based change detection to reduce false alarms in the Vision1200 synthetic aperture sonar

Sequential techniques for Hypothesis testing &amp; Change detection George V. Moustakides

Web Security Secure Socket Layer (SSL) December 7, 2000 SSL 2 Web Security authentication:

A Comparative Review of HTTP/1.1, HTTP/2 &amp; HTTP/3 December 3, 2018 Nancy Mogire WHAT

Lessons learned from the theory and practice of Simulated data - One change (Signal and spectral

On statistical change detection for FDI Michle Basseville IRISA/CNRS, Rennes, France

Privately Detecting Changes in Unknown Distributions Wanrong Zhang, Georgia Tech joint work with

Speaker Change Detection using Siamese Networks Siamese layers share their Acoustic Data

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Sequential techniques for Hypothesis testing & Change detection George V. Moustakides

A Comparative Review of HTTP/1.1, HTTP/2 & HTTP/3 December 3, 2018 Nancy Mogire WHAT