Data Stream Processing Part II Stream Windows Lossy Counting - - PowerPoint PPT Presentation

data stream processing
SMART_READER_LITE
LIVE PREVIEW

Data Stream Processing Part II Stream Windows Lossy Counting - - PowerPoint PPT Presentation

Data Stream Processing Part II Stream Windows Lossy Counting Sticky Sampling 1 Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required


slide-1
SLIDE 1

Data Stream Processing

Part II

Stream Windows Lossy Counting Sticky Sampling

1

slide-2
SLIDE 2

Data Streams (recap)

continuous, unbounded sequence of items unpredictable arrival times too large to store locally

  • ne pass real time processing required

Stream Windows Lossy Counting Sticky Sampling

2

slide-3
SLIDE 3

Reservoir Sampling (recap)

r/N r/N r/N r/N r/N Reservoir

Stream

create representative sample of incoming data items N uniformly sample into reservoir of size r

Stream Windows Lossy Counting Sticky Sampling

3

slide-4
SLIDE 4

Today Counting algorithms based on stream windows

Lossy Counting Sticky Sampling

Stream Windows Lossy Counting Sticky Sampling

4

slide-5
SLIDE 5

Stream Windows

Mechanism for extracting a finite relation from an infinite stream.

Stream Windows Lossy Counting Sticky Sampling

5

slide-6
SLIDE 6

Window Example

a d j u w s w y u j g d e d l

stream past future

Stream Windows Lossy Counting Sticky Sampling

6

slide-7
SLIDE 7

Window Example

a d j u w s w y u j g d e d l

stream past future

a d j u w s w y u j g d e d l

stream

Stream Windows Lossy Counting Sticky Sampling

7

slide-8
SLIDE 8

Window Example

a d j u w s w y u j g d e d l

stream past future

a d j u w s w y u j g d e d l

stream

a d j u w s w y u j g d e d l

stream

Stream Windows Lossy Counting Sticky Sampling

8

slide-9
SLIDE 9

Window Example

a d j u w s w y u j g d e d l

stream past future

a d j u w s w y u j g d e d l

stream

a d j u w s w y u j g d e d l

stream

Sliding Window

Stream Windows Lossy Counting Sticky Sampling

9

slide-10
SLIDE 10

Window Types

assumes existences of some attribute that defines the order of the stream elements (e.g. time) w is the window length (size) expressed in units of the

  • rdering attribute (e.g. seconds)

t1 t2 t3 t4 t1' t2' t3' t4' Sliding Window ti' - ti = w

Stream Windows Lossy Counting Sticky Sampling

10

slide-11
SLIDE 11

Window Types

assumes existences of some attribute that defines the order of the stream elements (e.g. time) w is the window length (size) expressed in units of the

  • rdering attribute (e.g. seconds)

t1 t2 t3 t4 t1' t2' t3' t4' Sliding Window ti' - ti = w t1 t2 t3 Tumbling Window ti+1 - ti = w

Stream Windows Lossy Counting Sticky Sampling

11

slide-12
SLIDE 12

Count based Windows

Ordering attribute can cause problems for duplicates (e.g. same time stamps) Use count based windows instead

Stream Windows Lossy Counting Sticky Sampling

12

slide-13
SLIDE 13

Count based Windows

Ordering attribute can cause problems for duplicates (e.g. same time stamps) Use count based windows instead

t1 t2 t3 t1' t2' t3' Count based Window

Count based windows are potentially unpredicatable with respect to fluctuation in input rates.

Stream Windows Lossy Counting Sticky Sampling

13

slide-14
SLIDE 14

Punctuation based Windows

Split windows based on punctuations in the data

Punctuation based Window \n \n \n

Stream Windows Lossy Counting Sticky Sampling

14

slide-15
SLIDE 15

Punctuation based Windows

Split windows based on punctuations in the data

Punctuation based Window \n \n \n

Potentially problematic if windows grow too large or too small.

Stream Windows Lossy Counting Sticky Sampling

15

slide-16
SLIDE 16

Window Standing Query Example

What is the average of the integers in the window?

Stream of integers Window of size w = 4 Count based sliding window for the first w inputs, sum and count afterwards change average by adding (i − j)/w to the previous window average

Stream Windows Lossy Counting Sticky Sampling

16

slide-17
SLIDE 17

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

Stream Windows Lossy Counting Sticky Sampling

17

slide-18
SLIDE 18

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream 1+3+5+4 4

= 3.25

Stream Windows Lossy Counting Sticky Sampling

18

slide-19
SLIDE 19

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream 1+3+5+4 4

= 3.25 3.25 + i−j

w

with i newest value, j

  • ldest value

Stream Windows Lossy Counting Sticky Sampling

19

slide-20
SLIDE 20

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream 1+3+5+4 4

= 3.25 3.25 + i−j

w

with i newest value, j

  • ldest value

1+3+5+4 4

+ 8−1

4

= 5

Stream Windows Lossy Counting Sticky Sampling

20

slide-21
SLIDE 21

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream 1+3+5+4 4

= 3.25 3.25 + i−j

w

with i newest value, j

  • ldest value

1+3+5+4 4

+ 8−1

4

= 5 5 + 9−3

4

= 6.5

Stream Windows Lossy Counting Sticky Sampling

21

slide-22
SLIDE 22

Window Standing Query Example

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream

1 3 5 4 8 9 3 1 4 2 7 5 6 8 7

stream 1+3+5+4 4

= 3.25 3.25 + i−j

w

with i newest value, j

  • ldest value

1+3+5+4 4

+ 8−1

4

= 5 5 + 9−3

4

= 6.5

Datastructure?

Stream Windows Lossy Counting Sticky Sampling

22

slide-23
SLIDE 23

Window Average

#!/usr/bin/env python2 import sys import Queue WINDOW = 4 elems = Queue.Queue() elem_sum = 0 for i in range(WINDOW): # initial average val = int(sys.stdin.readline().strip()) elems.put(val) elem_sum += val avg = float(elem_sum) / WINDOW for line in sys.stdin: print(avg) val = int(line.strip()) avg = avg + (val - elems.get())/float(WINDOW) elems.put(val)

Stream Windows Lossy Counting Sticky Sampling

23

slide-24
SLIDE 24

Window Average

#!/usr/bin/env python2 import sys import Queue WINDOW = 4 elems = Queue.Queue() elem_sum = 0 for i in range(WINDOW): # initial average val = int(sys.stdin.readline().strip()) elems.put(val) elem_sum += val avg = float(elem_sum) / WINDOW for line in sys.stdin: print(avg) val = int(line.strip()) avg = avg + (val - elems.get())/float(WINDOW) elems.put(val)

Allows calculation in a single pass of each element.

Stream Windows Lossy Counting Sticky Sampling

24

slide-25
SLIDE 25

Window based Algorithm Lossy Counting

Stream Windows Lossy Counting Sticky Sampling

25

slide-26
SLIDE 26

Problem Description

Maintain a count of distinct elements seen so far

Stream Windows Lossy Counting Sticky Sampling

26

slide-27
SLIDE 27

Problem Description

Maintain a count of distinct elements seen so far

Examples:

Google web crawler counting URL encounters. Detecting spam pages through content analysis. User login rankings to web services.

Stream Windows Lossy Counting Sticky Sampling

27

slide-28
SLIDE 28

Problem Description

Maintain a count of distinct elements seen so far

Examples:

Google web crawler counting URL encounters. Detecting spam pages through content analysis. User login rankings to web services.

Straight forward solution: Hashtable

Stream Windows Lossy Counting Sticky Sampling

28

slide-29
SLIDE 29

Problem Description

Maintain a count of distinct elements seen so far

Examples:

Google web crawler counting URL encounters. Detecting spam pages through content analysis. User login rankings to web services.

Straight forward solution: Hashtable Too large for memory, too slow on disk

Stream Windows Lossy Counting Sticky Sampling

29

slide-30
SLIDE 30

Algorithm Parameters

Environment Parameters

Elements seen so far N

User-specified Parameters

support threshold s ∈ (0, 1) error parameter ǫ ∈ (0, 1)

Stream Windows Lossy Counting Sticky Sampling

30

slide-31
SLIDE 31

Algorithm Guarantees

1 All items whose true frequency exceeds sN are output. There

are no false negatives.

2 No items whose true frequency is less than (s − ǫ)N is output. 3 Estimated frequencies are less than the true frequencies by at

most ǫN.

Stream Windows Lossy Counting Sticky Sampling

31

slide-32
SLIDE 32

Example

With s = 10%, ǫ = 1%, N = 1000

Stream Windows Lossy Counting Sticky Sampling

32

slide-33
SLIDE 33

Example

With s = 10%, ǫ = 1%, N = 1000

1 All elements exceeding frequency sN = 100 will be output. Stream Windows Lossy Counting Sticky Sampling

33

slide-34
SLIDE 34

Example

With s = 10%, ǫ = 1%, N = 1000

1 All elements exceeding frequency sN = 100 will be output. 2 No elements with frequencies below (s − ǫ)N = 90 are output.

False positives between 90 and 100 might or might not be

  • utput.

Stream Windows Lossy Counting Sticky Sampling

34

slide-35
SLIDE 35

Example

With s = 10%, ǫ = 1%, N = 1000

1 All elements exceeding frequency sN = 100 will be output. 2 No elements with frequencies below (s − ǫ)N = 90 are output.

False positives between 90 and 100 might or might not be

  • utput.

3 All estimated frequencies diverge from their true frequencies by

at most ǫN = 10 instances.

Stream Windows Lossy Counting Sticky Sampling

35

slide-36
SLIDE 36

Example

With s = 10%, ǫ = 1%, N = 1000

1 All elements exceeding frequency sN = 100 will be output. 2 No elements with frequencies below (s − ǫ)N = 90 are output.

False positives between 90 and 100 might or might not be

  • utput.

3 All estimated frequencies diverge from their true frequencies by

at most ǫN = 10 instances.

Rule of thumb: ǫ = 0.1s

Stream Windows Lossy Counting Sticky Sampling

36

slide-37
SLIDE 37

Expected Errors

1 high frequency false positives 2 small errors in frequency estimations Stream Windows Lossy Counting Sticky Sampling

37

slide-38
SLIDE 38

Expected Errors

1 high frequency false positives 2 small errors in frequency estimations

Acceptable for high numbers of N

Stream Windows Lossy Counting Sticky Sampling

38

slide-39
SLIDE 39

Lossy Counting in Action

Incoming Stream of Colours

Stream Windows Lossy Counting Sticky Sampling

39

slide-40
SLIDE 40

Divide into Windows/Buckets

w w w

Window Size w = 1

ǫ

  • =

1

0.01

  • = 100

Stream Windows Lossy Counting Sticky Sampling

40

slide-41
SLIDE 41

First Window Comes In

Empty Counts Frequency Counts First Window

Go through elements. If counter exists, increase by one, if not create one and initialise it to one.

Stream Windows Lossy Counting Sticky Sampling

41

slide-42
SLIDE 42

Adjust Counts at Window Boundaries

Frequency Counts Frequency Counts

Reduce all counts by one. If counter is zero for a specific element, drop it.

Stream Windows Lossy Counting Sticky Sampling

42

slide-43
SLIDE 43

Next Window Comes In

Next Window Frequency Counts Frequency Counts

Count elements and adjust counts afterwards.

Stream Windows Lossy Counting Sticky Sampling

43

slide-44
SLIDE 44

Lossy Counting Summary

Split Stream into Windows For each window: Count elements, if no counter exists, create

  • ne.

At window boundaries: Reduce all frequencies by one. If frequency goes to zero, drop counter. Process next window ...

Stream Windows Lossy Counting Sticky Sampling

44

slide-45
SLIDE 45

Lossy Counting Summary

Split Stream into Windows For each window: Count elements, if no counter exists, create

  • ne.

At window boundaries: Reduce all frequencies by one. If frequency goes to zero, drop counter. Process next window ...

Data structure to save counters:

Stream Windows Lossy Counting Sticky Sampling

45

slide-46
SLIDE 46

Lossy Counting Summary

Split Stream into Windows For each window: Count elements, if no counter exists, create

  • ne.

At window boundaries: Reduce all frequencies by one. If frequency goes to zero, drop counter. Process next window ...

Data structure to save counters: Hashtable<Color,Integer>

Stream Windows Lossy Counting Sticky Sampling

46

slide-47
SLIDE 47

Output

With s = 10%, ǫ = 1%, N = 200

Frequency Counts

  • utput

threshold Output 24 22 19 27 False Positive

To reduce false positives to acceptable amount, only

  • utput counters with frequency f ≥ (s − ǫ)N = 18.

Stream Windows Lossy Counting Sticky Sampling

47

slide-48
SLIDE 48

Accuracy Improvement

Reduction step of counters follows the approach of reducing all counters by one. An improved version maintains exact frequencies and remembers for each counter at which window id it was created. At window boundaries, counters are only removed when their frequency falls below a certain level in relation to their window id.

(Color,Integer,WindowID)

See paper for details.

  • G. S. Manku, R. Motwani. Approximate Frequency Counts over

Data Streams, VLDB, 2002.

Stream Windows Lossy Counting Sticky Sampling

48

slide-49
SLIDE 49

Window based Algorithm Sticky Sampling

Stream Windows Lossy Counting Sticky Sampling

49

slide-50
SLIDE 50

Problem Description

Counting algorithm using a sampling approach.

Probabilistic sampling decides if a counter for a distinct element is created. If a counter exists for a certain element, every future instance of this element will be counted.

Stream Windows Lossy Counting Sticky Sampling

50

slide-51
SLIDE 51

Algorithm Parameters

Environment Parameters

Elements seen so far N

User-specified Parameters

support threshold s ∈ (0, 1) error parameter ǫ ∈ (0, 1) probability of failure δ ∈ (0, 1)

The algorithm is probabilistic and fails if any of the three guarantees is not satisfied.

Stream Windows Lossy Counting Sticky Sampling

51

slide-52
SLIDE 52

Algorithm Guarantees

1 All items whose true frequency exceeds sN are output. There

are no false negatives.

2 No items whose true frequency is less than (s − ǫ)N is output. 3 Estimated frequencies are less than the true frequencies by at

most ǫN.

Guarantees and thereby expected errors are the same as for Lossy Counting. Except for the small probability that it might fail to provide correct answers.

Stream Windows Lossy Counting Sticky Sampling

52

slide-53
SLIDE 53

Sticky Sampling in Action

Incoming Stream of Colours

Stream Windows Lossy Counting Sticky Sampling

53

slide-54
SLIDE 54

Divide into Windows/Buckets

w = t w = 2t w = 4t window 1 window 2 window 3

Dynamic window size with t = 1

ǫ log

1

  • With s = 10%, ǫ = 1%, δ = 0.1%

t ≈ 921

Stream Windows Lossy Counting Sticky Sampling

54

slide-55
SLIDE 55

A Window Comes in

window 1 window 2 window 3 w = t r = 1 w = 2t r = 2 w = 4t r = 4

Go through elements. If counter exists, increase it. If not, create a counter with probability 1

r and initialise it to one.

Sampling rate r grows in proportion to window size.

Stream Windows Lossy Counting Sticky Sampling

55

slide-56
SLIDE 56

Adjust Counts at Window Boundaries

Frequency Counts Frequency Counts

Go through elements of each counter. Toss coin, if unsuccessful remove element, otherwise move

  • n to next counter. If

counter becomes zero, drop it.

Ensures uniform sampling

Stream Windows Lossy Counting Sticky Sampling

56

slide-57
SLIDE 57

Sticky Sampling Summary

Split stream into windows, doubling window size of each new window For each window: Go through elements if counter exists, increase it. If not, create one with probability 1

r with r growing

at the same rate as window size. At window boundaries: Reduce all frequencies by tossing an unbiased coin for each counted element. Remove element if coin toss unsuccessful, otherwise move on to next counter. If frequency goes to zero, drop counter. Process next window ...

Stream Windows Lossy Counting Sticky Sampling

57

slide-58
SLIDE 58

Output

Same principle as Lossy Counting To reduce false positives to acceptable amount, only

  • utput counters with frequency f ≥ (s − ǫ)N.

Stream Windows Lossy Counting Sticky Sampling

58

slide-59
SLIDE 59

Lossy Counting vs. Sticky Sampling

Feature Lossy Counting Sticky Sampling Results deterministic probabilistic Memory grows with N static (independent of N) Theory performs worse performs better Practice performs better performs worse

performance in terms of memory and accuracy

Stream Windows Lossy Counting Sticky Sampling

59