Efficient Processing of Massive Data Streams for Mining and - - PowerPoint PPT Presentation

efficient processing of massive data streams for mining
SMART_READER_LITE
LIVE PREVIEW

Efficient Processing of Massive Data Streams for Mining and - - PowerPoint PPT Presentation

Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University Acknowledgements Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke


slide-1
SLIDE 1

Mirek Riedewald Department of Computer Science Cornell University

Efficient Processing of Massive Data Streams for Mining and Monitoring

slide-2
SLIDE 2

Acknowledgements

Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke KD-D initiative (Art Becker et al.)

slide-3
SLIDE 3

Introduction

Data streams versus databases

Infinite stream, continuous queries Limited resources

Network monitoring

High arrival rates, approximation [CGJSS02]

Stock trading

Complex computation [ZS02]

Retail, E-business, Intelligence, Medical

Surveillance

Identify relevant information on-the-fly, archive for

data mining

Exact results, error guarantees

slide-4
SLIDE 4

Information Spheres

Local Information Sphere

Within each organization Continuous processing of distributed data

streams

Online evaluation of thousands of triggers Storage/archival of important data

Global Information Sphere

Between organizations Share data in privacy preserving way

slide-5
SLIDE 5

Local Information Sphere

Distributed data stream event processing and online data mining

Technical challenges

Blocking operators, unbounded state Graceful degradation under increasing load Integration with archive Processing of physically distributed streams

slide-6
SLIDE 6

Event Matching, Correlation

Join of data streams

200 3.0 Canon Price Mpix Brand <250 >2.0 Price Mpix

slide-7
SLIDE 7

Event Matching, Correlation

Join of data streams

200 3.0 Canon 100 3.0 Fuji Price Mpix Brand <250 >2.0 <400 >4.0 Price Mpix

slide-8
SLIDE 8

Event Matching, Correlation

Join of data streams Equi-join, text similarity, geographical

proximity,…

Problem: unbounded state, computation

220 3.0 Fuji 180 3.0 Canon 340 4.0 Kodak Price Mpix Brand < 400 > 4.0 < 250 > 2.0 < 200 = 3.0 Price Mpix

slide-9
SLIDE 9

Window Joins

Restrict join to window of most recent

records (tuples)

Landmark window Sliding window based on time or number of

records

Problem definition

Window based on time: size w Synchronous record arrival Equi-join

slide-10
SLIDE 10

Abstract Model

Data streams R(A,…), S(A,…) Compute equi-join on A

Match all r and s of streams R, S such that

r.A=s.A

Sliding window of size w

1 1 1 1 3 2 R S (r0,s2), (r1,s2), (r2,s2)

slide-11
SLIDE 11

Abstract Model (cont.)

Data streams R(A,…), S(A,…) Compute equi-join on A

Match all r and s of streams R, S such that

r.A=s.A

Sliding window of size w

3 1 1 1 1 1 3 2 R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3)

slide-12
SLIDE 12

Abstract Model (cont.)

Data streams R(A,…), S(A,…) Compute equi-join on A

Match all r and s of streams R, S such that

r.A=s.A

Sliding window of size w

2 3 1 1 1 4 1 1 3 2 R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output

slide-13
SLIDE 13

Limited Resources

Focus on limited memory M<2w State of the art: random load shedding

[KNV03]

Random sample of streams Desired approach: semantic load shedding

Goal: graceful degradation

Approximation Set-valued result: Error measure?

slide-14
SLIDE 14

Set-Approximation Error

What is a good error measure?

Information Retrieval, Statistics, Data Mining

Matching coefficient Dice coefficient Jaccard coefficient Cosine coefficient Overlap coefficient

Earth Mover’s Distance (EMD) [RTG98] Match And Compare (MAC) [IP99]

Join: subset of output result

EMD, Overlap coefficient trivially 0 or 1 Others (except MAC) reduce to MAX-subset error

measure

| | B A∩ |) | | /(| | | 2 B A B A + ∩ | | / | | B A B A ∪ ∩ | | | | / | | B A B A + ∩ |} | |, min{| / | | B A B A∩

slide-15
SLIDE 15

Optimization Problem

Select records to be kept in memory such that the result size is maximized subject to memory constraints

Lightweight online technique Adaptivity in presence of memory

fluctuations

slide-16
SLIDE 16

Optimal Offline Algorithm

What is the best possible that can be

achieved?

Optimal sampling strategy for MAX-subset Bottom-line for evaluation of any online

algorithm

Same optimization problem, but knows future Finite subsets of input streams

Formulate as linear flow problem

slide-17
SLIDE 17

Generation of Flow Model

R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3

  • 3

cost Capacity: 0..1, linear cost

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Keep in memory Replace

slide-18
SLIDE 18

Correspondence to Windows

R=1,1,1,3 S=2,3,1,1

slide-19
SLIDE 19

Correspondence to Windows

R=1,1,1,3 S=2,3,1,1

slide-20
SLIDE 20

Correspondence to Windows

R=1,1,1,3 S=2,3,1,1

  • 1
  • 1
  • 1
slide-21
SLIDE 21

Correspondence to Windows

R=1,1,1,3 S=2,3,1,1

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
slide-22
SLIDE 22

Complexity

Integer solution exists Optimal solution found in O(n2 m log n)

N input size of single stream #nodes: n < 2wN + N + 2 #arcs: m < 2n + M + 1

Reasonable costs for benchmarking

  • Approx. 1GB memory (w=800, M=800)
  • Approx. 1h computation time
slide-23
SLIDE 23

Optimal Flow

R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3

  • 3

cost Capacity: 0..1, linear cost

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Keep in memory Replace

slide-24
SLIDE 24

Easy to Extend

R=1,1,1,3 S=2,3,1,1 M=2, w=3 Variable memory allocation 3

  • 3

cost Capacity: 0..1, linear cost

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Keep in memory Replace

slide-25
SLIDE 25

Online Heuristics

Maximize expected output

PROB: sort tuples by join partner arrival

probability

LIFE: sort tuples by product of partner arrival

probability and remaining lifetime

Maintain stream statistics

Histograms (DGIM02, TGIK02), wavelets

(GKMS01), quantiles (GKMS02, GK01)

slide-26
SLIDE 26

Approximation Quality

slide-27
SLIDE 27

Effect of Skew

slide-28
SLIDE 28

Summary

Information sphere architecture Optimal algorithm and fast efficient heuristic for

sliding window joins

Open problems

Other set error measures, resource models Other joins: compress records Complex queries Distributed processing Integration with other techniques into local

information sphere

slide-29
SLIDE 29

Related Work

Aurora (Brown, MIT), STREAM (Stanford),

Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI)

Memory requirements [ABBMW02,TM02] Aggregation

Alon, Bar-Yossef, Datar, Dobra, Garofalakis,

Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy

slide-30
SLIDE 30

Other Results

[DGR03]

Integration with archive

Load smoothing, not shedding Novel “error” measure: archive access cost

Static join for sensor networks

Maximize result size subject to constraints on energy

consumption

Polynomial dynamic programming solution Fast 2-approximation algorithms NP-hardness proof for join of 3 or more streams

slide-31
SLIDE 31

Other Results (cont.)

[DGGR02]

Computation of aggregates over streams

for multiple joins

Small pseudo-random sketch synopses

(randomized linear projections)

Explicit, tunable error guarantees Sketch partitioning to boost accuracy

(intelligently partition join attribute space)

slide-32
SLIDE 32

Thanks!

Questions? ? ? ? ? ? ? ?