Efficient Processing of Massive Data Streams for Mining and - - PowerPoint PPT Presentation
Efficient Processing of Massive Data Streams for Mining and - - PowerPoint PPT Presentation
Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University Acknowledgements Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke
Acknowledgements
Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke KD-D initiative (Art Becker et al.)
Introduction
Data streams versus databases
Infinite stream, continuous queries Limited resources
Network monitoring
High arrival rates, approximation [CGJSS02]
Stock trading
Complex computation [ZS02]
Retail, E-business, Intelligence, Medical
Surveillance
Identify relevant information on-the-fly, archive for
data mining
Exact results, error guarantees
Information Spheres
Local Information Sphere
Within each organization Continuous processing of distributed data
streams
Online evaluation of thousands of triggers Storage/archival of important data
Global Information Sphere
Between organizations Share data in privacy preserving way
Local Information Sphere
Distributed data stream event processing and online data mining
Technical challenges
Blocking operators, unbounded state Graceful degradation under increasing load Integration with archive Processing of physically distributed streams
Event Matching, Correlation
Join of data streams
200 3.0 Canon Price Mpix Brand <250 >2.0 Price Mpix
Event Matching, Correlation
Join of data streams
200 3.0 Canon 100 3.0 Fuji Price Mpix Brand <250 >2.0 <400 >4.0 Price Mpix
Event Matching, Correlation
Join of data streams Equi-join, text similarity, geographical
proximity,…
Problem: unbounded state, computation
220 3.0 Fuji 180 3.0 Canon 340 4.0 Kodak Price Mpix Brand < 400 > 4.0 < 250 > 2.0 < 200 = 3.0 Price Mpix
Window Joins
Restrict join to window of most recent
records (tuples)
Landmark window Sliding window based on time or number of
records
Problem definition
Window based on time: size w Synchronous record arrival Equi-join
Abstract Model
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that
r.A=s.A
Sliding window of size w
1 1 1 1 3 2 R S (r0,s2), (r1,s2), (r2,s2)
Abstract Model (cont.)
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that
r.A=s.A
Sliding window of size w
3 1 1 1 1 1 3 2 R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3)
Abstract Model (cont.)
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that
r.A=s.A
Sliding window of size w
2 3 1 1 1 4 1 1 3 2 R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output
Limited Resources
Focus on limited memory M<2w State of the art: random load shedding
[KNV03]
Random sample of streams Desired approach: semantic load shedding
Goal: graceful degradation
Approximation Set-valued result: Error measure?
Set-Approximation Error
What is a good error measure?
Information Retrieval, Statistics, Data Mining
Matching coefficient Dice coefficient Jaccard coefficient Cosine coefficient Overlap coefficient
Earth Mover’s Distance (EMD) [RTG98] Match And Compare (MAC) [IP99]
Join: subset of output result
EMD, Overlap coefficient trivially 0 or 1 Others (except MAC) reduce to MAX-subset error
measure
| | B A∩ |) | | /(| | | 2 B A B A + ∩ | | / | | B A B A ∪ ∩ | | | | / | | B A B A + ∩ |} | |, min{| / | | B A B A∩
Optimization Problem
Select records to be kept in memory such that the result size is maximized subject to memory constraints
Lightweight online technique Adaptivity in presence of memory
fluctuations
Optimal Offline Algorithm
What is the best possible that can be
achieved?
Optimal sampling strategy for MAX-subset Bottom-line for evaluation of any online
algorithm
Same optimization problem, but knows future Finite subsets of input streams
Formulate as linear flow problem
Generation of Flow Model
R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3
- 3
cost Capacity: 0..1, linear cost
- 1
- 1
- 1
- 1
- 1
- 1
Keep in memory Replace
Correspondence to Windows
R=1,1,1,3 S=2,3,1,1
Correspondence to Windows
R=1,1,1,3 S=2,3,1,1
Correspondence to Windows
R=1,1,1,3 S=2,3,1,1
- 1
- 1
- 1
Correspondence to Windows
R=1,1,1,3 S=2,3,1,1
- 1
- 1
- 1
- 1
- 1
- 1
Complexity
Integer solution exists Optimal solution found in O(n2 m log n)
N input size of single stream #nodes: n < 2wN + N + 2 #arcs: m < 2n + M + 1
Reasonable costs for benchmarking
- Approx. 1GB memory (w=800, M=800)
- Approx. 1h computation time
Optimal Flow
R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3
- 3
cost Capacity: 0..1, linear cost
- 1
- 1
- 1
- 1
- 1
- 1
Keep in memory Replace
Easy to Extend
R=1,1,1,3 S=2,3,1,1 M=2, w=3 Variable memory allocation 3
- 3
cost Capacity: 0..1, linear cost
- 1
- 1
- 1
- 1
- 1
- 1
Keep in memory Replace
Online Heuristics
Maximize expected output
PROB: sort tuples by join partner arrival
probability
LIFE: sort tuples by product of partner arrival
probability and remaining lifetime
Maintain stream statistics
Histograms (DGIM02, TGIK02), wavelets
(GKMS01), quantiles (GKMS02, GK01)
Approximation Quality
Effect of Skew
Summary
Information sphere architecture Optimal algorithm and fast efficient heuristic for
sliding window joins
Open problems
Other set error measures, resource models Other joins: compress records Complex queries Distributed processing Integration with other techniques into local
information sphere
Related Work
Aurora (Brown, MIT), STREAM (Stanford),
Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI)
Memory requirements [ABBMW02,TM02] Aggregation
Alon, Bar-Yossef, Datar, Dobra, Garofalakis,
Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy
Other Results
[DGR03]
Integration with archive
Load smoothing, not shedding Novel “error” measure: archive access cost
Static join for sensor networks
Maximize result size subject to constraints on energy
consumption
Polynomial dynamic programming solution Fast 2-approximation algorithms NP-hardness proof for join of 3 or more streams
Other Results (cont.)
[DGGR02]
Computation of aggregates over streams
for multiple joins
Small pseudo-random sketch synopses
(randomized linear projections)
Explicit, tunable error guarantees Sketch partitioning to boost accuracy