efficient processing of massive data streams for mining
play

Efficient Processing of Massive Data Streams for Mining and - PowerPoint PPT Presentation

Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University Acknowledgements Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke


  1. Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University

  2. Acknowledgements � Al Demers � Abhinandan Das � Alin Dobra � Sasha Evfimievski � Johannes Gehrke � KD-D initiative (Art Becker et al.)

  3. Introduction � Data streams versus databases � Infinite stream, continuous queries � Limited resources � Network monitoring � High arrival rates, approximation [CGJSS02] � Stock trading � Complex computation [ZS02] � Retail, E-business, Intelligence, Medical Surveillance � Identify relevant information on-the-fly, archive for data mining � Exact results, error guarantees

  4. Information Spheres � Local Information Sphere � Within each organization � Continuous processing of distributed data streams � Online evaluation of thousands of triggers � Storage/archival of important data � Global Information Sphere � Between organizations � Share data in privacy preserving way

  5. Local Information Sphere Distributed data stream event processing and online data mining � Technical challenges � Blocking operators, unbounded state � Graceful degradation under increasing load � Integration with archive � Processing of physically distributed streams

  6. Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 200 >2.0 <250

  7. Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 200 >2.0 <250 Fuji 3.0 100 >4.0 <400

  8. Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 180 > 2.0 < 250 Fuji 3.0 220 > 4.0 < 400 Kodak 4.0 340 = 3.0 < 200 � Equi-join, text similarity, geographical proximity,… � Problem: unbounded state, computation

  9. Window Joins � Restrict join to window of most recent records (tuples) � Landmark window � Sliding window based on time or number of records � Problem definition � Window based on time: size w � Synchronous record arrival � Equi-join

  10. Abstract Model � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 (r0,s2), (r1,s2), (r2,s2) S 2 3 1

  11. Abstract Model (cont.) � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 3 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) S 2 3 1 1

  12. Abstract Model (cont.) � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 3 2 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output S 2 3 1 1 4

  13. Limited Resources � Focus on limited memory M<2w � State of the art: random load shedding [KNV03] � Random sample of streams � Desired approach: semantic load shedding � Goal: graceful degradation � Approximation � Set-valued result: Error measure?

  14. Set-Approximation Error � What is a good error measure? � Information Retrieval, Statistics, Data Mining � Matching coefficient A ∩ | | B � Dice coefficient ∩ + 2 | | /(| | | |) A B A B ∩ ∪ � Jaccard coefficient | | / | | A B A B � Cosine coefficient ∩ + | | / | | | | A B A B A ∩ � Overlap coefficient | | / min{| |, | |} B A B � Earth Mover’s Distance (EMD) [RTG98] � Match And Compare (MAC) [IP99] � Join: subset of output result � EMD, Overlap coefficient trivially 0 or 1 � Others (except MAC) reduce to MAX-subset error measure

  15. Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints � Lightweight online technique � Adaptivity in presence of memory fluctuations

  16. Optimal Offline Algorithm � What is the best possible that can be achieved? � Optimal sampling strategy for MAX-subset � Bottom-line for evaluation of any online algorithm � Same optimization problem, but knows future � Finite subsets of input streams � Formulate as linear flow problem

  17. Generation of Flow Model M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace

  18. Correspondence to Windows R=1,1,1,3 S=2,3,1,1

  19. Correspondence to Windows R=1,1,1,3 S=2,3,1,1

  20. Correspondence to Windows -1 R=1,1,1,3 -1 -1 S=2,3,1,1

  21. Correspondence to Windows -1 R=1,1,1,3 -1 -1 -1 -1 S=2,3,1,1 -1

  22. Complexity � Integer solution exists � Optimal solution found in O(n 2 m log n) � N input size of single stream � #nodes: n < 2wN + N + 2 � #arcs: m < 2n + M + 1 � Reasonable costs for benchmarking � Approx. 1GB memory (w=800, M=800) � Approx. 1h computation time

  23. Optimal Flow M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace

  24. Easy to Extend M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Variable memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace

  25. Online Heuristics � Maximize expected output � PROB: sort tuples by join partner arrival probability � LIFE: sort tuples by product of partner arrival probability and remaining lifetime � Maintain stream statistics � Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01)

  26. Approximation Quality

  27. Effect of Skew

  28. Summary � Information sphere architecture � Optimal algorithm and fast efficient heuristic for sliding window joins � Open problems � Other set error measures, resource models � Other joins: compress records � Complex queries � Distributed processing � Integration with other techniques into local information sphere

  29. Related Work � Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI) � Memory requirements [ABBMW02,TM02] � Aggregation � Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy

  30. Other Results [DGR03] � Integration with archive � Load smoothing, not shedding � Novel “error” measure: archive access cost � Static join for sensor networks � Maximize result size subject to constraints on energy consumption � Polynomial dynamic programming solution � Fast 2-approximation algorithms � NP-hardness proof for join of 3 or more streams

  31. Other Results (cont.) [DGGR02] � Computation of aggregates over streams for multiple joins � Small pseudo-random sketch synopses (randomized linear projections) � Explicit, tunable error guarantees � Sketch partitioning to boost accuracy (intelligently partition join attribute space)

  32. Thanks! ? ? ? Questions? ? ? ? ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend