Efficient Processing of Massive Data Streams for Mining and - PowerPoint PPT Presentation

Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University

Acknowledgements � Al Demers � Abhinandan Das � Alin Dobra � Sasha Evfimievski � Johannes Gehrke � KD-D initiative (Art Becker et al.)

Introduction � Data streams versus databases � Infinite stream, continuous queries � Limited resources � Network monitoring � High arrival rates, approximation [CGJSS02] � Stock trading � Complex computation [ZS02] � Retail, E-business, Intelligence, Medical Surveillance � Identify relevant information on-the-fly, archive for data mining � Exact results, error guarantees

Information Spheres � Local Information Sphere � Within each organization � Continuous processing of distributed data streams � Online evaluation of thousands of triggers � Storage/archival of important data � Global Information Sphere � Between organizations � Share data in privacy preserving way

Local Information Sphere Distributed data stream event processing and online data mining � Technical challenges � Blocking operators, unbounded state � Graceful degradation under increasing load � Integration with archive � Processing of physically distributed streams

Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 200 >2.0 <250

Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 200 >2.0 <250 Fuji 3.0 100 >4.0 <400

Event Matching, Correlation � Join of data streams Brand Mpix Price Mpix Price Canon 3.0 180 > 2.0 < 250 Fuji 3.0 220 > 4.0 < 400 Kodak 4.0 340 = 3.0 < 200 � Equi-join, text similarity, geographical proximity,… � Problem: unbounded state, computation

Window Joins � Restrict join to window of most recent records (tuples) � Landmark window � Sliding window based on time or number of records � Problem definition � Window based on time: size w � Synchronous record arrival � Equi-join

Abstract Model � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 (r0,s2), (r1,s2), (r2,s2) S 2 3 1

Abstract Model (cont.) � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 3 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) S 2 3 1 1

Abstract Model (cont.) � Data streams R(A,…), S(A,…) � Compute equi-join on A � Match all r and s of streams R, S such that r.A=s.A � Sliding window of size w R 1 1 1 3 2 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output S 2 3 1 1 4

Limited Resources � Focus on limited memory M<2w � State of the art: random load shedding [KNV03] � Random sample of streams � Desired approach: semantic load shedding � Goal: graceful degradation � Approximation � Set-valued result: Error measure?

Set-Approximation Error � What is a good error measure? � Information Retrieval, Statistics, Data Mining � Matching coefficient A ∩ | | B � Dice coefficient ∩ + 2 | | /(| | | |) A B A B ∩ ∪ � Jaccard coefficient | | / | | A B A B � Cosine coefficient ∩ + | | / | | | | A B A B A ∩ � Overlap coefficient | | / min{| |, | |} B A B � Earth Mover’s Distance (EMD) [RTG98] � Match And Compare (MAC) [IP99] � Join: subset of output result � EMD, Overlap coefficient trivially 0 or 1 � Others (except MAC) reduce to MAX-subset error measure

Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints � Lightweight online technique � Adaptivity in presence of memory fluctuations

Optimal Offline Algorithm � What is the best possible that can be achieved? � Optimal sampling strategy for MAX-subset � Bottom-line for evaluation of any online algorithm � Same optimization problem, but knows future � Finite subsets of input streams � Formulate as linear flow problem

Generation of Flow Model M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace

Correspondence to Windows R=1,1,1,3 S=2,3,1,1

Correspondence to Windows -1 R=1,1,1,3 -1 -1 S=2,3,1,1

Correspondence to Windows -1 R=1,1,1,3 -1 -1 -1 -1 S=2,3,1,1 -1

Complexity � Integer solution exists � Optimal solution found in O(n 2 m log n) � N input size of single stream � #nodes: n < 2wN + N + 2 � #arcs: m < 2n + M + 1 � Reasonable costs for benchmarking � Approx. 1GB memory (w=800, M=800) � Approx. 1h computation time

Optimal Flow M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace

Easy to Extend M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Variable memory allocation -1 -3 3 S=2,3,1,1 -1 cost Keep in memory Capacity: 0..1, linear cost Replace

Online Heuristics � Maximize expected output � PROB: sort tuples by join partner arrival probability � LIFE: sort tuples by product of partner arrival probability and remaining lifetime � Maintain stream statistics � Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01)

Approximation Quality

Effect of Skew

Summary � Information sphere architecture � Optimal algorithm and fast efficient heuristic for sliding window joins � Open problems � Other set error measures, resource models � Other joins: compress records � Complex queries � Distributed processing � Integration with other techniques into local information sphere

Related Work � Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI) � Memory requirements [ABBMW02,TM02] � Aggregation � Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy

Other Results [DGR03] � Integration with archive � Load smoothing, not shedding � Novel “error” measure: archive access cost � Static join for sensor networks � Maximize result size subject to constraints on energy consumption � Polynomial dynamic programming solution � Fast 2-approximation algorithms � NP-hardness proof for join of 3 or more streams

Other Results (cont.) [DGGR02] � Computation of aggregates over streams for multiple joins � Small pseudo-random sketch synopses (randomized linear projections) � Explicit, tunable error guarantees � Sketch partitioning to boost accuracy (intelligently partition join attribute space)

Thanks! ? ? ? Questions? ? ? ? ?

Efficient Processing of Massive Data Streams for Mining and - PowerPoint PPT Presentation

Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University Acknowledgements Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Program Schedule for Oral Presentation Session-I Time : 12:00 PM - 1:30 PM Venue : M6 Session

New York State Photonics Board of Officers Public Meeting Eastman Business Park January 20, 2016

Southern California Academic Year 2010-2011 About the CFA Institute CFA Institute (CFAI) is the

PIT PIT CRE CREW W MEE MEETI TING NG Hal Hutchens Elementary Family Night May 3, 2018

United States Court of Appeals for the Federal Circuit ______________________ SYNOPSYS, INC.,

Future Trends in Higher Education and Research in Finland Outi Tasala 10.06.2016 GOALS

& Beverage for Finnish Food & Beverage products to enter the Japanese market. Pro

MTEC SMARTZONE The Role of the MTEC SmartZone in the Development of Houghton and Hancock Michigan

Efficient Processing of Massive Data Streams for Mining and - PowerPoint PPT Presentation

Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University Acknowledgements Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Program Schedule for Oral Presentation Session-I Time : 12:00 PM - 1:30 PM Venue : M6 Session

New York State Photonics Board of Officers Public Meeting Eastman Business Park January 20, 2016

Southern California Academic Year 2010-2011 About the CFA Institute CFA Institute (CFAI) is the

PIT PIT CRE CREW W MEE MEETI TING NG Hal Hutchens Elementary Family Night May 3, 2018

United States Court of Appeals for the Federal Circuit ______________________ SYNOPSYS, INC.,

Future Trends in Higher Education and Research in Finland Outi Tasala 10.06.2016 GOALS

&amp; Beverage for Finnish Food &amp; Beverage products to enter the Japanese market. Pro

MTEC SMARTZONE The Role of the MTEC SmartZone in the Development of Houghton and Hancock Michigan

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

& Beverage for Finnish Food & Beverage products to enter the Japanese market. Pro