Introduction to data stream querying and mining
Georges HEBRAIL
Workshop Franco-Brasileiro sobre Mineração de Dados Recife, May 5-7, 2009
Introduction to data stream querying and mining Georges HEBRAIL - - PowerPoint PPT Presentation
Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre Minerao de Dados Recife, May 5-7, 2009 Preliminaries Now at Google Page 2 G.HEBRAIL May 5th, 2009 Introduction to data stream querying
Workshop Franco-Brasileiro sobre Mineração de Dados Recife, May 5-7, 2009
Introduction to data stream querying and mining Page 2 G.HEBRAIL – May 5th, 2009
Now at Google
Introduction to data stream querying and mining Page 3 G.HEBRAIL – May 5th, 2009
What is a data stream ? Applications of data stream management Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 4
… … … … … 15 235,02 0,522 3,52 16/12/2006-17:29 15,8 235,68 0,528 3,666 16/12/2006-17:28 23 233,74 0,502 5,388 16/12/2006-17:27 23 233,29 0,498 5,374 16/12/2006-17:26 … … … … … I 1 (A) U 1 (V)
Timestamp
Golab & Oszu (2003): “A data stream is a real-time, continuous,
sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”
Structured records ≠
≠ ≠ ≠ audio or video data
Massive volumes of data, records arrive at a high rate
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 5
… … … … … … ftp 80K 18 16.5.5.8 19.7.1.2 12345 http 58K 26 14.8.7.4 12.4.3.8 12344 http 24K 16 12.4.0.3 18.6.7.1 12343 http 20K 12 16.2.3.7 10.1.0.2 12342 … … … … … … Protocol Bytes Duration Destination Source Timestamp
Golab & Oszu (2003): “A data stream is a real-time, continuous,
sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”
Structured records ≠
≠ ≠ ≠ audio or video data
Massive volumes of data, records arrive at a high rate
Introduction to data stream querying and mining Page 6 G.HEBRAIL – May 5th, 2009
What is a data stream ? Applications of data stream processing Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 7
Requirements
Real-time processing One-pass processing Bounded storage (no complete storage of streams) Possibly consider several streams
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 8
generating unstorable large amounts of data
analysis)
consumption)
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 9
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 10
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 11
… … … … … ftp 80K 18 16.5.5.8 19.7.1.2 http 58K 26 14.8.7.4 12.4.3.8 http 24K 16 12.4.0.3 18.6.7.1 http 20K 12 16.2.3.7 10.1.0.2 … … … … … Protocol Bytes Duration Destination Source
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 12
Stock monitoring
Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
the first MSFT price afterwards is below $27.
from one transaction to the next.
monotonically for 30 min.
the price chart of any stock
price of a stock and its 10 day moving average is greater than some threshold value
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 13
Benchmark to compare Data Stream Management Systems
Source: Linear Road: A Stream Data Management Benchmark, VLDB 2004
Linear City
access ramp), cut into segments
transmission
accident every 20 minutes
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 14
Source: Linear Road: A Stream Data Management Benchmark, VLDB 2004
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 15
Source: Linear Road: A Stream Data Management Benchmark, VLDB 2004
segment
segment change
direction for the last 5 minutes
lane for 4 position reports
Introduction to data stream querying and mining Page 16 G.HEBRAIL – May 5th, 2009
What is a data stream ? Applications of data stream processing Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 17
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 18
… … … … … … ftp 80K 18 16.5.5.8 19.7.1.2 12345 http 58K 26 14.8.7.4 12.4.3.8 12344 http 24K 16 12.4.0.3 18.6.7.1 12343 http 20K 12 16.2.3.7 10.1.0.2 12342 … … … … … … Protocol Bytes Duration Destination Source Timestamp … … … … … 15 235,02 0,522 3,52 16/12/2006-17:29 15,8 235,68 0,528 3,666 16/12/2006-17:28 23 233,74 0,502 5,388 16/12/2006-17:27 23 233,29 0,498 5,374 16/12/2006-17:26 … … … … … I 1 (A) U 1 (V)
Timestamp
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 19
Applying queries/mining tasks to the whole stream (from beginning to current time) Applying queries/mining to a portion of the stream
Beginning of the stream Current date Window on the stream t
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 20
Definition of windows of interest on streams
Window specification
Refreshing rate
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 21
Beginning of the stream t
tc
t
t’c
Refreshment time
Results Results
Introduction to data stream querying and mining Page 22 G.HEBRAIL – May 5th, 2009
What is a data stream ? Applications of data stream processing Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 23
Definition of a DSMS (Data Stream Management System ) DSMS data model Queries in a DSMS Approximate answers to queries Main existing DSMS
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 24
Tools for capturing input streams and producing
SQL language in a programming language Import/export utilities Data feeding SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query) Query Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash Large volumes of data Performance Permanent relations are stored on disk Streams are processed on the fly Data is stored on disk Storage Streams and permanent updatable relations Permanent updatable relations Data model
DSMS - Data Stream Management System DBMS - Data Base Management System
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 25
Definition of a DSMS (Data Stream Management System ) DSMS data model Queries in a DSMS Approximate answers to queries Main existing DSMS
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 26
Permanent relations (table)
… 3 3 2 2 … ID_CUSTOMER … … … … … 15 235,02 0,522 3,52 16/12/2006-17:29 … … … … … 15,8 235,68 0,528 3,666 16/12/2006-17:26 23 233,74 0,502 5,388 16/12/2006-17:27 23 233,29 0,498 5,374 16/12/2006-17:26 I 1 (A) U 1 (V)
TIMESTAMP CUSTOMER TABLE
Streams
Vélizy 34, Rue Irun Laure Firin 4 Paris
Vincent 3 Orsay 12, Bd Jaurès Pierre Duval 2 Bagneux 25, Rue de Paris Jacques Dupont 1 CITY ADRESS FIRST NAME ID_CUSTOMER
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 27
DSMS output
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 28
Definition of a DSMS (Data Stream Management System ) DSMS data model Queries in a DSMS Approximate answers to queries Main existing DSMS
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 29
streams)
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 30
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 31
Source: Talk from Jennifer Widom http://infolab.stanford.edu/stream/index.html#talks
ISTREAM: stream of inserted tuples DSTREAM: stream of deleted tuples RSTREAM: stream of all tuples at every instant
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 32 CarLocStr (car_id, speed, expr_way, lane, dir, x_pos) CarSegStr (car_id, speed, expr_way, dir, seg)
SELECT car_id, speed, expr_way, dir, x_pos/5280 FROM CarLocStr; Toll notification to each vehicule RSTREAM ( SELECT E.car_id, E.seg, T.toll FROM CarSegEntryStr [Now] as E, SegToll as T WHERE E.expr_way = T.expr_way AND E.dir = T.dir AND E.seg = T.seg); CurCarSeg (car_id, expr_way, dir, seg)
SELECT car_id, expr_way, dir, seg FROM CarSegStr [Partition By car_id Rows 1]; CarSegEntryStr (car_id, expr_way, dir, seg)
(insertion stream) ISTREAM ( SELECT * FROM CurCarSeg ); SegAvgSpeed (expr_way, dir, seg, speed)
SELECT expr_way, dir, seg, AVG(speed) FROM CarSegEntryStr [Range 5 Minutes] GROUP BY expr_way, dir, seg; SegVolume (expr_way, dir, seg, volume)
SELECT expr_way, dir, seg, COUNT(*) FROM CurCarSeg GROUP BY expr_way, dir, seg; SegToll (expr_way, dir, seg, toll)
SELECT S.expr_way, S.dir, S.seg, 2 * (V.volume – 150) * (V.volume – 150) FROM SegAvgSpeed as S, SegVolume as V WHERE S.expr_way = V.expr_way AND S.dir = V.dir AND S.seg = V.seg AND S.speed < 40.00;
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 33
Definition of a DSMS (Data Stream Management System ) DSMS data model Queries in a DSMS Approximate answers to queries Main existing DSMS
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 34
+ scheduler
– Sharing of execution plans, queuing files, buffers, temporary storage – Index of queries
Approximate answers to queries
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 35
When ?
requirements
Solution: approximate answers to queries
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 36
Definition of a DSMS (Data Stream Management System ) DSMS data model Queries in a DSMS Approximate answers to queries Main existing DSMS
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 37
General-purpose research DSMS’s
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 38
Specialized research or proprietary DSMS’s
Commercial DSMS’s
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 39
What is a data stream ? Applications of data stream processing Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 40
Definition Decision tree PCA Clustream
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 41
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 42
Beginning of the stream Current date t
Application to the whole stream Application to any portion
Application to a sliding window
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 43
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 44
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 45
Definition Decision tree PCA Clustream
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 46
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 47
) ( ) (
j n j
X G X G →
+∞ →
δ δ ε ε − = > = ≥ − 1 )) ( ) ( ( 2 ) 1 ln( ) ( ) (
' 2 ' j j j j
X G X G P then n R with X G X G if
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 48
Algorithm
in the tree leaves
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 49
Definition Decision tree PCA Clustream
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 50
n i ij
.. 1
n i ij ijx
.. 1 '
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 51
1h 1h 1h 1h 24h ……………. t t + 1h
Sliding window of 24h Refreshment every 1h
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 52
Definition Decision tree PCA Clustream
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 53
Summarizing with evolving micro-clusters Supports concept drift Clustream (Aggarwal et al. 03)
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 54
Representation of micro-clusters
(n, CF1(T), CF2(T), CF1(X1), CF2(X1), …, CF1(Xp), CF2(Xp))
=
n i ij j n i ij j
.. 1 2 .. 1
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 55
Maintenance of micro-clusters
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 56
Mecanism to keep track of micro-clusters history
tc
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 57
tc
Selection of relevant data for the period
Hierarchical clustering of micro-clusters
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 58
What is a data stream ? Applications of data stream processing Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 59
Motivation
Approximate result based on summarized information
Several approaches
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 60
– Select element n with probability M/n – If element n is selected pick up randomly an element in the reservoir and replace it by element n
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 61
Motivation
Approximate result based on summarized information
Several approaches
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 62
Sketch
Many sketch structures: usually dedicated to a specialized task Examples of sketch structures
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 63
ai(t) = ai(t-1) + ct if it = i ai(t) = ai(t-1) if it i
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 64
+ ct . . . . . . . . . . . . +15 +23 +7 +12 +65 +66 + ct +78 . . . . . . . . +1 + ct 12 5 . 1 2 . . . . d w … 2 1
it h1(it)
h2(it) hd(it)
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 65
− n i i d i i
1 1 1
Introduction to data stream querying and mining Page 66 G.HEBRAIL – May 5th, 2009
What is a data stream ? Applications of data stream processing Models for data streams Data stream management systems Data stream mining Synopses structures Conclusion
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 67
Very active area of research Many practical applications in various domains DSMS are more mature than data stream mining DSMS
Data stream mining
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 68
French ANR MIDAS project (2008-2010) http://midas.enst.fr
patterns, automata, OLAP data cubes
web services
vehicules
Orange Labs
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 69
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 70
Querying and Mining Data Streams: You Only Get One Look. A tutorial. M.Garofalakis, J.Gehrke, R.Rastogi, Tutorial SIGMOD'02, Juin 2002. Issues in Data STREAM Management. L.Golab, M.T.Özsu, Canada. SIGMOD Record, Vol. 32, No. 2, June 2003. Models and Issues in data stream systems. B.Babcock, S.Babu, M.Datar, R.Motwani, J.Widom, PODS’2002, 2002. Data streams: algorithms and applications. S.Muthukrishnan, In Foundations and Trends in Theoretical Computer Science, Volume 1, Issue 2, August 2005. Data streams: models and algorithms. C.C.Aggarwal. Springer, 2007. Linear Road: A Stream Data Management Benchmark. A.Arasu, M.Cherniack, E.Galvez, D.Maier, A.S.Maskey, E.Ryvkina, M.Stonebraker, R.Tibbetts, Proceedings of the 30th VLDB Conference, Toronto, Canada,
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 71
Data STREAM Management Systems - Applications, Concepts, and Systems. V.Goebel, T.Plagemann, Tutorial MIPS’2004, 2004. STREAM: The Stanford Data STREAM Management System. A.Arasu, B.Babcock, S.Babu, J.Cieslewicz, M.Datar, K.Ito, R.Motwani, U.Srivastava, J.Widom. Department of Computer Science, Stanford University. Mars 2004. Available at: http://www-db.stanford.edu/stream TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. S.Chandrasekaran, O.Cooper, A.Deshpande, M.J.Franklin, J.M.Hellerstein, W.Hong (Intel Berkeley Laboratory), S.Krishnamurthy, S.Madden, V.Raman (IBM Almaden Research Center), F.Reiss, M.Shah. (Université de Berkeley). CIDR
Aurora: A New Model and Architecture for Data Stream Management. D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul,
Load Shedding for Aggregation Queries over Data Streams. B.Babcock, M.Datar, R.Motwani, 2004. Available at:http://www-db.stanford.edu/stream Aleri software, http://www.aleri.com Coral8 software, http://www.coral8.com Streambase software, http://www.streambase.com
G.HEBRAIL – May 5-78, 2009 Introduction to data stream querying and mining Page 72
° °
Introduction to data stream querying and mining Page 73 G.HEBRAIL – May 5th, 2009
Introduction to data stream querying and mining Page 74 G.HEBRAIL – May 5th, 2009
Generic tools for processing data Specific development without database technology Applications with basic streaming data Querying and mining ‘on the fly’ (scalable) Data warehouses (unscalable) Monitoring, Business Intelligence applications Data stream processing technology Standard data processing technology
Introduction to data stream querying and mining Page 75 G.HEBRAIL – May 5th, 2009
Source: Aurora: a new model and architecture for data stream management, VLDB Journal 2003
Introduction to data stream querying and mining Page 76 G.HEBRAIL – May 5th, 2009
One generic architecture proposed by Golab et Ozsu (2003):
Source: Golab & Özsu 2003
Introduction to data stream querying and mining Page 77 G.HEBRAIL – May 5th, 2009
Introduction to data stream querying and mining Page 78 G.HEBRAIL – May 5th, 2009
Babcock, Datar and Motwani (STREAM Project)
Introduction to data stream querying and mining Page 79 G.HEBRAIL – May 5th, 2009
Parameters of the problem
For each operator Oi : selectivity si,
processing time of a tuple ti
For each terminal operator (SUM) : result
average µi and standard-deviation σi
For each stream: ri arrival rate of tuples For each operator Oi : pi is the number of
tuples to send to it by unit of time
Problem definition
Determine pi‘s by minimizing the maximum
error on terminal operators under the constraint of system max load
Introduction to data stream querying and mining Page 80 G.HEBRAIL – May 5th, 2009
Goal
Sketch structure
into L bits
18.6.7.1
1 1 1
Introduction to data stream querying and mining Page 81 G.HEBRAIL – May 5th, 2009
1 1 1
H(18.6.7.1)
1 1 1
SK New SK
1 1 1 1
Introduction to data stream querying and mining Page 82 G.HEBRAIL – May 5th, 2009
1 1 1 1
SK R
For n elements already seen, we expect:
Introduction to data stream querying and mining Page 83 G.HEBRAIL – May 5th, 2009
Goal
Introduction to data stream querying and mining Page 84 G.HEBRAIL – May 5th, 2009
e
+1
+1
Introduction to data stream querying and mining Page 85 G.HEBRAIL – May 5th, 2009
Sketch structure h : hash function from [0, … , N-1] to [0, 1, … , B] s : hash function from [0, … , N-1] to {+1, -1} Array of B counters: C1, …, CB (with B << N) Sketch maintenance when e arrives: Ch(e) += s(e) Use of sketch Estimation of frequency of object e: ne ≈ Ch(e) . s(e) Actually t hash function h and t hash function s:
Theoretical results on error depending on N, t and B.
Introduction to data stream querying and mining Page 86 G.HEBRAIL – May 5th, 2009