December 3, 2002
1
Multi-Dimensional Regression Analysis of Time-Series Data Streams - - PowerPoint PPT Presentation
Multi-Dimensional Regression Analysis of Time-Series Data Streams Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, Jianyong Wang University of Illinois at Urbana-Champaign Wright State University 1 December 3, 2002 Outline
December 3, 2002
1
December 3, 2002
2
Characteristics of stream data Why on-line analytical processing and
Linearly compressed representation of
A stream cube architecture Stream cube computation Discussion Conclusions
December 3, 2002
3
Huge volumes of data, possibly infinite Fast changing and requires fast response Data stream is more suited to our data processing needs
Single linear scan algorithm: can only have one look
random access is expensive
Store only the summary of the data seen thus far Most stream data reside at pretty low-level or multi-
December 3, 2002
4
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &
Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
December 3, 2002
5
December 3, 2002
6
Stream data model
Data Stream Management System (DSMS)
Stream query model
Continuous Queries Sliding windows
Stream data mining
Clustering & summarization (Guha, Motwani, et al.) Correlation of data streams (Gehrke, et al.) Classification of stream data (Domingos, et al.) Mining frequent sets in streams (Motwani, et al., VLDB’02)
December 3, 2002
7
Most stream data are at pretty low-level or multi-
Analysis requirements
Multi-dimensional trends and unusual patterns Capturing important changes at multi-dimensions/levels Fast, real-time detection and response Comparing with data cube: Similarity and differences
Stream (data) cube or stream OLAP
Is it feasible? How to implement it efficiently?
December 3, 2002
8
Analysis of Web click streams
Raw data at low levels: seconds, web page addresses, user IP
addresses, …
Analysts want: changes, trends, unusual patterns, at reasonable
levels of details
E.g., Average clicking traffic in North America on sports in the last
15 minutes is 40% higher than that in the last 24 hours.”
Analysis of power consumption streams
Raw data: power consumption flow for every household, every
minute
Patterns one may find: average hourly power consumption surges
up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago
December 3, 2002
9
Raw data cannot be stored Simple aggregates not powerful enough History shape and patterns at different levels are desirable: multi-dimensional regression analysis
A scalable multi-dimensional stream data warehouse that can aggregate regression model of stream data efficiently without accessing the raw data
Compress the stream data to support memory- and time- efficient multi-dimensional regression analysis
December 3, 2002
10
= =
− − 1 , 1 1 1
... 1 ...
k i i i k i i
u u u u u x x u
1 , 1 1 1
− −
k i k i i T i i
December 3, 2002
11
i
=
− − − 1 , 2 1 1 , 2 22 21 1 , 1 12 11
... 1 . . . . . . . . . . ... 1 ... 1
k n n n k k
u u u u u u u u u U
∧
η η
T
y U U U
T T
RSS
1
) ( ) (
− ∧
= ⇒ = ∂ ∂ η η η
December 3, 2002
12
=
n h hj hi ij
1
ij i
∧
2
December 3, 2002
13
LCR consists of and , where and where provides OLS regression parameters essential for regression analysis is an auxiliary matrix that facilitates aggregations of LCR in standard and regression dimensions in a data cube environment LCR only stores the upper triangle of
1 1 − ∧ ∧ ∧ ∧
k T
∧
− − − − − − − 1 , 1 1 , 1 1 , 2 , 1 1 , 1 , 1 12 11 10 02 01 00
k k k k k k k
∧
December 3, 2002
14
for m base cells for an aggregated cell
2 2 2 1 1 1 m m m
∧ ∧ ∧
a a a
∧
1 1
= ∧ ∧ a m i i a
December 3, 2002
15
December 3, 2002
16
2 2 2 1 1 1 m m m
∧ ∧ ∧
a a a
∧
= = ∧ − = ∧
m i i a m i i i m i i a 1 1 1 1
December 3, 2002
17
December 3, 2002
18
Including quadratic, polynomial, and nonlinear models
December 3, 2002
19
A tilt time frame
Different time granularities
second, minute, quarter, hour, day, week, …
Critical layers
Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down
down to m-layer
Partial materialization of stream cubes
Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?
December 3, 2002
20
12 months
December 3, 2002
21
Limited memory space: Impossible to store the history in full scale
Most applications emphasize on recent data (slide window)
Putting different weights on remote data Useful even for uniform weight
for mining changes and evolutions
Finding those with dramatic changes E.g., exceptional stocks—not following the trends
December 3, 2002
22
(*, theme, quarter) (user-group, URL-group, minute)
(individual-user, URL, second) (primitive) stream data layer
December 3, 2002
23
Materialization problem
Only materialize cuboids of the critical layers? Popular path approach vs. exception cell
Computation problem
How to compute and store stream cubes
How to discover unusual cells and patterns
December 3, 2002
24
Materialization takes precious resources and time
Only incremental materialization (with slide window)
Only materialize “cuboids” of the critical layers?
Some intermediate cells that should be materialized
Popular path approach vs. exception cell approach
Materialize intermediate cells along the popular paths Exception cells: how to set up exception thresholds? Notice exceptions do not have monotonic behavior
How to compute and store stream cubes efficiently? How to discover unusual cells between the critical layer?
December 3, 2002
25
(A1, *, C1) (A1, *, C2) (A1, *, C2) (A1, *, C2) (A1, B1, C2) (A1, B2, C1) (A2, *, C2) (A2, B1, C1) (A1, B2, C2) (A2, B1, C2) A2, B2, C1) (A2, B2, C2)
December 3, 2002
26
Cube structure from m-layer to o-layer Three approaches
All cuboids approach
Materializing all cells (too much in both space and time)
Exceptional cells approach
Materializing only exceptional cells (saves space but not time
to compute and definition of exception is not flexible)
Popular path approach
Computing and materializing cells only along a popular path Using H-tree structure to store computed cells (which form the
stream cube—a selectively materialized cube)
December 3, 2002
27
root entertainment sports politics uic uiuc uic uiuc jeff mary jeff Jim Q.I. Q.I. Q.I.
Regression: Sum: xxxx Cnt: yyyy Quant-Info
Observation layer Minimal int. layer
December 3, 2002
28
H-tree and H-cubing
Developed for computing data cubes and ice-berg cubes
Compressed database Fast cubing Space preserving in cube computation
Using H-tree for stream cubing
Space preserving
Intermediate aggregates can be computed incrementally and
saved in tree nodes
Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as stream cube
December 3, 2002
29
Popular path
Computing layers along the popular path Other planes/cells will be computed when requested Using H-cube structure to store computed cells
Tradeoff for time/space between cube
Exception cells approach
How to set up an appropriate thresholds for all the
December 3, 2002
30
1 10 100 1000 25 50 100 200 400
Size (in K tuples) Runtime (in seconds)
popular-path exception-cells all-cubing
200 400 600 25 50 100 200 400
Size (in K tuples) Memory usage (in MB)
popular-path exception-cells all-cubing
December 3, 2002
31
1 10 100 1000 10000 3 4 5 6 7
Number of levels Runtime (in seconds)
popular-path exception-cells all-cubing
1 10 100 1000 10000 3 4 5 6 7
Number of levels Memory usage (in MB)
popular-path exception-cells all-cubing
December 3, 2002
32
1 10 100 1000 0.1 1 10 100
Exception (in %) Runtime (in seconds)
popular-path m/o-cubing
100 200 300 400 0.1 1 10 100
Exception(in %) Memory usage (in MB)
popular-path m/o-cubing
December 3, 2002
33
1 10 100 1000 25 50 100 200 400
Size (in K tuples) Runtime (in seconds)
popular-path m/o-cubing
100 200 300 25 50 100 200 400
Size (in K tuples) Memory usage (in MB)
popular-path m/o-cubing
December 3, 2002
34 1 10 100 1000 10000 3 4 5 6 7
Number of levels Runtime (in seconds)
popular-path m/o-cubing 1 10 100 1000 3 4 5 6 7
Number of levels Memory usage (in MB)
popular-path m/o-cubing
December 3, 2002
35
Important but missing link—Multi-level and multi-
A multi-dimensional stream data analysis framework
Tilt time frame (weighted vs. uniform weights on time) Critical layers Popular path approach (partial materialization of stream cubes)
Mining stream data at high-level, multiple-levels, or in
Discovery of changes and evolutions in data streams
December 3, 2002
36
Stream data analysis
Besides query and mining, stream cube and OLAP are
powerful tools for finding general and unusual patterns
A multi-dimensional stream cube framework
Tilt time frame Critical layers Popular path approach
An important issue for further study
Mining stream data at high-level, multiple-levels, or in
multiple dimensions