Multi-Dimensional Regression Analysis of Time-Series Data Streams - PowerPoint PPT Presentation

Multi-Dimensional Regression Analysis of Time-Series Data Streams Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, Jianyong Wang University of Illinois at Urbana-Champaign Wright State University 1 December 3, 2002

Outline � Characteristics of stream data � Why on-line analytical processing and mining of stream data? � Linearly compressed representation of stream data � A stream cube architecture � Stream cube computation � Discussion � Conclusions 2 December 3, 2002

Characteristics of Stream Data � Huge volumes of data, possibly infinite � Fast changing and requires fast response � Data stream is more suited to our data processing needs of today � Single linear scan algorithm: can only have one look � random access is expensive � Store only the summary of the data seen thus far � Most stream data reside at pretty low-level or multi- dimensional in nature—needs ML (multi-level) / MD (multi-dimensional) processing 3 December 3, 2002

Stream Data Applications � Telecommunication calling records � Business: credit card transaction flows � Network monitoring and traffic engineering � Financial market: stock exchange � Engineering & industrial processes: power supply & manufacturing � Sensor, monitoring & surveillance: video streams � Security monitoring � Web logs and Web page click streams � Massive data sets (even saved but random access is too expensive) 4 December 3, 2002

Projects on DSMS (Data Stream Management System) � STREAM STREAM (Stanford): A general-purpose DSMS � � Cougar Cougar (Cornell): sensors � � Aurora Aurora (Brown/MIT): sensor monitoring, dataflow � � Hancock Hancock (AT&T): telecom streams � � Niagara Niagara (OGI/Wisconsin): Internet XML databases � � OpenCQ OpenCQ (Georgia Tech): triggers, incr. view maintenance � � Tapestry Tapestry (Xerox): pub/sub content-based filtering � � Telegraph Telegraph (Berkeley): adaptive engine for sensors � � Tradebot Tradebot (www.tradebot.com): stock tickers & streams � � Tribeca Tribeca (Bellcore): network monitoring � 5 December 3, 2002

Previous Work: Towards OLAP and Mining Data Streams � Stream data model � Data Stream Management System (DSMS) � Stream query model � Continuous Queries � Sliding windows � Stream data mining � Clustering & summarization (Guha, Motwani, et al.) � Correlation of data streams (Gehrke, et al.) � Classification of stream data (Domingos, et al.) � Mining frequent sets in streams (Motwani, et al., VLDB’02) 6 December 3, 2002

Why Stream Cube and Stream OLAP? � Most stream data are at pretty low-level or multi- dimensional in nature: needs ML/MD processing � Analysis requirements � Multi-dimensional trends and unusual patterns � Capturing important changes at multi-dimensions/levels � Fast, real-time detection and response � Comparing with data cube: Similarity and differences � Stream (data) cube or stream OLAP � Is it feasible? How to implement it efficiently? 7 December 3, 2002

Multi-Dimensional Stream Analysis: Examples � Analysis of Web click streams � Raw data at low levels: seconds, web page addresses, user IP addresses, … � Analysts want: changes, trends, unusual patterns, at reasonable levels of details � E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.” � Analysis of power consumption streams � Raw data: power consumption flow for every household, every minute � Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago 8 December 3, 2002

Motivations for Stream Data Compression Challenges of OLAPing stream data Raw data cannot be stored Simple aggregates not powerful enough History shape and patterns at different levels are desirable: multi-dimensional regression analysis Proposal A scalable multi-dimensional stream data warehouse that can aggregate regression model of stream data efficiently without accessing the raw data Stream data compression Compress the stream data to support memory- and time- efficient multi-dimensional regression analysis 9 December 3, 2002

Basics of General Linear Regression n tuples in one cell: ( x i , y i ), i = 1..n , where y i is the measure attribute to be analyzed For sample i , a vector of k user-defined predictors u i :     1 u     0 ( )     x u u = = 1 1 i i u     i ... ...       ( )     x  u  u − − , 1 1 i k k i The linear regression model: ( ) = η u = η + η + + η T | ... E y u u u − − 0 1 1 1 , 1 i i i i k i k where ? is a k × 1 vector of regression parameters 10 December 3, 2002

Theory of General Linear Regression n × Collect into the model matrix U k u i   1 ... u u u −   11 12 1 , 1 k   1 ... u u u − 21 22 2 , 1 k   = . . . . . U   . . . . .     1 ...  u u u  − 1 2 , 1 n n n k ∧ η η The ordinary least square (OLS) estimate of is the argument that minimize the residue sum of squares function η = − − T ( ) ( ) ( ) RSS y U ? y U ? Main theorem to determine the OLS regression parameters ∂ ∧ − η = ⇒ η = 1 T T ( ) 0 ( ) RSS U U U y ∂ η 11 December 3, 2002

Linearly Compressed Representation (LCR) Stream data compression for multi-dimensional regression analysis n Define, for i, j = 0 , …, k- 1: ∑ θ = u u ij hi hj = 1 h The linearly compressed representation (LCR) of one cell: ∧ η = − θ = − ≤ U { | 0 ,..., 1 } { | , 0 ,..., 1 , } i k i j k i j i ij + + 2 ( 1 ) 3 k k k k Size of LCR of one cell: + = , k 2 2 quadratic in k , independent of the number of tuples n in one cell 12 December 3, 2002

Matrix Form of LCR ∧ T ∧ ∧ ∧ ∧ η LCR consists of and , where T η = η η η ( , ,..., ) − 0 1 1 k and θ  θ θ θ  ... −   0 , 1 00 01 02 k θ θ θ θ   ... − 1 , 1 10 11 12 k   = . . . ... . T   . . . ...  .    θ θ θ θ ...   where − − − − − 1 , 0 1 , 1 1 , 2 k k k 1 , 1 k k ∧ η provides OLS regression parameters essential for regression analysis T is an auxiliary matrix that facilitates aggregations of LCR in standard and regression dimensions in a data cube environment = T ⇒ T T T LCR only stores the upper triangle of 13 December 3, 2002

Aggregation in Standard Dimensions Given LCR of m cells that differ in one standard dimension, what is the LCR of the cell aggregated in that dimension? for m base cells ∧ ∧ ∧ = η = η = η ( , ), ( , ), ... , ( , ) LCR T LCR T LCR T 1 1 1 2 2 2 m m m for an aggregated cell ∧ = η ( , ) LCR T a a a The lossless aggregation formula = ∑ ∧ ∧ m η η a i = 1 i = T T 1 a 14 December 3, 2002

Stock Price Example—Aggregation in Standard Dimensions Simple linear regression on time series data Cells of two companies After aggregation: 15 December 3, 2002

Aggregation in Regression Dimensions Given LCR of m cells that differ in one regression dimension, what is the LCR of the cell aggregated in that dimension? ∧ ∧ ∧ = η = η = η ( , ), ( , ), ... , ( , ) LCR T LCR T LCR T for m base cells 1 1 1 2 2 2 m m m ∧ = η for the aggregated cell ( , ) T LCR a a a The lossless aggregation formula − 1     ∧ ∧ m m ∑ ∑ η =    η  T T a i i i     = = 1 1 i i m ∑ = T T a i = 1 16 i December 3, 2002

Stock Price Example—Aggregation in Time Dimension Cells of two adjacent time intervals: After aggregation 17 December 3, 2002

Feasibility of Stream Regression Analysis Efficient storage and scalable (independent of the number of tuples in data cells) Lossless aggregation without accessing the raw data Fast aggregation: computationally efficient Regression models of data cells at all levels General results: covered a large and the most popular class of regression Including quadratic, polynomial, and nonlinear models 18 December 3, 2002

A Stream Cube Architecture � A tilt time frame � Different time granularities � second, minute, quarter, hour, day, week, … � Critical layers � Minimum interest layer (m-layer) � Observation layer (o-layer) � User: watches at o-layer and occasionally needs to drill-down down to m-layer � Partial materialization of stream cubes � Full materialization: too space and time consuming � No materialization: slow response at query time � Partial materialization: what do we mean “partial”? 19 December 3, 2002

A Tilt Time-Frame Model Up to 7 days 4qtrs 4* 25sec. 7 days 24hrs 15minutes Time Now Up to a year 31 days 24 hours 4 qtrs 12 months Time Now 20 December 3, 2002

Multi-Dimensional Regression Analysis of Time-Series Data Streams - PowerPoint PPT Presentation

Multi-Dimensional Regression Analysis of Time-Series Data Streams Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, Jianyong Wang University of Illinois at Urbana-Champaign Wright State University 1 December 3, 2002 Outline

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li,

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Outline Time series and forecasting Time series objects 1 in R Basic time series functionality

Financial Econometrics Econ 40357 Regression review, Time-series regression Some Necessary Matrix

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression Analysis in Stata Hsueh-Sheng Wu CFDR Workshop Series February 18, 2019 1 Overview

Analysis of variance and regression Other types of regression models Other types of regression

Multi Multi-dimensional Data and Spatial Range dimensional Data and Spatial Range Query in

Multi- -dimensional Data and dimensional Data and Spatial Range Spatial Range Multi Query in

Why do you care? Time-series data is all over the place. Time-Series Data Kaitlin Duck

Multi-Dimensional Reflective BSDE July 29 2010, Cornell University By Qinghua Li, Columbia

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

standard series Overview DP series DX series H series M series bitte hier

VisTrails: Enabling Interactive, Multiple-View Visualizations Louis Bavoil Patricia Crossno

Pyramid Analysis for DUC2007 Coordination: Hoa Trang Dang, Lucy Vanderwende Pyramid

The case for reactive objects Johan Nordlander, Lule Univ. och Technology (with Mark Jones,

CSE 513 I ntroduction to Operating Systems Class 1 - History and I ntro to OS- related Hardware

Yoga Alliance - Tue 7/28 10am (USYOGA2807B) Closed Captioning/ Transcript Disclaimer Closed

Lava I Mary Sheeran, Thomas Hallgren Chalmers University of Technology Where are we? Take a

Pragmatically determined word order in Cherokee and its exceptions Brian Hsu and Benjamin Frey

What have we learned about the global carbon cycle from GOSAT and OCO-2 ? David Baker, Andy