Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional - PowerPoint PPT Presentation

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1

Time Series Data • Many applications produce time series data Intel stock 2

Time Series Data • Many applications produce time series data 2

Apple, Intel, NASDAQ Computers Stock Values 3

Apple, Intel, NASDAQ Computers Stock Values Compare time series to gather differences Apple stock has a very different “trend” Intel stock had different magnitude 4

Apple, Intel, NASDAQ Computers Stock Values Apple stock has a Compare time series to gather differences very different “trend” 4

Apple, Intel, NASDAQ Computers Stock Values Compare time series to gather differences Intel stock had different magnitude 4

Problem Statement Find anomalies in a data cube of multi-dimensional time series data 5

Table of Contents 1. Time Series Examples 2. Problem Statement ☚ 3. Related Work 4. Observed/Expected Time Series and Anomaly Measure 5. Subspace Iterative Search i. Generating candidate subspaces ii. Discovering top-k anomaly cells 6. Experiments 7. Conclusion 6

Multi-Dimensional Attributes • Time series are not flat data; contains multi-dimensional attributes • Stock example ‣ Apple and Intel are a part of the NASDAQ Computers Index ‣ Apple is hardware/software; Intel is hardware ‣ Related to NASDAQ-100 Technology Stock Index • Sales example ‣ Multi-dimensional information collected for every sale (e.g., buyer age, product type, store location, purchase time) ‣ Compare sales by any combination of categories or sub-categories : “sales of sporting apparel to males with 3+ children have been declining compared to overall male sporting apparel sales” 7

Multi-Dimensional Attributes • Time series are not flat data; contains multi-dimensional attributes • Stock example ‣ Apple and Intel are a part of the NASDAQ Computers Index ‣ Apple is hardware/software; Intel is hardware ‣ Related to NASDAQ-100 Technology Stock Index • Sales example ‣ Multi-dimensional information collected for every sale (e.g., buyer age, product type, store location, purchase time) ‣ Compare sales by any combination of categories or sub-categories : “sales of sporting apparel to males with 3+ children have been declining compared to overall male sporting apparel sales” subset 7

Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R child • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R child • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) parent ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

Query Model • Given R , a probe cell p ∈ C R , and an anomaly function g , find the anomaly cells among C R descendants of p in C R as measured by g p ‣ Each abnormal cell must satisfy a minimum support (count) threshold ‣ Anomaly does not have to hold for entire time series ‣ Only the top k anomalies as ranked by g are needed base 10

Related Work • Exploratory Data Analysis ‣ [Sarawagi SIGMOD’00] explores OLAP anomaly but necessitates full cube materialization ‣ [Palpanas SSDBM’01] approximately finds interesting cells in data cube but still requires exponential calculations ‣ [Imielinski DMKD’02] requires anti-monotonic measure and does not focus on time series • Time Series Data Cube [Chen VLDB’02] ‣ Only suitable for low-dimensional data ‣ Requires user guidance • General outlier detection, subspace clustering, and time series similarity search does not address OLAP-style data 11

Measuring Anomaly: Intuition 12

Measuring Anomaly: Intuition 1.For every cell, compute the expected time series (with respect to the probe cell) 12

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional - PowerPoint PPT Presentation

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Time Series Data Many applications produce time series data Intel stock 2

Subspace Polynomials and Cyclic Subspace Codes Netanel Raviv Joint work with: Prof. Tuvi Etzion

Mining Anomalies Andrzej Wasylkowski 1 Why Mine Anomalies? How can we make programs more

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph based Subspace Segmentation Canyi Lu National University of Singapore Nov. 21, 2013

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Cyclic Subspace Codes Via Subspace Polynomials Kamil Otal and Ferruh zbudak Middle East

Subspace Modeling and Selection Subspace Modeling and Selection for Noisy Speech Recognition for

Subspace Embeddings for Regression Lecture 12 October 1, 2020 Chandra (UIUC) CS498ABD 1 Fall

Subspace Embeddings and p -Regression Using Exponential Random Variables David P. Woodruff

Subspace Information Criterion Subspace Information Criterion for Image Restoration for Image

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Math 211 Math 211 Lecture #21 Determinants October 16, 2002 2 Basis of a Subspace Basis of a

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

The Subspace Method for Diagnosing Network-Wide Traffic Anomalies Anukool Lakhina, Mark Crovella,

b s b c anomalies anomalies Found by LHCb (and perhaps Found by several experiments

http://cs224w.stanford.edu Output: Node embeddings. We can also embed larger network

Exceptio ions and fil ile in input/output try-raise-except-finally Exception control

IRTF-NMRG Workshop IRTF-NMRG Workshop Challenges for Future Research on Challenges for Future

GUI Testing Chapter 19 GUI characteristic Figure 19.1 What is the main characteristic of

Palliative assessment in dementia Pa Pain Depressi ssion Anxi xiety Psyc sychosi sis De

Techniques and Tools for the Analysis of Timed Workflows Jiri Srba Department of Computer

Accessibility and Inclusive Design Tracy Tran | Microsoft Program Manager | tracyt@microsoft.com

Facilities and Accessibility February 24, 2019 Agenda Review 2017 2019 progress