mining approximate top k subspace anomalies in multi
play

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional - PowerPoint PPT Presentation

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Time Series Data Many applications produce time series data Intel stock 2


  1. Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1

  2. Time Series Data • Many applications produce time series data Intel stock 2

  3. Time Series Data • Many applications produce time series data 2

  4. Time Series Data • Many applications produce time series data 2

  5. Apple, Intel, NASDAQ Computers Stock Values 3

  6. Apple, Intel, NASDAQ Computers Stock Values 3

  7. Apple, Intel, NASDAQ Computers Stock Values Compare time series to gather differences Apple stock has a very different “trend” Intel stock had different magnitude 4

  8. Apple, Intel, NASDAQ Computers Stock Values Apple stock has a Compare time series to gather differences very different “trend” 4

  9. Apple, Intel, NASDAQ Computers Stock Values Compare time series to gather differences Intel stock had different magnitude 4

  10. Problem Statement Find anomalies in a data cube of multi-dimensional time series data 5

  11. Table of Contents 1. Time Series Examples 2. Problem Statement ☚ 3. Related Work 4. Observed/Expected Time Series and Anomaly Measure 5. Subspace Iterative Search i. Generating candidate subspaces ii. Discovering top-k anomaly cells 6. Experiments 7. Conclusion 6

  12. Multi-Dimensional Attributes • Time series are not flat data; contains multi-dimensional attributes • Stock example ‣ Apple and Intel are a part of the NASDAQ Computers Index ‣ Apple is hardware/software; Intel is hardware ‣ Related to NASDAQ-100 Technology Stock Index • Sales example ‣ Multi-dimensional information collected for every sale (e.g., buyer age, product type, store location, purchase time) ‣ Compare sales by any combination of categories or sub-categories : “sales of sporting apparel to males with 3+ children have been declining compared to overall male sporting apparel sales” 7

  13. Multi-Dimensional Attributes • Time series are not flat data; contains multi-dimensional attributes • Stock example ‣ Apple and Intel are a part of the NASDAQ Computers Index ‣ Apple is hardware/software; Intel is hardware ‣ Related to NASDAQ-100 Technology Stock Index • Sales example ‣ Multi-dimensional information collected for every sale (e.g., buyer age, product type, store location, purchase time) ‣ Compare sales by any combination of categories or sub-categories : “sales of sporting apparel to males with 3+ children have been declining compared to overall male sporting apparel sales” subset 7

  14. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  15. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  16. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  17. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  18. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  19. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  20. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R child • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  21. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R child • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) parent ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  22. Query Model • Given R , a probe cell p ∈ C R , and an anomaly function g , find the anomaly cells among C R descendants of p in C R as measured by g p ‣ Each abnormal cell must satisfy a minimum support (count) threshold ‣ Anomaly does not have to hold for entire time series ‣ Only the top k anomalies as ranked by g are needed base 10

  23. Query Model • Given R , a probe cell p ∈ C R , and an anomaly function g , find the anomaly cells among C R descendants of p in C R as measured by g p ‣ Each abnormal cell must satisfy a minimum support (count) threshold ‣ Anomaly does not have to hold for entire time series ‣ Only the top k anomalies as ranked by g are needed base 10

  24. Related Work • Exploratory Data Analysis ‣ [Sarawagi SIGMOD’00] explores OLAP anomaly but necessitates full cube materialization ‣ [Palpanas SSDBM’01] approximately finds interesting cells in data cube but still requires exponential calculations ‣ [Imielinski DMKD’02] requires anti-monotonic measure and does not focus on time series • Time Series Data Cube [Chen VLDB’02] ‣ Only suitable for low-dimensional data ‣ Requires user guidance • General outlier detection, subspace clustering, and time series similarity search does not address OLAP-style data 11

  25. Measuring Anomaly: Intuition 12

  26. Measuring Anomaly: Intuition 1.For every cell, compute the expected time series (with respect to the probe cell) 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend