Updates: v Week 15 on 11/30 Project 2 progress Presentation and - PowerPoint PPT Presentation

Updates: v Week 15 on 11/30 § Project 2 progress Presentation and Discussions • Each T eam 10 min (including presentation and Q&A) • Focus on ideas and Q&A § Quiz 2 on Recommender systems v Quiz 1 will be graded over the weekend v One review will be graded over the weekend 1

Welcome to DS504/CS586: Big Data Analytics Anomaly/Outlier Detection Prof. Yanhua Li Time: 6:00pm – 8:50pm Thu Location: KH116 Fall 2017

What is an outlier? Outlier is an observation that deviates too much from other observations so that it arouses suspicions that it was generated by a different mechanism. outliers, exceptions, peculiarities, surprise, etc. Outliers

Introduction (not an easy task) w We are drowning in the deluge of data that are being collected world-wide, while starving for knowledge at the same time* w Anomalous events occur relatively infrequently w However, when they do occur, “ Mining needle in a haystack. So much hay and so little time ” their consequences can be quite dramatic and quite often in a negative sense * - J. Naisbitt, Megatrends: Ten New Directions Transforming Our Lives. New York: Warner Books, 1982.

Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, v Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which v had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by v the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size. html

Real World Anomalies v Credit Card Fraud § An abnormally high purchase made on a credit card v Cyber Intrusions § A web server involved in ftp traffic Classical problem : Noise in measurements needs to be removed because of the detrimental effect for statistical inference. For example making clustering more robust. Data mining : Deviating or surprising measurement is interesting . In this case we want to detect outliers.

Anomaly/Outlier Detection v Variants of Anomaly/Outlier Detection Problems § Given a database D, containing mostly normal (but unlabeled) data points, and a test point x , compute the anomaly score of x with respect to D § Given a database D, find all the data points x Î D with anomaly scores greater than some threshold t § Given a database D, find all the data points x Î D having the top-n largest anomaly scores f( x )

Anomaly Detection v Challenges § How many outliers are there in the data? § Method is unsupervised • Validation can be quite challenging (just like for clustering) § Finding needle in a haystack v Working assumption: § There are considerably more “ normal ” observations than “ abnormal ” observations (outliers/anomalies) in the data

Anomaly Detection Schemes v General Steps § Build a profile of the “ normal ” behavior • Profile can be patterns or summary statistics for the overall population § Use the “ normal ” profile to detect anomalies • Anomalies are observations whose characteristics differ significantly from the normal profile v Types of anomaly detection schemes § Graphical & Statistical-based § Distance-based § Model-based (Machine Learning/classification)

Graphical Approaches v Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) v Limitations § Time consuming § Subjective

Statistical Approaches v Assume a parametric model describing the distribution of the data (e.g., normal distribution) v Apply a statistical test that depends on § Data distribution § Parameter of distribution (e.g., mean, variance) § Number of expected outliers (confidence limit)

Distance-based Approaches v Data is represented as a vector of features v Three major approaches § Nearest-neighbor based § Density based § Clustering based

K-Nearest Neighbour Graph Definition: Every vector in the data set forms one node and every node has pointers to its k nearest neighbours.

KDIST: Density-based outlier detection [S. Ramaswamy et al., SIGMOD , Texas, 2000.] Define k Nearest Neighbour distance ( KDIST ) as the distance to the k th nearest vector. Vectors are sorted by their KDIST distance. The last n vectors in the list are defined as outliers. Intuitive idea is that when KDIST is large, vector is in sparse area and is likely to be an outlier. Problem : user has to know in advance how many outliers he has in the data set.

Density-based: Inverse distance v For each point, ! − 1 P y ∈ N ( x,k ) distance ( x, y ) density ( x, k ) = | N ( x, k ) | In the NN and ID approach, p 1 is considered as an outlier, p 2 is not considered as outlier p 2 ´ p 1 ´

Density-based: LOF v For each point, compute the density of its local neighborhood v Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors v Outliers are points with largest LOF value density ( x, k ) Reldensity ( x, k ) = P y ∈ N ( x,k ) density ( y, k ) / | N ( x, k ) | In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers p 2 ´ p 1 ´

Clustering-Based v Basic idea: § Cluster the data into groups of different density § Choose points in small cluster as candidate outliers § Compute the distance between candidate points and non-candidate clusters. • If candidate points are far from all other non-candidate points, they are outliers § E.g., DBSCAN

Summary v Types of anomaly detection schemes § Graphical & Statistical-based § Distance-based • Nearest-neighbor based • Density based • Clustering based § Model-based (Machine Learning/classification)

Crowd Sensing of Traffic Anomalies based on Human Mobility and Social Media Bei Pan (Penny), University of Southern California Yu Zheng, Microsoft Research David Wilkie, University of North Carolina Cyrus Shahabi, University of Southern California

20 Background v The prevalence of location services • Mobile phones, GPS • Check-in services • “Crowd sensing” city rhythms • Urban planning • Activity understanding • Our interests: • Dynamics of urban traffic • Detect and Analyze traffic anomalies 20 ACM SIGSPTIAL 2013

21 Insights When a traffic anomaly occurs: 1)% of traveling on different routes may change 2)People may discuss the anomaly on social media rt 2 rt 2 rt 1 rt 1 rt 4 rt 3 rt 3 routing behavior in routing behavior normal times during the traffic 21 ACM SIGSPTIAL anomaly 2013

22 Goal - Detection During During regular anomalous times event Increase of routing behavior Anomalous graph Decrease of routing behavior ACM SIGSPTIAL 2013

23 Goal - Analysis v Understand the traffic anomalies § Describe the anomaly using social media § Impact analysis on travel time delay Detected anomalous graph ACM SIGSPTIAL 2013

Applications Individual users Transportation authorities ACM SIGSPTIAL 24 2013

System Overview ACM SIGSPTIAL 25 2013

Preliminaries v Trajectory ( tr ) § A sequence of GPS points • E.g.,{ <loc 1 , t 1 >, <loc 2 , t 2 >, <loc 3 , t 3 > } § After map-matching & interpolation [1][2] • E.g.,{ <r 1 , t’ 1 >, <r 2 , t’ 2 >, <r 3 , t’ 3 >, <r 4 , t’ 4 > } v Route ( rt ) : a sequence of connected road segments § E.g., < r 1 , r 2 , r 3 , r 4 > ACM SIGSPTIAL 2013 [1] J. Yuan, Y. Zheng, C. Zhang, X. Xie, and G.-Z. Sun. An interactive-voting based map matching algorithm. In MDM ’10. 27 [2] L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popular routes from uncertain trajectories. In KDD ’12

Anomaly Detection in Routing Behavior Analysis v Routing Behavior: § RP OD = < f 1 , p 1 , f 2 , p 2 , ... , f n , p n > § f : traffic flow / p : percentage § e.g., RP OD = <160, 0.8, 20, 0.1, 20, 0.1> v Anomaly Detection Problem Definition: § Given a complete road network, trajectory set in [t 0 , t 1 ], find graphs • For each O , at least one D , that the RP OD at time t 1 is anomalous compared with regular RP OD at time [t 0 , t 1 ): 28

Anomaly Detection in Term Mining ( T H ) ( T C ) ACM SIGSPTIAL 30 2013

Impact Analysis & Visualization v Impact : Travel Time Delay § Individual travel time calculation: • E.g., travel time at segment a is : 96 sec. § Mean travel time during time interval T : § Delayed travel time for road segment r: v Visaulization: § Green: < 2x regular travel time § Yellow: [2x, 3x] regular travel time § Red: >3x regular travel time ACM SIGSPTIAL 31 2013

Evaluation v Traffic data set: (~ 20% of traffic flow on Beijing road network) v Social Media Data: § Crawled from Chinese micro-blogging services called “Weibo”. v Anomaly detection baseline approach § PCA – proposed in [1]: anomaly detection based on traffic volume ACM SIGSPTIAL 32 2013 [1] S. Chawla, Y. Zheng, and J. Hu. Inferring the root cause in road traffic anomalies. In ICDM ’12 .

Updates: v Week 15 on 11/30 Project 2 progress Presentation and - PowerPoint PPT Presentation

Updates: v Week 15 on 11/30 Project 2 progress Presentation and Discussions Each T eam 10 min (including presentation and Q&A) Focus on ideas and Q&A Quiz 2 on Recommender systems v Quiz 1 will be graded over the weekend v

Mission Updates Payload and Subsystems Updates Rocket and Subsystems Updates

MIT ROCKET TEAM NASA ULSI 2012-2013 CDR 2 Overview Mission Updates Payload and Subsystem

All Provider Meeting March 20, 2019 1-3 pm Agenda Welcome Alliance Updates Legislative

General Updates November 26, 2015 By: Shelly Cuddy General Updates Implementation Updates

TEC Roadshow 2016 Welcome Our agenda this afternoon: Tertiary Policy updates SDR

Health Safety Net (HSN) Updates Massachusetts Health Care Training Forum July 2019 HSN Updates

GUI Updates #1 Joschua Dilly, Martin Spitznagel O MC 25.02.2019 GUI Updates #1 2 Gui Updates

MIT ROCKET TEAM NASA ULSI 2012-2013 FRR 2 Overview Mission Updates Payload and Subsystem

Health Safety Net Updates Massachusetts Health Care Training Forum July 2018 HSN Updates

MIT ROCKET TEAM FLIGHT READINESS REVIEW 2 Overview Mission Updates Rocket and Subsystems

OLTL Updates Long-Term Care Council June 4, 2020 6/12/2020 1 Agenda COVID-19 Updates

2019 RHC UPDATES ROBIN VELTKAMP/TRESSA SACREY HEALTH SERVICES ASSOCIATES CMS UPDATES on Appendix

School Art Program Open House AGENDA Introductions Program Updates Contest

Updates to External Reporting Investor & Analyst Briefing 16 February 2018 Updates to External

Health Safety Net Updates Massachusetts Health Care Training Forum January 2017 HSN Updates

OACTE Success Series Matthew Wells Health Science Updates CTE Office Updates Reset and Restart

Authorization: Intrusion Detection Prof. Tom Austin San Jos State University Prevention vs.

Signal Processing Methods for Network Single Time Series Methods Anomaly Detection

and Monitoring wit ith Streaming Spatiotemporal Data Nan Cao, Chaoguang Lin, Quihan Zhu, Yu-Ru

Disclosures Arrhythmogenic Right Ventricular Cardiomyopathy: Abbott Labs: Grant support

Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment Sean

Prediction and Prevention Nothing to disclose of Preterm Birth Juan M. Gonzalez, MD

Announcements Department coin design contest deadline - IT420: Database Management and

Timing Update Darryl Veitch darryl.veitch@uts.edu.au School of Computing and Communications

Updates: v Week 15 on 11/30 Project 2 progress Presentation and - PowerPoint PPT Presentation

Updates: v Week 15 on 11/30 Project 2 progress Presentation and Discussions Each T eam 10 min (including presentation and Q&A) Focus on ideas and Q&A Quiz 2 on Recommender systems v Quiz 1 will be graded over the weekend v

Mission Updates Payload and Subsystems Updates Rocket and Subsystems Updates

MIT ROCKET TEAM NASA ULSI 2012-2013 CDR 2 Overview Mission Updates Payload and Subsystem

All Provider Meeting March 20, 2019 1-3 pm Agenda Welcome Alliance Updates Legislative

General Updates November 26, 2015 By: Shelly Cuddy General Updates Implementation Updates

TEC Roadshow 2016 Welcome Our agenda this afternoon: Tertiary Policy updates SDR

Health Safety Net (HSN) Updates Massachusetts Health Care Training Forum July 2019 HSN Updates

GUI Updates #1 Joschua Dilly, Martin Spitznagel O MC 25.02.2019 GUI Updates #1 2 Gui Updates

MIT ROCKET TEAM NASA ULSI 2012-2013 FRR 2 Overview Mission Updates Payload and Subsystem

Health Safety Net Updates Massachusetts Health Care Training Forum July 2018 HSN Updates

MIT ROCKET TEAM FLIGHT READINESS REVIEW 2 Overview Mission Updates Rocket and Subsystems

OLTL Updates Long-Term Care Council June 4, 2020 6/12/2020 1 Agenda COVID-19 Updates

2019 RHC UPDATES ROBIN VELTKAMP/TRESSA SACREY HEALTH SERVICES ASSOCIATES CMS UPDATES on Appendix

School Art Program Open House AGENDA Introductions Program Updates Contest

Updates to External Reporting Investor &amp; Analyst Briefing 16 February 2018 Updates to External

Health Safety Net Updates Massachusetts Health Care Training Forum January 2017 HSN Updates

OACTE Success Series Matthew Wells Health Science Updates CTE Office Updates Reset and Restart

Authorization: Intrusion Detection Prof. Tom Austin San Jos State University Prevention vs.

Signal Processing Methods for Network Single Time Series Methods Anomaly Detection

and Monitoring wit ith Streaming Spatiotemporal Data Nan Cao, Chaoguang Lin, Quihan Zhu, Yu-Ru

Disclosures Arrhythmogenic Right Ventricular Cardiomyopathy: Abbott Labs: Grant support

Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment Sean

Prediction and Prevention Nothing to disclose of Preterm Birth Juan M. Gonzalez, MD

Announcements Department coin design contest deadline - IT420: Database Management and

Timing Update Darryl Veitch darryl.veitch@uts.edu.au School of Computing and Communications

Updates to External Reporting Investor & Analyst Briefing 16 February 2018 Updates to External