Updates: v Week 15 on 11/30 Project 2 progress Presentation and - - PowerPoint PPT Presentation

updates
SMART_READER_LITE
LIVE PREVIEW

Updates: v Week 15 on 11/30 Project 2 progress Presentation and - - PowerPoint PPT Presentation

Updates: v Week 15 on 11/30 Project 2 progress Presentation and Discussions Each T eam 10 min (including presentation and Q&A) Focus on ideas and Q&A Quiz 2 on Recommender systems v Quiz 1 will be graded over the weekend v


slide-1
SLIDE 1

1

Updates:

v Week 15 on 11/30

§ Project 2 progress Presentation and Discussions

  • Each T

eam 10 min (including presentation and Q&A)

  • Focus on ideas and Q&A

§ Quiz 2 on Recommender systems

v Quiz 1 will be graded over the weekend v One review will be graded over the weekend

slide-2
SLIDE 2

DS504/CS586: Big Data Analytics Anomaly/Outlier Detection

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm – 8:50pm Thu Location: KH116 Fall 2017

slide-3
SLIDE 3

What is an outlier?

Outlier is an observation that deviates too much from

  • ther observations so that it arouses suspicions that it

was generated by a different mechanism.

Outliers

  • utliers, exceptions, peculiarities, surprise, etc.
slide-4
SLIDE 4

Introduction (not an easy task)

w We are drowning in the deluge of

data that are being collected world-wide, while starving for knowledge at the same time*

w Anomalous events occur relatively

infrequently

w However, when they do occur,

their consequences can be quite dramatic and quite often in a negative sense

“Mining needle in a haystack. So much hay and so little time”

* - J. Naisbitt, Megatrends: Ten New Directions Transforming Our Lives. New York: Warner Books, 1982.

slide-5
SLIDE 5

Importance of Anomaly Detection

Ozone Depletion History

v

In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels

v

Why did the Nimbus 7 satellite, which had instruments aboard for recording

  • zone levels, not record similarly low
  • zone concentrations?

v

The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!

Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size. html

slide-6
SLIDE 6

Real World Anomalies

v Credit Card Fraud

§ An abnormally high purchase made on a credit card

v Cyber Intrusions

§ A web server involved in ftp traffic Classical problem: Noise in measurements needs to be removed because of the detrimental effect for statistical

  • inference. For example making clustering more robust.

Data mining: Deviating or surprising measurement is

  • interesting. In this case we want to detect outliers.
slide-7
SLIDE 7

Anomaly/Outlier Detection

v Variants of Anomaly/Outlier Detection Problems

§ Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D § Given a database D, find all the data points x Î D with anomaly scores greater than some threshold t § Given a database D, find all the data points x Î D having the top-n largest anomaly scores f(x)

slide-8
SLIDE 8

Anomaly Detection

v Challenges

§ How many outliers are there in the data? § Method is unsupervised

  • Validation can be quite challenging (just like for

clustering)

§ Finding needle in a haystack

v Working assumption:

§ There are considerably more “normal”

  • bservations than “abnormal” observations

(outliers/anomalies) in the data

slide-9
SLIDE 9

Anomaly Detection Schemes

v General Steps

§ Build a profile of the “normal” behavior

  • Profile can be patterns or summary statistics for the overall population

§ Use the “normal” profile to detect anomalies

  • Anomalies are observations whose characteristics

differ significantly from the normal profile

v Types of anomaly detection

schemes

§ Graphical & Statistical-based § Distance-based § Model-based (Machine Learning/classification)

slide-10
SLIDE 10

Graphical Approaches

v Boxplot (1-D), Scatter plot (2-D), Spin plot

(3-D)

v Limitations

§ Time consuming § Subjective

slide-11
SLIDE 11

Statistical Approaches

v Assume a parametric model describing the

distribution of the data (e.g., normal distribution)

v Apply a statistical test that depends on

§ Data distribution § Parameter of distribution (e.g., mean, variance) § Number of expected outliers (confidence limit)

slide-12
SLIDE 12

Distance-based Approaches

v Data is represented as a vector of features v Three major approaches

§ Nearest-neighbor based § Density based § Clustering based

slide-13
SLIDE 13

K-Nearest Neighbour Graph

Definition: Every vector in the data set forms one node and every node has pointers to its k nearest neighbours.

slide-14
SLIDE 14

KDIST: Density-based

  • utlier detection

[S. Ramaswamy et al., SIGMOD, Texas, 2000.]

Define k Nearest Neighbour distance (KDIST) as the distance to the kth nearest vector. Vectors are sorted by their KDIST distance. The last n vectors in the list are defined as outliers. Intuitive idea is that when KDIST is large, vector is in sparse area and is likely to be an outlier. Problem: user has to know in advance how many

  • utliers he has in the data set.
slide-15
SLIDE 15

Density-based: Inverse distance

v For each point,

p2

´

p1

´

density(x, k) = P

y∈N(x,k) distance(x, y)

|N(x, k)| !−1

In the NN and ID approach, p1 is considered as an

  • utlier, p2 is not considered

as outlier

slide-16
SLIDE 16

Density-based: LOF

v For each point, compute the density of its local neighborhood v Compute local outlier factor (LOF) of a sample p as the average

  • f the ratios of the density of sample p and the density of its

nearest neighbors

v Outliers are points with largest LOF value

p2

´

p1

´

In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

Reldensity(x, k) = density(x, k) P

y∈N(x,k) density(y, k)/|N(x, k)|

slide-17
SLIDE 17

Clustering-Based

v Basic idea:

§ Cluster the data into groups

  • f different density

§ Choose points in small cluster as candidate outliers § Compute the distance between candidate points and non-candidate clusters.

  • If candidate points are far from

all other non-candidate points, they are outliers

§ E.g., DBSCAN

slide-18
SLIDE 18

Summary

v Types of anomaly detection

schemes § Graphical & Statistical-based § Distance-based

  • Nearest-neighbor based
  • Density based
  • Clustering based

§ Model-based (Machine Learning/classification)

slide-19
SLIDE 19

Crowd Sensing of Traffic Anomalies based on Human Mobility and Social Media

Bei Pan (Penny), University of Southern California Yu Zheng, Microsoft Research David Wilkie, University of North Carolina Cyrus Shahabi, University of Southern California

slide-20
SLIDE 20

Background

v The prevalence of location services

  • Mobile phones, GPS
  • Check-in services
  • “Crowd sensing” city rhythms
  • Urban planning
  • Activity understanding
  • Our interests:
  • Dynamics of urban traffic
  • Detect and Analyze

traffic anomalies

20

ACM SIGSPTIAL 2013 20

slide-21
SLIDE 21

Insights

When a traffic anomaly occurs:

1)% of traveling on different routes may change 2)People may discuss the anomaly on social media

21

routing behavior in normal times routing behavior during the traffic anomaly

ACM SIGSPTIAL 2013

21

rt1 rt2 rt3 rt1 rt2 rt3 rt4

slide-22
SLIDE 22

During regular times During anomalous event

ACM SIGSPTIAL 2013 22

Anomalous graph

Increase

  • f routing

behavior Decrease

  • f routing

behavior

Goal - Detection

slide-23
SLIDE 23

Goal - Analysis

ACM SIGSPTIAL 2013

23

v Understand the traffic anomalies

§ Describe the anomaly using social media § Impact analysis on travel time delay

Detected anomalous graph

slide-24
SLIDE 24

Applications

Individual users Transportation authorities

24

ACM SIGSPTIAL 2013

slide-25
SLIDE 25

System Overview

25

ACM SIGSPTIAL 2013

slide-26
SLIDE 26

System Overview

26

ACM SIGSPTIAL 2013

slide-27
SLIDE 27

Preliminaries

v Trajectory (tr)

§ A sequence of GPS points

  • E.g.,{<loc1, t1>, <loc2, t2>, <loc3, t3>}

§ After map-matching & interpolation [1][2]

  • E.g.,{<r1, t’1>, <r2, t’2>, <r3, t’3>, <r4, t’4>}

v Route (rt) : a sequence of connected road

segments

§ E.g., < r1, r2 , r3, r4 >

ACM SIGSPTIAL 2013

27 [1] J. Yuan, Y. Zheng, C. Zhang, X. Xie, and G.-Z. Sun. An interactive-voting based map matching algorithm. In MDM ’10. [2] L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popular routes from uncertain trajectories. In KDD ’12

slide-28
SLIDE 28

Anomaly Detection in Routing Behavior Analysis

v Routing Behavior:

§ RPOD =< f1 , p1 , f2 , p2 , ... , fn , pn > § f : traffic flow / p: percentage § e.g., RPOD =<160, 0.8, 20, 0.1, 20, 0.1>

v Anomaly Detection Problem Definition:

§ Given a complete road network, trajectory set in [t0, t1], find graphs

  • For each O, at least one D, that the RPOD at time t1 is

anomalous compared with regular RPOD at time [t0, t1):

28

slide-29
SLIDE 29

System Overview

29

ACM SIGSPTIAL 2013

slide-30
SLIDE 30

Anomaly Detection in Term Mining

ACM SIGSPTIAL 2013

30

(TC) (TH)

slide-31
SLIDE 31

Impact Analysis & Visualization

v Impact : Travel Time Delay

§ Individual travel time calculation:

  • E.g., travel time at segment a is : 96 sec.

§ Mean travel time during time interval T : § Delayed travel time for road segment r:

v Visaulization:

§ Green: < 2x regular travel time § Yellow: [2x, 3x] regular travel time § Red: >3x regular travel time

ACM SIGSPTIAL 2013

31

slide-32
SLIDE 32

Evaluation

v Traffic data set: (~ 20% of traffic flow on Beijing road network) v Social Media Data:

§ Crawled from Chinese micro-blogging services called “Weibo”.

v Anomaly detection baseline approach

§ PCA – proposed in [1]: anomaly detection based on traffic volume

32

ACM SIGSPTIAL 2013

[1] S. Chawla, Y. Zheng, and J. Hu. Inferring the root cause in road traffic anomalies. In ICDM ’12.

slide-33
SLIDE 33

Effectiveness Evaluation

v Recall: (percentage of actual events can be

detected)

§ Sampling time period: 4pm to 6pm on 5/12/2011 § Events reported from Beijing transportation authorities are not necessarily the entire set of ground truth

33

Reported events Detected by baseline Detected by our approach Recall: 46.7% Recall: 86.7%

slide-34
SLIDE 34

Case Study - 1

v Traffic accidents – (reported by transportation

agency)

34

Mined Terms: Term weights:

slide-35
SLIDE 35

Case Study - 2

v Wedding Expo – (not reported by transportation

agency)

35

Mined Terms:

slide-36
SLIDE 36

Conclusion

v Anomaly detection using crowd sensing

§ More precise, more meaningful than traffic volume based algorithm.

v Anomaly analysis using social media

§ Significant reduction of searching space

v Enable new thoughts in urban computing

§ Detect and describe traffic anomalies that is not reported § Understand human’s behavior during traffic anomalies

36

ACM SIGSPTIAL 2013

slide-37
SLIDE 37

Q & A

ACM SIGSPTIAL 2013

37