MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - - PowerPoint PPT Presentation

msd kmeans
SMART_READER_LITE
LIVE PREVIEW

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - - PowerPoint PPT Presentation

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central


slide-1
SLIDE 1

MSD-Kmeans:

A Novel Algorithm for Efficient Detection of Global and Local Outliers

Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang School of Natural & Computational Science

Massey University

Fariza Sabrina School of Engineering and Technology

Central Queensland University New Zealand Australia

slide-2
SLIDE 2

Outline

  • 1. Introduction – Outlier
  • 2. An Issue with Extreme Value
  • 3. Our Approach – MSD-Kmeans
  • 4. Evaluation and Discussion
  • 5. Summary
slide-3
SLIDE 3

Introduction

  • What is an outlier?

(source from [1], [2], [3])

slide-4
SLIDE 4

Types of Outlier

Global vs Local outliers ([4], [5], [6])

slide-5
SLIDE 5

Existing State-of-the-arts

Ø Statistical based approaches

  • MSD (Mean and Standard Deviation)
  • Z-score
  • MIQR (Median and Interquartile Range)

Ø Machine learning based approaches

  • LOF (Local Outlier Factor)
  • K-means
slide-6
SLIDE 6

K-means Clustering

Ø K-means is one of the most popular clustering algorithms to detect outliers.

K- means clustering [7]

slide-7
SLIDE 7

Issue with Extreme Values of K-means

Ø The algorithm can be misled if there are clusters of highly different size or different density. Ø It is sensitive to noise and outlier data, and the K-means worked best for globular clusters.

  • Increase intra-cluster distance
  • Fail to detect local outliers

For example [8]:

slide-8
SLIDE 8

Our approach: MSD-Kmeans

Ø The goal of MSD-kmeans: to eliminate as many global outliers (extreme value) as possible to minimize their interference on efficient clustering by K-means. Ø To make efficient clustering by removing extreme values v 2 phase approaches

  • 1st phase: remove extreme values so that the data is better

normalized

  • 2nd phase: do the clustering with normalize data to produce

better clustering with more realistic outliers.

slide-9
SLIDE 9

MSD-Kmeans:

  • The MSD algorithm is calculated by calculating and comparing the mean

and standard deviation values of supplied dataset; those deviating from the mean value by more than one standard deviation are considered global

  • utliers [9].

v Phase 1: MSD algorithm for detecting global outliers

MSD (Mean and standard deviation)

slide-10
SLIDE 10

MSD Algorithm for Detecting Global Outliers

Let ! ∈ {!1, !2, !3, … , !)}, then + =

∑-./ 12 3 4 5 637

8 = 7

6 ∑2:7 6

!i , , Where ; is the number of dataset. 8 is the mean value of the dataset, while + is standard deviation.

!global_outliers= <!upper, ! > 8 + + !lower, ! < 8 − +

slide-11
SLIDE 11

v Phase 2: K-means algorithm for detecting local outliers

MSD-Kmeans:

Initial K clusters Distance between

  • bjects and

centroid Grouping based

  • n minimum

distance No object move group? Converged Dataset no Yes centroids

Calculating the threshold based on intra-cluster distance

Normal data Outliers

If intra-cluster distance < threshold If intra-cluster distance > threshold

slide-12
SLIDE 12

MSD-Kmeans Outliers:

Ø This method was implemented in Python using the sklearn.cluster.Kmeans module to group the dataset into K clusters (K = 2). Ø Calculating intra-distance from centroid to each data points in each cluster by using Euclidean Distance: d(p, %) = ∑)*+

, (%- − /-)2

Where p is the centroid data point, while % is a set of data point in its cluster.

slide-13
SLIDE 13

Ø The threshold of top-n outliers in each cluster Ø The outlier threshold is calculated to be the sum of the mean value and 1.5 times the standard deviation of intra-clustering distance.

!"#$%"_#'(")*+ = !) > .)/(+%_0)1(%/$* + 1.5 ∗ 6)/(+%_0)1(%/$*

MSD-Kmeans Outliers:

slide-14
SLIDE 14

MSD-Kmeans pseudocode

slide-15
SLIDE 15

Experiment & Results

Ø Utilizing NYC (New York City) Taxi Dataset

  • A dataset of about 1.71GB data collected from registered taxis

in NYC in January 2016.

slide-16
SLIDE 16

Ø Experiment Setup

  • In order to find out greedy dishonest drivers may attempt to charge

high fares by detouring.

  • Parameter: Fare – amount (from Lower Manhattan suburb of

SOHO to John F. Kennedy International (JFK) Airport).

(Source from Google Map)

(a) Lower Manhattan (b) JFK Airport

slide-17
SLIDE 17

MSD-Kmeans & K-means

  • Fig. 1 K-means outlier detection with extreme value

(5.11% outlies in total)

  • Fig. 2 MSD-Kmeans outlier detection without extreme value

(11.14% outliers in total)

slide-18
SLIDE 18

Ø Our Approach

v We took a number of popular outlier detection algorithms both from statistical and machine learning area, and compared the results. v Comparing existing state-of-the-arts

  • Statistical method – MSD, Z-score and MIQR
  • Machine learning – K-means, LOF

v Comparing measurement of the following properties:

  • TPR (True Positive Rate)
  • FPR (False Positive Rate)
  • Precision =

!" !"#$"

v Comparing the execution time

Evaluation Approaches

  • Accuracy =

!"#!% !"#!%#$"#$%

  • F-measure =

&!" &!"#$"#$%

slide-19
SLIDE 19

Performance Comparison of Outlier Detection Algorithms using NYC Taxi Dataset

Outlier Detection Algorithm TPR (%) FPR (%) Precision (%) Accuracy (%) Recall (%) F-measure (%) Execution Time (MS) MSD 99.9 24.2 96.6 96.9 99.9 98.2 21 Z-score 100 48.9 94.3 94.6 100 97.1 157 MIQR* 97.8 12.6 98.1 96.4 97.8 98.0 54 K-means [10] 99.7 55.6 93.5 93.7 99.7 96.9 1,132 LOF* [11] 98.2 79.3 26.1 38.0 98.2 43.1 31,483 MSD-Kmeans 98.5 11.6 98.6 97.4 98.5 98.6 824

* MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor)

slide-20
SLIDE 20

Comparison

* MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor)

96.9 94.6 96.4 93.7 38 97.4 20 40 60 80 100 120 M S D Z

  • s

c

  • r

e M I Q R * K

  • m

e a n s L O F * M S D

  • K

m e a n s Accuracy (%) Different Algorithms

Accuracy (%)

21 157 54 1,132 31,483 824 5000 10000 15000 20000 25000 30000 35000 M S D Z

  • s

c

  • r

e M I Q R * K

  • m

e a n s L O F * M S D

  • K

m e a n s Execution Time (MS) Different Algorithms

Execution Time (MS)

MSD-Kmeans paper with NYC taxi dataset paper has submitted in ICONIP conference (A Ranking)

slide-21
SLIDE 21

Summary

Ø We propose a new outlier detection algorithm MSD-Kmeans which can deal with extreme values better. Ø We have evaluated with IoT datasets, for example NYC taxi data. Ø The performance indicators of MSD-Kmeans with those of other outlier detection algorithms, such as MSD, Z-score, MIQR, K-means and LOF, and proved that the proposed MSD-Kmeans algorithm achieved the highest measure of accuracy.

slide-22
SLIDE 22

Future Work

Ø Currently testing with different dataset

  • Indoor air quality data that are generate by a real IoT application

SKOMOBO (SKOol MOnitoring BOx)

  • Initial evaluation is done and results being written as a Journal

manuscript

  • Other datasets will also use such as from

§ UCI Machine Learning Repository

Ø Multivariate analysis of variance

  • Expand to study interdependence between several variables and

its impact.

slide-23
SLIDE 23

References:

[1]. Ved, M. (2018). Outlier Detection and Anomaly Detection with Machine Learning. Retrieved from https://medium.com/@mehulved1503/outlier-detection-and-anomaly-detection-with-machine-learning-caa96b34b7f6 [2]. Floydhub. (2019). Introduction to Anomaly Detection in Python [ online forum comment]. Retrieved from https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/ [3]. Related Guided Lesson. (2019). The Ugly Ducking. Retrieved from https://www.education.com/game/the-ugly-duckling/ [4]. Rushworth, A. (2019). The local outlier factor (LOF). Retrieved from https://campus.datacamp.com/courses/anomaly-detection-in-r/distance-and-density-based-anomaly-detection?ex=9 [5]. Packet. (2017). Machine Learning Review. Retrieved from https://hub.packtpub.com/machine-learning-review/ [6]. Jose, C. (2019). Anomaly Detection Techniques in Python. Retrieved from https://medium.com/learningdatascience/anomaly-detection-techniques-in-python-50f650c75aaf [7]. Learn by Marketing. (2019). K-Means Clustering – What it is and How it works. Retrieved from http://www.learnbymarketing.com/methods/k-means-clustering/ [8]. Secience & Technology review. (2012). Finding and Fixing a Supercomputer’s Faults. Retrieved from https://str.llnl.gov/June12/desupinski.html [9]. Prasad, Y. S., & Krishna, G. R. (2013). Statistical Anomaly Detection Technique for Real Time Datasets. International Journal of Computer Trends and Technology (IJCTT), 6(2), 89-94. [10]. Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research, 168(1), 151-168. [11]. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, No. 2, pp. 93-104). ACM.

slide-24
SLIDE 24

THANK YOU