MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - PowerPoint PPT Presentation

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central Queensland University Massey University Australia New Zealand

Outline 1. Introduction – Outlier 2. An Issue with Extreme Value 3. Our Approach – MSD-Kmeans 4. Evaluation and Discussion 5. Summary

Introduction • What is an outlier? (source from [1], [2], [3])

Types of Outlier Global vs Local outliers ([4], [5], [6])

Existing State-of-the-arts Ø Statistical based approaches • MSD (Mean and Standard Deviation) • Z-score • MIQR (Median and Interquartile Range) Ø Machine learning based approaches • LOF (Local Outlier Factor) • K-means

K-means Clustering Ø K-means is one of the most popular clustering algorithms to detect outliers. K- means clustering [7]

Issue with Extreme Values of K-means The algorithm can be misled if there are clusters of highly different Ø size or different density. Ø It is sensitive to noise and outlier data, and the K-means worked best for globular clusters. For example [8]: • Increase intra-cluster distance • Fail to detect local outliers

Our approach: MSD-Kmeans Ø The goal of MSD-kmeans: to eliminate as many global outliers (extreme value) as possible to minimize their interference on efficient clustering by K-means. Ø To make efficient clustering by removing extreme values v 2 phase approaches 1 st phase: remove extreme values so that the data is better • normalized 2 nd phase: do the clustering with normalize data to produce • better clustering with more realistic outliers.

MSD-Kmeans: v Phase 1: MSD algorithm for detecting global outliers MSD (Mean and standard deviation) • The MSD algorithm is calculated by calculating and comparing the mean and standard deviation values of supplied dataset; those deviating from the mean value by more than one standard deviation are considered global outliers [9].

MSD Algorithm for Detecting Global Outliers � Let ! ∈ {! 1 , ! 2 , ! 3 , … , !)} , then � 8 = 7 6 6 ∑ 2:7 ! i , 0 1 2 3 4 5 ∑ -./ , + = 637 Where ; is the number of dataset. 8 is the mean value of the dataset, while + is standard deviation. ! global_outliers = <! upper , ! > 8 + + ! lower , ! < 8 − +

MSD-Kmeans: v Phase 2: K-means algorithm for detecting local outliers Dataset Normal data If intra-cluster distance < threshold Initial K clusters Calculating the centroids If intra-cluster distance > threshold threshold Outliers based on Yes intra-cluster distance Distance between objects and centroid No object no move Converged Grouping based group? on minimum distance

MSD-Kmeans Outliers: Ø This method was implemented in Python using the sklearn.cluster.Kmeans module to group the dataset into K clusters ( K = 2). Ø Calculating intra-distance from centroid to each data points in each cluster by using Euclidean Distance: , (%- − /-) 2 d(p, %) = ∑ )*+ Where p is the centroid data point, while % is a set of data point in its cluster.

MSD-Kmeans Outliers: Ø The threshold of top-n outliers in each cluster Ø The outlier threshold is calculated to be the sum of the mean value and 1.5 times the standard deviation of intra-clustering distance. ! "#$%"_#'(")*+ = ! ) > . )/(+%_0)1(%/$* + 1.5 ∗ 6 )/(+%_0)1(%/$*

MSD-Kmeans pseudocode �

Experiment & Results Ø Utilizing NYC (New York City) Taxi Dataset • A dataset of about 1.71GB data collected from registered taxis in NYC in January 2016.

Ø Experiment Setup • In order to find out greedy dishonest drivers may attempt to charge high fares by detouring. • Parameter: Fare – amount (from Lower Manhattan suburb of SOHO to John F. Kennedy International (JFK) Airport). (a) Lower Manhattan (b) JFK Airport (Source from Google Map)

MSD-Kmeans & K-means Fig. 2 MSD-Kmeans outlier detection without extreme value Fig. 1 K-means outlier detection with extreme value (11.14% outliers in total) (5.11% outlies in total)

Evaluation Approaches Ø Our Approach v We took a number of popular outlier detection algorithms both from statistical and machine learning area, and compared the results. v Comparing existing state-of-the-arts • Statistical method – MSD, Z-score and MIQR • Machine learning – K-means, LOF v Comparing measurement of the following properties: • TPR (True Positive Rate) !"#!% • Accuracy = • FPR (False Positive Rate) !"#!%#$"#$% &!" !" • F-measure = • Precision = &!"#$"#$% !"#$" v Comparing the execution time

Performance Comparison of Outlier Detection Algorithms using NYC Taxi Dataset Outlier TPR FPR Precision Accuracy Recall F-measure Execution Detection (%) (%) (%) (%) (%) (%) Time (MS) Algorithm MSD 99.9 24.2 96.6 96.9 99.9 98.2 21 Z-score 100 48.9 94.3 94.6 100 97.1 157 MIQR* 97.8 12.6 98.1 96.4 97.8 98.0 54 K-means [10] 99.7 55.6 93.5 93.7 99.7 96.9 1,132 LOF* [11] 98.2 79.3 26.1 38.0 98.2 43.1 31,483 MSD-Kmeans 98.5 11.6 98.6 97.4 98.5 98.6 824 * MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor)

Comparison Execution Time (MS) Accuracy (%) 35000 120 30000 Execution Time (MS) 31,483 100 25000 97.4 96.9 96.4 Accuracy (%) 80 94.6 93.7 20000 60 15000 40 10000 38 20 5000 21 157 54 1,132 824 0 0 D * s D * s e e s * s * n n S r R S r R n F n F o o M Q a M Q a a O a O e e c e c e I I s L m s L m M M m m - - Z Z K K - - K - K - D D S S M M Different Algorithms Different Algorithms * MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor) MSD-Kmeans paper with NYC taxi dataset paper has submitted in ICONIP conference (A Ranking)

Summary Ø We propose a new outlier detection algorithm MSD-Kmeans which can deal with extreme values better. Ø We have evaluated with IoT datasets, for example NYC taxi data. Ø The performance indicators of MSD-Kmeans with those of other outlier detection algorithms, such as MSD, Z-score, MIQR, K-means and LOF, and proved that the proposed MSD-Kmeans algorithm achieved the highest measure of accuracy.

Future Work Ø Currently testing with different dataset • Indoor air quality data that are generate by a real IoT application SKOMOBO (SKOol MOnitoring BOx) • Initial evaluation is done and results being written as a Journal manuscript • Other datasets will also use such as from § UCI Machine Learning Repository Ø Multivariate analysis of variance • Expand to study interdependence between several variables and its impact.

References: [1]. Ved, M. (2018). Outlier Detection and Anomaly Detection with Machine Learning. Retrieved from https://medium.com/@mehulved1503/outlier-detection-and-anomaly-detection-with-machine-learning-caa96b34b7f6 [2]. Floydhub. (2019). Introduction to Anomaly Detection in Python [ online forum comment]. Retrieved from https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/ [3]. Related Guided Lesson. (2019). The Ugly Ducking. Retrieved from https://www.education.com/game/the-ugly-duckling/ [4]. Rushworth, A. (2019). The local outlier factor (LOF). Retrieved from https://campus.datacamp.com/courses/anomaly-detection-in-r/distance-and-density-based-anomaly-detection?ex=9 [5]. Packet. (2017). Machine Learning Review. Retrieved from https://hub.packtpub.com/machine-learning-review/ [6]. Jose, C. (2019). Anomaly Detection Techniques in Python. Retrieved from https://medium.com/learningdatascience/anomaly-detection-techniques-in-python-50f650c75aaf [7]. Learn by Marketing. (2019). K-Means Clustering – What it is and How it works. Retrieved from http://www.learnbymarketing.com/methods/k-means-clustering/ [8]. Secience & Technology review. (2012). Finding and Fixing a Supercomputer’s Faults. Retrieved from https://str.llnl.gov/June12/desupinski.html [9]. Prasad, Y. S., & Krishna, G. R. (2013). Statistical Anomaly Detection Technique for Real Time Datasets. International Journal of Computer Trends and Technology (IJCTT) , 6 (2), 89-94. [10]. Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research , 168 (1), 151-168. [11]. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, No. 2, pp. 93-104). ACM.

THANK YOU �

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - PowerPoint PPT Presentation

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central

MSD Presentation RSE Conference Annie Aranui Sam ple im age only Contact design@msd.govt.nz

MSD Project Clears Rainscaping Program - Neighborhood Scale Rain Gardens Outline MSD

kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING 2012 Change in timetable:

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

of MSD and wellbeing initiatives .what works? Susan Gee Group Occupational Health &

Green Infrastructure Project Overview MSD is responsible for clean water in our community

ESMO SUMMIT LATIN AMERICA 2019 Clinical cases presentation Maria Ignez Braghiroli CONFLICT OF

ANGELS ARC Ambulance for Rescuing Children A HOPE for Honduras and RIT MSD/ID Collaboration

Oral Mucositis Joel Epstein DMD, MSD, FRCD(C), FDS RCS(Ed) Diplomate American Board of Oral

ANGELS ARC Ambulance for Rescuing Children A HOPE for Honduras and RIT MSD/ID Collaboration

MSD P18102: Hybrid Rocket Engine Detailed Design Review Presented by: Ryan Chojnacki Amy

Agile working- DSE and ergonomic issues Katharine Metters Exploring the MSD and associated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

MSD II Group P14546 Introductions Name Role Corey Rothfuss Team Leader Kayla King

MSD Project Review Presentation Objectives Proposed Structure Example Presentation

P15542: X-Y Camera Rig Multidisciplinary Senior Design MSD Team Tia Damman ME Communications

Mathematics for Elementary School: Mathematics for Elementary School: Collaboration Between

Density Curves & Normal Distribution Some Call it a Bell Curve DENSITY CURVES Density

A Family Friendly Guide to Understanding Psychological Test Scores and Results Natasha Ludwig,

What Do They Do? Kathleen F. Slevin Vice President, Faculty Assembly September 17, 2009

The Enhanced GSMA Smart City Index Dr Guo Chao Alex Peng g.c.alexpeng@gmail.com (on behalf of Prof

The Apprenticeship-to-Work Transition: Experimental Evidence from Ghana Morgan Hardy Isaac Mbiti

FOR PROFESSIONAL USE ONLY PORTFOLIO PROCESS THE PORTFOLIO: GENERAL OVERVIEW Individual

What Will My Account Really Be Worth? An Experiment on Exponential Growth Bias and Retirement

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - PowerPoint PPT Presentation

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central

MSD Presentation RSE Conference Annie Aranui Sam ple im age only Contact design@msd.govt.nz

MSD Project Clears Rainscaping Program - Neighborhood Scale Rain Gardens Outline MSD

kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING 2012 Change in timetable:

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

of MSD and wellbeing initiatives .what works? Susan Gee Group Occupational Health &amp;

Green Infrastructure Project Overview MSD is responsible for clean water in our community

ESMO SUMMIT LATIN AMERICA 2019 Clinical cases presentation Maria Ignez Braghiroli CONFLICT OF

ANGELS ARC Ambulance for Rescuing Children A HOPE for Honduras and RIT MSD/ID Collaboration

Oral Mucositis Joel Epstein DMD, MSD, FRCD(C), FDS RCS(Ed) Diplomate American Board of Oral

ANGELS ARC Ambulance for Rescuing Children A HOPE for Honduras and RIT MSD/ID Collaboration

MSD P18102: Hybrid Rocket Engine Detailed Design Review Presented by: Ryan Chojnacki Amy

Agile working- DSE and ergonomic issues Katharine Metters Exploring the MSD and associated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

MSD II Group P14546 Introductions Name Role Corey Rothfuss Team Leader Kayla King

MSD Project Review Presentation Objectives Proposed Structure Example Presentation

P15542: X-Y Camera Rig Multidisciplinary Senior Design MSD Team Tia Damman ME Communications

Mathematics for Elementary School: Mathematics for Elementary School: Collaboration Between

Density Curves &amp; Normal Distribution Some Call it a Bell Curve DENSITY CURVES Density

A Family Friendly Guide to Understanding Psychological Test Scores and Results Natasha Ludwig,

What Do They Do? Kathleen F. Slevin Vice President, Faculty Assembly September 17, 2009

The Enhanced GSMA Smart City Index Dr Guo Chao Alex Peng g.c.alexpeng@gmail.com (on behalf of Prof

The Apprenticeship-to-Work Transition: Experimental Evidence from Ghana Morgan Hardy Isaac Mbiti

FOR PROFESSIONAL USE ONLY PORTFOLIO PROCESS THE PORTFOLIO: GENERAL OVERVIEW Individual

What Will My Account Really Be Worth? An Experiment on Exponential Growth Bias and Retirement

of MSD and wellbeing initiatives .what works? Susan Gee Group Occupational Health &

Density Curves & Normal Distribution Some Call it a Bell Curve DENSITY CURVES Density