msd kmeans
play

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and - PowerPoint PPT Presentation

MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central


  1. MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers Yuanyuan Wei, Julian Jang-Jaccard, Ruili Wang Fariza Sabrina School of Natural & Computational Science School of Engineering and Technology Central Queensland University Massey University Australia New Zealand

  2. Outline 1. Introduction – Outlier 2. An Issue with Extreme Value 3. Our Approach – MSD-Kmeans 4. Evaluation and Discussion 5. Summary

  3. Introduction • What is an outlier? (source from [1], [2], [3])

  4. Types of Outlier Global vs Local outliers ([4], [5], [6])

  5. Existing State-of-the-arts Ø Statistical based approaches • MSD (Mean and Standard Deviation) • Z-score • MIQR (Median and Interquartile Range) Ø Machine learning based approaches • LOF (Local Outlier Factor) • K-means

  6. K-means Clustering Ø K-means is one of the most popular clustering algorithms to detect outliers. K- means clustering [7]

  7. Issue with Extreme Values of K-means The algorithm can be misled if there are clusters of highly different Ø size or different density. Ø It is sensitive to noise and outlier data, and the K-means worked best for globular clusters. For example [8]: • Increase intra-cluster distance • Fail to detect local outliers

  8. Our approach: MSD-Kmeans Ø The goal of MSD-kmeans: to eliminate as many global outliers (extreme value) as possible to minimize their interference on efficient clustering by K-means. Ø To make efficient clustering by removing extreme values v 2 phase approaches 1 st phase: remove extreme values so that the data is better • normalized 2 nd phase: do the clustering with normalize data to produce • better clustering with more realistic outliers.

  9. MSD-Kmeans: v Phase 1: MSD algorithm for detecting global outliers MSD (Mean and standard deviation) • The MSD algorithm is calculated by calculating and comparing the mean and standard deviation values of supplied dataset; those deviating from the mean value by more than one standard deviation are considered global outliers [9].

  10. MSD Algorithm for Detecting Global Outliers � Let ! ∈ {! 1 , ! 2 , ! 3 , … , !)} , then � 8 = 7 6 6 ∑ 2:7 ! i , 0 1 2 3 4 5 ∑ -./ , + = 637 Where ; is the number of dataset. 8 is the mean value of the dataset, while + is standard deviation. ! global_outliers = <! upper , ! > 8 + + ! lower , ! < 8 − +

  11. MSD-Kmeans: v Phase 2: K-means algorithm for detecting local outliers Dataset Normal data If intra-cluster distance < threshold Initial K clusters Calculating the centroids If intra-cluster distance > threshold threshold Outliers based on Yes intra-cluster distance Distance between objects and centroid No object no move Converged Grouping based group? on minimum distance

  12. MSD-Kmeans Outliers: Ø This method was implemented in Python using the sklearn.cluster.Kmeans module to group the dataset into K clusters ( K = 2). Ø Calculating intra-distance from centroid to each data points in each cluster by using Euclidean Distance: , (%- − /-) 2 d(p, %) = ∑ )*+ Where p is the centroid data point, while % is a set of data point in its cluster.

  13. MSD-Kmeans Outliers: Ø The threshold of top-n outliers in each cluster Ø The outlier threshold is calculated to be the sum of the mean value and 1.5 times the standard deviation of intra-clustering distance. ! "#$%"_#'(")*+ = ! ) > . )/(+%_0)1(%/$* + 1.5 ∗ 6 )/(+%_0)1(%/$*

  14. MSD-Kmeans pseudocode �

  15. Experiment & Results Ø Utilizing NYC (New York City) Taxi Dataset • A dataset of about 1.71GB data collected from registered taxis in NYC in January 2016.

  16. Ø Experiment Setup • In order to find out greedy dishonest drivers may attempt to charge high fares by detouring. • Parameter: Fare – amount (from Lower Manhattan suburb of SOHO to John F. Kennedy International (JFK) Airport). (a) Lower Manhattan (b) JFK Airport (Source from Google Map)

  17. MSD-Kmeans & K-means Fig. 2 MSD-Kmeans outlier detection without extreme value Fig. 1 K-means outlier detection with extreme value (11.14% outliers in total) (5.11% outlies in total)

  18. Evaluation Approaches Ø Our Approach v We took a number of popular outlier detection algorithms both from statistical and machine learning area, and compared the results. v Comparing existing state-of-the-arts • Statistical method – MSD, Z-score and MIQR • Machine learning – K-means, LOF v Comparing measurement of the following properties: • TPR (True Positive Rate) !"#!% • Accuracy = • FPR (False Positive Rate) !"#!%#$"#$% &!" !" • F-measure = • Precision = &!"#$"#$% !"#$" v Comparing the execution time

  19. Performance Comparison of Outlier Detection Algorithms using NYC Taxi Dataset Outlier TPR FPR Precision Accuracy Recall F-measure Execution Detection (%) (%) (%) (%) (%) (%) Time (MS) Algorithm MSD 99.9 24.2 96.6 96.9 99.9 98.2 21 Z-score 100 48.9 94.3 94.6 100 97.1 157 MIQR* 97.8 12.6 98.1 96.4 97.8 98.0 54 K-means [10] 99.7 55.6 93.5 93.7 99.7 96.9 1,132 LOF* [11] 98.2 79.3 26.1 38.0 98.2 43.1 31,483 MSD-Kmeans 98.5 11.6 98.6 97.4 98.5 98.6 824 * MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor)

  20. Comparison Execution Time (MS) Accuracy (%) 35000 120 30000 Execution Time (MS) 31,483 100 25000 97.4 96.9 96.4 Accuracy (%) 80 94.6 93.7 20000 60 15000 40 10000 38 20 5000 21 157 54 1,132 824 0 0 D * s D * s e e s * s * n n S r R S r R n F n F o o M Q a M Q a a O a O e e c e c e I I s L m s L m M M m m - - Z Z K K - - K - K - D D S S M M Different Algorithms Different Algorithms * MIQR (Median and Interquartile Range Method) * LOF (Local Outlier Factor) MSD-Kmeans paper with NYC taxi dataset paper has submitted in ICONIP conference (A Ranking)

  21. Summary Ø We propose a new outlier detection algorithm MSD-Kmeans which can deal with extreme values better. Ø We have evaluated with IoT datasets, for example NYC taxi data. Ø The performance indicators of MSD-Kmeans with those of other outlier detection algorithms, such as MSD, Z-score, MIQR, K-means and LOF, and proved that the proposed MSD-Kmeans algorithm achieved the highest measure of accuracy.

  22. Future Work Ø Currently testing with different dataset • Indoor air quality data that are generate by a real IoT application SKOMOBO (SKOol MOnitoring BOx) • Initial evaluation is done and results being written as a Journal manuscript • Other datasets will also use such as from § UCI Machine Learning Repository Ø Multivariate analysis of variance • Expand to study interdependence between several variables and its impact.

  23. References: [1]. Ved, M. (2018). Outlier Detection and Anomaly Detection with Machine Learning. Retrieved from https://medium.com/@mehulved1503/outlier-detection-and-anomaly-detection-with-machine-learning-caa96b34b7f6 [2]. Floydhub. (2019). Introduction to Anomaly Detection in Python [ online forum comment]. Retrieved from https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/ [3]. Related Guided Lesson. (2019). The Ugly Ducking. Retrieved from https://www.education.com/game/the-ugly-duckling/ [4]. Rushworth, A. (2019). The local outlier factor (LOF). Retrieved from https://campus.datacamp.com/courses/anomaly-detection-in-r/distance-and-density-based-anomaly-detection?ex=9 [5]. Packet. (2017). Machine Learning Review. Retrieved from https://hub.packtpub.com/machine-learning-review/ [6]. Jose, C. (2019). Anomaly Detection Techniques in Python. Retrieved from https://medium.com/learningdatascience/anomaly-detection-techniques-in-python-50f650c75aaf [7]. Learn by Marketing. (2019). K-Means Clustering – What it is and How it works. Retrieved from http://www.learnbymarketing.com/methods/k-means-clustering/ [8]. Secience & Technology review. (2012). Finding and Fixing a Supercomputer’s Faults. Retrieved from https://str.llnl.gov/June12/desupinski.html [9]. Prasad, Y. S., & Krishna, G. R. (2013). Statistical Anomaly Detection Technique for Real Time Datasets. International Journal of Computer Trends and Technology (IJCTT) , 6 (2), 89-94. [10]. Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Cluster-based outlier detection. Annals of Operations Research , 168 (1), 151-168. [11]. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, No. 2, pp. 93-104). ACM.

  24. THANK YOU �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend