clustering analysis basics
play

Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [25.1, - PowerPoint PPT Presentation

Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [25.1, KPM] COMP24111 Machine Learning Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary 2 COMP24111


  1. Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [25.1, KPM] COMP24111 Machine Learning

  2. Outline • Introduction • Data Types and Representations • Distance Measures • Major Clustering Methodologies • Summary 2 COMP24111 Machine Learning

  3. Introduction • Cluster: A collection/group of data objects/points – similar (or related) to one another within the same group – dissimilar (or unrelated) to the objects in other groups • Cluster analysis – find similarities between data according to characteristics underlying the data and grouping similar data objects into clusters • Clustering Analysis: Unsupervised learning – no predefined classes for a training data set – Two general tasks: identify the “natural” clustering number and properly grouping objects into “sensible” clusters • Typical applications – as a stand-alone tool to gain an insight into data distribution – as a preprocessing step of other algorithms in intelligent systems 3 COMP24111 Machine Learning

  4. Introduction Illustrative Example 1: how many clusters? • 4 COMP24111 Machine Learning

  5. Introduction Illustrative Example 2: are they in the same cluster? • 1. Two clusters Blue shark, Lizard, sparrow, sheep, cat, viper, seagull, gold 2. Clustering criterion: fish, frog, red dog How animals bear mullet their progeny Sheep, sparrow, 1. Two clusters Gold fish, red dog, cat, seagull, mullet, blue 2. Clustering criterion: lizard, frog, viper shark Existence of lungs 5 COMP24111 Machine Learning

  6. Introduction • Real Applications: Google News 6 COMP24111 Machine Learning

  7. Introduction • Real Applications: Genetics Analysis 7 COMP24111 Machine Learning

  8. Introduction • Real Applications: Emerging Applications 8 COMP24111 Machine Learning

  9. Introduction • A technique demanded by many real world tasks – Bank/Internet Security: fraud/spam pattern discovery – Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species – City-planning: Identifying groups of houses according to their house type, value, and geographical location – Climate change: understanding earth climate, find patterns of atmospheric and ocean – Finance: stock clustering analysis to uncover correlation underlying shares – Image Compression/segmentation: coherent pixels grouped – Information retrieval/organisation: Google search, topic-based news – Land use: Identification of areas of similar land use in an earth observation database – Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs – Social network mining: special interest group automatic discovery 9 COMP24111 Machine Learning

  10. Quiz √ √ 10 COMP24111 Machine Learning

  11. Data Types and Representations • Discrete vs. Continuous – Discrete Feature • Has only a finite set of values e.g., zip codes, rank, or the set of words in a collection of documents • Sometimes, represented as integer variable – Continuous Feature • Has real numbers as feature values e.g, temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous features are typically represented as floating-point variables 11 COMP24111 Machine Learning

  12. Data Types and Representations • Data representations – Data matrix (object-by-feature structure)   x ... x ... x 11 1f 1p    n data points (objects) with p ... ... ... ... ...   dimensions (features)   x ... x ... x  i1 if ip   Two modes: row and column   ... ... ... ... ... represent different entities   x ... x ... x     n1 nf np – Distance/dissimilarity matrix (object-by-object structure)    0 n data points, but registers   d(2,1) 0 only the distance      d(3,1 ) d ( 3 , 2 ) 0 A symmetric/triangular matrix   : : :    Single mode: row and column     d ( n , 1 ) d ( n , 2 ) ... ... 0 for the same entity (distance) 12 COMP24111 Machine Learning

  13. Data Types and Representations • Examples 3 point x y 0 2 p1 p1 2 2 0 p2 p3 p4 3 1 p3 1 5 1 p4 p2 0 Data Matrix 0 1 2 3 4 5 6 p1 p2 p3 p4 0 2.828 3.162 5.099 p1 2.828 0 1.414 3.162 p2 3.162 1.414 0 2 p3 5.099 3.162 2 0 p4 Distance Matrix (i.e., Dissimilarity Matrix) for Euclidean Distance 13 COMP24111 Machine Learning

  14. Distance Measures • Minkowski Distance ( http://en.wikipedia.org/wiki/Minkowski_distance ) = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ x ( x x x ) and y ( y y y ) For 1 2 n 1 2 n ( ) 1 = − + − ⋅ ⋅ ⋅+ − > p p p ( , ) | | | | | | , 0 d x y x y x y x y p p 1 1 2 2 n n – p = 1 : Manhattan (city block) distance = − + − ⋅ ⋅ ⋅+ − d ( x , y ) | x y | | x y | | x y | 1 1 2 2 n n – p = 2 : Euclidean distance = − + − ⋅ ⋅ ⋅+ − 2 2 2 d ( x , y ) | x y | | x y | | x y | 1 1 2 2 n n Do not confuse p with n , i.e., all these distances are defined – based on all numbers of features (dimensions). A generic measure: use appropriate p in different applications – 14 COMP24111 Machine Learning

  15. Distance Measures • Example: Manhatten and Euclidean distances 3 L1 p1 p2 p3 p4 0 4 4 6 p1 p1 2 4 0 2 4 p2 p3 p4 4 2 0 2 1 p3 p2 6 4 2 0 p4 0 Distance Matrix for Manhattan Distance 0 1 2 3 4 5 6 point x y L2 p1 p2 p3 p4 0 2 0 2.828 3.162 5.099 p1 p1 2 0 p2 p2 2.828 0 1.414 3.162 p3 3 1 3.162 1.414 0 2 p3 5 1 5.099 3.162 2 0 p4 p4 Data Matrix Distance Matrix for Euclidean Distance 15 COMP24111 Machine Learning

  16. Distance Measures • Cosine Measure (Similarity vs. Distance) = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ x ( x x x ) and y ( y y y ) For 1 2 n 1 2 n + ⋅ ⋅ ⋅ + x y x y = 1 1 n n cos( , ) x y + ⋅ ⋅ ⋅ + + ⋅ ⋅ ⋅ + 2 2 2 2 x x y y 1 1 n n = − ( , ) 1 cos( , ) d x y x y ≤ ≤ 0 d ( x , y ) 2 – Property: – Nonmetric vector objects: keywords in documents, gene features in micro-arrays, … – Applications: information retrieval, biologic taxonomy, ... 16 COMP24111 Machine Learning

  17. Distance Measures • Example: Cosine measure = = ( 3, 2, 0, 5, 2, 0, 0 ), ( 1,0, 0, 0, 1, 0, 2 ) x x 1 2 × + × + × + × + × + × + × = 3 1 2 0 0 0 5 0 2 1 0 0 0 2 5 + + + + + + = ≈ 2 2 2 2 2 2 2 3 2 0 5 2 0 0 42 6 . 48 + + + + + + = ≈ 2 2 2 2 2 2 2 1 0 0 0 1 0 2 6 2 . 45 5 = ≈ cos( , ) 0 . 32 x x × 1 2 6 . 48 2 . 45 = − = − = ( , ) 1 cos( , ) 1 0 . 32 0 . 68 d x x x x 1 2 1 2 17 COMP24111 Machine Learning

  18. Distance Measures • Distance for Binary Features – For binary features, their value can be converted into 1 or 0. y x – Contingency table for binary feature vectors, and y x a : number of features that equal 1 for both x and y b : number of features that equal 1 for x but that are 0 for y c : number of features that equal 0 for x but that are 1 for y d : number of features that equal 0 for both x and y 18 COMP24111 Machine Learning

  19. Distance Measures • Distance for Binary Features – Distance for symmetric binary features Both of their states equally valuable and carry the same weight; i.e., no preference on which outcome should be coded as 1 or 0 , e.g. gender + b c = d ( x , y ) + + + a b c d – Distance for asymmetric binary features Outcomes of the states not equally important, e.g., the positive and negative outcomes of a disease test ; the rarest one is set to 1 and the other is 0. + b c = d ( x , y ) + + a b c 19 COMP24111 Machine Learning

  20. Distance Measures • Example: Distance for binary features Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4  “Y”: yes Jack Jack M M Y N P N N N 1 0 1 0 0 0  “P”: positive Mary Mary F F Y N P N P N 1 0 1 0 1 0  “N”: negative Jim Jim M M Y P N N N N 1 1 0 0 0 0 – gender is a symmetric feature (less important) – the remaining features are asymmetric binary – set the values “Y” and “P” to 1, and the value “N” to 0 + Mary 0 1 = = d ( Jack, Mary ) 0 . 33 + + Jack 2 0 1 + 1 1 Jim = = d ( Jack, Jim ) 0 . 67 + + 1 1 1 Jack + 1 2 = = Mary d ( Jim, Mary ) 0 . 75 + + 1 1 2 Jim 20 COMP24111 Machine Learning

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend