machine learning algorithms and applications
play

Machine Learning: Algorithms and Applications Floriano Zini Free - PDF document

14/05/12 Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 10: 14 May 2012 Unsupervised Learning (cont) Slides courtesy of Bing


  1. 14/05/12 ¡ Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 10: 14 May 2012 Unsupervised Learning (cont…) Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html 1 ¡

  2. 14/05/12 ¡ Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary Hierarchical Clustering Produce a nested sequence of clusters, a tree, also called dendrogram n Singleton clusters are at the bottom of the three n One root clusters covers all the data points n Siblings clusters partition the data points of the common parent 2 ¡

  3. 14/05/12 ¡ Types of hierarchical clustering n Agglomerative (bottom up) clustering: it builds the dendrogram (tree) from the bottom level, and q merges the most similar (or nearest) pair of clusters q stops when all the data points are merged into a single cluster (i.e., the root cluster) n Divisive (top down) clustering: it starts with all data points in one cluster, the root q splits the root into a set of child clusters q each child cluster is recursively divided further q stops when only singleton clusters of individual data points remain Agglomerative clustering It is more popular then divisive methods n At the beginning, each data point forms a cluster (also called a node) n Merge nodes/clusters that have the least distance n Go on merging n Eventually all nodes belong to one cluster 3 ¡

  4. 14/05/12 ¡ Agglomerative clustering algorithm An example: working of the algorithm 4 ¡

  5. 14/05/12 ¡ Measuring the distance of two clusters n A few ways to measure distances of two clusters q k-means uses only the distances between centroids n Different variations of the algorithm q Single link q Complete link q Average link q Centroids q … Single link method n The distance between two clusters is the distance between two closest data points in the two clusters q one data point from each cluster n It can find arbitrarily The two natural clusters shaped clusters, but (in red) are not found q It may cause the undesirable “chain effect” by noisy points (in black) 5 ¡

  6. 14/05/12 ¡ Complete link method n The distance between two clusters is the distance of two furthest data points in the two clusters n It is sensitive to outliers (in black) because they are far away n It usually produces better clusters than the single-link method Average link and centroid methods Average link method n A compromise between q the sensitivity of complete-link clustering to outliers and q the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects n The distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters Centroid method n the distance between two clusters is the distance between their centroids 6 ¡

  7. 14/05/12 ¡ The complexity n All the hierarchical algorithms are at least O(n 2 ) q n is the number of data points n Single link can be done in O(n 2 ) n Complete and average links can be done in O(n 2 log n) n Due the complexity, hierarchical algorithms are hard to use for large data sets q Perform hierarchical clustering on a sample of data points and then assign the others by distance or by supervised learning (see lecture 9) q Use scale-up methods (e.g., BIRCH) that find many small clusters using an efficient algorithm n use these clusters as the starting nodes for the hierarchical clustering n Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary 7 ¡

  8. 14/05/12 ¡ Distance functions n Key to clustering q “similarity” and “dissimilarity” are other commonly used terms n There are numerous distance functions for q Different types of data n Numeric data n Nominal data n … q Different specific applications Distance functions for numeric attributes n We denote distance with dist ( x i , x j ), where x i and x j are data points (vectors) n Most commonly used functions are q Euclidean distance and q Manhattan (city block) distance n They are special cases of Minkowski distance 1 h + x i 2 ! x j 2 h + ... + x ir ! x jr ( h ) h dist ( x i , x j ) = x i 1 ! x j 1 h is positive integer, r is the number of attributes 8 ¡

  9. 14/05/12 ¡ Euclidean distance and Manhattan distance n If h = 2, it is the Euclidean distance 2 2 2 dist ( x , x ) ( x x ) ( x x ) ... ( x x ) = − + − + + − i j i 1 j 1 i 2 j 2 ir jr n If h = 1, it is the Manhattan distance dist ( x , x ) | x x | | x x | ... | x x | = − + − + + − i j i 1 j 1 i 2 j 2 ir jr n Weighted Euclidean distance 2 2 2 dist ( x , x ) w ( x x ) w ( x x ) ... w ( x x ) = − + − + + − i j 1 i 1 j 1 2 i 2 j 2 r ir jr Squared distance and Chebychev distance n Squared Euclidean distance : to place progressively greater weight on data points that are further apart 2 2 2 dist ( x , x ) ( x x ) ( x x ) ... ( x x ) = − + − + + − i j i 1 j 1 i 2 j 2 ir jr n Chebychev distance : one wants to define two data points as “different” if they are different on any one of the attributes ( ) dist ( x i , x j ) = max x i 1 ! x j 1 , x i 2 ! x j 2 , … , x ir ! x jr 9 ¡

  10. 14/05/12 ¡ Distance functions for binary and nominal attributes n Binary attribute: has two values or states but no ordering relationships, q E.g., Gender: female and male q The 2 values are conventionally represented by 1 and 0 n We use a confusion matrix to introduce the distance functions/measures n Let the i th and j th data points be x i and x j (vectors) Confusion matrix 10 ¡

  11. 14/05/12 ¡ Symmetric binary attributes n A binary attribute is symmetric if both of its states (0 and 1) have equal importance, e.g., female and male of the attribute Gender n Distance function: Simple Matching Distance, proportion of mismatches of their values b c + (1) dist ( x , x ) = i j a b c d + + + n There are variations, adding weights To mismatches To matches 2( b + c ) b + c dist ( x i , x j ) = dist ( x i , x j ) = a + d + 2( b + c ) 2( a + d ) + b + c Symmetric binary attributes: example n x 1 and x 2 are two data points n Each of the 7 attributes is symmetric binary n The simple matching distance is b + c 2 + 2 + 1 + 2 = 3 2 + 1 dist ( x 1 , x 2 ) = a + b + c + d = 7 = 0.429 n If there is a weight on mismatches 2( b + c ) 2 + 2(2 + 1) + 2 = 6 2(2 + 1) dist ( x 1 , x 2 ) = a + 2( b + c ) + d = 10 = 0.6 11 ¡

  12. 14/05/12 ¡ Asymmetric binary attributes n Asymmetric: if one of the states is more important or valuable than the other q By convention, state 1 represents the more important state, which is typically the rare or infrequent state q Jaccard distance is a popular measure b c + (2) dist ( x , x ) = i j a b c + + q There are variations, adding weights To mismatches To matches of the important state 2( b + c ) b + c dist ( x i , x j ) = dist ( x i , x j ) = a + 2( b + c ) 2 a + b + c Asymmetric binary attributes: example n x 1 and x 2 are two data points n Each of the 7 attributes is asymmetric binary n The Jaccard distance is b + c 2 + 2 + 1 = 3 2 + 1 dist ( x 1 , x 2 ) = a + b + c = 5 = 0.6 n If there is a weight on matches of the important state b + c 2*2 + 2 + 1 = 3 2 + 1 dist ( x 1 , x 2 ) = 7 = 0.429 2 a + b + c = 12 ¡

  13. 14/05/12 ¡ Nominal attributes n Nominal attributes : with more than two states or values q the commonly used distance measure is also based on the simple matching method q Given two data points x i and x j , let the number of attributes be r , and the number of values that match in x i and x j be q r q − dist ( x , x ) (3) = i j r Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary 13 ¡

  14. 14/05/12 ¡ Data standardization n In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances n Consider the following pair of data points q x i : (0.1, 20) and x j : (0.9, 720) (0.9 ! 0.1) 2 + (720 ! 20) 2 = 700.000457 dist ( x i , x j ) = n The distance is almost completely dominated by (720-20) = 700 n Standardize attributes: to force the attributes to have a common value range Interval-scaled attributes n Their values are real numbers following a linear scale q E.g., the difference in Age between 10 and 20 is the same as that between 40 and 50 q The key idea is that intervals keep the same importance through out the scale n Two main approaches to standardize interval scaled attributes, range and z-score 14 ¡

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend