unsupervised learning
play

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and learn p ( x ) and then infer the


  1. 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html

  2. ML Problem Setting • First build and learn p ( x ) and then infer the conditional dependence p ( x t | x i ) • Unsupervised learning • Each dimension of x is equally treated • Directly learn the conditional dependence p ( x t | x i ) • Supervised learning • x t is the label to predict

  3. Definition of Unsupervised Learning • Given the training dataset D = f x i g i =1 ; 2 ;:::;N D = f x i g i =1 ; 2 ;:::;N let the machine learn the data underlying patterns • Latent variables z ! x z ! x • Probabilistic density function (p.d.f.) estimation p ( x ) p ( x ) • Good data representation (used for discrimination) Á ( x ) Á ( x )

  4. Uses of Unsupervised Learning • Data structure discovery, data science • Data compression • Outlier detection • Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) • A theory of biological learning and perception Slide credit: Maneesh Sahani

  5. Content • Fundamentals of Unsupervised Learning • K-means clustering • Principal component analysis • Probabilistic Unsupervised Learning • Mixture Gaussians • EM Methods

  6. K-Means Clustering

  7. K-Means Clustering

  8. K-Means Clustering • Provide the number of desired clusters k • Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster • Iterate • Assign each instance to the cluster with the closest centroid • Re-estimate the centroid of each cluster • Stop when clustering converges • Or after a fixed number of iterations Slide credit: Ray Mooney

  9. K-Means Clustering: Centriod • Assume instances are real-valued vectors x 2 R d x 2 R d • Clusters based on centroids, center of gravity, or mean of points in a cluster C k X X ¹ k = 1 ¹ k = 1 x x C k C k x 2 C k x 2 C k Slide credit: Ray Mooney

  10. K-Means Clustering: Distance • Distance to a centroid L ( x; ¹ k ) L ( x; ¹ k ) • Euclidian distance (L2 norm) v v u u d d u u X X t t L 2 ( x; ¹ k ) = k x ¡ ¹ k k = L 2 ( x; ¹ k ) = k x ¡ ¹ k k = ( x i ¡ ¹ k ( x i ¡ ¹ k m ) 2 m ) 2 m =1 m =1 • Euclidian distance (L1 norm) d d X X L 1 ( x; ¹ k ) = j x ¡ ¹ k j = L 1 ( x; ¹ k ) = j x ¡ ¹ k j = j x i ¡ ¹ k j x i ¡ ¹ k m j m j m =1 m =1 • Cosine distance x > ¹ k x > ¹ k L cos ( x; ¹ k ) = 1 ¡ L cos ( x; ¹ k ) = 1 ¡ j x j ¢ j ¹ k j j x j ¢ j ¹ k j Slide credit: Ray Mooney

  11. K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x Compute centroids x x x Reassign clusters Converged! Slide credit: Ray Mooney

  12. K-Means Time Complexity • Assume computing distance between two instances is O ( d ) where d is the dimensionality of the vectors • Reassigning clusters: O ( knd ) distance computations • Computing centroids: Each instance vector gets added once to some centroid: O ( nd ) • Assume these two steps are each done once for I iterations: O ( Iknd ) Slide credit: Ray Mooney

  13. K-Means Clustering Objective • The objective of K -means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid K K X X X X X X ¹ k = 1 ¹ k = 1 L ( x ¡ ¹ k ) L ( x ¡ ¹ k ) min min x x C k C k f ¹ k g K f ¹ k g K k =1 k =1 k =1 k =1 x 2 C k x 2 C k x 2 C k x 2 C k • Finding the global optimum is NP-hard. • The K -means algorithm is guaranteed to converge a local optimum.

  14. Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method.

  15. Clustering Applications • Text mining • Cluster documents for related search • Cluster words for query suggestion • Recommender systems and advertising • Cluster users for item/ad recommendation • Cluster items for related item suggestion • Image search • Cluster images for similar image search and duplication detection • Speech recognition or separation • Cluster phonetical features

  16. Principal Component Analysis (PCA) • An example of 2- dimensional data • x 1 : the piloting skill of pilot • x 2 : how much he/she enjoys flying • Main components • u 1 : intrinsic piloting “karma” of a person • u 2 : some noise Example credit: Andrew Ng

  17. Principal Component Analysis (PCA) • PCA tries to identify the subspace in which the data approximately lies • PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. • The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d ! R k R d ! R k k ¿ d k ¿ d

  18. PCA Data Preprocessing • Given the dataset D = f x ( i ) g m D = f x ( i ) g m i =1 i =1 • Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 m m X X ¹ = 1 ¹ = 1 x ( i ) Ã x ( i ) ¡ ¹ x ( i ) Ã x ( i ) ¡ ¹ x ( i ) x ( i ) m m i =1 i =1 2. Unify the variance of each variable m m X X j = 1 j = 1 ( x ( i ) ( x ( i ) x ( i ) Ã x ( i ) =¾ j x ( i ) Ã x ( i ) =¾ j ¾ 2 ¾ 2 j ) 2 j ) 2 m m i =1 i =1

  19. PCA Data Preprocessing • Zero out the mean of the data • Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale”.

  20. PCA Solution • PCA finds the directions with the largest variable variance • which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues

  21. PCA Solution: Data Projection • The projection of each point x ( i ) to a direction u ( k u k = 1) ( k u k = 1) u x ( i ) x ( i ) x ( i ) > u x ( i ) > u • The variance of the x ( i ) > u x ( i ) > u projection m m m m X X X X 1 1 ( x ( i ) > u ) 2 = 1 ( x ( i ) > u ) 2 = 1 u > x ( i ) x ( i ) > u u > x ( i ) x ( i ) > u m m m m i =1 i =1 i =1 i =1 = u > ³ 1 = u > ³ 1 x ( i ) x ( i ) > ´ x ( i ) x ( i ) > ´ m m X X u u m m i =1 i =1 ´ u > § u ´ u > § u

  22. PCA Solution: Largest Eigenvalues m m X X § = 1 § = 1 x ( i ) x ( i ) > x ( i ) x ( i ) > u > § u u > § u max max m m u u i =1 i =1 s.t. k u k = 1 s.t. k u k = 1 • Find k principal components of the data is to find the k principal u x ( i ) x ( i ) eigenvectors of Σ • i.e. the top- k eigenvectors with the largest eigenvalues x ( i ) > u x ( i ) > u • Projected vector for x ( i ) 2 2 3 3 u > u > 1 x ( i ) 1 x ( i ) 6 6 7 7 u > u > 2 x ( i ) 2 x ( i ) 6 6 7 7 y ( i ) = y ( i ) = 5 2 R k 5 2 R k 6 6 7 7 . . . . 4 4 . . u > u > k x ( i ) k x ( i )

  23. Eigendecomposition Revisit • For a semi-positive square matrix Σ d × d • suppose u to be its eigenvector ( k u k = 1) ( k u k = 1) • with the scalar eigenvalue w § u = wu § u = wu • There are d eigenvectors-eigenvalue pairs ( u i , w i ) • These d eigenvectors are orthogonal, thus they form an orthonormal basis d d X X u i u > u i u > i = I i = I i =1 i =1 • Thus any vector v can be written as ³ ³ ´ ´ d d d d d d X X X X X X u i u > u i u > ( u > ( u > v = v = v = v = i v ) u i = i v ) u i = v ( i ) u i v ( i ) u i i i i =1 i =1 i =1 i =1 i =1 i =1 U = [ u 1 ; u 2 ; : : : ; u d ] U = [ u 1 ; u 2 ; : : : ; u d ] • Σ d × d can be written as 2 2 3 3 w 1 w 1 0 0 ¢ ¢ ¢ ¢ ¢ ¢ 0 0 6 6 7 7 d d d d X X X X 0 0 w 2 w 2 ¢ ¢ ¢ ¢ ¢ ¢ 0 0 6 6 7 7 u i u > u i u > w i u i u > w i u i u > i = UWU > i = UWU > W = W = 6 6 7 7 § = § = i § = i § = . . . . ... ... . . . . 4 4 5 5 . . . . 0 0 i =1 i =1 i =1 i =1 0 0 0 0 ¢ ¢ ¢ ¢ ¢ ¢ w d w d

  24. Eigendecomposition Revisit 2 2 3 3 x > x > 1 1 6 6 7 7 x > x > 6 6 7 7 • Given the data 2 2 and its covariance matrix § = X > X § = X > X X = X = 6 6 7 7 . . . . 4 4 5 5 . . (here we may drop m for simplicity) x > x > n n • The variance in direction u i is k Xu i k 2 = u > k Xu i k 2 = u > i X > Xu i = u > i X > Xu i = u > i § u i = u > i § u i = u > i w i u i = w i i w i u i = w i • The variance in any direction v is ° ° ´° ´° ³ ³ d d d d X X X X X X 2 2 ° ° ° ° k Xv k 2 = k Xv k 2 = v ( i ) u > v ( i ) u > v 2 v 2 ° X ° X v ( i ) u i v ( i ) u i = = i § u i v ( j ) = i § u i v ( j ) = ( i ) w i ( i ) w i ° ° i =1 i =1 ij ij i =1 i =1 where v ( i ) is the projection length of v on u i k v k =1 k Xv k 2 = u (max) k v k =1 k Xv k 2 = u (max) • If v T v = 1, then arg max arg max The direction of greatest variance is the eigenvector with the largest eigenvalue

  25. PCA Discussion • PCA can also be derived by picking the basis that minimizes the approximation error arising from projecting the data onto the k -dimensional subspace spanned by them.

  26. PCA Visualization http://setosa.io/ev/principal-component-analysis/

  27. PCA Visualization http://setosa.io/ev/principal-component-analysis/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend