2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html
ML Problem Setting • First build and learn p ( x ) and then infer the conditional dependence p ( x t | x i ) • Unsupervised learning • Each dimension of x is equally treated • Directly learn the conditional dependence p ( x t | x i ) • Supervised learning • x t is the label to predict
Definition of Unsupervised Learning • Given the training dataset D = f x i g i =1 ; 2 ;:::;N D = f x i g i =1 ; 2 ;:::;N let the machine learn the data underlying patterns • Latent variables z ! x z ! x • Probabilistic density function (p.d.f.) estimation p ( x ) p ( x ) • Good data representation (used for discrimination) Á ( x ) Á ( x )
Uses of Unsupervised Learning • Data structure discovery, data science • Data compression • Outlier detection • Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) • A theory of biological learning and perception Slide credit: Maneesh Sahani
Content • Fundamentals of Unsupervised Learning • K-means clustering • Principal component analysis • Probabilistic Unsupervised Learning • Mixture Gaussians • EM Methods
K-Means Clustering
K-Means Clustering
K-Means Clustering • Provide the number of desired clusters k • Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster • Iterate • Assign each instance to the cluster with the closest centroid • Re-estimate the centroid of each cluster • Stop when clustering converges • Or after a fixed number of iterations Slide credit: Ray Mooney
K-Means Clustering: Centriod • Assume instances are real-valued vectors x 2 R d x 2 R d • Clusters based on centroids, center of gravity, or mean of points in a cluster C k X X ¹ k = 1 ¹ k = 1 x x C k C k x 2 C k x 2 C k Slide credit: Ray Mooney
K-Means Clustering: Distance • Distance to a centroid L ( x; ¹ k ) L ( x; ¹ k ) • Euclidian distance (L2 norm) v v u u d d u u X X t t L 2 ( x; ¹ k ) = k x ¡ ¹ k k = L 2 ( x; ¹ k ) = k x ¡ ¹ k k = ( x i ¡ ¹ k ( x i ¡ ¹ k m ) 2 m ) 2 m =1 m =1 • Euclidian distance (L1 norm) d d X X L 1 ( x; ¹ k ) = j x ¡ ¹ k j = L 1 ( x; ¹ k ) = j x ¡ ¹ k j = j x i ¡ ¹ k j x i ¡ ¹ k m j m j m =1 m =1 • Cosine distance x > ¹ k x > ¹ k L cos ( x; ¹ k ) = 1 ¡ L cos ( x; ¹ k ) = 1 ¡ j x j ¢ j ¹ k j j x j ¢ j ¹ k j Slide credit: Ray Mooney
K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x Compute centroids x x x Reassign clusters Converged! Slide credit: Ray Mooney
K-Means Time Complexity • Assume computing distance between two instances is O ( d ) where d is the dimensionality of the vectors • Reassigning clusters: O ( knd ) distance computations • Computing centroids: Each instance vector gets added once to some centroid: O ( nd ) • Assume these two steps are each done once for I iterations: O ( Iknd ) Slide credit: Ray Mooney
K-Means Clustering Objective • The objective of K -means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid K K X X X X X X ¹ k = 1 ¹ k = 1 L ( x ¡ ¹ k ) L ( x ¡ ¹ k ) min min x x C k C k f ¹ k g K f ¹ k g K k =1 k =1 k =1 k =1 x 2 C k x 2 C k x 2 C k x 2 C k • Finding the global optimum is NP-hard. • The K -means algorithm is guaranteed to converge a local optimum.
Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method.
Clustering Applications • Text mining • Cluster documents for related search • Cluster words for query suggestion • Recommender systems and advertising • Cluster users for item/ad recommendation • Cluster items for related item suggestion • Image search • Cluster images for similar image search and duplication detection • Speech recognition or separation • Cluster phonetical features
Principal Component Analysis (PCA) • An example of 2- dimensional data • x 1 : the piloting skill of pilot • x 2 : how much he/she enjoys flying • Main components • u 1 : intrinsic piloting “karma” of a person • u 2 : some noise Example credit: Andrew Ng
Principal Component Analysis (PCA) • PCA tries to identify the subspace in which the data approximately lies • PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. • The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d ! R k R d ! R k k ¿ d k ¿ d
PCA Data Preprocessing • Given the dataset D = f x ( i ) g m D = f x ( i ) g m i =1 i =1 • Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 m m X X ¹ = 1 ¹ = 1 x ( i ) Ã x ( i ) ¡ ¹ x ( i ) Ã x ( i ) ¡ ¹ x ( i ) x ( i ) m m i =1 i =1 2. Unify the variance of each variable m m X X j = 1 j = 1 ( x ( i ) ( x ( i ) x ( i ) Ã x ( i ) =¾ j x ( i ) Ã x ( i ) =¾ j ¾ 2 ¾ 2 j ) 2 j ) 2 m m i =1 i =1
PCA Data Preprocessing • Zero out the mean of the data • Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale”.
PCA Solution • PCA finds the directions with the largest variable variance • which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues
PCA Solution: Data Projection • The projection of each point x ( i ) to a direction u ( k u k = 1) ( k u k = 1) u x ( i ) x ( i ) x ( i ) > u x ( i ) > u • The variance of the x ( i ) > u x ( i ) > u projection m m m m X X X X 1 1 ( x ( i ) > u ) 2 = 1 ( x ( i ) > u ) 2 = 1 u > x ( i ) x ( i ) > u u > x ( i ) x ( i ) > u m m m m i =1 i =1 i =1 i =1 = u > ³ 1 = u > ³ 1 x ( i ) x ( i ) > ´ x ( i ) x ( i ) > ´ m m X X u u m m i =1 i =1 ´ u > § u ´ u > § u
PCA Solution: Largest Eigenvalues m m X X § = 1 § = 1 x ( i ) x ( i ) > x ( i ) x ( i ) > u > § u u > § u max max m m u u i =1 i =1 s.t. k u k = 1 s.t. k u k = 1 • Find k principal components of the data is to find the k principal u x ( i ) x ( i ) eigenvectors of Σ • i.e. the top- k eigenvectors with the largest eigenvalues x ( i ) > u x ( i ) > u • Projected vector for x ( i ) 2 2 3 3 u > u > 1 x ( i ) 1 x ( i ) 6 6 7 7 u > u > 2 x ( i ) 2 x ( i ) 6 6 7 7 y ( i ) = y ( i ) = 5 2 R k 5 2 R k 6 6 7 7 . . . . 4 4 . . u > u > k x ( i ) k x ( i )
Eigendecomposition Revisit • For a semi-positive square matrix Σ d × d • suppose u to be its eigenvector ( k u k = 1) ( k u k = 1) • with the scalar eigenvalue w § u = wu § u = wu • There are d eigenvectors-eigenvalue pairs ( u i , w i ) • These d eigenvectors are orthogonal, thus they form an orthonormal basis d d X X u i u > u i u > i = I i = I i =1 i =1 • Thus any vector v can be written as ³ ³ ´ ´ d d d d d d X X X X X X u i u > u i u > ( u > ( u > v = v = v = v = i v ) u i = i v ) u i = v ( i ) u i v ( i ) u i i i i =1 i =1 i =1 i =1 i =1 i =1 U = [ u 1 ; u 2 ; : : : ; u d ] U = [ u 1 ; u 2 ; : : : ; u d ] • Σ d × d can be written as 2 2 3 3 w 1 w 1 0 0 ¢ ¢ ¢ ¢ ¢ ¢ 0 0 6 6 7 7 d d d d X X X X 0 0 w 2 w 2 ¢ ¢ ¢ ¢ ¢ ¢ 0 0 6 6 7 7 u i u > u i u > w i u i u > w i u i u > i = UWU > i = UWU > W = W = 6 6 7 7 § = § = i § = i § = . . . . ... ... . . . . 4 4 5 5 . . . . 0 0 i =1 i =1 i =1 i =1 0 0 0 0 ¢ ¢ ¢ ¢ ¢ ¢ w d w d
Eigendecomposition Revisit 2 2 3 3 x > x > 1 1 6 6 7 7 x > x > 6 6 7 7 • Given the data 2 2 and its covariance matrix § = X > X § = X > X X = X = 6 6 7 7 . . . . 4 4 5 5 . . (here we may drop m for simplicity) x > x > n n • The variance in direction u i is k Xu i k 2 = u > k Xu i k 2 = u > i X > Xu i = u > i X > Xu i = u > i § u i = u > i § u i = u > i w i u i = w i i w i u i = w i • The variance in any direction v is ° ° ´° ´° ³ ³ d d d d X X X X X X 2 2 ° ° ° ° k Xv k 2 = k Xv k 2 = v ( i ) u > v ( i ) u > v 2 v 2 ° X ° X v ( i ) u i v ( i ) u i = = i § u i v ( j ) = i § u i v ( j ) = ( i ) w i ( i ) w i ° ° i =1 i =1 ij ij i =1 i =1 where v ( i ) is the projection length of v on u i k v k =1 k Xv k 2 = u (max) k v k =1 k Xv k 2 = u (max) • If v T v = 1, then arg max arg max The direction of greatest variance is the eigenvector with the largest eigenvalue
PCA Discussion • PCA can also be derived by picking the basis that minimizes the approximation error arising from projecting the data onto the k -dimensional subspace spanned by them.
PCA Visualization http://setosa.io/ev/principal-component-analysis/
PCA Visualization http://setosa.io/ev/principal-component-analysis/
Recommend
More recommend