 
              2019 CS420, Machine Learning, Lecture 9 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html
What is Data Science • Data Science • Physics • Goal: discover the • Goal: discover the underlying Principal of the underlying Principal of the data world • Solution: build the model of • Solution: build the model of the world from observations the data from observations e f ( x ) e f ( x ) F = Gm 1 m 2 F = Gm 1 m 2 p ( x ) = p ( x ) = P P r 2 r 2 x 0 e f ( x 0 ) x 0 e f ( x 0 )
Data Science • Mathematically • Find joint data p ( x ) p ( x ) distribution • Then the conditional distribution p ( x 2 j x 1 ) p ( x 2 j x 1 ) • Gaussian distribution • Multivariate • Univariate p ( x ) = e ¡ 1 p ( x ) = e ¡ 1 2 ( x ¡ ¹ ) > § ¡ 1 ( x ¡ ¹ ) 2 ( x ¡ ¹ ) > § ¡ 1 ( x ¡ ¹ ) 2 ¼¾ 2 e ¡ ( x ¡ ¹ )2 2 ¼¾ 2 e ¡ ( x ¡ ¹ )2 1 1 p p p ( x ) = p ( x ) = p p 2 ¾ 2 2 ¾ 2 j 2 ¼ § j j 2 ¼ § j
A Simple Example in User Behavior Modelling Interest Gender Age BBC Sports PubMed Bloomberg Spotify Business Finance Male 29 Yes No Yes No Sports Male 21 Yes No No Yes Medicine Female 32 No Yes No No Music Female 25 No No No Yes Medicine Male 40 Yes Yes Yes No • Joint data distribution p(Interest=Finance, Gender=Male, Age=29, Browsing=BBC Sports,Bloomberg Business) • Conditional data distribution p(Interest=Finance | Browsing=BBC Sports,Bloomberg Business) p(Gender=Male | Browsing=BBC Sports,Bloomberg Business)
Problem Setting • First build and learn p ( x ) and then infer the conditional dependence p ( x t | x i ) • Unsupervised learning • Each dimension of x is equally treated • Directly learn the conditional dependence p ( x t | x i ) • Supervised learning • x t is the label to predict
Definition of Unsupervised Learning • Given the training dataset D = f x i g i =1 ; 2 ;:::;N D = f x i g i =1 ; 2 ;:::;N let the machine learn the data underlying patterns • Latent variables z ! x z ! x • Probabilistic density function (p.d.f.) estimation p ( x ) p ( x ) • Good data representation (used for discrimination) Á ( x ) Á ( x )
Uses of Unsupervised Learning • Data structure discovery, data science • Data compression • Outlier detection • Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) • A theory of biological learning and perception Slide credit: Maneesh Sahani
Content • Fundamentals of Unsupervised Learning • K-means clustering • Principal component analysis • Probabilistic Unsupervised Learning • Mixture Gaussians • EM Methods • Deep Unsupervised Learning • Auto-encoders • Generative adversarial nets
Content • Fundamentals of Unsupervised Learning • K-means clustering • Principal component analysis • Probabilistic Unsupervised Learning • Mixture Gaussians • EM Methods • Deep Unsupervised Learning • Auto-encoders • Generative adversarial nets
K-Means Clustering
K-Means Clustering
K-Means Clustering • Provide the number of desired clusters k • Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster • Iterate • Assign each instance to the cluster with the closest centroid • Re-estimate the centroid of each cluster • Stop when clustering converges • Or after a fixed number of iterations Slide credit: Ray Mooney
K-Means Clustering: Centriod • Assume instances are real-valued vectors x 2 R d x 2 R d • Clusters based on centroids, center of gravity, or mean of points in a cluster C k X X ¹ k = 1 ¹ k = 1 x x C k C k x 2 C k x 2 C k Slide credit: Ray Mooney
K-Means Clustering: Distance • Distance to a centroid L ( x; ¹ k ) L ( x; ¹ k ) • Euclidian distance (L2 norm) v v u u d d u u X X t t L 2 ( x; ¹ k ) = k x ¡ ¹ k k = L 2 ( x; ¹ k ) = k x ¡ ¹ k k = ( x i ¡ ¹ k ( x i ¡ ¹ k m ) 2 m ) 2 m =1 m =1 • Euclidian distance (L1 norm) d d X X L 1 ( x; ¹ k ) = j x ¡ ¹ k j = L 1 ( x; ¹ k ) = j x ¡ ¹ k j = j x i ¡ ¹ k j x i ¡ ¹ k m j m j m =1 m =1 • Cosine distance x > ¹ k x > ¹ k L cos ( x; ¹ k ) = 1 ¡ L cos ( x; ¹ k ) = 1 ¡ j x j ¢ j ¹ k j j x j ¢ j ¹ k j Slide credit: Ray Mooney
K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Re-assign clusters x x x Compute centroids x x x Reassign clusters Converged! Slide credit: Ray Mooney
K-Means Time Complexity • Assume computing distance between two instances is O ( d ) where d is the dimensionality of the vectors • Reassigning clusters: O ( knd ) distance computations • Computing centroids: Each instance vector gets added once to some centroid: O ( nd ) • Assume these two steps are each done once for I iterations: O ( Iknd ) Slide credit: Ray Mooney
K-Means Clustering Objective • The objective of K -means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid K K X X X X X X ¹ k = 1 ¹ k = 1 L ( x ¡ ¹ k ) L ( x ¡ ¹ k ) min min x x C k C k f ¹ k g K f ¹ k g K k =1 k =1 k =1 k =1 x 2 C k x 2 C k x 2 C k x 2 C k • Finding the global optimum is NP-hard. • The K -means algorithm is guaranteed to converge to a local optimum.
Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method.
Clustering Applications • Text mining • Cluster documents for related search • Cluster words for query suggestion • Recommender systems and advertising • Cluster users for item/ad recommendation • Cluster items for related item suggestion • Image search • Cluster images for similar image search and duplication detection • Speech recognition or separation • Cluster phonetical features
Principal Component Analysis (PCA) • An example of 2- dimensional data • x 1 : the piloting skill of pilot • x 2 : how much he/she enjoys flying • Main components • u 1 : intrinsic piloting “karma” of a person • u 2 : some noise Example credit: Andrew Ng
Principal Component Analysis (PCA) • PCA tries to identify the subspace in which the data approximately lies • PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. • The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d ! R k R d ! R k k ¿ d k ¿ d
PCA Data Preprocessing • Given the dataset D = f x ( i ) g m D = f x ( i ) g m i =1 i =1 • Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 m m X X ¹ = 1 ¹ = 1 x ( i ) Ã x ( i ) ¡ ¹ x ( i ) Ã x ( i ) ¡ ¹ x ( i ) x ( i ) m m i =1 i =1 2. Unify the variance of each variable m m X X j = 1 j = 1 ( x ( i ) ( x ( i ) x ( i ) Ã x ( i ) =¾ j x ( i ) Ã x ( i ) =¾ j ¾ 2 ¾ 2 j ) 2 j ) 2 m m i =1 i =1
PCA Data Preprocessing • Zero out the mean of the data • Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale”.
PCA Solution • PCA finds the directions with the largest variable variance • which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues
PCA Solution: Data Projection • The projection of each point x ( i ) to a direction u ( k u k = 1) ( k u k = 1) u x ( i ) x ( i ) x ( i ) > u x ( i ) > u • The variance of the x ( i ) > u x ( i ) > u projection m m m m X X X X 1 1 ( x ( i ) > u ) 2 = 1 ( x ( i ) > u ) 2 = 1 u > x ( i ) x ( i ) > u u > x ( i ) x ( i ) > u m m m m i =1 i =1 i =1 i =1 = u > ³ 1 = u > ³ 1 x ( i ) x ( i ) > ´ x ( i ) x ( i ) > ´ m m X X u u m m i =1 i =1 ´ u > § u ´ u > § u
PCA Solution: Largest Eigenvalues m m X X § = 1 § = 1 x ( i ) x ( i ) > x ( i ) x ( i ) > u > § u u > § u max max m m u u i =1 i =1 s.t. k u k = 1 s.t. k u k = 1 • Find k principal components of the data is to find the k principal u x ( i ) x ( i ) eigenvectors of Σ • i.e. the top- k eigenvectors with the largest eigenvalues x ( i ) > u x ( i ) > u • Projected vector for x ( i ) 2 2 3 3 u > u > 1 x ( i ) 1 x ( i ) 6 6 7 7 u > u > 2 x ( i ) 2 x ( i ) 6 6 7 7 y ( i ) = y ( i ) = 5 2 R k 5 2 R k 6 6 7 7 . . . . 4 4 . . u > u > k x ( i ) k x ( i )
Recommend
More recommend