Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html

ML Problem Setting • First build and learn p ( x ) and then infer the conditional dependence p ( x t | x i ) • Unsupervised learning • Each dimension of x is equally treated • Directly learn the conditional dependence p ( x t | x i ) • Supervised learning • x t is the label to predict

Definition of Unsupervised Learning • Given the training dataset D = f x i g i =1 ; 2 ;:::;N D = f x i g i =1 ; 2 ;:::;N let the machine learn the data underlying patterns • Latent variables z ! x z ! x • Probabilistic density function (p.d.f.) estimation p ( x ) p ( x ) • Good data representation (used for discrimination) Á ( x ) Á ( x )

Uses of Unsupervised Learning • Data structure discovery, data science • Data compression • Outlier detection • Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) • A theory of biological learning and perception Slide credit: Maneesh Sahani

Content • Fundamentals of Unsupervised Learning • K-means clustering • Principal component analysis • Probabilistic Unsupervised Learning • Mixture Gaussians • EM Methods

K-Means Clustering

K-Means Clustering • Provide the number of desired clusters k • Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster • Iterate • Assign each instance to the cluster with the closest centroid • Re-estimate the centroid of each cluster • Stop when clustering converges • Or after a fixed number of iterations Slide credit: Ray Mooney

K-Means Clustering: Centriod • Assume instances are real-valued vectors x 2 R d x 2 R d • Clusters based on centroids, center of gravity, or mean of points in a cluster C k X X ¹ k = 1 ¹ k = 1 x x C k C k x 2 C k x 2 C k Slide credit: Ray Mooney

K-Means Clustering: Distance • Distance to a centroid L ( x; ¹ k ) L ( x; ¹ k ) • Euclidian distance (L2 norm) v v u u d d u u X X t t L 2 ( x; ¹ k ) = k x ¡ ¹ k k = L 2 ( x; ¹ k ) = k x ¡ ¹ k k = ( x i ¡ ¹ k ( x i ¡ ¹ k m ) 2 m ) 2 m =1 m =1 • Euclidian distance (L1 norm) d d X X L 1 ( x; ¹ k ) = j x ¡ ¹ k j = L 1 ( x; ¹ k ) = j x ¡ ¹ k j = j x i ¡ ¹ k j x i ¡ ¹ k m j m j m =1 m =1 • Cosine distance x > ¹ k x > ¹ k L cos ( x; ¹ k ) = 1 ¡ L cos ( x; ¹ k ) = 1 ¡ j x j ¢ j ¹ k j j x j ¢ j ¹ k j Slide credit: Ray Mooney

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x Compute centroids x x x Reassign clusters Converged! Slide credit: Ray Mooney

K-Means Time Complexity • Assume computing distance between two instances is O ( d ) where d is the dimensionality of the vectors • Reassigning clusters: O ( knd ) distance computations • Computing centroids: Each instance vector gets added once to some centroid: O ( nd ) • Assume these two steps are each done once for I iterations: O ( Iknd ) Slide credit: Ray Mooney

K-Means Clustering Objective • The objective of K -means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid K K X X X X X X ¹ k = 1 ¹ k = 1 L ( x ¡ ¹ k ) L ( x ¡ ¹ k ) min min x x C k C k f ¹ k g K f ¹ k g K k =1 k =1 k =1 k =1 x 2 C k x 2 C k x 2 C k x 2 C k • Finding the global optimum is NP-hard. • The K -means algorithm is guaranteed to converge a local optimum.

Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method.

Clustering Applications • Text mining • Cluster documents for related search • Cluster words for query suggestion • Recommender systems and advertising • Cluster users for item/ad recommendation • Cluster items for related item suggestion • Image search • Cluster images for similar image search and duplication detection • Speech recognition or separation • Cluster phonetical features

Principal Component Analysis (PCA) • An example of 2- dimensional data • x 1 : the piloting skill of pilot • x 2 : how much he/she enjoys flying • Main components • u 1 : intrinsic piloting “karma” of a person • u 2 : some noise Example credit: Andrew Ng

Principal Component Analysis (PCA) • PCA tries to identify the subspace in which the data approximately lies • PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. • The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d ! R k R d ! R k k ¿ d k ¿ d

PCA Data Preprocessing • Given the dataset D = f x ( i ) g m D = f x ( i ) g m i =1 i =1 • Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 m m X X ¹ = 1 ¹ = 1 x ( i ) Ã x ( i ) ¡ ¹ x ( i ) Ã x ( i ) ¡ ¹ x ( i ) x ( i ) m m i =1 i =1 2. Unify the variance of each variable m m X X j = 1 j = 1 ( x ( i ) ( x ( i ) x ( i ) Ã x ( i ) =¾ j x ( i ) Ã x ( i ) =¾ j ¾ 2 ¾ 2 j ) 2 j ) 2 m m i =1 i =1

PCA Data Preprocessing • Zero out the mean of the data • Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale”.

PCA Solution • PCA finds the directions with the largest variable variance • which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues

PCA Solution: Data Projection • The projection of each point x ( i ) to a direction u ( k u k = 1) ( k u k = 1) u x ( i ) x ( i ) x ( i ) > u x ( i ) > u • The variance of the x ( i ) > u x ( i ) > u projection m m m m X X X X 1 1 ( x ( i ) > u ) 2 = 1 ( x ( i ) > u ) 2 = 1 u > x ( i ) x ( i ) > u u > x ( i ) x ( i ) > u m m m m i =1 i =1 i =1 i =1 = u > ³ 1 = u > ³ 1 x ( i ) x ( i ) > ´ x ( i ) x ( i ) > ´ m m X X u u m m i =1 i =1 ´ u > § u ´ u > § u

PCA Solution: Largest Eigenvalues m m X X § = 1 § = 1 x ( i ) x ( i ) > x ( i ) x ( i ) > u > § u u > § u max max m m u u i =1 i =1 s.t. k u k = 1 s.t. k u k = 1 • Find k principal components of the data is to find the k principal u x ( i ) x ( i ) eigenvectors of Σ • i.e. the top- k eigenvectors with the largest eigenvalues x ( i ) > u x ( i ) > u • Projected vector for x ( i ) 2 2 3 3 u > u > 1 x ( i ) 1 x ( i ) 6 6 7 7 u > u > 2 x ( i ) 2 x ( i ) 6 6 7 7 y ( i ) = y ( i ) = 5 2 R k 5 2 R k 6 6 7 7 . . . . 4 4 . . u > u > k x ( i ) k x ( i )

Eigendecomposition Revisit • For a semi-positive square matrix Σ d × d • suppose u to be its eigenvector ( k u k = 1) ( k u k = 1) • with the scalar eigenvalue w § u = wu § u = wu • There are d eigenvectors-eigenvalue pairs ( u i , w i ) • These d eigenvectors are orthogonal, thus they form an orthonormal basis d d X X u i u > u i u > i = I i = I i =1 i =1 • Thus any vector v can be written as ³ ³ ´ ´ d d d d d d X X X X X X u i u > u i u > ( u > ( u > v = v = v = v = i v ) u i = i v ) u i = v ( i ) u i v ( i ) u i i i i =1 i =1 i =1 i =1 i =1 i =1 U = [ u 1 ; u 2 ; : : : ; u d ] U = [ u 1 ; u 2 ; : : : ; u d ] • Σ d × d can be written as 2 2 3 3 w 1 w 1 0 0 ¢ ¢ ¢ ¢ ¢ ¢ 0 0 6 6 7 7 d d d d X X X X 0 0 w 2 w 2 ¢ ¢ ¢ ¢ ¢ ¢ 0 0 6 6 7 7 u i u > u i u > w i u i u > w i u i u > i = UWU > i = UWU > W = W = 6 6 7 7 § = § = i § = i § = . . . . ... ... . . . . 4 4 5 5 . . . . 0 0 i =1 i =1 i =1 i =1 0 0 0 0 ¢ ¢ ¢ ¢ ¢ ¢ w d w d

Eigendecomposition Revisit 2 2 3 3 x > x > 1 1 6 6 7 7 x > x > 6 6 7 7 • Given the data 2 2 and its covariance matrix § = X > X § = X > X X = X = 6 6 7 7 . . . . 4 4 5 5 . . (here we may drop m for simplicity) x > x > n n • The variance in direction u i is k Xu i k 2 = u > k Xu i k 2 = u > i X > Xu i = u > i X > Xu i = u > i § u i = u > i § u i = u > i w i u i = w i i w i u i = w i • The variance in any direction v is ° ° ´° ´° ³ ³ d d d d X X X X X X 2 2 ° ° ° ° k Xv k 2 = k Xv k 2 = v ( i ) u > v ( i ) u > v 2 v 2 ° X ° X v ( i ) u i v ( i ) u i = = i § u i v ( j ) = i § u i v ( j ) = ( i ) w i ( i ) w i ° ° i =1 i =1 ij ij i =1 i =1 where v ( i ) is the projection length of v on u i k v k =1 k Xv k 2 = u (max) k v k =1 k Xv k 2 = u (max) • If v T v = 1, then arg max arg max The direction of greatest variance is the eigenvector with the largest eigenvalue

PCA Discussion • PCA can also be derived by picking the basis that minimizes the approximation error arising from projecting the data onto the k -dimensional subspace spanned by them.

PCA Visualization http://setosa.io/ev/principal-component-analysis/

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and learn p ( x ) and then infer the

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Algorithms for Distributed Functional Monitoring

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

The Fundamental Theorem prof. dr Arno Siebes Algorithmic Data Analysis Group Department of

Randomness in Computing L ECTURE 21 Last time Probabilistic method Sample and Modify

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee

Robust Hidden Markov Models Inference in the Presence of Label Noise Benot Frnay February 7,

H Filtering of Uncertain LPV Systems with Time-Delay C.Briat, O.Sename and JF.Lafay August

Probability, Entropy, and Inference Ensemble X is a triple ( x, A X , P X ) , where Based on