Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 9 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

What is Data Science • Data Science • Physics • Goal: discover the • Goal: discover the underlying Principal of the underlying Principal of the data world • Solution: build the model of • Solution: build the model of the world from observations the data from observations e f ( x ) e f ( x ) F = Gm 1 m 2 F = Gm 1 m 2 p ( x ) = p ( x ) = P P r 2 r 2 x 0 e f ( x 0 ) x 0 e f ( x 0 )

Data Science • Mathematically • Find joint data p ( x ) p ( x ) distribution • Then the conditional distribution p ( x 2 j x 1 ) p ( x 2 j x 1 ) • Gaussian distribution • Multivariate • Univariate p ( x ) = e ¡ 1 p ( x ) = e ¡ 1 2 ( x ¡ ¹ ) > § ¡ 1 ( x ¡ ¹ ) 2 ( x ¡ ¹ ) > § ¡ 1 ( x ¡ ¹ ) 2 ¼¾ 2 e ¡ ( x ¡ ¹ )2 2 ¼¾ 2 e ¡ ( x ¡ ¹ )2 1 1 p p p ( x ) = p ( x ) = p p 2 ¾ 2 2 ¾ 2 j 2 ¼ § j j 2 ¼ § j

A Simple Example in User Behavior Modelling Interest Gender Age BBC Sports PubMed Bloomberg Spotify Business Finance Male 29 Yes No Yes No Sports Male 21 Yes No No Yes Medicine Female 32 No Yes No No Music Female 25 No No No Yes Medicine Male 40 Yes Yes Yes No • Joint data distribution p(Interest=Finance, Gender=Male, Age=29, Browsing=BBC Sports,Bloomberg Business) • Conditional data distribution p(Interest=Finance | Browsing=BBC Sports,Bloomberg Business) p(Gender=Male | Browsing=BBC Sports,Bloomberg Business)

Problem Setting • First build and learn p ( x ) and then infer the conditional dependence p ( x t | x i ) • Unsupervised learning • Each dimension of x is equally treated • Directly learn the conditional dependence p ( x t | x i ) • Supervised learning • x t is the label to predict

Definition of Unsupervised Learning • Given the training dataset D = f x i g i =1 ; 2 ;:::;N D = f x i g i =1 ; 2 ;:::;N let the machine learn the data underlying patterns • Latent variables z ! x z ! x • Probabilistic density function (p.d.f.) estimation p ( x ) p ( x ) • Good data representation (used for discrimination) Á ( x ) Á ( x )

Uses of Unsupervised Learning • Data structure discovery, data science • Data compression • Outlier detection • Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) • A theory of biological learning and perception Slide credit: Maneesh Sahani

Content • Fundamentals of Unsupervised Learning • K-means clustering • Principal component analysis • Probabilistic Unsupervised Learning • Mixture Gaussians • EM Methods • Deep Unsupervised Learning • Auto-encoders • Generative adversarial nets

K-Means Clustering

K-Means Clustering • Provide the number of desired clusters k • Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster • Iterate • Assign each instance to the cluster with the closest centroid • Re-estimate the centroid of each cluster • Stop when clustering converges • Or after a fixed number of iterations Slide credit: Ray Mooney

K-Means Clustering: Centriod • Assume instances are real-valued vectors x 2 R d x 2 R d • Clusters based on centroids, center of gravity, or mean of points in a cluster C k X X ¹ k = 1 ¹ k = 1 x x C k C k x 2 C k x 2 C k Slide credit: Ray Mooney

K-Means Clustering: Distance • Distance to a centroid L ( x; ¹ k ) L ( x; ¹ k ) • Euclidian distance (L2 norm) v v u u d d u u X X t t L 2 ( x; ¹ k ) = k x ¡ ¹ k k = L 2 ( x; ¹ k ) = k x ¡ ¹ k k = ( x i ¡ ¹ k ( x i ¡ ¹ k m ) 2 m ) 2 m =1 m =1 • Euclidian distance (L1 norm) d d X X L 1 ( x; ¹ k ) = j x ¡ ¹ k j = L 1 ( x; ¹ k ) = j x ¡ ¹ k j = j x i ¡ ¹ k j x i ¡ ¹ k m j m j m =1 m =1 • Cosine distance x > ¹ k x > ¹ k L cos ( x; ¹ k ) = 1 ¡ L cos ( x; ¹ k ) = 1 ¡ j x j ¢ j ¹ k j j x j ¢ j ¹ k j Slide credit: Ray Mooney

K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Re-assign clusters x x x Compute centroids x x x Reassign clusters Converged! Slide credit: Ray Mooney

K-Means Time Complexity • Assume computing distance between two instances is O ( d ) where d is the dimensionality of the vectors • Reassigning clusters: O ( knd ) distance computations • Computing centroids: Each instance vector gets added once to some centroid: O ( nd ) • Assume these two steps are each done once for I iterations: O ( Iknd ) Slide credit: Ray Mooney

K-Means Clustering Objective • The objective of K -means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid K K X X X X X X ¹ k = 1 ¹ k = 1 L ( x ¡ ¹ k ) L ( x ¡ ¹ k ) min min x x C k C k f ¹ k g K f ¹ k g K k =1 k =1 k =1 k =1 x 2 C k x 2 C k x 2 C k x 2 C k • Finding the global optimum is NP-hard. • The K -means algorithm is guaranteed to converge to a local optimum.

Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method.

Clustering Applications • Text mining • Cluster documents for related search • Cluster words for query suggestion • Recommender systems and advertising • Cluster users for item/ad recommendation • Cluster items for related item suggestion • Image search • Cluster images for similar image search and duplication detection • Speech recognition or separation • Cluster phonetical features

Principal Component Analysis (PCA) • An example of 2- dimensional data • x 1 : the piloting skill of pilot • x 2 : how much he/she enjoys flying • Main components • u 1 : intrinsic piloting “karma” of a person • u 2 : some noise Example credit: Andrew Ng

Principal Component Analysis (PCA) • PCA tries to identify the subspace in which the data approximately lies • PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. • The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d ! R k R d ! R k k ¿ d k ¿ d

PCA Data Preprocessing • Given the dataset D = f x ( i ) g m D = f x ( i ) g m i =1 i =1 • Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 m m X X ¹ = 1 ¹ = 1 x ( i ) Ã x ( i ) ¡ ¹ x ( i ) Ã x ( i ) ¡ ¹ x ( i ) x ( i ) m m i =1 i =1 2. Unify the variance of each variable m m X X j = 1 j = 1 ( x ( i ) ( x ( i ) x ( i ) Ã x ( i ) =¾ j x ( i ) Ã x ( i ) =¾ j ¾ 2 ¾ 2 j ) 2 j ) 2 m m i =1 i =1

PCA Data Preprocessing • Zero out the mean of the data • Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same “scale”.

PCA Solution • PCA finds the directions with the largest variable variance • which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues

PCA Solution: Data Projection • The projection of each point x ( i ) to a direction u ( k u k = 1) ( k u k = 1) u x ( i ) x ( i ) x ( i ) > u x ( i ) > u • The variance of the x ( i ) > u x ( i ) > u projection m m m m X X X X 1 1 ( x ( i ) > u ) 2 = 1 ( x ( i ) > u ) 2 = 1 u > x ( i ) x ( i ) > u u > x ( i ) x ( i ) > u m m m m i =1 i =1 i =1 i =1 = u > ³ 1 = u > ³ 1 x ( i ) x ( i ) > ´ x ( i ) x ( i ) > ´ m m X X u u m m i =1 i =1 ´ u > § u ´ u > § u

PCA Solution: Largest Eigenvalues m m X X § = 1 § = 1 x ( i ) x ( i ) > x ( i ) x ( i ) > u > § u u > § u max max m m u u i =1 i =1 s.t. k u k = 1 s.t. k u k = 1 • Find k principal components of the data is to find the k principal u x ( i ) x ( i ) eigenvectors of Σ • i.e. the top- k eigenvectors with the largest eigenvalues x ( i ) > u x ( i ) > u • Projected vector for x ( i ) 2 2 3 3 u > u > 1 x ( i ) 1 x ( i ) 6 6 7 7 u > u > 2 x ( i ) 2 x ( i ) 6 6 7 7 y ( i ) = y ( i ) = 5 2 R k 5 2 R k 6 6 7 7 . . . . 4 4 . . u > u > k x ( i ) k x ( i )

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 9 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html What is Data Science Data Science Physics Goal: discover the

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Mean Field Games: Numerical Methods Y. Achdou October 24, 2011 Content A partial review on

Extracting computational content from proofs Helmut Schwichtenberg (j.w.w. Diana Ratiu)

Combining Factorization Model and Additive Forest for Recommendation Presenter: Tianqi Chen Team

Analysis of Cross-Sectional Data Kevin Sheppard https://kevinsheppard.com/teaching/mfe/ Modules

Symmetric Encryption Scheme adapted to Fully Homomorphic Encryption Scheme: New Criteria for

Polyhedra with Prescribed Number of Lattice Points and the k -Frobenius Problem I. Aliev, J. De

The Multiple Unicast Network Coding Conjecture and a geometric framework for studying it Tang

PhoneGap 99 slides of NIWEA Install PhoneGap http://lmgtfy.com/?q=install+phonegap There