SLIDE 1 Unsupervised Learning Unsupervised Learning
Learning without Class Labels (or correct Learning without Class Labels (or correct
– – Density Estimation Density Estimation
Learn P(X) given training data for X Learn P(X) given training data for X
– – Clustering Clustering
Partition data into clusters Partition data into clusters
– – Dimensionality Reduction Dimensionality Reduction
Discover low Discover low-
- dimensional representation of data
dimensional representation of data
– – Blind Source Separation Blind Source Separation
Unmixing multiple signals Unmixing multiple signals
SLIDE 2 Density Estimation Density Estimation
Given: S = { Given: S = {x x1
1,
, x x2
2,
, … …, , x xN
N}
} Find: P( Find: P(x x) ) Search problem: Search problem:
argmax argmaxh
h P(S|h) = argmax
P(S|h) = argmaxh
h ∑
∑i
i log P(
log P(x xi
i|h)
|h)
SLIDE 3 Unsupervised Fitting of Unsupervised Fitting of the Na the Naï ïve Bayes Model ve Bayes Model
y is discrete with K values y is discrete with K values
P( P(x x) = ) = ∑ ∑k
k P(y=k)
P(y=k) ∏ ∏j
j P(x
P(xj
j | y=k)
| y=k)
finite mixture model finite mixture model we can think of each y=k as a separate we can think of each y=k as a separate “ “cluster cluster” ” of data points
y x3 x2 x1 xn
…
SLIDE 4 The Expectation The Expectation-
- Maximization Algorithm (1):
Maximization Algorithm (1): Hard EM Hard EM Learning would be easy if we knew Learning would be easy if we knew y yi
i for each
for each x xi
i
Idea: guess them and then Idea: guess them and then iteratively revise our guesses to iteratively revise our guesses to maximize P(S|h) maximize P(S|h)
y y1
1
x x1
1
y yN
N
x xN
N
… … … … y y2
2
x x2
2
y yi
i
x xi
i
SLIDE 5 Hard EM (2) Hard EM (2)
1.
Guess initial y y values to get values to get “ “complete complete data data” ”
2.
- 2. M Step: Compute probabilities for
M Step: Compute probabilities for hypotheses (model) from complete hypotheses (model) from complete data [Maximum likelihood estimate of data [Maximum likelihood estimate of the model parameters] the model parameters]
3.
- 3. E Step: Classify each example using
E Step: Classify each example using the current model to get a new the current model to get a new y y value value [Most likely class [Most likely class ŷ ŷ of each example]
4.
Repeat steps 2-
3 until convergence
y y1
1
x x1
1
y yN
N
x xN
N
… … … … y y2
2
x x2
2
y yi
i
x xi
i
SLIDE 6 Special Case: k Special Case: k-
Means Clustering
1.
Assign an initial y yi
i to each data point
to each data point x xi
i at
at random random
2.
- 2. M Step. For each class k = 1,
M Step. For each class k = 1, … …, K , K compute the mean: compute the mean:
µ µk
k = 1/N
= 1/Nk
k ∑
∑i
i x
xi
i ·
· I[y I[yi
i = k]
= k]
3.
- 3. E Step. For each example x
E Step. For each example xi
i, assign it to
, assign it to the class k with the nearest mean: the class k with the nearest mean:
y yi
i = argmin
= argmink
k ||
||x xi
i -
µk
k||
||
4.
- 4. Repeat steps 2 and 3 to convergence
Repeat steps 2 and 3 to convergence
SLIDE 7 Gaussian Interpretation of K Gaussian Interpretation of K-
means
Each feature x Each feature xj
j in class k is gaussian
in class k is gaussian distributed with mean distributed with mean µ µkj
kj and constant
and constant variance variance σ σ2
2
P (xj|y = k) = 1 √ 2πσ exp
"
−1 2 (kxj − µkjk)2 σ2
#
logP (xj|y = k) = −1 2 kxj − µkjk2 σ2 + C argmax
y
P(x|y = k) = argmax
y
logP(x|y) = argmin
y
kx − µkjk2 = argmin
y
kx − µkjk
This could easily be extended to have This could easily be extended to have general covariance matrix general covariance matrix Σ Σ or class
specific Σ Σk
k
SLIDE 8 The EM algorithm The EM algorithm
The true EM algorithm augments The true EM algorithm augments the incomplete data with a the incomplete data with a probability distribution over the probability distribution over the possible possible y y values values
1.
- 1. Start with initial naive Bayes
Start with initial naive Bayes hypothesis hypothesis
2.
E step: : For each example, compute For each example, compute P(y P(yi
i) and add it to the table
) and add it to the table
3.
- 3. M step: Compute updated estimates
M step: Compute updated estimates
- f the parameters
- f the parameters
4.
Repeat steps 2-
3 to convergence.
P(y P(y1
1)
) x x1
1
P(y P(yN
N)
) x xN
N
… … … … P(y P(y2
2)
) x x2
2
y yi
i
x xi
i
SLIDE 9 Details of the M step Details of the M step
Each example Each example x xi
i is treated as if y
is treated as if yi
i=k with
=k with probability P(y probability P(yi
i=k |
=k | x xi
i)
)
P(y = k) := 1 N
N X i=1
P (yi = k|xi) P (xj = v|y = k) :=
P i P (yi = k|xi) · I(xij = v) PN i=1P(yi = k|xi)
SLIDE 10 Example: Mixture of 2 Gaussians Example: Mixture of 2 Gaussians
Initial distributions means at -0.5, +0.5
SLIDE 11 Example: Mixture of 2 Gaussians Example: Mixture of 2 Gaussians
Iteration 1
SLIDE 12 Example: Mixture of 2 Gaussians Example: Mixture of 2 Gaussians
Iteration 2
SLIDE 13 Example: Mixture of 2 Gaussians Example: Mixture of 2 Gaussians
Iteration 3
SLIDE 14 Example: Mixture of 2 Gaussians Example: Mixture of 2 Gaussians
Iteration 10
SLIDE 15 Example: Mixture of 2 Gaussians Example: Mixture of 2 Gaussians
Iteration 20
SLIDE 16
Evaluation: Test set likelihood Evaluation: Test set likelihood
Overfitting is also a problem in unsupervised Overfitting is also a problem in unsupervised learning learning
SLIDE 17 Potential Problems Potential Problems
If If σ σk
k is allowed to vary, it may go to zero,
is allowed to vary, it may go to zero, which leads to infinite likelihood which leads to infinite likelihood Fix by placing an overfitting penalty on 1/ Fix by placing an overfitting penalty on 1/σ σ
SLIDE 18
Choosing K Choosing K
Internal holdout likelihood Internal holdout likelihood
SLIDE 19 Unsupervised Learning for Unsupervised Learning for Sequences Sequences
Suppose each training example Suppose each training example X Xi
i is a
is a sequence of objects: sequence of objects:
X Xi
i = (
= (x xi1
i1,
, x xi2
i2,
, … …, , x xi,T
i,Ti i)
)
Fit HMM by unsupervised learning Fit HMM by unsupervised learning
1.
- 1. Initialize model parameters
Initialize model parameters 2.
E step: apply forward-
backward algorithm to estimate P(y estimate P(yit
it |
| X Xi
i) at each point t
) at each point t 3.
- 3. M step: estimate model parameters
M step: estimate model parameters 4.
Repeat steps 2-
3 to convergence
SLIDE 20 Agglomerative Clustering Agglomerative Clustering
Initialize each data point to be its own cluster Initialize each data point to be its own cluster Repeat: Repeat:
– – Merge the two clusters that are most similar Merge the two clusters that are most similar – – Build dendrogram with height = distance between the most similar Build dendrogram with height = distance between the most similar clusters clusters
Apply various intuitive methods to choose number of clusters Apply various intuitive methods to choose number of clusters
– – Equivalent to choosing where to Equivalent to choosing where to “ “slice slice” ” the dendrogram the dendrogram
Source: Charity Morgan http://www.people.fas.harvard.edu/~rizem/teach/stat325/CharityCluster.ppt
SLIDE 21
Agglomerative Clustering Agglomerative Clustering
Each cluster is defined only by the points it Each cluster is defined only by the points it contains (not by a parameterized model) contains (not by a parameterized model) Very fast (using priority queue) Very fast (using priority queue) No objective measure of correctness No objective measure of correctness Distance measures Distance measures
– – distance between nearest pair of points distance between nearest pair of points – – distance between cluster centers distance between cluster centers
SLIDE 22 Probabilistic Agglomerative Clustering Probabilistic Agglomerative Clustering = Bottom = Bottom-
up Model Merging
Each data point is an initial cluster but with Each data point is an initial cluster but with penalized penalized σ σk
k
Repeat: Repeat:
– – Merge the two clusters that would most Merge the two clusters that would most increase the penalized log likelihood increase the penalized log likelihood – – Until no merger would further improve Until no merger would further improve likelihood likelihood
Note that without the penalty on Note that without the penalty on σ σk
k, the
, the algorithm would never merge anything algorithm would never merge anything
SLIDE 23 Dimensionality Reduction Dimensionality Reduction
Often, raw data have very high dimension Often, raw data have very high dimension
– – Example: images of human faces Example: images of human faces
Dimensionality Reduction: Dimensionality Reduction:
– – Construct a lower Construct a lower-
dimensional space that preserves information important for the task preserves information important for the task – – Examples: Examples:
preserve distances preserve distances preserve separation between classes preserve separation between classes etc. etc.
SLIDE 24 Principal Component Analysis Principal Component Analysis
Given: Given:
– – Data: n Data: n-
dimensional vectors {x x1
1,
, x x2
2,
, … …, , x xN
N}
} – – Desired dimensionality m Desired dimensionality m
Find an m x n orthogonal matrix A to minimize Find an m x n orthogonal matrix A to minimize
∑ ∑i
i ||A
||A-
1A
Ax xi
i –
– x xi
i||
||2
2
Explanation: Explanation:
– – A Ax xi
i maps
maps x xi
i into an m
into an m-
dimensional matrix x x’ ’i
i
– – A A-
1A
Ax xi
i maps
maps x x’ ’i
i back to n
back to n-
dimensional space – – We minimize the We minimize the “ “squared reconstruction error squared reconstruction error” ” between the reconstructed vectors and the original between the reconstructed vectors and the original vectors vectors
SLIDE 25
Conceptual Algorithm Conceptual Algorithm
Find a line such that when the data is Find a line such that when the data is projected onto that line, it has the projected onto that line, it has the maximum variance: maximum variance:
SLIDE 26
Conceptual Algorithm Conceptual Algorithm
Find a new line, orthogonal to the first, that Find a new line, orthogonal to the first, that has maximum projected variance: has maximum projected variance:
SLIDE 27 Repeat Until m Lines Have Been Repeat Until m Lines Have Been Found Found
The projected position of a point on these The projected position of a point on these lines gives the coordinates in the m lines gives the coordinates in the m-
- dimensional reduced space
dimensional reduced space
SLIDE 28 A Better Method Numerical Method A Better Method Numerical Method
Compute the co Compute the co-
variance matrix
Σ Σ = = Σ Σi
i (
(x xi
i –
– µ µ) ) · · ( (x xi
i –
– µ µ) )T
T
Compute the singular value decomposition Compute the singular value decomposition
Σ Σ = U D V = U D VT
T
where where – – the columns of U are the eigenvectors of the columns of U are the eigenvectors of Σ Σ – – D is a diagonal matrix whose elements are the square D is a diagonal matrix whose elements are the square roots of the eigenvalues of roots of the eigenvalues of Σ Σ in descending order in descending order – – V VT
T are the projected data points
are the projected data points
Replace all but the m largest elements of D by Replace all but the m largest elements of D by zeros zeros
SLIDE 29 Example: Eigenfaces Example: Eigenfaces
Database of 128 carefully Database of 128 carefully-
aligned faces Here are the first 15 eigenvectors: Here are the first 15 eigenvectors:
SLIDE 30 Face Classification in Eigenspace Face Classification in Eigenspace is Easier is Easier
Nearest Mean classifier Nearest Mean classifier
ŷ ŷ = argmin = argmink
k || A
|| Ax x – – A Aµ µk
k ||
||
Accuracy Accuracy
– – variation in lighting: 96% variation in lighting: 96% – – variation in orientation: 85% variation in orientation: 85% – – variation in size: 64% variation in size: 64%
SLIDE 31
PCA is a useful preprocessing step PCA is a useful preprocessing step
Helps all LTU algorithms by making the Helps all LTU algorithms by making the features more independent features more independent Helps decision tree algorithms Helps decision tree algorithms Helps nearest neighbor algorithms by Helps nearest neighbor algorithms by discovering the distance metric discovering the distance metric Fails when data consists of multiple Fails when data consists of multiple separate clusters separate clusters
– – mixtures of PCAs can be learned too mixtures of PCAs can be learned too
SLIDE 32 Non Non-
Linear Dimensionality Reduction: ISOMAP Reduction: ISOMAP
Replace Euclidean distance by geodesic distance Replace Euclidean distance by geodesic distance
– – Construct a graph where each point is connected to its k nearest Construct a graph where each point is connected to its k nearest neighbors by an edge AND any pair of points are connected if neighbors by an edge AND any pair of points are connected if they are less than they are less than ε ε apart apart – – Construct an N x N matrix D in which D[i,j] is the shortest path Construct an N x N matrix D in which D[i,j] is the shortest path in in the graph connecting the graph connecting x xi
i to
to x xj
j
– – Apply SVD to D and keep the m most important Apply SVD to D and keep the m most important dimensions dimensions
SLIDE 33
Two more ISOMAP Two more ISOMAP examples examples
SLIDE 34
Linear Interpolation Between Points Linear Interpolation Between Points in ISOMAP Space in ISOMAP Space
Algorithm Algorithm generates new generates new poses poses and new 2 and new 2’ ’s s
SLIDE 35 Blind Source Separation Blind Source Separation
Suppose we have two sound sources that Suppose we have two sound sources that have been linearly mixed and recorded by have been linearly mixed and recorded by two microphones. Given the two two microphones. Given the two microphone signals, we want to recover microphone signals, we want to recover the two sound sources the two sound sources
y1 y2 ŷ1 ŷ2 Magic Box α 1 − α 1 − β β x1 x2
SLIDE 36 Minimizing Mutual Information Minimizing Mutual Information
If the input sources are independent, then If the input sources are independent, then they should have zero mutual information. they should have zero mutual information. Idea: Minimize the mutual information Idea: Minimize the mutual information between the outputs while maximizing the between the outputs while maximizing the information (entropy) of each output information (entropy) of each output separately: separately:
max maxW
W H(
H(ŷ ŷ1
1) + H(
) + H(ŷ ŷ2
2)
) – – I( I(ŷ ŷ1
1;
; ŷ ŷ2
2)
) where [ where [ŷ ŷ1
1,
, ŷ ŷ] = F ] = FW
W(x
(x1
1, x
, x2
2)
) and F and FW
W is a sigmoid neural network
is a sigmoid neural network
SLIDE 37 Independent Component Analysis Independent Component Analysis (ICA) (ICA)
Microphone 1 Microphone 1 Microphone 2 Microphone 2 Reconstructed Reconstructed source 1 source 1 Reconstructed Reconstructed source 2 source 2
source: http://www.cnl.salk.edu/~tewon/Blind/blind_audio.html
SLIDE 38 Unsupervised Learning Summary Unsupervised Learning Summary
Density Estimation: Learn P(X) given training Density Estimation: Learn P(X) given training data for X data for X
– – Mixture models and EM Mixture models and EM
Clustering: Partition data into clusters Clustering: Partition data into clusters
– – Bottom up aggomerative clustering Bottom up aggomerative clustering
Dimensionality Reduction: Discover low Dimensionality Reduction: Discover low-
- dimensional representation of data
dimensional representation of data
– – Principal Component Analysis Principal Component Analysis – – ISOMAP ISOMAP
Blind Source Separation: Unmixing multiple Blind Source Separation: Unmixing multiple signals signals
– – Many algorithms Many algorithms
SLIDE 39
Objective Functions Objective Functions
Density Estimation: Density Estimation:
– – Log likelihood on training data Log likelihood on training data
Clustering: Clustering:
– – ???? ????
Dimensionality Reduction Dimensionality Reduction
– – Minimum reconstruction error Minimum reconstruction error – – Maximum likelihood (gaussian interpretation of PCA) Maximum likelihood (gaussian interpretation of PCA)
Blind Source Separation Blind Source Separation
– – Information Maximization Information Maximization – – Maximum Likelihood (assuming models of the Maximum Likelihood (assuming models of the sources) sources)