Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop - - PowerPoint PPT Presentation

continuous latent variables
SMART_READER_LITE
LIVE PREVIEW

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop - - PowerPoint PPT Presentation

Principal Component Analysis Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component Analysis Outline Principal Component Analysis Principal Component Analysis Outline Principal Component Analysis


slide-1
SLIDE 1

Principal Component Analysis

Continuous Latent Variables

Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12

slide-2
SLIDE 2

Principal Component Analysis

Outline

Principal Component Analysis

slide-3
SLIDE 3

Principal Component Analysis

Outline

Principal Component Analysis

slide-4
SLIDE 4

Principal Component Analysis

PCA: Motivation and Intuition

  • Basic Ideas over 100 years old (stats). Still useful!
  • Think about linear regression. If basis functions are not

given, can we learn them from data?

  • Goal: find a small set of hidden basis functions that

explains the data as well as possible.

  • Intuition: Suppose that your data is generated by a few

hidden causes or factors. Then you could compactly describe each data point by how much each cause contributes to generate it.

  • Principal Component Analysis (PCA) assumes that the

contribution of each factor to each data point is linear.

slide-5
SLIDE 5

Principal Component Analysis

Informal Example: Student Performance

  • Each student’s performance is summarized in 4

assignments, 1 midterm, 1 project = 6 numbers.

  • Suppose that on each item, a student’s performance can

be explained in terms of two factors.

  • Her intelligence In
  • Her diligence Dn.
  • Combine these into a vector zn.
  • The importance of each factor varies with the assignment.

So we have 6 numbers for each. Put them in a 6x2 matrix W.

  • Then the performance numbers of student n can be

predicted by the model xn = Wzn + ε, where ε is (Gaussian) noise.

slide-6
SLIDE 6

Principal Component Analysis

Informal Example: Blind Source Separation

  • Two people are talking in a room, sometimes at the same
  • time. http://www.youtube.com/watch?v=

Qr74sM7oqQc&feature=related

  • Two microphones are set up at different parts of the room.

Each mike catches each person from a different position. Let xi be the combined signal at microphone i.

  • The contribution of person 1 to mike i depends on the

position of mike i, can be summarized as a pair of numbers wi1, wi2.

  • Similarly for person 2. Combine these into a 2x2 matrix W.
  • Let zi be the (amplitude of) the voice signal of person i.

Then the combined signal at mike 1 is given by x1 = w11 · z1 + w12 · z2.

  • Similarly for mike 2. Overall, we have that

x = Wz.

slide-7
SLIDE 7

Principal Component Analysis

Example: Digit Rotation

  • Take a single digit (3), make 100x100 pixel image
  • Create multiple copies by translating and rotating
  • This dataset could be represented as vectors in

R100x100 = R10000

  • But the dataset only has 3 degrees of freedom... why are

10,000 needed?

  • Shouldn’t a manifold or subspace of intrinsic dimension 3

suffice?

  • Teapot demo

http://www.youtube.com/watch?v=BfTMmoDFXyE

slide-8
SLIDE 8

Principal Component Analysis

Auto-Associative Neural Nets

x1 xD z1 zM x1 xD inputs

  • utputs
  • An auto-associative neural net has just as many input units

as output units, say D.

  • The error is the squared difference between input unit xi

and output unit oi, i.e. the network is supposed to recreate the input.

slide-9
SLIDE 9

Principal Component Analysis

Dimensionality Reduction: Neural Net View

x1 xD z1 zM x1 xD inputs

  • utputs
  • Suppose we have 1 hidden layer with just one node.
  • The network then has to map each input to a single number

that allows it to recreate the entire input as well as possible.

  • More generally, we could have M << D hidden nodes.
  • The network then has to map each input to a

lower-dimensional vector that allows it to recreate the entire input as well as possible.

  • You can in fact use this set-up to train an ANN to perform

dimensionality reduction.

  • But because of the linearity assumption, we can get a fast

closed-form solution.

slide-10
SLIDE 10

Principal Component Analysis

Component Analysis: Pros and Cons

Pros

  • Reduces dimensionality of data: easier to learn.
  • Removes noise, filters out important regularities.
  • Can be used to standardize data (whitening).

Cons

  • PCA is restricted to linear hidden models. (Relax later).
  • Black box: data vectors become hard to interpret.
slide-11
SLIDE 11

Principal Component Analysis

Pre-processing Example

2 4 6 40 50 60 70 80 90 100 −2 2 −2 2

After preprocessing the original data (left), we obtain a data set with mean 0 and unit covariance.

slide-12
SLIDE 12

Principal Component Analysis

Dimensionality Reduction

x2 x1 xn

  • xn

u1

  • We will study one simple method for

finding a lower dimensional manifold – principal component analysis (PCA)

  • PCA finds a lower dimensional linear

space to represent data

  • How to define the right linear space?
  • Subspace that maximizes variance of

projected data

  • Minimizes projection cost
  • Turns out they are the same!
slide-13
SLIDE 13

Principal Component Analysis

Maximum Variance

x2 x1 xn

  • xn

u1

  • Consider dataset {xn ∈ RD}
  • Try to project into space with

dimensionality M < D

  • For M = 1, space given by u1 ∈ RD,

uT

1u1 = 1

  • Optimization problem: find u1 that

maximizes variance

slide-14
SLIDE 14

Principal Component Analysis

Projected variance

  • The projection of a datapoint xn ∈ RD by u1 is uT

1xn

  • The mean of the projected data is

1 N

N

  • n=1

uT

1xn = uT 1

  • 1

N

N

  • n=1

xn

  • = uT

x

  • The variance of the projected data is

1 N

N

  • n=1
  • uT

1xn − uT 1¯

x 2 = uT

1

  • 1

N

N

  • n=1

(xn − ¯ x)(xn − ¯ x)T

  • u1

= uT

1Su1

where S is the sample covariance.

slide-15
SLIDE 15

Principal Component Analysis

Optimization

  • How do we maximize the projected variance uT

1Su1 subject

to the constraint that uT

1u1 = 1?

  • Lagrange multipliers:

uT

1Su1 + λ1(1 − uT 1u1)

  • Taking derivatives, stationary point when

Su1 = λ1u1 i.e. u1 is an eigenvector of S

slide-16
SLIDE 16

Principal Component Analysis

Optimization – Which Eigenvector

  • There are up to D eigenvectors, which is the right one?
  • Maximize variance!
  • Variance is:

uT

1Su1

= uT

1λ1u1 since u1 is an eigenvector

= λ1 since ||u1|| = 1

  • Choose the eigenvector u1 corresponding to the largest

eigenvalue λ1

  • This is the first direction (M = 1)
  • If M > 1, simple to show eigenvectors corresponding to

largest M eigenvalues are the ones to choose to maximize projected variance

slide-17
SLIDE 17

Principal Component Analysis

Reconstruction Error

  • Can also phrase problem as finding set of orthonormal

basis vectors {ui} for projection

  • Find set of M < D vectors to minimize reconstruction error

J = 1 N

N

  • n=1

||xn − ˜ xn||2 where ˜ xn is projected version of xn

  • ˜

xn will end up being same as before – mean plus leading eigenvectors of covariance matrix S ˜ xn = ¯ x +

M

  • i=1

βniui

slide-18
SLIDE 18

Principal Component Analysis

PCA Example – MNIST Digits

Mean

✂✁✂✄✆☎✞✝ ✟✡✠☞☛☞✌✎✍ ✑✏✒✄✆✓✞✝ ✔✕✠✞☛✎✌☞✍ ✑✖✒✄✗✓✘✝ ✟✡✠✞☛☞✌✎✍ ✚✙✛✄✆☛✞✝ ✜✕✠☞☛☞✌✎✍
  • ✁✄✂

(a) 200 400 600 1 2 3 x 10

5

(b) 200 400 600 1 2 3 x 10

6

  • PCA of digits “3” from MNIST
  • First ≈ 100 dimensions capture most variance / low

reconstruction error

slide-19
SLIDE 19

Principal Component Analysis

Reconstruction – MNIST Digits

Original

✂✁☎✄ ✂✁✆✄✞✝ ✟✁☎✠✡✝ ✂✁☎☛✡✠✞✝
  • PCA approximation to a data vector xn is:

˜ xn = ¯ x +

M

  • i=1

βniui

  • As M is increased, this reconstruction becomes more

accurate

  • D = 784, but with M = 250 quite good reconstruction
  • Dimensionality reduction
slide-20
SLIDE 20

Principal Component Analysis

PCA Example – Eigenfaces

106

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 12. NO. 1, JANUARY 1990

Plate 2. First nine eigenpictures, in order from left to right and top to bottom. Plate 3. Binary mapped eigenpictures for both extended (top row) and unextended (bottom row) ensembles. One can quantify the error of the approximation as E,w = ll(0 -

( O N l l / l l ( O l l

IIPII = (P3 (0)

(30) where the norm is defined by

I / 2

It measures the magnitude of the error vector, normalized by the vector representing the face that is being reconstructed. This mea- sure is not supposed to be equivalent to the human eye in deter- mining the quality of a reconstruction. However, the goal is still to form a recognizable reconstruction. See Plate 4 for pictures of Plate 4. Approximation to the exact picture of caricature (lower right cor- ner) using 10, 20, 30, 40, 50 eigenpictures. The original picture is not a member of the ensemble. the reconstructions of the caricature of a typical face outside of the ensemble for N = IO, 20, 30, 40, 50. In Plate 5 , two more typical approximates of data from outside of the original extended ensem- ble are shown. The approximation for N = 50 (left) is compared to the exact picture (right) in each case. The convergence error E,, plotted versus N , for the approxi- mation shown in Plate 4 is given by the solid line in Fig. 1. The dashed curve represents the error averaged over a set of ten faces chosen at random from outside of the ensemble. In Fig. 2, we com- pare the errors, again averaged over ten faces projected from out- side the data set, for the approximations using the symmetrical ba- sis (lower curve) and the nonsymmetrical basis. At N = 50, the extended basis gives an error of 3.68 percent compared to 3.86 percent for the unextended set. In addition, we compute the fraction of the total variance con- tained in the first N terms q, as a function of N where

N

c A“’ c A“’

/ = I

qN = r .

I = I

The first ten terms contain 82 percent of the variance, while by N

= 50, we are up to 95 percent (see Fig. 3). We also plot A(n’/Amax

versus n in Fig. 4. Here, we see that the global Karhunen-Lokve estimate of the dimensionality (the value of the index i for which A0”/Amax = 0.01) of the set is about 21.

  • VII. ERROR

ESTIMATION One must exercise some care in making statements about the error of the approximation. Within the framework of finite dimen- sional vector spaces, it is possible to determine meaningful upper bounds on the error estimate of our truncated expansion even if our ensemble of pictures is too small. We begin by assuming that V is a finite dimensional vector space which contains all human faces. It is reasonable to assume that the dimension of V , say N , is finite in view of our earlier remarks (see “Results”). However, we do not restrict the total number of faces M to be finite. We will con- sider an example to be too small if it does not span V . Let the space spanned by an ensemble of size M, V,, have dimension D, 5 N . If our ensemble is too small, then estimates for both the accuracy

  • f the approximation for a member of the ensemble and the total

Kirby and Sirovich, PAMI 1990

http://en.wikipedia.org/wiki/Eigenface.

slide-21
SLIDE 21

Principal Component Analysis

Probabilistic PCA

xn zn N µ σ2 W

  • Probabilistic model of PCA: For each

data point xn, there is a latent variable vector zn.

  • Linear Gaussian model:

x = Wz + µ + ε.

  • Can train using EM.
  • Handles missing data.
  • Can take mixtures of PCA models.
  • Closely related to factor analysis.

http://cscs.umich.edu/ ~crshalizi/weblog/.

  • .... (see text).
slide-22
SLIDE 22

Principal Component Analysis

Nonlinear PCA: Kernel Methods

  • Can use the kernel trick: replace dot products with kernel

evaluations.

  • In the figure, the first 2 eigenvectors separate 3 clusters.
  • The next 3 split the clusters in halves.
  • The last 3 split the clusters in orthogonal halves.
slide-23
SLIDE 23

Principal Component Analysis

Nonlinear PCA: Neural Nets

x1 xD x1 xD inputs

  • utputs

F1 F2 non-linear

With more than one hidden layers, neural nets perform non-linear dimensionality reduction.

slide-24
SLIDE 24

Principal Component Analysis

Conclusion

  • Readings: Ch. 12.1
  • We discussed one method for finding a lower dimensional

manifold – principal component analysis (PCA)

  • PCA is a basic technique
  • Finds linear manifold (hyperplane)
  • In general, manifold will be non-linear
  • Simple example – translating digit 1
  • Also “dimensions” corresponding to style of digit
  • Other important techniques for non-linear dimensionality

reduction: independent component analysis (ICA), isomap, locally linear embeddings