Fast Discriminative Component Analysis for Comparing Examples Jaakko - - PowerPoint PPT Presentation

fast discriminative component analysis for comparing
SMART_READER_LITE
LIVE PREVIEW

Fast Discriminative Component Analysis for Comparing Examples Jaakko - - PowerPoint PPT Presentation

NIPS 2006 LCE workshop Fast Discriminative Component Analysis for Comparing Examples Jaakko Peltonen 1 , Jacob Goldberger 2 , and Samuel Kaski 1 1 Helsinki Institute for Information Technology & Adaptive Informatics Research Centre,


slide-1
SLIDE 1

Fast Discriminative Component Analysis for Comparing Examples

Jaakko Peltonen1, Jacob Goldberger2, and Samuel Kaski1

1Helsinki Institute for Information Technology & Adaptive Informatics Research Centre,

Laboratory of Computer and Information Science, Helsinki University of Technology

2School of Engineering, Bar-Ilan University

NIPS 2006 LCE workshop

slide-2
SLIDE 2

Outline

  • 1. Background
  • 2. Our method
  • 3. Optimization
  • 4. Properties
  • 5. Experiments
  • 6. Conclusions
slide-3
SLIDE 3
  • 1. Background

Task: discriminative component analysis (searching for data components that discriminate some auxiliary data of interest, e.g. classes)

slide-4
SLIDE 4
  • 1. Background

Task: discriminative component analysis (searching for data components that discriminate some auxiliary data of interest, e.g. classes)

Another application possibility: supervised unsupervised learning

slide-5
SLIDE 5
  • 1. Background

Linear Discriminant Analysis: well-known classical method.

slide-6
SLIDE 6
  • 1. Background

Linear Discriminant Analysis: well-known classical method. Optimal subspace under restrictive assumptions: Gaussian classes with equal cov. matrix, take enough components.

Not optimal otherwise!

slide-7
SLIDE 7
  • 1. Background

Linear Discriminant Analysis: well-known classical method. Optimal subspace under restrictive assumptions: Gaussian classes with equal cov. matrix, take enough components.

Not optimal otherwise!

desired?

slide-8
SLIDE 8
  • 1. Background

Linear Discriminant Analysis: well-known classical method. Optimal subspace under restrictive assumptions: Gaussian classes with equal cov. matrix, take enough components. Extensions: HDA, reduced-rank

  • MDA. LDA and many extensions

can be seen as models that maximize joint likelihood of (x,c)

Not optimal otherwise!

desired?

slide-9
SLIDE 9
  • 1. Background

Recent discriminative methods:

slide-10
SLIDE 10
  • 1. Background

Recent discriminative methods:

Information-theoretic methods

(Torkkola - Renyi entropy based; Leiva-Murillo & Artés-Rodríquez)

slide-11
SLIDE 11
  • 1. Background

Recent discriminative methods:

Information-theoretic methods

(Torkkola - Renyi entropy based; Leiva-Murillo & Artés-Rodríquez)

Likelihood ratio-based (Zhu & Hastie)

slide-12
SLIDE 12
  • 1. Background

Recent discriminative methods:

Information-theoretic methods

(Torkkola - Renyi entropy based; Leiva-Murillo & Artés-Rodríquez)

Likelihood ratio-based (Zhu & Hastie) Kernel-based (Fukumizu et al.)

slide-13
SLIDE 13
  • 1. Background

Recent discriminative methods:

Information-theoretic methods

(Torkkola - Renyi entropy based; Leiva-Murillo & Artés-Rodríquez)

Likelihood ratio-based (Zhu & Hastie) Kernel-based (Fukumizu et al.) Other approaches (e.g. Globerson & Roweis, Hammer & Villmann, ...)

slide-14
SLIDE 14
  • 1. Background

Recent discriminative methods:

Information-theoretic methods

(Torkkola - Renyi entropy based; Leiva-Murillo & Artés-Rodríquez)

Likelihood ratio-based (Zhu & Hastie) Kernel-based (Fukumizu et al.) Other approaches (e.g. Globerson & Roweis, Hammer & Villmann, ...)

Two recent very similar methods: Informative Discriminant Analysis (IDA) Neighborhood Components Analysis (NCA)

slide-15
SLIDE 15
  • 1. Background

Recent discriminative methods:

Information-theoretic methods

(Torkkola - Renyi entropy based; Leiva-Murillo & Artés-Rodríquez)

Likelihood ratio-based (Zhu & Hastie) Kernel-based (Fukumizu et al.) Other approaches (e.g. Globerson & Roweis, Hammer & Villmann, ...)

Two recent very similar methods: Informative Discriminant Analysis (IDA) Neighborhood Components Analysis (NCA) Nonparametric: no distributional assumptions, but O(N2) complexity per iteration.

slide-16
SLIDE 16
  • 2. Our Method

Basic idea: instead of optimizing the metric for a nonparametric predictor, optimize it for a parametric predictor

slide-17
SLIDE 17
  • 2. Our Method

Basic idea: instead of optimizing the metric for a nonparametric predictor, optimize it for a parametric predictor Parametric predictors are much simpler than nonparametric ones: much less computation, and can increase robustness

slide-18
SLIDE 18
  • 2. Our Method

Basic idea: instead of optimizing the metric for a nonparametric predictor, optimize it for a parametric predictor Parametric predictors are much simpler than nonparametric ones: much less computation, and can increase robustness Of course, then you have to optimize the predictor parameters too...

slide-19
SLIDE 19
  • 2. Our Method

Parametric predictor: mixture of labeled Gaussians

( )

( )

c k c, k c, k c

, ; N β α = c; , p Σ μ Ax θ Ax

slide-20
SLIDE 20
  • 2. Our Method

Parametric predictor: mixture of labeled Gaussians Objective function: conditional likelihood of classes

( )

( )

c k c, k c, k c

, ; N β α = c; , p Σ μ Ax θ Ax

( ) ( ) ( )

∑∑ ∑

i i c i i i i i

c; , p ; c , p = ; | c p = L θ Ax θ Ax θ Ax

slide-21
SLIDE 21
  • 2. Our Method

Parametric predictor: mixture of labeled Gaussians Objective function: conditional likelihood of classes We call this “discriminative component analysis by Gaussian mixtures” or DCA-GM

( )

( )

c k c, k c, k c

, ; N β α = c; , p Σ μ Ax θ Ax

( ) ( ) ( )

∑∑ ∑

i i c i i i i i

c; , p ; c , p = ; | c p = L θ Ax θ Ax θ Ax

slide-22
SLIDE 22

DCA-GM

slide-23
SLIDE 23
  • 3. Optimization

Use gradient descent for the matrix A

( ) ( )

( )(

)

T k c, k c, i, i c c,

c; , | k p δ ; | k c, p = L x μ Ax θ Ax θ Ax A − − ∂ ∂

slide-24
SLIDE 24
  • 3. Optimization

Use gradient descent for the matrix A

( ) ( )

( )(

)

T k c, k c, i, i c c,

c; , | k p δ ; | k c, p = L x μ Ax θ Ax θ Ax A − − ∂ ∂

( )

( ) ( )

( )

( ) ( )

c' k , c' k , c' c' c k c, k c, c c l c, l c, c k c, k c,

, ; N β α , ; N β α = ; | k c, p , ; N β , ; N β = c; , | k p Σ μ Ax Σ μ Ax θ Ax Σ μ Ax Σ μ Ax θ Ax

∑ ∑

slide-25
SLIDE 25
  • 3. Optimization

We could optimize the mixture model parameters by conjugate gradient too. But here we will use a hybrid approach: we optimize the mixture by EM before each conjugate gradient iteration. Then only the projection matrix A needs to be

  • ptimized by conjugate gradient.
slide-26
SLIDE 26

Initialization

slide-27
SLIDE 27

Iteration 1, after EM

slide-28
SLIDE 28

Iteration 1, after CG

slide-29
SLIDE 29

Iteration 2, after EM

slide-30
SLIDE 30

Iteration 2, after CG

slide-31
SLIDE 31

Iteration 3, after EM

slide-32
SLIDE 32

Iteration 3, after CG

slide-33
SLIDE 33

Iteration 4, after EM

slide-34
SLIDE 34

Iteration 4, after CG

slide-35
SLIDE 35

Iteration 5, after EM

slide-36
SLIDE 36

Iteration 5, after CG

slide-37
SLIDE 37

Iteration 6, after EM

slide-38
SLIDE 38

Iteration 6, after CG

slide-39
SLIDE 39

Iteration 7, after EM

slide-40
SLIDE 40

Iteration 7, after CG

slide-41
SLIDE 41

Iteration 8, after EM

slide-42
SLIDE 42

Iteration 8, after CG

slide-43
SLIDE 43

Iteration 9, after EM

slide-44
SLIDE 44

Iteration 9, after CG

slide-45
SLIDE 45

Iteration 10, after EM

slide-46
SLIDE 46

Iteration 10, after CG

slide-47
SLIDE 47

Iteration 19, after CG

slide-48
SLIDE 48
  • 3. Optimization

In the hybrid optimization, the mixture parameters do not change during optimization of the A matrix. We can make the centers change: reparameterize Causes only small changes to the gradient and EM step.

k c k c, = ,

Aμ μ '

slide-49
SLIDE 49
  • 4. Properties

Gradient computation and EM step are both O(N)

slide-50
SLIDE 50
  • 4. Properties

Gradient computation and EM step are both O(N) Finds a subspace. Metric within the subspace unidentifiable (mixture

parameters can compensate for metric changes within the subspace)

slide-51
SLIDE 51
  • 4. Properties

Gradient computation and EM step are both O(N) Finds a subspace. Metric within the subspace unidentifiable (mixture

parameters can compensate for metric changes within the subspace)

Metric within the subspace can be found by various

methods.

slide-52
SLIDE 52
  • 5. Experiments

Four benchmark data sets from UCI Machine Learning

Repository (Wine, Balance, Ionosphere, Iris)

slide-53
SLIDE 53
  • 5. Experiments

Four benchmark data sets from UCI Machine Learning

Repository (Wine, Balance, Ionosphere, Iris)

30 divisions of data into training and test sets

slide-54
SLIDE 54
  • 5. Experiments

Four benchmark data sets from UCI Machine Learning

Repository (Wine, Balance, Ionosphere, Iris)

30 divisions of data into training and test sets Performance measured by test-set accuracy of 1-NN

classification

slide-55
SLIDE 55
  • 5. Experiments

Four benchmark data sets from UCI Machine Learning

Repository (Wine, Balance, Ionosphere, Iris)

30 divisions of data into training and test sets Performance measured by test-set accuracy of 1-NN

classification

4 comparison methods:

  • LDA
  • LDA+RCA
  • NCA
  • DCA-GM, 3 Gaussians per class
slide-56
SLIDE 56
  • 5. Experiments

DCA-GM is comparable to NCA For these small data sets both methods run fast

slide-57
SLIDE 57
  • 6. Conclusions

Method for discriminative component analysis Optimizes a subspace for a Gaussian mixture model O(N) computation Works equally well as NCA

slide-58
SLIDE 58
  • 6. Conclusions

Web links:

www.cis.hut.fi/projects/mi/ www.eng.biu.ac.il/~goldbej/