Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings - - PowerPoint PPT Presentation

partial hessian strategies for fast learning of nonlinear
SMART_READER_LITE
LIVE PREVIEW

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings - - PowerPoint PPT Presentation

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu August 30,


slide-1
SLIDE 1

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings

Max Vladymyrov and Miguel ´

  • A. Carreira-Perpi˜

n´ an

Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu

August 30, 2012

slide-2
SLIDE 2

Dimensionality reduction

Given a high-dimensional dataset Y = (y1, . . . , yN) ⊂ RD find a low-dimensional representation X = (x1, . . . , xN) ⊂ Rd where d ≪ D.

−10 −5 5 10 15 5 10 15 20 −15 −10 −5 5 10 15 Y1 Y2 Y3 2 4 6 8 10 12 14 16 18 20 4 5 6 7 8 9 10 11 12 13 14 15 X1 X2

Can be used for:

◮ Data compression. ◮ Visualization. ◮ Detect latent manifold structure. ◮ Fast search. ◮ . . . 2

slide-3
SLIDE 3

Graph-based dimensionality reduction techniques

◮ Input: (sparse) affinity matrix W defined on a set of

high-dimensional points Y.

◮ Objective function: minimization over the latent points X. ◮ Examples:

  • Spectral methods: Laplacian Eigenmaps (LE), LLE;

✓ closed-form solution;

✗ results can be bad.

  • Nonlinear methods: SNE, t-SNE, elastic embedding (EE);

✓ better results;

✗ slow to train, limited to small data sets.

3

slide-4
SLIDE 4

COIL-20 Dataset

Rotations of 10 objects every 5◦; input is greyscale images of 128 × 128. Y : . . . Elastic Embedding Laplacian Eigenmaps

−10 −8 −6 −4 −2 2 4 6 8 −10 −5 5 −2 −1 1 2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5

4

slide-5
SLIDE 5

General Embedding Formulation (Carreira-Perpi˜

n´ an 2010)

For Y ∈ RD×N matrix of high-d points and X ∈ Rd×N low-d points E(X, λ) = E +(X) + λE −(X) λ ≥ 0 E +(X) is the attractive term:

◮ often quadratic, ◮ minimal with coincident points;

E −(X) is the repulsive term:

◮ often very nonlinear, ◮ minimal with points separated infinitely.

Optimal embeddings balance both forces.

5

slide-6
SLIDE 6

General Embedding Formulation: Special Cases

E +(X) E −(X) SNE:

(Hinton&Roweis,’03)

N

  • n,m=1

pnm xn − xm2

N

  • n=1

log

N

  • m=1

e−xn−xm2 t-SNE:

(van der Maaten & Hinton,’08)

N

  • n,m=1

pnm log (1 + xn − xm2) log

N

  • n,m=1

(1 + xn − xm2)−1 EE:

(Carreira-Perpi˜ n´ an,’10)

N

  • n,m=1

w+

nm xn − xm2 N

  • n,m=1

w−

nme−xn−xm2

LE & LLE:

(Belkin & Niyogi,’03) (Roweis & Saul,’00)

N

  • n,m=1

w+

nm xn − xm2 s.t. constraints w+

nm and w− nm are affinity matrices elements

6

slide-7
SLIDE 7

Optimization Strategy For every iteration k:

  • 1. Choose positive definite Bk.
  • 2. Solve a linear system Bkpk = −gk for a search

direction pk, where gk is the gradient.

  • 3. Use line search to find a step size α for the next

iteration Xk+1 = Xk + αpk.(e.g. with backtracking line search). Convergence is guaranteed! (under mild assumptions)

7

slide-8
SLIDE 8

How to choose good Bk?

Solve linear system Bkpk = −gk: Bk = I (grad. descent) more Hessian information − − − − − − − − − − − − − − − →

faster convergence rate

Bk = ∇2E (Newton’s method)

8 6 2 4 6 8 1 2 3 4 5

Bk = I

  • ur Bk

Bk = ∇2E

1 2 3 4 5

We want Bk:

◮ contain as much Hessian information as possible; ◮ positive definite (pd); ◮ fast to solve the linear system and scale up to larger N. 8

slide-9
SLIDE 9

The Spectral Direction

The Hessian of the generalized embedding formulation is given by: ∇2E = 4(L+−λL−) ⊗ Id + 8Lxx − 16λ vec (XLq) vec (XLq)T where L+, L−, Lxx, Lq are graph Laplacians. B = 4L+ ⊗ Id is a convenient Hessian approximation:

◮ block-diagonal and has d blocks of N × N graph Laplacian 4L+; ◮ always psd ⇒ global convergence under mild assumptions; ◮ constant for Gaussian kernel. For other kernels we can fix it at some X; ◮ equal to the Hessian of the spectral methods: ∇2E +(X); ◮ “bends” the gradient of the nonlinear E using the curvature of the

spectral E +;

9

slide-10
SLIDE 10

The Spectral Direction (computation)

Solve Bpk = gk efficiently for every iteration k (naively O(N3d)):

◮ Cache Cholesky factor of L+ in first iteration. ◮ (Further) sparsify the weights of L+ with a κ-NN graph. Runtime

is faster and convergence is still guaranteed. Cost per iteration Objective function O(N2d) Gradient O(N2d) Spectral direction O(Nκd) This strategy adds almost no overhead when compared to the

  • bjective function and the gradient computation.

10

slide-11
SLIDE 11

Experimental Evaluation: Methods Compared

Now:

◮ Gradient descent (GD),

B = I

(Hinton&Roweis,’03) ◮ fixed-point iterations (FP),

B = 4D+ ⊗ Id

(Carreira-Perpi˜ n´ an,’10) ◮ Spectral direction (SD),

B = 4L+ ⊗ Id

◮ L-BFGS.

More experiments and methods at the poster:

◮ Hessian diagonal update; ◮ nonlinear Conjugate Gradient; ◮ some other interesting partial-Hessian update. 11

slide-12
SLIDE 12

COIL-20. Convergence analysis, s-SNE

COIL-20 dataset of rotated objects (N = 720, D = 16 384, d = 2). Run the algorithms 50 times for 30 seconds each initialized randomly. 100 200 300 400 500 10.1 10.2 10.3

Gradient Desent Fixed-point it. Spectral direction L-BFGS Objective function value Number of iterations

Animation

12

slide-13
SLIDE 13
  • MNIST. t-SNE

◮ N = 20 000 images of handwritten digits (each a 28 × 28 pixel

grayscale image, D = 784).

◮ One hour of optimization on a modern computer with one CPU.

5 10 15 20 25 30 35 40 45 50 55 60 16.7 17 17.4 17.8 18.2 18.6 19 19.4 19.8

Objective function value

Runtime (minutes)

Fixed-point it. Spectral direction L-BFGS

Animation

13

slide-14
SLIDE 14

Conclusions

◮ We presented a common framework for many well-known

dimensionality reduction techniques.

◮ We presented the spectral direction: a new simple, generic and

scalable optimization strategy that runs one to two orders of magnitude faster compared to traditional methods.

◮ Matlab code: http://eecs.ucmerced.edu/.

Ongoing work:

◮ The evaluation of E and ∇E remains the bottleneck (O(N2d)).

We can use Fast Multipole Methods to speed up the runtime.

◮ Avoid line search, use constant, near-optimal step sizes. 14

slide-15
SLIDE 15
  • MNIST. Embedding after 20 min of EE optimization

Fixed-point iteration Spectral direction

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 −6 −4 −2 2 4 6 −6 −4 −2 2 4 6

Animation

15

slide-16
SLIDE 16

COIL-20. Convergence to the same minimum, s-SNE

We initialized X0 close enough to X∞ so that all methods have the same initial and final points.

10 10

1

10

2

10

3

10

4

10.15 10.2 10.25 10.3 10.35

Number of iterations Objective function, s-SNE

10

−1

10 10

1

10

2

Runtime (seconds) GD FP DiagH SD SD– L-BFGS CG

16

slide-17
SLIDE 17

COIL-20: Homotopy optimization for EE

Start with small λ where E is convex and follow the path of minima to desired λ by minimizing over X as λ increases. We used 50 log-spaced values from 10−4 to 102.

10

−2

10

−1

10 10

1

10

2

10 10

1

10

2

10

3

10

4

λ Number of iterations

10

−2

10

−1

10 10

1

10

2

10

−2

10

−1

10 10

1

10

2

10

3

λ Time, s

GD FP DiagH SD SD– L-BFGS CG

Animation

17