[PPT] - Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings PowerPoint Presentation

SLIDE 1

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings

Max Vladymyrov and Miguel ´

A. Carreira-Perpi˜

n´ an

Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu

August 30, 2012

SLIDE 2

Dimensionality reduction

Given a high-dimensional dataset Y = (y1, . . . , yN) ⊂ RD find a low-dimensional representation X = (x1, . . . , xN) ⊂ Rd where d ≪ D.

−10 −5 5 10 15 5 10 15 20 −15 −10 −5 5 10 15 Y1 Y2 Y3 2 4 6 8 10 12 14 16 18 20 4 5 6 7 8 9 10 11 12 13 14 15 X1 X2

Can be used for:

◮ Data compression. ◮ Visualization. ◮ Detect latent manifold structure. ◮ Fast search. ◮ . . . 2

SLIDE 3

Graph-based dimensionality reduction techniques

◮ Input: (sparse) affinity matrix W defined on a set of

high-dimensional points Y.

◮ Objective function: minimization over the latent points X. ◮ Examples:

Spectral methods: Laplacian Eigenmaps (LE), LLE;

✓ closed-form solution;

✗ results can be bad.

Nonlinear methods: SNE, t-SNE, elastic embedding (EE);

✓ better results;

✗ slow to train, limited to small data sets.

3

SLIDE 4

COIL-20 Dataset

Rotations of 10 objects every 5◦; input is greyscale images of 128 × 128. Y : . . . Elastic Embedding Laplacian Eigenmaps

−10 −8 −6 −4 −2 2 4 6 8 −10 −5 5 −2 −1 1 2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5

4

SLIDE 5

General Embedding Formulation (Carreira-Perpi˜

n´ an 2010)

For Y ∈ RD×N matrix of high-d points and X ∈ Rd×N low-d points E(X, λ) = E +(X) + λE −(X) λ ≥ 0 E +(X) is the attractive term:

◮ often quadratic, ◮ minimal with coincident points;

☼

☼

E −(X) is the repulsive term:

◮ often very nonlinear, ◮ minimal with points separated infinitely.

☼

☼

Optimal embeddings balance both forces.

5

SLIDE 6

General Embedding Formulation: Special Cases

E +(X) E −(X) SNE:

(Hinton&Roweis,’03)

N

n,m=1

pnm xn − xm2

N

n=1

log

N

m=1

e−xn−xm2 t-SNE:

(van der Maaten & Hinton,’08)

N

n,m=1

pnm log (1 + xn − xm2) log

N

n,m=1

(1 + xn − xm2)−1 EE:

(Carreira-Perpi˜ n´ an,’10)

N

n,m=1

w+

nm xn − xm2 N

n,m=1

w−

nme−xn−xm2

LE & LLE:

(Belkin & Niyogi,’03) (Roweis & Saul,’00)

N

n,m=1

w+

nm xn − xm2 s.t. constraints w+

nm and w− nm are affinity matrices elements

6

SLIDE 7

Optimization Strategy For every iteration k:

1. Choose positive definite Bk.
2. Solve a linear system Bkpk = −gk for a search

direction pk, where gk is the gradient.

3. Use line search to find a step size α for the next

iteration Xk+1 = Xk + αpk.(e.g. with backtracking line search). Convergence is guaranteed! (under mild assumptions)

7

SLIDE 8

How to choose good Bk?

Solve linear system Bkpk = −gk: Bk = I (grad. descent) more Hessian information − − − − − − − − − − − − − − − →

faster convergence rate

Bk = ∇2E (Newton’s method)

8 6 2 4 6 8 1 2 3 4 5

Bk = I

ur Bk

Bk = ∇2E

1 2 3 4 5

We want Bk:

◮ contain as much Hessian information as possible; ◮ positive definite (pd); ◮ fast to solve the linear system and scale up to larger N. 8

SLIDE 9

The Spectral Direction

The Hessian of the generalized embedding formulation is given by: ∇2E = 4(L+−λL−) ⊗ Id + 8Lxx − 16λ vec (XLq) vec (XLq)T where L+, L−, Lxx, Lq are graph Laplacians. B = 4L+ ⊗ Id is a convenient Hessian approximation:

◮ block-diagonal and has d blocks of N × N graph Laplacian 4L+; ◮ always psd ⇒ global convergence under mild assumptions; ◮ constant for Gaussian kernel. For other kernels we can fix it at some X; ◮ equal to the Hessian of the spectral methods: ∇2E +(X); ◮ “bends” the gradient of the nonlinear E using the curvature of the

spectral E +;

9

SLIDE 10

The Spectral Direction (computation)

Solve Bpk = gk efficiently for every iteration k (naively O(N3d)):

◮ Cache Cholesky factor of L+ in first iteration. ◮ (Further) sparsify the weights of L+ with a κ-NN graph. Runtime

is faster and convergence is still guaranteed. Cost per iteration Objective function O(N2d) Gradient O(N2d) Spectral direction O(Nκd) This strategy adds almost no overhead when compared to the

bjective function and the gradient computation.

10

SLIDE 11

Experimental Evaluation: Methods Compared

Now:

◮ Gradient descent (GD),

B = I

(Hinton&Roweis,’03) ◮ fixed-point iterations (FP),

B = 4D+ ⊗ Id

(Carreira-Perpi˜ n´ an,’10) ◮ Spectral direction (SD),

B = 4L+ ⊗ Id

◮ L-BFGS.

More experiments and methods at the poster:

◮ Hessian diagonal update; ◮ nonlinear Conjugate Gradient; ◮ some other interesting partial-Hessian update. 11

SLIDE 12

COIL-20. Convergence analysis, s-SNE

COIL-20 dataset of rotated objects (N = 720, D = 16 384, d = 2). Run the algorithms 50 times for 30 seconds each initialized randomly. 100 200 300 400 500 10.1 10.2 10.3

Gradient Desent Fixed-point it. Spectral direction L-BFGS Objective function value Number of iterations

Animation

12

SLIDE 13

MNIST. t-SNE

◮ N = 20 000 images of handwritten digits (each a 28 × 28 pixel

grayscale image, D = 784).

◮ One hour of optimization on a modern computer with one CPU.

5 10 15 20 25 30 35 40 45 50 55 60 16.7 17 17.4 17.8 18.2 18.6 19 19.4 19.8

Objective function value

Runtime (minutes)

Fixed-point it. Spectral direction L-BFGS

Animation

13

SLIDE 14

Conclusions

◮ We presented a common framework for many well-known

dimensionality reduction techniques.

◮ We presented the spectral direction: a new simple, generic and

scalable optimization strategy that runs one to two orders of magnitude faster compared to traditional methods.

◮ Matlab code: http://eecs.ucmerced.edu/.

Ongoing work:

◮ The evaluation of E and ∇E remains the bottleneck (O(N2d)).

We can use Fast Multipole Methods to speed up the runtime.

◮ Avoid line search, use constant, near-optimal step sizes. 14

SLIDE 15

MNIST. Embedding after 20 min of EE optimization

Fixed-point iteration Spectral direction

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 −6 −4 −2 2 4 6 −6 −4 −2 2 4 6

Animation

15

SLIDE 16

COIL-20. Convergence to the same minimum, s-SNE

We initialized X0 close enough to X∞ so that all methods have the same initial and final points.

10 10

1

10

2

10

3

10

4

10.15 10.2 10.25 10.3 10.35

Number of iterations Objective function, s-SNE

10

−1

10 10

1

10

2

Runtime (seconds) GD FP DiagH SD SD– L-BFGS CG

16

SLIDE 17

COIL-20: Homotopy optimization for EE

Start with small λ where E is convex and follow the path of minima to desired λ by minimizing over X as λ increases. We used 50 log-spaced values from 10−4 to 102.

10

−2

10

−1

10 10

1

10

2

10 10

1

10

2

10

3

10

4

λ Number of iterations

10

−2

10

−1

10 10

1

10

2

10

−2

10

−1

10 10

1

10

2

10

3

λ Time, s

GD FP DiagH SD SD– L-BFGS CG

Animation

17