Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings
Max Vladymyrov and Miguel ´
- A. Carreira-Perpi˜
n´ an
Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu
August 30, 2012
Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings - - PowerPoint PPT Presentation
Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu August 30,
Max Vladymyrov and Miguel ´
n´ an
Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu
August 30, 2012
Given a high-dimensional dataset Y = (y1, . . . , yN) ⊂ RD find a low-dimensional representation X = (x1, . . . , xN) ⊂ Rd where d ≪ D.
−10 −5 5 10 15 5 10 15 20 −15 −10 −5 5 10 15 Y1 Y2 Y3 2 4 6 8 10 12 14 16 18 20 4 5 6 7 8 9 10 11 12 13 14 15 X1 X2
Can be used for:
◮ Data compression. ◮ Visualization. ◮ Detect latent manifold structure. ◮ Fast search. ◮ . . . 2
◮ Input: (sparse) affinity matrix W defined on a set of
high-dimensional points Y.
◮ Objective function: minimization over the latent points X. ◮ Examples:
✓ closed-form solution;
✗ results can be bad.
✓ better results;
✗ slow to train, limited to small data sets.
3
Rotations of 10 objects every 5◦; input is greyscale images of 128 × 128. Y : . . . Elastic Embedding Laplacian Eigenmaps
−10 −8 −6 −4 −2 2 4 6 8 −10 −5 5 −2 −1 1 2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5
4
n´ an 2010)
For Y ∈ RD×N matrix of high-d points and X ∈ Rd×N low-d points E(X, λ) = E +(X) + λE −(X) λ ≥ 0 E +(X) is the attractive term:
◮ often quadratic, ◮ minimal with coincident points;
E −(X) is the repulsive term:
◮ often very nonlinear, ◮ minimal with points separated infinitely.
Optimal embeddings balance both forces.
5
E +(X) E −(X) SNE:
(Hinton&Roweis,’03)
N
pnm xn − xm2
N
log
N
e−xn−xm2 t-SNE:
(van der Maaten & Hinton,’08)
N
pnm log (1 + xn − xm2) log
N
(1 + xn − xm2)−1 EE:
(Carreira-Perpi˜ n´ an,’10)
N
w+
nm xn − xm2 N
w−
nme−xn−xm2
LE & LLE:
(Belkin & Niyogi,’03) (Roweis & Saul,’00)
N
w+
nm xn − xm2 s.t. constraints w+
nm and w− nm are affinity matrices elements
6
7
Solve linear system Bkpk = −gk: Bk = I (grad. descent) more Hessian information − − − − − − − − − − − − − − − →
faster convergence rate
Bk = ∇2E (Newton’s method)
8 6 2 4 6 8 1 2 3 4 5
Bk = I
Bk = ∇2E
1 2 3 4 5
We want Bk:
◮ contain as much Hessian information as possible; ◮ positive definite (pd); ◮ fast to solve the linear system and scale up to larger N. 8
The Hessian of the generalized embedding formulation is given by: ∇2E = 4(L+−λL−) ⊗ Id + 8Lxx − 16λ vec (XLq) vec (XLq)T where L+, L−, Lxx, Lq are graph Laplacians. B = 4L+ ⊗ Id is a convenient Hessian approximation:
◮ block-diagonal and has d blocks of N × N graph Laplacian 4L+; ◮ always psd ⇒ global convergence under mild assumptions; ◮ constant for Gaussian kernel. For other kernels we can fix it at some X; ◮ equal to the Hessian of the spectral methods: ∇2E +(X); ◮ “bends” the gradient of the nonlinear E using the curvature of the
spectral E +;
9
Solve Bpk = gk efficiently for every iteration k (naively O(N3d)):
◮ Cache Cholesky factor of L+ in first iteration. ◮ (Further) sparsify the weights of L+ with a κ-NN graph. Runtime
is faster and convergence is still guaranteed. Cost per iteration Objective function O(N2d) Gradient O(N2d) Spectral direction O(Nκd) This strategy adds almost no overhead when compared to the
10
Now:
◮ Gradient descent (GD),
B = I
(Hinton&Roweis,’03) ◮ fixed-point iterations (FP),
B = 4D+ ⊗ Id
(Carreira-Perpi˜ n´ an,’10) ◮ Spectral direction (SD),
B = 4L+ ⊗ Id
◮ L-BFGS.
More experiments and methods at the poster:
◮ Hessian diagonal update; ◮ nonlinear Conjugate Gradient; ◮ some other interesting partial-Hessian update. 11
COIL-20 dataset of rotated objects (N = 720, D = 16 384, d = 2). Run the algorithms 50 times for 30 seconds each initialized randomly. 100 200 300 400 500 10.1 10.2 10.3
Gradient Desent Fixed-point it. Spectral direction L-BFGS Objective function value Number of iterations
Animation
12
◮ N = 20 000 images of handwritten digits (each a 28 × 28 pixel
grayscale image, D = 784).
◮ One hour of optimization on a modern computer with one CPU.
5 10 15 20 25 30 35 40 45 50 55 60 16.7 17 17.4 17.8 18.2 18.6 19 19.4 19.8
Objective function value
Runtime (minutes)
Fixed-point it. Spectral direction L-BFGS
Animation
13
◮ We presented a common framework for many well-known
dimensionality reduction techniques.
◮ We presented the spectral direction: a new simple, generic and
scalable optimization strategy that runs one to two orders of magnitude faster compared to traditional methods.
◮ Matlab code: http://eecs.ucmerced.edu/.
Ongoing work:
◮ The evaluation of E and ∇E remains the bottleneck (O(N2d)).
We can use Fast Multipole Methods to speed up the runtime.
◮ Avoid line search, use constant, near-optimal step sizes. 14
Fixed-point iteration Spectral direction
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 −6 −4 −2 2 4 6 −6 −4 −2 2 4 6
Animation
15
We initialized X0 close enough to X∞ so that all methods have the same initial and final points.
10 10
1
10
2
10
3
10
4
10.15 10.2 10.25 10.3 10.35
Number of iterations Objective function, s-SNE
10
−1
10 10
1
10
2
Runtime (seconds) GD FP DiagH SD SD– L-BFGS CG
16
Start with small λ where E is convex and follow the path of minima to desired λ by minimizing over X as λ increases. We used 50 log-spaced values from 10−4 to 102.
10
−2
10
−1
10 10
1
10
2
10 10
1
10
2
10
3
10
4
λ Number of iterations
10
−2
10
−1
10 10
1
10
2
10
−2
10
−1
10 10
1
10
2
10
3
λ Time, s
GD FP DiagH SD SD– L-BFGS CG
Animation
17