Cheap Orthogonal Constraints in Neural Networks: A Simple - - PowerPoint PPT Presentation

cheap orthogonal constraints in neural networks a simple
SMART_READER_LITE
LIVE PREVIEW

Cheap Orthogonal Constraints in Neural Networks: A Simple - - PowerPoint PPT Presentation

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group Mario Lezcano-Casado David Martnez-Rubio Mathematical Institute Department of Computer Science June 12, 2019 Cheap Orthogonal


slide-1
SLIDE 1

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group

Mario Lezcano-Casado Mathematical Institute David Martínez-Rubio Department of Computer Science June 12, 2019

slide-2
SLIDE 2

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

We study the optimization of neural networks with orthogonal constraints B ∈ Rn×n, B⊺B = I

Visit our poster (#27 on Wednesday) 1 4

slide-3
SLIDE 3

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

We study the optimization of neural networks with orthogonal constraints B ∈ Rn×n, B⊺B = I Motivation:

Visit our poster (#27 on Wednesday) 1 4

slide-4
SLIDE 4

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

We study the optimization of neural networks with orthogonal constraints B ∈ Rn×n, B⊺B = I Motivation: ◮ Orthogonal matrices have eigenvalues with norm 1.

Visit our poster (#27 on Wednesday) 1 4

slide-5
SLIDE 5

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

We study the optimization of neural networks with orthogonal constraints B ∈ Rn×n, B⊺B = I Motivation: ◮ Orthogonal matrices have eigenvalues with norm 1.

◮ Convenient for exploding and vanishing gradient problems within

RNNs.

◮ They constitute a implicit regularization method. Visit our poster (#27 on Wednesday) 1 4

slide-6
SLIDE 6

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

We study the optimization of neural networks with orthogonal constraints B ∈ Rn×n, B⊺B = I Motivation: ◮ Orthogonal matrices have eigenvalues with norm 1.

◮ Convenient for exploding and vanishing gradient problems within

RNNs.

◮ They constitute a implicit regularization method.

◮ They are the basic building block for matrix factorizations like SVD or

QR.

Visit our poster (#27 on Wednesday) 1 4

slide-7
SLIDE 7

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

We study the optimization of neural networks with orthogonal constraints B ∈ Rn×n, B⊺B = I Motivation: ◮ Orthogonal matrices have eigenvalues with norm 1.

◮ Convenient for exploding and vanishing gradient problems within

RNNs.

◮ They constitute a implicit regularization method.

◮ They are the basic building block for matrix factorizations like SVD or

QR.

◮ They allow for the implementation of factorized linear layers. Visit our poster (#27 on Wednesday) 1 4

slide-8
SLIDE 8

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

is equivalent to solving min

A∈Skew(n) f(exp (A))

Visit our poster (#27 on Wednesday) 2 4

slide-9
SLIDE 9

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

Visit our poster (#27 on Wednesday) 2 4

slide-10
SLIDE 10

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

◮ The matrix exponential maps skew-symmetric matrices to

  • rthogonal matrices.

Visit our poster (#27 on Wednesday) 2 4

slide-11
SLIDE 11

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

◮ The matrix exponential maps skew-symmetric matrices to

  • rthogonal matrices.

◮ Compute the exponential to optimize over the unconstrained space of skew symmetric matrices.

Visit our poster (#27 on Wednesday) 2 4

slide-12
SLIDE 12

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

◮ The matrix exponential maps skew-symmetric matrices to

  • rthogonal matrices.

◮ Compute the exponential to optimize over the unconstrained space of skew symmetric matrices.

◮ No orthogonality needs to be enforced. Visit our poster (#27 on Wednesday) 2 4

slide-13
SLIDE 13

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

◮ The matrix exponential maps skew-symmetric matrices to

  • rthogonal matrices.

◮ Compute the exponential to optimize over the unconstrained space of skew symmetric matrices.

◮ No orthogonality needs to be enforced. ◮ It has negligible overhead in your neural network. Visit our poster (#27 on Wednesday) 2 4

slide-14
SLIDE 14

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

◮ The matrix exponential maps skew-symmetric matrices to

  • rthogonal matrices.

◮ Compute the exponential to optimize over the unconstrained space of skew symmetric matrices.

◮ No orthogonality needs to be enforced. ◮ It has negligible overhead in your neural network. ◮ General purpose optimizers can be used (SGD, ADAM, ADAGRAD, . . . ). Visit our poster (#27 on Wednesday) 2 4

slide-15
SLIDE 15

Cheap Orthogonal Constraints in Neural Networks

Optimization with orthogonal constraints

min

B∈SO(n) f(B)

  • constrained problem.

is equivalent to solving min

A∈Skew(n) f(exp (A))

  • unconstrained problem.

◮ The matrix exponential maps skew-symmetric matrices to

  • rthogonal matrices.

◮ Compute the exponential to optimize over the unconstrained space of skew symmetric matrices.

◮ No orthogonality needs to be enforced. ◮ It has negligible overhead in your neural network. ◮ General purpose optimizers can be used (SGD, ADAM, ADAGRAD, . . . ). ◮ No new extremal points are created in the main parametrization region. Visit our poster (#27 on Wednesday) 2 4

slide-16
SLIDE 16

Cheap Orthogonal Constraints in Neural Networks

500 1000 1500 2000 2500 3000 3500 4000 Iterations 0.000 0.005 0.010 0.015 0.020 Cross entropy Baseline EURNN LSTM scoRNN expRNN

Cross entropy in the copying problem for L = 2000. The copying problem uses synthetic data of the form: Random numbers Wait for L steps Recall Input: 14221

  • :----

Output:

  • 14221

Visit our poster (#27 on Wednesday) 3 4

slide-17
SLIDE 17

Cheap Orthogonal Constraints in Neural Networks MODEL

N

# PARAM VALID. TEST

EXPRNN

224 ≈ 83K 5.34 5.30

EXPRNN

322 ≈ 135K 4.42 4.38

EXPRNN

425 ≈ 200K 5.52 5.48

SCORNN

224 ≈ 83K 9.26 8.50

SCORNN

322 ≈ 135K 8.48 7.82

SCORNN

425 ≈ 200K 7.97 7.36

LSTM

84 ≈ 83K 15.42 14.30

LSTM

120 ≈ 135K 13.93 12.95

LSTM

158 ≈ 200K 13.66 12.62

EURNN

158 ≈ 83K 15.57 18.51

EURNN

256 ≈ 135K 15.90 15.31

EURNN

378 ≈ 200K 16.00 15.15

RGD

128 ≈ 83K 15.07 14.58

RGD

192 ≈ 135K 15.10 14.50

RGD

256 ≈ 200K 14.96 14.69

RNNs trained on a speech prediction task on the TIMIT dataset.

It shows the best validation MSE accuracy. Visit our poster (#27 on Wednesday) 4 4