Understanding Deep Learning Nati Srebro (TTIC) Based on joint work - - PowerPoint PPT Presentation

understanding deep learning
SMART_READER_LITE
LIVE PREVIEW

Understanding Deep Learning Nati Srebro (TTIC) Based on joint work - - PowerPoint PPT Presentation

Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar, Blake Woodworth, Pedro Savarese


slide-1
SLIDE 1

Implicit Optimization Bias

as a key to

Understanding Deep Learning

Nati Srebro (TTIC)

Based on joint work with Behnam Neyshabur (TTIC→IAS), Ryota Tomioka (TTIC→MSR), Srinadh Bhojanapalli, Suriya Gunasekar, Blake Woodworth, Pedro Savarese (TTIC), Russ Salakhutdinov (CMU), Ashia Wilson, Becca Roelofs, Mitchel Stern, Ben Recht (Berkeley), Daniel Soudry, Elad Hoffer, Mor Shpigel (Technion), Jason Lee (USC)

slide-2
SLIDE 2

Increasing the Network Size

[Neyshabur Tomioka S ICLR’15]

slide-3
SLIDE 3

Increasing the Network Size

[Neyshabur Tomioka S ICLR’15]

slide-4
SLIDE 4

Increasing the Network Size

[Neyshabur Tomioka S ICLR’15]

slide-5
SLIDE 5

Increasing the Network Size

[Neyshabur Tomioka S ICLR’15]

Complexity (Path Norm?) 0.5 1 Test Error

slide-6
SLIDE 6
  • What is the relevant “complexity measure” (eg norm)?
  • How is this minimized (or controlled) by the
  • ptimization algorithm?
  • How does it change if we change the opt algorithm?
slide-7
SLIDE 7

Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error

MNIST

50 100 150 200 250 300 0.015 0.02 0.025 0.03 0.035 50 100 150 200 250 300 0.005 0.01 0.015 0.02 50 100 150 200 250 300 0.5 1 1.5 2 2.5

Epoch Epoch Epoch

Path-SGD SGD

CIFAR-10

50 100 150 200 250 300 0.4 0.42 0.44 0.46 0.48 0.5 50 100 150 200 250 300 0.05 0.1 0.15 0.2 50 100 150 200 250 300 0.5 1 1.5 2 2.5

SVHN

100 200 300 400 0.12 0.13 0.14 0.15 0.16 0.17 0.18 100 200 300 400 0.1 0.2 0.3 0.4 100 200 300 400 0.5 1 1.5 2 2.5 100 200 300 400 0.65 0.7 0.75 100 200 300 400 0.2 0.4 0.6 0.8 100 200 300 400 1 2 3 4 5

Epoch Epoch

CIFAR-100

With Dropout

Epoch Epoch Epoch

[Neyshabur Salakhudtinov S NIPS’15]

slide-8
SLIDE 8

SGD vs ADAM

Test Error (Preplexity) Traini Error (Preplexity) Results on Penn Treebank using 3-layer LSTM [Wilson Roelofs Stern S Recht, “The Marginal Value of Adaptive Gradient Methods in Machine Learning”, NIPS’17]

slide-9
SLIDE 9

The Deep Recurrent Residual Boosting Machine

Joe Flow, DeepFace Labs

Section 1: Introduction

We suggest a new amazing architecture and loss function that is great for learning. All you have to do to learn is fit the model on your training data

Section 2: Learning Contribution: our model

The model class ℎ𝑥 is amazing. Our learning method is: 𝐛𝐬𝐡 𝐧𝐣𝐨

𝒙 𝟐 𝒏 σ𝒋=𝟐 𝒏 𝒎𝒑𝒕𝒕(𝒊𝒙 𝒚 ; 𝒛)

(*)

Section 3: Optimization

This is how we solve the optimization problem (*): […]

Section 4: Experiments

It works!

slide-10
SLIDE 10

Different optimization algorithm ➔ Different bias in optimum reached ➔ Different Inductive bias ➔ Different learning properties Goal: understand optimization algorithms not just as reaching some (global) optimum, but as reaching a specific optimum

slide-11
SLIDE 11

Today

Precisely understand implicit bias in:

  • Matrix Factorization
  • Linear Classification (Logistic Regression)
  • Linear Convolutional Networks
slide-12
SLIDE 12

min

𝑋∈ℝ𝑜×𝑜 𝐺 𝑋 = 𝒝 𝑋 − 𝑧 2 2

𝒝 𝑋 𝑗 = 〈𝐵𝑗, 𝑋〉 𝐵1, … , 𝐵𝑛 ∈ ℝ𝑜×𝑜 𝑧 ∈ ℝ𝑛

  • Matrix completion (𝐵𝑗 is indicator matrix)
  • Matrix reconstruction from linear measurements
  • Multi-task learning (𝐵𝑗 = 𝑓𝑢𝑏𝑡𝑙 𝑝𝑔 𝑓𝑦𝑏𝑛𝑞𝑚𝑓 𝑗 ⋅ 𝜚 𝑓𝑦𝑏𝑛𝑞𝑚𝑓 𝑗 ⊤)

We are interested in the regime 𝒏 ≪ 𝒐𝟑

  • Many global optima for which 𝒝 𝑋 = 𝑧
  • Easy to have 𝒝 𝑋 = 𝑧 without reconstruction/generalization
  • E.g. for matrix completion, set all unobserved entries to 0
  • Gradient Descent on 𝑋 will generally yield trivial non-generalizing solution

Matrix Reconstruction

2 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1

𝒛

4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1

𝑩𝟐

4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1

𝑩𝟒

4 5 1 4 2 3 1 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1

𝑩𝟑

slide-13
SLIDE 13

min

𝑉,𝑊∈ℝ𝑜×𝑜 𝑔 𝑉, 𝑊 = 𝐺 𝑉𝑊⊤ =

𝒝 𝑉𝑊⊤ − 𝑧 2

2

  • Since 𝑉, 𝑊 full dim, no constraint on 𝑋, equivalent to min

𝑋 𝐺(𝑋)

  • Underdetermined, all the same global min, trivial to minimize without generalizing

𝑉

×

2 4 5 1 4 2 3 1 2 2 5 4 4 2 4 1 3 1 3 3 4 2 4 2 3 1 4 3 2 2 2 1 4 5 2 4 1 4 2 3 1 3 1 1 4 3 4 2 2 5 3 1

𝒛

𝑊⊤ 𝑋

= ≈

What happens when we optimize by gradient descent on 𝑽, 𝑾 ?

Factorized Matrix Reconstruction

slide-14
SLIDE 14

Gradient descent on 𝒈 𝑽, 𝑾 gets to “good” global minima

slide-15
SLIDE 15

Gradient descent on 𝒈 𝑽, 𝑾 generalizes better with smaller step size Gradient descent on 𝒈 𝑽, 𝑾 gets to “good” global minima

slide-16
SLIDE 16

Question: Which global minima does gradient descent reach? Why does it generalize well?

slide-17
SLIDE 17

Gradient descent on 𝑔(𝑉, 𝑊) converges to a minimum nuclear norm solution

slide-18
SLIDE 18
  • Rigorous proof when 𝐵𝑗s commute
  • General 𝐵𝑗: empirical validation + hand waving
  • Yuanzhi Li, Hongyang Zhang and Tengyu Ma:

proved when 𝑧 = 𝒝(𝑋∗), 𝑋∗ low rank, 𝒝 RIP Conjecture: With stepsize→0 (i.e. gradient flow) and initialization→0, gradient descent on 𝑉 converges to minimum nuclear norm solution: 𝑉𝑉⊤ → min

𝑋≽0 𝑋 ∗ 𝑡. 𝑢. 𝒝 𝑌 = 𝑧 [Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]

slide-19
SLIDE 19

Implicit Bias in Least Squared

min ‖ 𝐵𝑥 − 𝑐‖2

  • Gradient Descent (+Momentum) on 𝑥

➔ min

𝐵𝑥=𝑐 𝑥 2

  • Gradient Descent on factorization 𝑋 = 𝑉𝑊

➔ probably min

𝐵 𝑋 =𝑐 𝑋 𝑢𝑠 with stepsize↘ 0 and init ↘ 0,

but only in limit, depends on stepsize, init, proved only in special cases

  • AdaGrad on 𝑥

➔ in some special cases min

𝐵𝑥=𝑐 𝑥 ∞, but not always,

and it depends on stepsize, adaptation param, momentum

  • Steepest Descent w.r.t. ‖𝑥‖

➔ ??? Not min

𝐵𝑥=𝑐 𝑥 , even as stepsize↘ 0 !

and it depends on stepsize, init, momentum

  • Coordinate Descent (steepest descent w.r.t. 𝑥 1)

➔ Related to, but not quite the Lasso (with stepsize↘ 0 and particular tie-breaking ≈ LARS)

slide-20
SLIDE 20

Training Single Unit on Separable Data

arg min

𝑥∈ℝ𝑜 ℒ 𝑥 = ෍ 𝑗=1 𝑛

ℓ 𝑧𝑗 𝑥, 𝑦𝑗

ℓ 𝑨 = log 1 + 𝑓−𝑨

  • Data

𝑦𝑗, 𝑧𝑗

𝑗=1 𝑛

linearly separable (∃𝑥∀𝑗𝑧𝑗 𝑥, 𝑦𝑗 > 0)

  • Where does gradient descent converge?

𝑥 𝑢 = 𝑥 𝑢 − 𝜃𝛼ℒ(𝑥(𝑢))

  • inf ℒ 𝑥 = 0, but minima unattainable
  • GD diverges to infinity: 𝑥 𝑢 → ∞, ℒ 𝑥 𝑢

→ 0

  • In what direction? What does 𝑥 𝑢

𝑥 𝑢

converge to?

  • Theorem:

𝑥 𝑢 𝑥 𝑢

2 →

ෝ 𝑥 ෝ 𝑥 2

ෝ 𝑥 = arg min 𝑥 2 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝑥, 𝑦𝑗 ≥ 1

slide-21
SLIDE 21

Other Objectives and Opt Methods

  • Single linear unit, logistic loss

➔ hard margin SVM solution (regardless of init, stepsize)

  • Multi-class problems with softmax loss

➔ multiclass SVM solution (regardless of init, stepsize)

  • Steepest Descent w.r.t. ‖𝑥‖

➔ arg min 𝑥

𝑡. 𝑢. ∀𝑗𝑧𝑗 𝑥, 𝑦𝑗 ≥ 1 (regardless of init, stepsize)

  • Coordinate Descent

➔ arg min 𝑥 1 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝑥, 𝑦𝑗 ≥ 1 (regardless of init, stepsize)

  • Matrix factorization problems ℒ 𝑉, 𝑊 = σ𝑗 ℓ 𝐵𝑗, 𝑉𝑊⊤

, including 1-bit matrix completion ➔ arg min 𝑋 𝑢𝑠 𝑡. 𝑢. 𝐵𝑗, 𝑋 ≥ 1 (regardless of init)

slide-22
SLIDE 22

Linear Neural Networks

  • Graph 𝐻(𝑊, 𝐹), with ℎ𝑤 = σ𝑣→𝑤 𝑥𝑣→𝑤ℎ𝑣
  • Input units ℎ𝑗𝑜 = 𝑦𝑗 ∈ ℝ𝑜, single output ℎ𝑝𝑣𝑢(𝑦𝑗), binary label 𝑧𝑗 ∈ ±1
  • Training: min

𝑥 σ𝑗=1 𝑛 ℓ 𝑧𝑗ℎ𝑝𝑣𝑢 𝑦𝑗

  • Implements linear predictor: ℎ𝑝𝑣𝑢 𝑦𝑗 = ⟨𝒬 𝑥 , 𝑦𝑗⟩
  • Training:

min

𝑥 ℒ 𝒬 𝑥

= ෍

𝑗=1 𝑛

ℓ 𝑧𝑗 𝒬 𝑥 , 𝑦𝑗

  • Just a different parametrization of linear classification:

min

𝛾∈Im 𝒬 ℒ(𝛾)

  • GD on 𝑥: different optimization procedure for same argmin problem
  • Limit of GD: 𝛾∞ = lim

𝑢→∞ 𝒬 𝑥 𝑢 𝒬 𝑥 𝑢

Im 𝒬 = ℝ𝑜 in all our examples

slide-23
SLIDE 23

Fully Connected Linear NNs

𝑀 fully connected layers with 𝐸𝑚 ≥ 1 units in layer 𝑚 ℎ0 = ℎ𝑗𝑜 ℎ𝑚 = 𝑋

𝑚 𝑈ℎ𝑚−1

ℎ𝑝𝑣𝑢 = ℎ𝑀 ℎ𝑚 ∈ 𝐸𝑚, parameters: 𝑥 = 𝑋

𝑚 ∈ ℝ𝐸𝑚×𝐸𝑚−1, 𝑚 = 1. . 𝑀

Theorem: 𝛾∞ ∝ arg min 𝛾 2 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝛾, 𝑦𝑗 ≥ 1

for ℓ 𝑨 = exp(−𝑨), almost all linearly separable data sets and initializations 𝑥(0) and any bounded stepsizes s.t. ℒ 𝑥 𝑢 → 0 and Δ𝑥 𝑢 = 𝑥 𝑢 − 𝑥 𝑢 − 1 converges in direction

slide-24
SLIDE 24

Linear Conv Nets

L-1 hidden layers, ℎ𝑚 ∈ ℝ𝑜, each with full-width cyclic “convolution”: ℎ𝑚 𝑒 = ෍

𝑙=0 𝐸−1

𝑥𝑚 𝑙 ℎ𝑚−1[𝑒 + 𝑙 𝑛𝑝𝑒 𝐸] ℎ𝑝𝑣𝑢 = 𝑥𝑀, ℎ𝑀−1 Params: 𝑥 = 𝑥𝑚 ∈ ℝ𝐸, 𝑚 = 1. . 𝑀 Theorem: With single conv layer (L=2), 𝛾∞ ∝ arg min ℱ𝛾 1 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝛾, 𝑦𝑗 ≥ 1 Theorem: 𝛾∞ ∝ critical point of min ℱ𝛾

Τ

2 𝑀 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝛾, 𝑦𝑗 ≥ 1

for ℓ 𝑨 = exp(−𝑨), almost all linearly separable data sets and initializations 𝑥(0) and any bounded stepsizes s.t. ℒ → 0, and Δ𝑥(𝑢) converge in direction ℱ=discrete Fourier transform

slide-25
SLIDE 25

min 𝜸 𝟑 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝛾, 𝑦𝑗 ≥ 1 min 𝓖𝜸

ൗ 𝟑 𝑴 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝛾, 𝑦𝑗 ≥ 1

min 𝜸

ൗ 𝟑 𝑴 𝑡. 𝑢. ∀𝑗𝑧𝑗 𝛾, 𝑦𝑗 ≥ 1

𝑴 = 𝟑 𝑴 = 𝟔 𝑴 = 𝟔

slide-26
SLIDE 26
slide-27
SLIDE 27

Goal: understand optimization algorithms not just as reaching some (global) optimum, but as reaching a specific optimum Different optimization algorithm ➔ Different bias in optimum reached ➔ Different inductive bias ➔ Different learning properties