Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 - - PowerPoint PPT Presentation

β–Ά
backpropagation
SMART_READER_LITE
LIVE PREVIEW

Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 - - PowerPoint PPT Presentation

Optimization and Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Linear score function = On CIFAR-10 Credit: Li/Karpathy/Johnson On


slide-1
SLIDE 1

Optimization and Backpropagation

1 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-2
SLIDE 2

Lecture 3 Recap

2 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-3
SLIDE 3

Neural Network

  • Linear score function π’ˆ = π‘Ώπ’š

3 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

On CIFAR-10 On ImageNet

Credit: Li/Karpathy/Johnson

slide-4
SLIDE 4

Neural Network

  • Linear score function π’ˆ = π‘Ώπ’š
  • Neural network is a nesting of β€˜functions’

– 2-layers: π’ˆ = π‘ΏπŸ‘ max(𝟏, π‘ΏπŸπ’š) – 3-layers: π’ˆ = π‘ΏπŸ’ max(𝟏, π‘ΏπŸ‘ max(𝟏, π‘ΏπŸπ’š)) – 4-layers: π’ˆ = π‘ΏπŸ“ tanh (π‘ΏπŸ’, max(𝟏, π‘ΏπŸ‘ max(𝟏, π‘ΏπŸπ’š))) – 5-layers: π’ˆ = π‘ΏπŸ”πœ(π‘ΏπŸ“ tanh(π‘ΏπŸ’, max(𝟏, π‘ΏπŸ‘ max(𝟏, π‘ΏπŸπ’š)))) – … up to hundreds of layers

4 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-5
SLIDE 5

Neural Network

5 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Credit: Li/Karpathy/Johnson

Input layer Hidden layer Output layer

slide-6
SLIDE 6

Neural Network

Depth Width

6

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-7
SLIDE 7

Activation Functions

7 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Sigmoid: 𝜏 𝑦 =

1 (1+π‘“βˆ’π‘¦)

tanh: tanh 𝑦 ReLU: max 0, 𝑦 Leaky ReLU: max 0.1𝑦, 𝑦 Maxout max π‘₯1

π‘ˆπ‘¦ + 𝑐1, π‘₯2 π‘ˆπ‘¦ + 𝑐2

ELU f x = α‰Š 𝑦 if 𝑦 > 0 Ξ± e𝑦 βˆ’ 1 if 𝑦 ≀ 0 Parametric ReLU: max 𝛽𝑦, 𝑦

slide-8
SLIDE 8

Loss Functions

  • Measure the goodness of the predictions (or

equivalently, the network's performance)

8 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

  • Regression loss

– L1 loss 𝑀 𝒛, ෝ 𝒛; 𝜾 = 1

π‘œ σ𝑗 π‘œ | 𝑧𝑗 βˆ’ ෝ

𝑧𝑗| 1 – MSE loss 𝑀 𝒛, ෝ 𝒛; 𝜾 = 1

π‘œ σ𝑗 π‘œ | 𝑧𝑗 βˆ’ ෝ

𝑧𝑗| 2

2

  • Classification loss (for multi-class classification)

– Cross Entropy loss E 𝒛, ෝ 𝒛; 𝜾 = βˆ’ σ𝑗=1

π‘œ

σ𝑙=1

𝑙

(𝑧𝑗𝑙 βˆ™ log ො 𝑧𝑗𝑙)

slide-9
SLIDE 9

Computational Graphs

  • Neural network is a computational graph

– It has compute nodes – It has edges that connect nodes – It is directional – It is organized in β€˜layers’

9 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-10
SLIDE 10

Backprop

10 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-11
SLIDE 11

The Importance of Gradients

  • Our optimization schemes are based on computing

gradients

  • One can compute gradients analytically but what if
  • ur function is too complex?
  • Break down gradient computation

11 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Backpropagation

π›ΌπœΎπ‘€ 𝜾

Rumelhart 1986

slide-12
SLIDE 12

Backprop: Forward Pass

  • 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨

12 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Initialization 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4 mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

slide-13
SLIDE 13

Backprop: Backward Pass

13 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

with 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4

mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨 𝑒 = 𝑦 + 𝑧

πœ–π‘’ πœ–π‘¦ = 1, πœ–π‘’ πœ–π‘§ = 1

𝑔 = 𝑒 β‹… 𝑨

πœ–π‘” πœ–π‘’ = 𝑨, πœ–π‘” πœ–π‘¨ = 𝑒

What is

πœ–π‘” πœ–π‘¦ , πœ–π‘” πœ–π‘§ , πœ–π‘” πœ–π‘¨ ?

slide-14
SLIDE 14

Backprop: Backward Pass

14 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

with 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4

mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨 𝑒 = 𝑦 + 𝑧

πœ–π‘’ πœ–π‘¦ = 1, πœ–π‘’ πœ–π‘§ = 1

𝑔 = 𝑒 β‹… 𝑨

πœ–π‘” πœ–π‘’ = 𝑨, πœ–π‘” πœ–π‘¨ = 𝑒

What is

πœ–π‘” πœ–π‘¦ , πœ–π‘” πœ–π‘§ , πœ–π‘” πœ–π‘¨ ?

πœ–π‘” πœ–π‘” 1

slide-15
SLIDE 15

Backprop: Backward Pass

15 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

with 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4

mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨 𝑒 = 𝑦 + 𝑧

πœ–π‘’ πœ–π‘¦ = 1, πœ–π‘’ πœ–π‘§ = 1

𝑔 = 𝑒 β‹… 𝑨

πœ–π‘” πœ–π‘’ = 𝑨, πœ–π‘” πœ–π‘¨ = 𝑒

What is

πœ–π‘” πœ–π‘¦ , πœ–π‘” πœ–π‘§ , πœ–π‘” πœ–π‘¨ ?

1 πœ–π‘” πœ–π‘¨ βˆ’2 βˆ’2

slide-16
SLIDE 16

Backprop: Backward Pass

16 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

with 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4

mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨 𝑒 = 𝑦 + 𝑧

πœ–π‘’ πœ–π‘¦ = 1, πœ–π‘’ πœ–π‘§ = 1

𝑔 = 𝑒 β‹… 𝑨

πœ–π‘” πœ–π‘’ = 𝑨, πœ–π‘” πœ–π‘¨ = 𝑒

What is

πœ–π‘” πœ–π‘¦ , πœ–π‘” πœ–π‘§ , πœ–π‘” πœ–π‘¨ ?

βˆ’2 πœ–π‘” πœ–π‘’ 4 1 βˆ’2

slide-17
SLIDE 17

Backprop: Backward Pass

17 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

with 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4

mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨 𝑒 = 𝑦 + 𝑧

πœ–π‘’ πœ–π‘¦ = 1, πœ–π‘’ πœ–π‘§ = 1

𝑔 = 𝑒 β‹… 𝑨

πœ–π‘” πœ–π‘’ = 𝑨, πœ–π‘” πœ–π‘¨ = 𝑒

What is

πœ–π‘” πœ–π‘¦ , πœ–π‘” πœ–π‘§ , πœ–π‘” πœ–π‘¨ ?

1 4 πœ–π‘” πœ–π‘§ 4 πœ–π‘” πœ–π‘§ = πœ–π‘” πœ–π‘’ β‹… πœ–π‘’ πœ–π‘§ Chain Rule:

β†’ πœ–π‘” πœ–π‘§ = 4 β‹… 1 = 4

4 βˆ’2

slide-18
SLIDE 18

Backprop: Backward Pass

18 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

with 𝑦 = 1, 𝑧 = βˆ’3, 𝑨 = 4

mult sum 𝑔 = βˆ’8 1 βˆ’3 4 𝑒 = βˆ’2 1 βˆ’3 4

𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 β‹… 𝑨 𝑒 = 𝑦 + 𝑧

πœ–π‘’ πœ–π‘¦ = 1, πœ–π‘’ πœ–π‘§ = 1

𝑔 = 𝑒 β‹… 𝑨

πœ–π‘” πœ–π‘’ = 𝑨, πœ–π‘” πœ–π‘¨ = 𝑒

What is

πœ–π‘” πœ–π‘¦ , πœ–π‘” πœ–π‘§ , πœ–π‘” πœ–π‘¨ ?

1 βˆ’2 4 4 βˆ’2 πœ–π‘” πœ–π‘¦ = πœ–π‘” πœ–π‘’ β‹… πœ–π‘’ πœ–π‘¦ Chain Rule:

β†’ πœ–π‘” πœ–π‘¦ = 4 β‹… 1 = 4

πœ–π‘” πœ–π‘¦ 4 4 4

slide-19
SLIDE 19

Compute Graphs -> Neural Networks

  • 𝑦𝑙 input variables
  • π‘₯π‘š,𝑛,π‘œ network weights (note 3 indices)

– π‘š which layer – 𝑛 which neuron in layer – π‘œ which weight in neuron

  • ො

𝑧𝑗 computed output (𝑗 output dim; π‘œπ‘π‘£π‘’)

  • 𝑧𝑗 ground truth targets
  • 𝑀 loss function

19 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-20
SLIDE 20

Compute Graphs -> Neural Networks

20 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦1 ො 𝑧0 𝑧0 Input layer Output layer e.g., class label/ regression target 𝑦0 𝑦1 βˆ— π‘₯1 βˆ— π‘₯0 βˆ’π‘§0 + π‘¦βˆ—π‘¦ Input Weights (unknowns!) L2 Loss function Loss/ cost

slide-21
SLIDE 21

Compute Graphs -> Neural Networks

21 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Input layer Output layer e.g., class label/ regression target 𝑦0 𝑦1 βˆ— π‘₯1 βˆ— π‘₯0 + Input Weights (unknowns!) L2 Loss function Loss/ cost We want to compute gradients w.r.t. all weights 𝑿 max(0, 𝑦) ReLU Activation

(btw. I’m not arguing this is the right choice here)

π‘¦βˆ—π‘¦ βˆ’π‘§0

𝑦0 𝑦1 ො 𝑧0 𝑧0

slide-22
SLIDE 22

Compute Graphs -> Neural Networks

22 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦1 ො 𝑧0 𝑧0 Input layer Output layer ො 𝑧1 ො 𝑧2 𝑧1 𝑧2 𝑦0 𝑦1 βˆ— π‘₯0,0 + Loss/ cost + Loss/ cost βˆ— π‘₯0,1 βˆ— π‘₯1,0 βˆ— π‘₯1,1 + Loss/ cost βˆ— π‘₯2,0 βˆ— π‘₯2,1 We want to compute gradients w.r.t. all weights 𝑿 π‘¦βˆ—π‘¦ π‘¦βˆ—π‘¦ π‘¦βˆ—π‘¦ βˆ’π‘§0 βˆ’π‘§0 βˆ’π‘§0

slide-23
SLIDE 23

Compute Graphs -> Neural Networks

23 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦𝑙 ො 𝑧0 𝑧0 Input layer Output layer ො 𝑧1 𝑧1 … ො 𝑧𝑗 = 𝐡(𝑐𝑗 + ෍

𝑙

𝑦𝑙π‘₯𝑗,𝑙) 𝑀 = ෍

𝑗

𝑀𝑗 𝑀𝑗 = ො 𝑧𝑗 βˆ’ 𝑧𝑗 2 We want to compute gradients w.r.t. all weights 𝑿 AND all biases 𝒄 Activation function bias πœ–π‘€π‘— πœ–π‘₯𝑗,𝑙 = πœ–π‘€π‘— πœ– ො 𝑧𝑗 β‹… πœ– ො 𝑧𝑗 πœ–π‘₯𝑗,𝑙 ⟢ use chain rule to compute partials Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑿 𝑀: sum over loss per sample, e.g. L2 loss ⟢ simply sum up squares:

slide-24
SLIDE 24

NNs as Computational Graphs

  • We can express any kind of functions in a

computational graph, e.g. 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

24 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

Sigmoid function 𝜏 𝑦 = 1 1 + π‘“βˆ’π‘¦

exp(βˆ™)

slide-25
SLIDE 25

NNs as Computational Graphs

  • 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

25 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

exp(βˆ™)

2 βˆ’1 βˆ’3 βˆ’2 βˆ’3 βˆ’2 6 4 1 βˆ’1 0.37 1.37 0.73

slide-26
SLIDE 26

NNs as Computational Graphs

  • 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

26 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

exp(βˆ™)

2 βˆ’1 βˆ’3 βˆ’2 βˆ’3 βˆ’2 6 4 1 βˆ’1 0.37 1.37 0.73 1 βˆ’0.53 𝑕 𝑦 = 1

𝑦

β‡’

πœ–π‘• πœ–π‘¦ = βˆ’ 1 𝑦2

𝑕𝛽 𝑦 = 𝛽 + 𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 1

𝑕 𝑦 = 𝑓𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 𝑓𝑦

𝑕𝛽 𝑦 = 𝛽𝑦 β‡’

πœ–π‘• πœ–π‘¦ = 𝛽

1 βˆ™ βˆ’

1 1.372 = βˆ’0.53

slide-27
SLIDE 27

NNs as Computational Graphs

  • 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

27 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

exp(βˆ™)

2 βˆ’1 βˆ’3 βˆ’2 βˆ’3 βˆ’2 6 4 1 βˆ’1 0.37 1.37 0.73 1 βˆ’0.53 βˆ’0.53 𝑕 𝑦 = 1

𝑦

β‡’

πœ–π‘• πœ–π‘¦ = βˆ’ 1 𝑦2

𝑕𝛽 𝑦 = 𝛽 + 𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 1

𝑕 𝑦 = 𝑓𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 𝑓𝑦

𝑕𝛽 𝑦 = 𝛽𝑦 β‡’

πœ–π‘• πœ–π‘¦ = 𝛽

βˆ’0.53 βˆ™ 1 = βˆ’0.53

slide-28
SLIDE 28

NNs as Computational Graphs

  • 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

28 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

exp(βˆ™)

2 βˆ’1 βˆ’3 βˆ’2 βˆ’3 βˆ’2 6 4 1 βˆ’1 0.37 1.37 0.73 1 βˆ’0.53 βˆ’0.53 βˆ’0.2 𝑕 𝑦 = 1

𝑦

β‡’

πœ–π‘• πœ–π‘¦ = βˆ’ 1 𝑦2

𝑕𝛽 𝑦 = 𝛽 + 𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 1

𝑕 𝑦 = 𝑓𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 𝑓𝑦

𝑕𝛽 𝑦 = 𝛽𝑦 β‡’

πœ–π‘• πœ–π‘¦ = 𝛽

βˆ’0.2 βˆ’0.53 βˆ™ eβˆ’1 = βˆ’0.2

slide-29
SLIDE 29

NNs as Computational Graphs

  • 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

29 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

exp(βˆ™)

2 βˆ’1 βˆ’3 βˆ’2 βˆ’3 βˆ’2 6 4 1 βˆ’1 0.37 1.37 0.73 1 βˆ’0.53 βˆ’0.53 βˆ’0.2 0.2 𝑕 𝑦 = 1

𝑦

β‡’

πœ–π‘• πœ–π‘¦ = βˆ’ 1 𝑦2

𝑕𝛽 𝑦 = 𝛽 + 𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 1

𝑕 𝑦 = 𝑓𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 𝑓𝑦

𝑕𝛽 𝑦 = 𝛽𝑦 β‡’

πœ–π‘• πœ–π‘¦ = 𝛽

βˆ’0.2 βˆ™ βˆ’1 = 0.2

slide-30
SLIDE 30

NNs as Computational Graphs

  • 𝑔 𝒙, π’š =

1 1+π‘“βˆ’ 𝑐+π‘₯0𝑦0+π‘₯1𝑦1

30 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

π‘₯1 𝑦1 βˆ— π‘₯0 𝑦0 βˆ— 𝑐 + +

βˆ™ βˆ’1

+1

1 βˆ™

exp(βˆ™)

2 βˆ’1 βˆ’3 βˆ’2 βˆ’3 βˆ’2 6 4 1 βˆ’1 0.37 1.37 0.73 1 βˆ’0.53 βˆ’0.53 βˆ’0.2 0.2 0.2 0.2 0.4 βˆ’0.6 βˆ’0.2 βˆ’0.4 0.2 0.2 𝑕 𝑦 = 1

𝑦

β‡’

πœ–π‘• πœ–π‘¦ = βˆ’ 1 𝑦2

𝑕𝛽 𝑦 = 𝛽 + 𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 1

𝑕 𝑦 = 𝑓𝑦 β‡’ πœ–π‘•

πœ–π‘¦ = 𝑓𝑦

𝑕𝛽 𝑦 = 𝛽𝑦 β‡’

πœ–π‘• πœ–π‘¦ = 𝛽

slide-31
SLIDE 31

Gradient Descent

31 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-32
SLIDE 32

Gradient Descent

Optimum Initialization

π’šβˆ— = arg min 𝑔(π’š)

32 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-33
SLIDE 33

Gradient Descent

  • From derivative to gradient
  • Gradient steps in direction of negative gradient

33 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Direction of greatest increase

  • f the function

ⅆ𝑔 𝑦 ⅆ𝑦 𝛼

𝑦𝑔 𝑦

𝛼

𝑦𝑔(𝑦)

𝑦 Learning rate 𝑦′ = 𝑦 βˆ’ 𝛽𝛼

𝑦𝑔 𝑦

slide-34
SLIDE 34

Gradient Descent for Neural Networks

34

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer 𝑛 Neurons π‘š Layers

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-35
SLIDE 35

Gradient Descent for Neural Networks

𝛼𝑿𝑔 π’š,𝒛 (𝑿) = πœ–π‘” πœ–π‘₯0,0,0 … … πœ–π‘” πœ–π‘₯π‘š,𝑛,π‘œ

For a given training pair {π’š, 𝒛}, we want to update all weights, i.e., we need to compute the derivatives w.r.t. to all weights:

π‘š Layers 𝑛 Neurons 𝑿′ = 𝑿 βˆ’ 𝛽𝛼𝑿𝑔 π’š,𝒛 (𝑿)

Gradient step:

35

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-36
SLIDE 36

NNs can Become Quite Complex…

  • These graphs can be huge!

[Szegedy et al.,CVPR’15] Going Deeper with Convolutions

36 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-37
SLIDE 37

The Flow of the Gradients

  • Many many many many of these nodes form a neural

network

  • Each one has its own work to do

37 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

NEURONS FORWARD AND BACKWARD PASS

slide-38
SLIDE 38

The Flow of the Gradients

38 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Activations Activation function 𝑨 = 𝑔(𝑦, 𝑧)

𝑦 𝑧

πœ–π‘€ πœ–π‘¦ = πœ–π‘€ πœ–π‘¨ πœ–π‘¨ πœ–π‘¦ πœ–π‘€ πœ–π‘¨ πœ–π‘¨ πœ–π‘¦ πœ–π‘¨ πœ–π‘§ 𝑔

slide-39
SLIDE 39

Gradient Descent for Neural Networks

39 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦1 𝑦2 β„Ž0 β„Ž1 β„Ž2 β„Ž3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1 ො 𝑧𝑗 = 𝐡(𝑐1,𝑗 + ෍

π‘˜

β„Žπ‘˜π‘₯1,𝑗,π‘˜) β„Žπ‘˜ = 𝐡(𝑐0,π‘˜ + ෍

𝑙

𝑦𝑙π‘₯0,π‘˜,𝑙) Loss function 𝑀𝑗 = ො 𝑧𝑗 βˆ’ 𝑧𝑗 2 Just simple: 𝐡 𝑦 = max(0, 𝑦)

slide-40
SLIDE 40

Gradient Descent for Neural Networks

40 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦1 𝑦2 β„Ž0 β„Ž1 β„Ž2 β„Ž3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1

ො 𝑧𝑗 = 𝐡(𝑐1,𝑗 + ෍

π‘˜

β„Žπ‘˜π‘₯1,𝑗,π‘˜) β„Žπ‘˜ = 𝐡(𝑐0,π‘˜ + ෍

𝑙

𝑦𝑙π‘₯0,π‘˜,𝑙) 𝑀𝑗 = ො 𝑧𝑗 βˆ’ 𝑧𝑗 2

Backpropagation

πœ–π‘€π‘— πœ–π‘₯1,𝑗,π‘˜ = πœ–π‘€π‘— πœ– ො 𝑧𝑗 β‹… πœ– ො 𝑧𝑗 πœ–π‘₯1,𝑗,π‘˜ πœ–π‘€π‘— πœ–π‘₯0,π‘˜,𝑙 = πœ–π‘€π‘— πœ– ො 𝑧𝑗 β‹… πœ– ො 𝑧𝑗 πœ–β„Žπ‘˜ β‹… πœ–β„Žπ‘˜ πœ–π‘₯0,π‘˜,𝑙 Just go through layer by layer

πœ–π‘€π‘— πœ– ො 𝑧𝑗 = 2(ො 𝑧i βˆ’ yi)

πœ– ො 𝑧𝑗 πœ–π‘₯1,𝑗,π‘˜ = β„Žπ‘˜

if > 0, else 0

…

slide-41
SLIDE 41

Gradient Descent for Neural Networks

41 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦1 𝑦2 β„Ž0 β„Ž1 β„Ž2 β„Ž3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1

ො 𝑧𝑗 = 𝐡(𝑐1,𝑗 + ෍

π‘˜

β„Žπ‘˜π‘₯1,𝑗,π‘˜) β„Žπ‘˜ = 𝐡(𝑐0,π‘˜ + ෍

𝑙

𝑦𝑙π‘₯0,π‘˜,𝑙) 𝑀𝑗 = ො 𝑧𝑗 βˆ’ 𝑧𝑗 2

How many unknown weights?

  • Output layer: 2 β‹… 4 + 2
  • Hidden Layer: 4 β‹… 3 + 4

#neurons βˆ™ #input channels + #biases Note that some activations have also weights

slide-42
SLIDE 42

Derivatives of Cross Entropy Loss

42 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦0 𝑦1 𝑦2 β„Ž0 β„Ž1 β„Ž2 β„Ž3 ො 𝑧0 ො 𝑧1 𝑧0 𝑧1

𝑀 = βˆ’ ෍

𝑗=1 π‘œπ‘π‘£π‘’

(𝑧𝑗 log ො 𝑧𝑗 + 1 βˆ’ 𝑧𝑗 log(1 βˆ’ ො 𝑧𝑗)) ො 𝑧𝑗 = 1 1 + π‘“βˆ’π‘‘π‘— 𝑑𝑗 = ෍

j

β„Žπ‘˜π‘₯

π‘˜π‘—

  • utput

scores Binary Cross Entropy loss Gradients of weights of last layer: πœ–π‘€π‘— πœ–π‘₯

π‘˜π‘—

= πœ–π‘€π‘— πœ– ො 𝑧𝑗 βˆ™ πœ– ො 𝑧𝑗 πœ–π‘‘π‘— βˆ™ πœ–π‘‘π‘— πœ–π‘₯

π‘˜π‘—

πœ–π‘€π‘— πœ– ො 𝑧𝑗 = βˆ’π‘§π‘— ො 𝑧𝑗 + 1 βˆ’ 𝑧𝑗 1 βˆ’ ො 𝑧𝑗 = ො 𝑧𝑗 βˆ’ 𝑧𝑗 ො 𝑧𝑗(1 βˆ’ ො 𝑧𝑗) , πœ– ො 𝑧𝑗 πœ–π‘‘π‘— = ො 𝑧𝑗 1 βˆ’ ො 𝑧𝑗 , πœ–π‘‘π‘— πœ–π‘₯

π‘˜π‘—

= β„Žπ‘˜ πœ–π‘€π‘— πœ–π‘‘π‘— = ො 𝑧𝑗 βˆ’ 𝑧𝑗 ⟹ πœ–π‘€π‘— πœ–π‘₯

π‘˜π‘—

= (ො 𝑧𝑗 βˆ’ 𝑧𝑗)β„Žπ‘˜,

slide-43
SLIDE 43

Derivatives of Cross Entropy Loss

43 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Gradients of weights of first layer:

πœ–π‘€ πœ–π‘‘

π‘˜ 1 = ෍ 𝑗=1 π‘œπ‘π‘£π‘’ πœ–π‘€

πœ–π‘‘π‘— πœ–π‘‘π‘— πœ–β„Žπ‘˜ πœ–β„Žπ‘˜ πœ–π‘‘

π‘˜ 1 = ෍ 𝑗=1 π‘œπ‘π‘£π‘’

(ො 𝑧𝑗 βˆ’ 𝑧𝑗)π‘₯

π‘˜π‘—(β„Žπ‘˜ 1 βˆ’ β„Žπ‘˜ )

πœ–π‘€ πœ–β„Žπ‘˜ = ෍

𝑗=1 π‘œπ‘π‘£π‘’ πœ–π‘€

πœ– ො 𝑧𝑗 πœ– ො 𝑧𝑗 πœ–π‘‘π‘˜ πœ–π‘‘

π‘˜

πœ–β„Žπ‘˜ = ෍

𝑗=1 π‘œπ‘π‘£π‘’ πœ–π‘€

πœ– ො 𝑧𝑗 ො 𝑧𝑗 1 βˆ’ ො 𝑧𝑗 π‘₯

π‘˜π‘— = ෍ 𝑗=1 π‘œπ‘π‘£π‘’

ො 𝑧𝑗 βˆ’ 𝑧𝑗 π‘₯

π‘˜π‘—

πœ–π‘€ πœ–π‘₯π‘™π‘˜

1 = ෍ 𝑗=1 π‘œπ‘π‘£π‘’ πœ–π‘€

πœ–π‘‘

π‘˜ 1

πœ–π‘‘

π‘˜ 1

πœ–π‘₯π‘™π‘˜

1 = ෍ 𝑗=1 π‘œπ‘π‘£π‘’

ො 𝑧𝑗 βˆ’ 𝑧𝑗 π‘₯

π‘˜π‘— β„Žπ‘˜ 1 βˆ’ β„Žπ‘˜

𝑦𝑙

slide-44
SLIDE 44

Back to Compute Graphs & NNs

  • Inputs π’š and targets 𝒛
  • Two-layer NN for regression

with ReLU activation

  • Function we want to
  • ptimize:

෍

𝑗=1 π‘œ

π‘₯2max 0, π‘₯1𝑦𝑗 βˆ’ 𝑧𝑗

2 2

I2DL: Prof. Niessner, Prof. Leal-TaixΓ© 44

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 𝜏 𝑨

max(0,βˆ™)

ො 𝑧

slide-45
SLIDE 45

Gradient Descent for Neural Networks

45 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

𝑀 𝒛, ෝ 𝒛; 𝜾 =

1 π‘œ σ𝑗 π‘œ | ො

𝑧𝑗 βˆ’ 𝑧𝑗| 2

2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–π‘₯2 = 𝜏

Backpropagation πœ–π‘€ πœ–π‘₯2 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–π‘₯2 𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3

2

2 3 4 9

1 𝜏 𝑨

max(0,βˆ™)

ො 𝑧

1 3

slide-46
SLIDE 46

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

𝑀 𝒛, ෝ 𝒛; 𝜾 =

1 π‘œ σ𝑗 π‘œ | ො

𝑧𝑗 βˆ’ 𝑧𝑗| 2

2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–π‘₯2 = 𝜏

Gradient Descent for Neural Networks

46 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Backpropagation πœ–π‘€ πœ–π‘₯2 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–π‘₯2 𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3

2

2 3 4 9

1

4 3

𝜏 𝑨 2 βˆ™ 2

3

max(0,βˆ™)

ො 𝑧

1 3

slide-47
SLIDE 47

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

𝑀 𝒛, ෝ 𝒛; 𝜾 =

1 π‘œ σ𝑗 π‘œ | ො

𝑧𝑗 βˆ’ 𝑧𝑗| 2

2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–π‘₯2 = 𝜏

Gradient Descent for Neural Networks

47 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Backpropagation πœ–π‘€ πœ–π‘₯2 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–π‘₯2 𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3

2

2 3 4 9

1

4 3

𝜏 𝑨

4 9 4 9

2 βˆ™ 2

3 βˆ™ 1 3

max(0,βˆ™)

ො 𝑧

1 3

slide-48
SLIDE 48

Gradient Descent for Neural Networks

48 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3 2 3 4 9

1 Backpropagation πœ–π‘€ πœ–π‘₯1 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–πœ β‹… πœ–πœ πœ–π‘¨ β‹… πœ–π‘¨ πœ–π‘₯1 𝜏 𝑨

max(0,βˆ™)

4 9 4 9

2 ො 𝑧

1 3

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–πœ = π‘₯2

𝜏 = max 0, 𝑨 β‡’ πœ–πœ

πœ–π‘¨ = α‰Š1 if 𝑦 > 0

0 else 𝑨 = 𝑦 βˆ™ π‘₯1 β‡’

πœ–π‘¨ πœ–π‘₯1 = 𝑦

slide-49
SLIDE 49

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–πœ = π‘₯2

𝜏 = max 0, 𝑨 β‡’ πœ–πœ

πœ–π‘¨ = α‰Š1 if 𝑦 > 0

0 else 𝑨 = 𝑦 βˆ™ π‘₯1 β‡’

πœ–π‘¨ πœ–π‘₯1 = 𝑦

Gradient Descent for Neural Networks

49 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3 2 3 4 9

1 Backpropagation πœ–π‘€ πœ–π‘₯1 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–πœ β‹… πœ–πœ πœ–π‘¨ β‹… πœ–π‘¨ πœ–π‘₯1

4 3

𝜏 𝑨 2 βˆ™ 2

3

max(0,βˆ™)

4 9 4 9

2

4 3

ො 𝑧

1 3

slide-50
SLIDE 50

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–πœ = π‘₯2

𝜏 = max 0, 𝑨 β‡’ πœ–πœ

πœ–π‘¨ = α‰Š1 if 𝑦 > 0

0 else 𝑨 = 𝑦 βˆ™ π‘₯1 β‡’

πœ–π‘¨ πœ–π‘₯1 = 𝑦

Gradient Descent for Neural Networks

50 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3 2 3 4 9

1 Backpropagation πœ–π‘€ πœ–π‘₯1 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–πœ β‹… πœ–πœ πœ–π‘¨ β‹… πœ–π‘¨ πœ–π‘₯1

4 3 8 3

𝜏 𝑨 2 βˆ™ 2

3 βˆ™ 2

max(0,βˆ™)

4 9 4 9

2 ො 𝑧

1 3

slide-51
SLIDE 51

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–πœ = π‘₯2

𝜏 = max 0, 𝑨 β‡’ πœ–πœ

πœ–π‘¨ = α‰Š1 if 𝑦 > 0

0 else 𝑨 = 𝑦 βˆ™ π‘₯1 β‡’

πœ–π‘¨ πœ–π‘₯1 = 𝑦

Gradient Descent for Neural Networks

51 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3 2 3 4 9

1 Backpropagation πœ–π‘€ πœ–π‘₯1 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–πœ β‹… πœ–πœ πœ–π‘¨ β‹… πœ–π‘¨ πœ–π‘₯1

4 3 8 3 8 3

𝜏 𝑨 2 βˆ™ 2

3 βˆ™ 2 βˆ™ 1

max(0,βˆ™)

4 9 4 9

2 ො 𝑧

1 3

slide-52
SLIDE 52

Initialize 𝑦 = 1, 𝑧 = 0, π‘₯1 = 1

3, π‘₯2 = 2

In our case π‘œ, 𝑒 = 1: 𝑀 = ො 𝑧 βˆ’ 𝑧 2 β‡’ πœ–π‘€

πœ– ො 𝑧 = 2(ො

𝑧 βˆ’ 𝑧) ො 𝑧 = π‘₯2 βˆ™ 𝜏 β‡’ πœ– ො

𝑧 πœ–πœ = π‘₯2

𝜏 = max 0, 𝑨 β‡’ πœ–πœ

πœ–π‘¨ = α‰Š1 if 𝑦 > 0

0 else 𝑨 = 𝑦 βˆ™ π‘₯1 β‡’

πœ–π‘¨ πœ–π‘₯1 = 𝑦

Gradient Descent for Neural Networks

52 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3 1 3 2 3 4 9

1 Backpropagation πœ–π‘€ πœ–π‘₯1 = πœ–π‘€ πœ– ො 𝑧 β‹… πœ– ො 𝑧 πœ–πœ β‹… πœ–πœ πœ–π‘¨ β‹… πœ–π‘¨ πœ–π‘₯1

4 3 8 3 8 3 8 3 8 3

𝜏 𝑨 2 βˆ™ 2

3 βˆ™ 2 βˆ™ 1

βˆ™ 1

max(0,βˆ™)

4 9 4 9

2 ො 𝑧

slide-53
SLIDE 53

Gradient Descent for Neural Networks

53 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

  • Function we want to
  • ptimize:
  • Computed gradients wrt

to weights π‘₯1and π‘₯2

  • Now: update the weights

𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧 𝑀 1

1 3 1 3 1 3 2 3 4 9

1

4 3 8 3 8 3 8 3 8 3

𝜏 𝑨

max(0,βˆ™)

4 9 4 9

2 𝒙′ = 𝒙 βˆ’ 𝛽 βˆ™ 𝛼

𝒙𝑔 = π‘₯1

π‘₯2 βˆ’ 𝛽 βˆ™ 𝛼

π‘₯1𝑔

𝛼

π‘₯2𝑔

=

1 3

2 βˆ’ 𝛽 βˆ™

8 3 4 9

𝑔 𝑦, 𝒙 = ෍

𝑗=1 π‘œ

π‘₯2max 0, π‘₯1𝑦𝑗 βˆ’ 𝑧𝑗

2 2

But: how to choose a good learning rate 𝛽 ?

ො 𝑧

slide-54
SLIDE 54

Gradient Descent

  • How to pick good learning rate?
  • How to compute gradient for single training pair?
  • How to compute gradient for large training set?
  • How to speed things up? More to see in next

lectures…

54 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-55
SLIDE 55

Regularization

55 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-56
SLIDE 56

Recap: Basic Recipe for ML

  • Split your data

56 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Find your hyperparameters train test validation 20% 60% 20%

Other splits are also possible (e.g., 80%/10%/10%)

slide-57
SLIDE 57

Over- and Underfitting

57 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Underfitted Appropriate Overfitted

Source: Deep Learning by Adam Gibson, Josh Patterson, Oβ€˜Reily Media Inc., 2017

slide-58
SLIDE 58

Training a Neural Network

  • Training/ Validation curve

58 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

How can we prevent our model from overfitting? Regularization

Credits: Deep Learning. Goodfellow et al.

Training error too high Generalization gap is too big

slide-59
SLIDE 59

Regularization

  • Loss function 𝑀(𝒛, ෝ

𝒛, 𝜾) = σ𝑗=1

π‘œ

ො 𝑧𝑗 βˆ’ 𝑧𝑗 2 + πœ‡π‘†(𝜾)

  • Regularization techniques

– L2 regularization – L1 regularization – Max norm regularization – Dropout – Early stopping – ...

59 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Add regularization term to loss function More details later

slide-60
SLIDE 60

Regularization: Example

  • Input: 3 features π’š = [1, 2, 1]
  • Two linear classifiers that give the same result:
  • πœ„1 = 0, 0.75, 0
  • πœ„2 = [0.25, 0.5, 0.25]

60 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

Ignores 2 features Takes information from all features

slide-61
SLIDE 61

Regularization: Example

  • Loss 𝑀 𝒛, ෝ

𝒛, 𝜾 = σ𝑗=1

π‘œ

π‘¦π‘—πœ„

π‘˜π‘— βˆ’ 𝑧𝑗 2 + πœ‡π‘† 𝜾

  • L2 regularization

πœ„1 0 + 0.752 + 0 = 0.5625 πœ„2 0.252 + 0.52 + 0.252 = 0.375 𝑦 = 1, 2, 1 , πœ„1 = 0, 0.75, 0 , πœ„2 = [0.25, 0.5, 0.25]

61

Minimization

𝑆 𝜾 = ෍

𝑗=1 π‘œ

πœ„π‘—

2

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-62
SLIDE 62

Regularization: Example

  • Loss 𝑀 𝒛, ෝ

𝒛, 𝜾 = σ𝑗=1

π‘œ

π‘¦π‘—πœ„

π‘˜π‘— βˆ’ 𝑧𝑗 2 + πœ‡π‘† 𝜾

  • L1 regularization

πœ„1 0 + 0.75 + 0 = 0.75 πœ„2 0.25 + 0.5 + 0.25 = 1 𝑦 = 1, 2, 1 , πœ„1 = 0, 0.75, 0 , πœ„2 = [0.25, 0.5, 0.25]

62

Minimization

𝑆 𝜾 = ෍

𝑗=1 π‘œ

πœ„π‘—

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-63
SLIDE 63

Regularization: Example

  • Input: 3 features π’š = [1, 2, 1]
  • Two linear classifiers that give the same result:

πœ„1 = 0, 0.75, 0 πœ„2 = [0.25, 0.5, 0.25]

63

Ignores 2 features Takes information from all features

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-64
SLIDE 64

Regularization: Example

  • Input: 3 features π’š = [1, 2, 1]
  • Two linear classifiers that give the same result:

πœ„1 = 0, 0.75, 0 πœ„2 = [0.25, 0.5, 0.25]

64

L1 regularization enforces sparsity Takes information from all features

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-65
SLIDE 65

Regularization: Example

  • Input: 3 features π’š = [1, 2, 1]
  • Two linear classifiers that give the same result:

πœ„1 = 0, 0.75, 0 πœ„2 = [0.25, 0.5, 0.25]

65

L1 regularization enforces sparsity L2 regularization enforces that the weights have similar values

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-66
SLIDE 66

Regularization: Effect

  • Dog classifier takes different inputs

66

Furry Has two eyes Has a tail Has paws Has two ears L1 regularization will focus all the attention to a few key features

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-67
SLIDE 67

Regularization: Effect

  • Dog classifier takes different inputs

67

Furry Has two eyes Has a tail Has paws Has two ears L2 regularization will take all information into account to make decisions

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-68
SLIDE 68

Regularization for Neural Networks

68

Combining nodes: Network output + L2-loss + regularization 𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧

max(0,βˆ™)

+ 𝑀

𝑀2 loss

𝑆(π‘₯1, π‘₯2) βˆ™ πœ‡ ෍

𝑗=1 π‘œ

π‘₯2max 0, π‘₯1𝑦𝑗 βˆ’ 𝑧𝑗

2 2 + πœ‡π‘† π‘₯1, π‘₯2

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-69
SLIDE 69

Regularization for Neural Networks

69

Combining nodes: Network output + L2-loss + regularization 𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧

max(0,βˆ™)

+ 𝑀

𝑀2 loss

𝑆(π‘₯1, π‘₯2) βˆ™ πœ‡ ෍

𝑗=1 π‘œ

π‘₯2max 0, π‘₯1𝑦𝑗 βˆ’ 𝑧𝑗

2 2 + πœ‡

π‘₯1 π‘₯2

2 2

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-70
SLIDE 70

Regularization for Neural Networks

70

Combining nodes: Network output + L2-loss + regularization 𝑦 π‘₯1 βˆ— π‘₯2 βˆ— 𝑧

max(0,βˆ™)

+ 𝑀

𝑀2 loss

𝑆(π‘₯1, π‘₯2) βˆ™ πœ‡ ෍

𝑗=1 π‘œ

π‘₯2max 0, π‘₯1𝑦𝑗 βˆ’ 𝑧𝑗

2 2 + πœ‡(π‘₯1 2 + π‘₯2 2)

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-71
SLIDE 71

Regularization

71

Credits: University of Washington

What is the goal of regularization? What happens to the training error?

Decision Boundary Regularization πœ‡ = 0.001 πœ‡ = 000001 πœ‡ = 0 πœ‡ = 1 πœ‡ = 10

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-72
SLIDE 72

Regularization

  • Any strategy that aims to

72

Lower validation error Increasing training error

I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-73
SLIDE 73

Next Lecture

  • This week:

– Check exercises – Check office hours ☺

  • Next lecture

– Optimization of Neural Networks – In particular, introduction to SGD (our main method!)

73 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-74
SLIDE 74

See you next week ☺

74 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©

slide-75
SLIDE 75

Further Reading

  • Backpropagation

– Chapter 6.5 (6.5.1 - 6.5.3) in http://www.deeplearningbook.org/contents/mlp.html – Chapter 5.3 in Bishop, Pattern Recognition and Machine Learning – http://cs231n.github.io/optimization-2/

  • Regularization

– Chapter 7.1 (esp. 7.1.1 & 7.1.2) http://www.deeplearningbook.org/contents/regularization.html – Chapter 5.5 in Bishop, Pattern Recognition and Machine Learning

75 I2DL: Prof. Niessner, Prof. Leal-TaixΓ©