Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer - - PowerPoint PPT Presentation

β–Ά
neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer - - PowerPoint PPT Presentation

Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta] Limited power of one single neuron Perceptron: = (


slide-1
SLIDE 1

Neural Networks

Part 2

Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison

[Based on slides from Jerry Zhu, Mohit Gupta]

slide-2
SLIDE 2

slide 2

Limited power of one single neuron

  • Perceptron: 𝑏 = 𝑕(σ𝑒 π‘₯𝑒𝑦𝑒)
  • Activation function 𝑕: linear, step, sigmoid

… 1 π‘₯0 π‘₯1 π‘₯𝐸 𝑦𝐸 𝑦1 𝑏 𝑕(෍

𝑒

π‘₯𝑒𝑦𝑒)

slide-3
SLIDE 3

slide 3

Limited power of one single neuron

  • Perceptron: 𝑏 = 𝑕(σ𝑒 π‘₯𝑒𝑦𝑒)
  • Activation function 𝑕: linear, step, sigmoid
  • Decision boundary linear even for nonlinear 𝑕
  • XOR problem

… 1 π‘₯0 π‘₯1 π‘₯𝐸 𝑦𝐸 𝑦1 𝑏 𝑕(෍

𝑒

π‘₯𝑒𝑦𝑒)

slide-4
SLIDE 4

slide 4

Limited power of one single neuron

  • XOR problem
  • Wait! If one can represent AND, OR, NOT, one can

represent any logic circuit (including XOR), by connecting them

Question: how to?

slide-5
SLIDE 5

slide 5

Multi-layer neural networks

  • Standard way to connect Perceptrons
  • Example: 1 hidden layer, 1 output layer

𝑦2 𝑦1

Layer 3 (output) Layer 2 (hidden) Layer 1 (input)

slide-6
SLIDE 6

slide 6

Multi-layer neural networks

  • Standard way to connect Perceptrons
  • Example: 1 hidden layer, 1 output layer

π‘₯11

(2)

𝑏1

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯1𝑒

(2)

𝑦2 𝑦1

π‘₯12

(2)

slide-7
SLIDE 7

slide 7

Multi-layer neural networks

  • Standard way to connect Perceptrons
  • Example: 1 hidden layer, 1 output layer

π‘₯11

(2)

π‘₯21

(2)

π‘₯12

(2)

π‘₯22

(2)

𝑏1

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯1𝑒

(2)

𝑏2

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯2𝑒

(2)

𝑦2 𝑦1

slide-8
SLIDE 8

slide 8

Multi-layer neural networks

  • Standard way to connect Perceptrons
  • Example: 1 hidden layer, 1 output layer

π‘₯11

(2)

π‘₯21

(2)

π‘₯31

(2)

π‘₯12

(2)

π‘₯22

(2)

π‘₯32

(2)

𝑏1

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯1𝑒

(2)

𝑏2

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯2𝑒

(2)

𝑏3

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯3𝑒

(2)

𝑦2 𝑦1

slide-9
SLIDE 9

slide 9

Multi-layer neural networks

  • Standard way to connect Perceptrons
  • Example: 1 hidden layer, 1 output layer

π‘₯11

(2)

π‘₯21

(2)

π‘₯31

(2)

π‘₯12

(2)

π‘₯22

(2)

π‘₯32

(2)

𝑏1

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯1𝑒

(2)

𝑏2

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯2𝑒

(2)

𝑏3

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯3𝑒

(2)

π‘₯1

(3)

π‘₯2

(3)

π‘₯3

(3)

𝑏 = 𝑕 ෍

𝑗

𝑏𝑗

2 π‘₯𝑗 (3)

𝑦2 𝑦1

slide-10
SLIDE 10

slide 10

Neural net for 𝐿-way classification

  • Use 𝐿 output units
  • Training: encode a label 𝑧 by an indicator vector

β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc.

  • Test: choose the class corresponding to the largest
  • utput unit

π‘₯11

(3)

π‘₯12

(3)

π‘₯13

(3)

𝑏1 = 𝑕 ෍

𝑗

𝑏𝑗

2 π‘₯1𝑗 (3)

𝑦2 𝑦1

…

slide-11
SLIDE 11

slide 11

Neural net for 𝐿-way classification

  • Use 𝐿 output units
  • Training: encode a label 𝑧 by an indicator vector

β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc.

  • Test: choose the class corresponding to the largest
  • utput unit

π‘₯11

(3)

π‘₯12

(3)

π‘₯13

(3)

𝑏1 = 𝑕 ෍

𝑗

𝑏𝑗

2 π‘₯1𝑗 (3)

𝑦2 𝑦1

𝑏𝐿 = 𝑕 ෍

𝑗

𝑏𝑗

2 π‘₯𝐿𝑗 (3)

π‘₯𝐿1

(3)

π‘₯𝐿2

(3)

π‘₯𝐿3

(3)

…

slide-12
SLIDE 12

slide 14

The (unlimited) power of neural network

  • In theory

β–ͺ we don’t need too many layers: β–ͺ 1-hidden-layer net with enough hidden units can represent any continuous function of the inputs with arbitrary accuracy β–ͺ 2-hidden-layer net can even represent discontinuous functions

slide-13
SLIDE 13

slide 15

Learning in neural network

  • Again we will minimize the error (𝐿 outputs):
  • 𝑦: one training point in the training set 𝐸
  • 𝑏𝑑: the 𝑑-th output for the training point 𝑦
  • 𝑧𝑑: the 𝑑-th element of the label indicator vector for 𝑦

𝐹 = 1 2 ෍

π‘¦βˆˆπΈ

𝐹𝑦 , 𝐹𝑦 = 𝑧 βˆ’ 𝑏 2 = ෍

𝑑=1 𝐿

𝑏𝑑 βˆ’ 𝑧𝑑 2

𝑦2 𝑦1

…

𝑏1 𝑏𝐿 1 … = 𝑧

slide-14
SLIDE 14

slide 16

Learning in neural network

  • Again we will minimize the error (𝐿 outputs):
  • 𝑦: one training point in the training set 𝐸
  • 𝑏𝑑: the 𝑑-th output for the training point 𝑦
  • 𝑧𝑑: the 𝑑-th element of the label indicator vector for 𝑦
  • Our variables are all the weights π‘₯ on all the edges

β–ͺ Apparent difficulty: we don’t know the β€˜correct’

  • utput of hidden units

𝐹 = 1 2 ෍

π‘¦βˆˆπΈ

𝐹𝑦 , 𝐹𝑦 = 𝑧 βˆ’ 𝑏 2 = ෍

𝑑=1 𝐿

𝑏𝑑 βˆ’ 𝑧𝑑 2

slide-15
SLIDE 15

slide 17

Learning in neural network

  • Again we will minimize the error (𝐿 outputs):
  • 𝑦: one training point in the training set 𝐸
  • 𝑏𝑑: the 𝑑-th output for the training point 𝑦
  • 𝑧𝑑: the 𝑑-th element of the label indicator vector for 𝑦
  • Our variables are all the weights π‘₯ on all the edges

β–ͺ Apparent difficulty: we don’t know the β€˜correct’

  • utput of hidden units

β–ͺ It turns out to be OK: we can still do gradient

  • descent. The trick you need is the chain rule

β–ͺ The algorithm is known as back-propagation 𝐹 = 1 2 ෍

π‘¦βˆˆπΈ

𝐹𝑦 , 𝐹𝑦 = 𝑧 βˆ’ 𝑏 2 = ෍

𝑑=1 𝐿

𝑏𝑑 βˆ’ 𝑧𝑑 2

slide-16
SLIDE 16

slide 18

Gradient (on one data point)

𝑦2 𝑦1 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦

want to compute

πœ–πΉπ‘¦ πœ–π‘₯11

4

slide-17
SLIDE 17

slide 19

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2

slide-18
SLIDE 18

slide 20

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

slide-19
SLIDE 19

slide 21

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

π‘₯12

(4)

slide-20
SLIDE 20

slide 22

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = πœ–πΉπ’š

πœ–π‘1 πœ–π‘1 πœ–π‘¨1

(4)

πœ–π‘¨1

(4)

πœ–π‘₯11

(4)

By Chain Rule:

πœ–πΉπ’š πœ–π‘1 = 2(𝑏1 βˆ’ 𝑧1) πœ–π‘1 πœ–π‘¨1

(4) = 𝑕′ 𝑨1 (4)

slide-21
SLIDE 21

slide 23

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4)

πœ–π‘¨1

(4)

πœ–π‘₯11

(4)

By Chain Rule:

πœ–πΉπ’š πœ–π‘1 = 2(𝑏1 βˆ’ 𝑧1) πœ–π‘1 πœ–π‘¨1

(4) = 𝑕′ 𝑨1 (4)

slide-22
SLIDE 22

slide 24

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4) 𝑏1 (3)

By Chain Rule:

πœ–πΉπ’š πœ–π‘1 = 2(𝑏1 βˆ’ 𝑧1) πœ–π‘1 πœ–π‘¨1

(4) = 𝑕′ 𝑨1 (4)

slide-23
SLIDE 23

slide 25

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕 𝑨1 (4)

1 βˆ’ 𝑕 𝑨1

(4)

𝑏1

(3)

By Chain Rule:

πœ–πΉπ’š πœ–π‘1 = 2(𝑏1 βˆ’ 𝑧1) πœ–π‘1 πœ–π‘¨1

(4) = 𝑕′ 𝑨1 (4)

slide-24
SLIDE 24

slide 26

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2 𝑏1 βˆ’ 𝑧1 𝑏1 1 βˆ’ 𝑏1 𝑏1 (3)

By Chain Rule:

πœ–πΉπ’š πœ–π‘1 = 2(𝑏1 βˆ’ 𝑧1) πœ–π‘1 πœ–π‘¨1

(4) = 𝑕′ 𝑨1 (4)

slide-25
SLIDE 25

slide 27

Gradient (on one data point)

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑏1 𝑧 βˆ’ 𝑏 2 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑕 𝑨1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2 𝑏1 βˆ’ 𝑧1 𝑏1 1 βˆ’ 𝑏1 𝑏1 (3)

By Chain Rule:

πœ–πΉπ’š πœ–π‘1 = 2(𝑏1 βˆ’ 𝑧1) πœ–π‘1 πœ–π‘¨1

(4) = 𝑕′ 𝑨1 (4)

Can be computed by network activation

slide-26
SLIDE 26

slide 28

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2 𝑏1 βˆ’ 𝑧1 𝑏1 1 βˆ’ 𝑏1 𝑏1 (3)

By Chain Rule:

πœ–πΉπ’š πœ–π‘¨1

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4)

slide-27
SLIDE 27

slide 29

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 𝑨1

(4) = π‘₯11 (4)𝑏1 (3) + π‘₯12 (4)𝑏2 (3)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = 2 𝑏1 βˆ’ 𝑧1 𝑏1 1 βˆ’ 𝑏1 𝑏1 (3)

By Chain Rule:

πœ€1

(4) = πœ–πΉπ’š

πœ–π‘¨1

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4)

slide-28
SLIDE 28

slide 30

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 πœ€1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = πœ€1 (4)𝑏1 (3)

By Chain Rule:

πœ€1

(4) = πœ–πΉπ’š

πœ–π‘¨1

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4)

𝑏1

(3)

slide-29
SLIDE 29

slide 31

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 πœ€1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = πœ€1 (4)𝑏1 (3),

πœ–πΉπ’š πœ–π‘₯12

(4) = πœ€1 (4)𝑏2 (3)

By Chain Rule:

πœ€1

(4) = πœ–πΉπ’š

πœ–π‘¨1

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4)

π‘₯12

(4)

𝑏2

(3)

slide-30
SLIDE 30

slide 32

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯21

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 πœ€2

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯21

(4) = πœ€2 (4)𝑏1 (3),

πœ–πΉπ’š πœ–π‘₯22

(4) = πœ€2 (4)𝑏2 (3)

By Chain Rule:

πœ€2

(4) = πœ–πΉπ’š

πœ–π‘¨2

(4) = 2(𝑏2 βˆ’ 𝑧2)𝑕′ 𝑨2 (4)

π‘₯22

(4)

𝑏2

(3)

𝑏1

(3)

slide-31
SLIDE 31

slide 33

πœ€2

(4)

πœ€1

(3)

πœ€2

(3)

πœ€1

(2)

πœ€2

(2)

πœ€1

(4)

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦

Thus, for any weight in the network: πœ–πΉπ‘¦ πœ–π‘₯

π‘˜π‘™ (π‘š) = πœ€ π‘˜ (π‘š)𝑏𝑙 (π‘šβˆ’1)

πœ€

π‘˜ (π‘š)

: πœ€ of π‘˜π‘’β„Ž neuron in Layer π‘š 𝑏𝑙

(π‘šβˆ’1) : Activation of π‘™π‘’β„Ž neuron in Layer π‘š βˆ’ 1

π‘₯

π‘˜π‘™ (π‘š)

: Weight from π‘™π‘’β„Ž neuron in Layer π‘š βˆ’ 1 to π‘˜π‘’β„Ž neuron in Layer π‘š

slide-32
SLIDE 32

slide 34

πœ€2

(4)

πœ€1

(3)

πœ€2

(3)

πœ€1

(2)

πœ€2

(2)

πœ€1

(4)

Exercise

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦

Show that for any bias in the network: πœ–πΉπ‘¦ πœ–π‘

π‘˜ (π‘š) = πœ€ π‘˜ (π‘š)

πœ€

π‘˜ (π‘š)

: πœ€ of π‘˜π‘’β„Ž neuron in Layer π‘š 𝑐

π‘˜ (π‘š)

: bias for the π‘˜π‘’β„Ž neuron in Layer π‘š, i.e., π‘¨π‘˜

(π‘š) = σ𝑙 π‘₯ π‘˜π‘™ (π‘š)𝑏𝑙 (π‘šβˆ’1) + 𝑐 π‘˜ (π‘š)

slide-33
SLIDE 33

slide 35

πœ€2

(4)

πœ€1

(3)

πœ€2

(3)

πœ€1

(2)

πœ€2

(2)

πœ€1

(4)

Backpropagation of πœ€

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦

Thus, for any neuron in the network: πœ€

π‘˜ (π‘š) = ෍ 𝑙

πœ€π‘™

π‘š+1 π‘₯π‘™π‘˜ π‘š+1

𝑕′ 𝑨

π‘˜ π‘š

πœ€

π‘˜ (π‘š)

: πœ€ of π‘˜π‘’β„Ž Neuron in Layer π‘š πœ€π‘™

(π‘š+1)

: πœ€ of π‘™π‘’β„Ž Neuron in Layer π‘š + 1 𝑕′ 𝑨

π‘˜ π‘š

: derivative of π‘˜π‘’β„Ž Neuron in Layer π‘š w.r.t. its linear combination input π‘₯π‘™π‘˜

(π‘š+1)

: Weight from π‘˜π‘’β„Ž Neuron in Layer π‘š to π‘™π‘’β„Ž Neuron in Layer π‘š + 1

slide-34
SLIDE 34

slide 36

Gradient descent with Backpropagation

  • 1. Initialize Network with Random Weights and Biases
  • 2. For each Training Image:

a. Compute Activations for the Entire Network b. Compute πœ€ for Neurons in the Output Layer using Network Activation and Desired Activation πœ€

π‘˜ (𝑀) = 2 π‘§π‘˜ βˆ’ π‘π‘˜ π‘π‘˜(1 βˆ’ π‘π‘˜)

c. Compute πœ€ for all Neurons in the previous Layers πœ€

π‘˜ (π‘š) = ෍ 𝑙

πœ€π‘™

π‘š+1 π‘₯π‘™π‘˜ π‘š+1 π‘π‘˜ π‘š (1 βˆ’ π‘π‘˜ π‘š )

d. Compute Gradient of Cost w.r.t each Weight and Bias for the Training Image using πœ€ πœ–πΉπ‘¦ πœ–π‘₯

π‘˜π‘™ (π‘š) = πœ€ π‘˜ (π‘š)𝑏𝑙 (π‘šβˆ’1)

πœ–πΉπ‘¦ πœ–π‘

π‘˜ (π‘š) = πœ€ π‘˜ (π‘š)

slide-35
SLIDE 35

slide 37

Gradient descent with Backpropagation

  • 3. Average the Gradient w.r.t. each Weight and Bias over the

Entire Training Set πœ–πΉ πœ–π‘₯

π‘˜π‘™ (π‘š) = 1

π‘œ ෍ πœ–πΉπ’š πœ–π‘₯

π‘˜π‘™ (π‘š)

πœ–πΉ πœ–π‘

π‘˜ (π‘š) = 1

π‘œ ෍ πœ–πΉπ’š πœ–π‘

π‘˜ (π‘š)

  • 4. Update the Weights and Biases using Gradient Descent

π‘₯

π‘˜π‘™ (π‘š)βƒͺ π‘₯ π‘˜π‘™ (π‘š) βˆ’ πœƒ πœ–πΉ

πœ–π‘₯

π‘˜π‘™ π‘š

𝑐

π‘˜ π‘š βƒͺ 𝑐 π‘˜ π‘š βˆ’ πœƒ πœ–πΉ

πœ–π‘

π‘˜ (π‘š)

  • 5. Repeat Steps 2-4 till Cost reduces below an acceptable level