neural networks

Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer - PowerPoint PPT Presentation

Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta] Limited power of one single neuron Perceptron: = (


  1. Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta]

  2. Limited power of one single neuron β€’ Perceptron: 𝑏 = 𝑕(Οƒ 𝑒 π‘₯ 𝑒 𝑦 𝑒 ) β€’ Activation function 𝑕 : linear, step, sigmoid 1 π‘₯ 0 𝑦 1 π‘₯ 1 𝑏 … 𝑕(෍ π‘₯ 𝑒 𝑦 𝑒 ) 𝑒 π‘₯ 𝐸 𝑦 𝐸 slide 2

  3. Limited power of one single neuron β€’ Perceptron: 𝑏 = 𝑕(Οƒ 𝑒 π‘₯ 𝑒 𝑦 𝑒 ) β€’ Activation function 𝑕 : linear, step, sigmoid 1 π‘₯ 0 𝑦 1 π‘₯ 1 𝑏 … 𝑕(෍ π‘₯ 𝑒 𝑦 𝑒 ) 𝑒 π‘₯ 𝐸 𝑦 𝐸 β€’ Decision boundary linear even for nonlinear 𝑕 β€’ XOR problem slide 3

  4. Limited power of one single neuron β€’ XOR problem β€’ Wait! If one can represent AND, OR, NOT, one can represent any logic circuit (including XOR), by connecting them Question: how to? slide 4

  5. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer Layer 2 Layer 3 Layer 1 (hidden) (output) (input) 𝑦 1 𝑦 2 slide 5

  6. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 𝑒 𝑦 1 (2) π‘₯ 12 𝑦 2 slide 6

  7. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 𝑒 𝑦 1 (2) π‘₯ 21 2 = 𝑕 (2) 𝑏 2 𝑦 𝑒 π‘₯ 2𝑒 ෍ (2) π‘₯ 12 𝑒 (2) π‘₯ 22 𝑦 2 slide 7

  8. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 𝑒 𝑦 1 (2) π‘₯ 21 (2) π‘₯ 31 2 = 𝑕 (2) 𝑏 2 𝑦 𝑒 π‘₯ 2𝑒 ෍ (2) π‘₯ 12 𝑒 (2) π‘₯ 22 𝑦 2 (2) π‘₯ 32 2 = 𝑕 (2) 𝑏 3 ෍ 𝑦 𝑒 π‘₯ 3𝑒 𝑒 slide 8

  9. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 (3) 𝑒 π‘₯ 1 𝑦 1 (2) π‘₯ 21 (3) (2) π‘₯ 2 π‘₯ 31 2 = 𝑕 2 π‘₯ 𝑗 (2) (3) 𝑏 2 𝑦 𝑒 π‘₯ 2𝑒 ෍ 𝑏 = 𝑕 ෍ 𝑏 𝑗 (2) π‘₯ 12 𝑒 𝑗 (2) π‘₯ 22 𝑦 2 (3) π‘₯ 3 (2) π‘₯ 32 2 = 𝑕 (2) 𝑏 3 ෍ 𝑦 𝑒 π‘₯ 3𝑒 𝑒 slide 9

  10. Neural net for 𝐿 -way classification β€’ Use 𝐿 output units β€’ Training: encode a label 𝑧 by an indicator vector β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc. β€’ Test: choose the class corresponding to the largest output unit (3) π‘₯ 11 2 π‘₯ 1𝑗 (3) 𝑏 1 = 𝑕 ෍ 𝑏 𝑗 𝑦 1 𝑗 … (3) π‘₯ 12 𝑦 2 (3) π‘₯ 13 slide 10

  11. Neural net for 𝐿 -way classification β€’ Use 𝐿 output units β€’ Training: encode a label 𝑧 by an indicator vector β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc. β€’ Test: choose the class corresponding to the largest output unit (3) π‘₯ 11 2 π‘₯ 1𝑗 (3) (3) 𝑏 1 = 𝑕 ෍ 𝑏 𝑗 π‘₯ 𝐿1 𝑦 1 𝑗 … (3) π‘₯ 12 (3) π‘₯ 𝐿2 𝑦 2 (3) π‘₯ 13 2 π‘₯ 𝐿𝑗 (3) 𝑏 𝐿 = 𝑕 ෍ 𝑏 𝑗 (3) 𝑗 π‘₯ 𝐿3 slide 11

  12. The (unlimited) power of neural network β€’ In theory β–ͺ we don’t need too many layers: β–ͺ 1-hidden-layer net with enough hidden units can represent any continuous function of the inputs with arbitrary accuracy β–ͺ 2-hidden-layer net can even represent discontinuous functions slide 14

  13. Learning in neural network β€’ Again we will minimize the error ( 𝐿 outputs): 𝐿 𝐹 = 1 𝑧 βˆ’ 𝑏 2 = ෍ 𝑏 𝑑 βˆ’ 𝑧 𝑑 2 2 ෍ 𝐹 𝑦 , 𝐹 𝑦 = π‘¦βˆˆπΈ 𝑑=1 β€’ 𝑦 : one training point in the training set 𝐸 β€’ 𝑏 𝑑 : the 𝑑 -th output for the training point 𝑦 β€’ 𝑧 𝑑 : the 𝑑 -th element of the label indicator vector for 𝑦 𝑏 1 1 𝑦 1 0 … = 𝑧 … 𝑦 2 0 𝑏 𝐿 0 slide 15

  14. Learning in neural network β€’ Again we will minimize the error ( 𝐿 outputs): 𝐿 𝐹 = 1 𝑧 βˆ’ 𝑏 2 = ෍ 𝑏 𝑑 βˆ’ 𝑧 𝑑 2 2 ෍ 𝐹 𝑦 , 𝐹 𝑦 = π‘¦βˆˆπΈ 𝑑=1 β€’ 𝑦 : one training point in the training set 𝐸 β€’ 𝑏 𝑑 : the 𝑑 -th output for the training point 𝑦 β€’ 𝑧 𝑑 : the 𝑑 -th element of the label indicator vector for 𝑦 β€’ Our variables are all the weights π‘₯ on all the edges β–ͺ Apparent difficulty: we don’t know the β€˜correct’ output of hidden units slide 16

  15. Learning in neural network β€’ Again we will minimize the error ( 𝐿 outputs): 𝐿 𝐹 = 1 𝑧 βˆ’ 𝑏 2 = ෍ 𝑏 𝑑 βˆ’ 𝑧 𝑑 2 2 ෍ 𝐹 𝑦 , 𝐹 𝑦 = π‘¦βˆˆπΈ 𝑑=1 β€’ 𝑦 : one training point in the training set 𝐸 β€’ 𝑏 𝑑 : the 𝑑 -th output for the training point 𝑦 β€’ 𝑧 𝑑 : the 𝑑 -th element of the label indicator vector for 𝑦 β€’ Our variables are all the weights π‘₯ on all the edges β–ͺ Apparent difficulty: we don’t know the β€˜correct’ output of hidden units β–ͺ It turns out to be OK: we can still do gradient descent. The trick you need is the chain rule β–ͺ The algorithm is known as back-propagation slide 17

  16. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) π‘₯ 11 𝑦 1 𝐹 𝑦 𝑦 2 πœ–πΉ 𝑦 want to compute 4 πœ–π‘₯ 11 slide 18

  17. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 𝑧 βˆ’ 𝑏 2 𝑏 1 𝐹 𝑦 slide 19

  18. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 slide 20

  19. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 (4) π‘₯ 12 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) π‘₯ 12 slide 21

  20. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 (4) πœ–πΉ π’š (4) = πœ–πΉ π’š πœ–π‘ 1 πœ–π‘¨ 1 By Chain Rule: (4) (4) πœ–π‘ 1 πœ–π‘₯ 11 πœ–π‘¨ 1 πœ–π‘₯ 11 slide 22

  21. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 (4) πœ–πΉ π’š πœ–π‘¨ 1 (4) (4) = 2(𝑏 1 βˆ’ 𝑧 1 )𝑕′ 𝑨 1 By Chain Rule: (4) πœ–π‘₯ 11 πœ–π‘₯ 11 slide 23

  22. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (4) 𝑏 1 (3) (4) = 2(𝑏 1 βˆ’ 𝑧 1 )𝑕′ 𝑨 1 By Chain Rule: πœ–π‘₯ 11 slide 24

  23. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (4) (4) (3) (4) = 2(𝑏 1 βˆ’ 𝑧 1 )𝑕 𝑨 1 1 βˆ’ 𝑕 𝑨 1 𝑏 1 By Chain Rule: πœ–π‘₯ 11 slide 25

  24. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (3) (4) = 2 𝑏 1 βˆ’ 𝑧 1 𝑏 1 1 βˆ’ 𝑏 1 𝑏 1 By Chain Rule: πœ–π‘₯ 11 slide 26

  25. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (3) (4) = 2 𝑏 1 βˆ’ 𝑧 1 𝑏 1 1 βˆ’ 𝑏 1 𝑏 1 By Chain Rule: πœ–π‘₯ 11 Can be computed by network activation slide 27

Recommend


More recommend