neural networks
play

Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer - PowerPoint PPT Presentation

Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta] Limited power of one single neuron Perceptron: = (


  1. Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta]

  2. Limited power of one single neuron β€’ Perceptron: 𝑏 = 𝑕(Οƒ 𝑒 π‘₯ 𝑒 𝑦 𝑒 ) β€’ Activation function 𝑕 : linear, step, sigmoid 1 π‘₯ 0 𝑦 1 π‘₯ 1 𝑏 … 𝑕(෍ π‘₯ 𝑒 𝑦 𝑒 ) 𝑒 π‘₯ 𝐸 𝑦 𝐸 slide 2

  3. Limited power of one single neuron β€’ Perceptron: 𝑏 = 𝑕(Οƒ 𝑒 π‘₯ 𝑒 𝑦 𝑒 ) β€’ Activation function 𝑕 : linear, step, sigmoid 1 π‘₯ 0 𝑦 1 π‘₯ 1 𝑏 … 𝑕(෍ π‘₯ 𝑒 𝑦 𝑒 ) 𝑒 π‘₯ 𝐸 𝑦 𝐸 β€’ Decision boundary linear even for nonlinear 𝑕 β€’ XOR problem slide 3

  4. Limited power of one single neuron β€’ XOR problem β€’ Wait! If one can represent AND, OR, NOT, one can represent any logic circuit (including XOR), by connecting them Question: how to? slide 4

  5. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer Layer 2 Layer 3 Layer 1 (hidden) (output) (input) 𝑦 1 𝑦 2 slide 5

  6. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 𝑒 𝑦 1 (2) π‘₯ 12 𝑦 2 slide 6

  7. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 𝑒 𝑦 1 (2) π‘₯ 21 2 = 𝑕 (2) 𝑏 2 𝑦 𝑒 π‘₯ 2𝑒 ෍ (2) π‘₯ 12 𝑒 (2) π‘₯ 22 𝑦 2 slide 7

  8. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 𝑒 𝑦 1 (2) π‘₯ 21 (2) π‘₯ 31 2 = 𝑕 (2) 𝑏 2 𝑦 𝑒 π‘₯ 2𝑒 ෍ (2) π‘₯ 12 𝑒 (2) π‘₯ 22 𝑦 2 (2) π‘₯ 32 2 = 𝑕 (2) 𝑏 3 ෍ 𝑦 𝑒 π‘₯ 3𝑒 𝑒 slide 8

  9. Multi-layer neural networks β€’ Standard way to connect Perceptrons β€’ Example: 1 hidden layer, 1 output layer 2 = 𝑕 (2) 𝑏 1 ෍ 𝑦 𝑒 π‘₯ 1𝑒 (2) π‘₯ 11 (3) 𝑒 π‘₯ 1 𝑦 1 (2) π‘₯ 21 (3) (2) π‘₯ 2 π‘₯ 31 2 = 𝑕 2 π‘₯ 𝑗 (2) (3) 𝑏 2 𝑦 𝑒 π‘₯ 2𝑒 ෍ 𝑏 = 𝑕 ෍ 𝑏 𝑗 (2) π‘₯ 12 𝑒 𝑗 (2) π‘₯ 22 𝑦 2 (3) π‘₯ 3 (2) π‘₯ 32 2 = 𝑕 (2) 𝑏 3 ෍ 𝑦 𝑒 π‘₯ 3𝑒 𝑒 slide 9

  10. Neural net for 𝐿 -way classification β€’ Use 𝐿 output units β€’ Training: encode a label 𝑧 by an indicator vector β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc. β€’ Test: choose the class corresponding to the largest output unit (3) π‘₯ 11 2 π‘₯ 1𝑗 (3) 𝑏 1 = 𝑕 ෍ 𝑏 𝑗 𝑦 1 𝑗 … (3) π‘₯ 12 𝑦 2 (3) π‘₯ 13 slide 10

  11. Neural net for 𝐿 -way classification β€’ Use 𝐿 output units β€’ Training: encode a label 𝑧 by an indicator vector β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc. β€’ Test: choose the class corresponding to the largest output unit (3) π‘₯ 11 2 π‘₯ 1𝑗 (3) (3) 𝑏 1 = 𝑕 ෍ 𝑏 𝑗 π‘₯ 𝐿1 𝑦 1 𝑗 … (3) π‘₯ 12 (3) π‘₯ 𝐿2 𝑦 2 (3) π‘₯ 13 2 π‘₯ 𝐿𝑗 (3) 𝑏 𝐿 = 𝑕 ෍ 𝑏 𝑗 (3) 𝑗 π‘₯ 𝐿3 slide 11

  12. The (unlimited) power of neural network β€’ In theory β–ͺ we don’t need too many layers: β–ͺ 1-hidden-layer net with enough hidden units can represent any continuous function of the inputs with arbitrary accuracy β–ͺ 2-hidden-layer net can even represent discontinuous functions slide 14

  13. Learning in neural network β€’ Again we will minimize the error ( 𝐿 outputs): 𝐿 𝐹 = 1 𝑧 βˆ’ 𝑏 2 = ෍ 𝑏 𝑑 βˆ’ 𝑧 𝑑 2 2 ෍ 𝐹 𝑦 , 𝐹 𝑦 = π‘¦βˆˆπΈ 𝑑=1 β€’ 𝑦 : one training point in the training set 𝐸 β€’ 𝑏 𝑑 : the 𝑑 -th output for the training point 𝑦 β€’ 𝑧 𝑑 : the 𝑑 -th element of the label indicator vector for 𝑦 𝑏 1 1 𝑦 1 0 … = 𝑧 … 𝑦 2 0 𝑏 𝐿 0 slide 15

  14. Learning in neural network β€’ Again we will minimize the error ( 𝐿 outputs): 𝐿 𝐹 = 1 𝑧 βˆ’ 𝑏 2 = ෍ 𝑏 𝑑 βˆ’ 𝑧 𝑑 2 2 ෍ 𝐹 𝑦 , 𝐹 𝑦 = π‘¦βˆˆπΈ 𝑑=1 β€’ 𝑦 : one training point in the training set 𝐸 β€’ 𝑏 𝑑 : the 𝑑 -th output for the training point 𝑦 β€’ 𝑧 𝑑 : the 𝑑 -th element of the label indicator vector for 𝑦 β€’ Our variables are all the weights π‘₯ on all the edges β–ͺ Apparent difficulty: we don’t know the β€˜correct’ output of hidden units slide 16

  15. Learning in neural network β€’ Again we will minimize the error ( 𝐿 outputs): 𝐿 𝐹 = 1 𝑧 βˆ’ 𝑏 2 = ෍ 𝑏 𝑑 βˆ’ 𝑧 𝑑 2 2 ෍ 𝐹 𝑦 , 𝐹 𝑦 = π‘¦βˆˆπΈ 𝑑=1 β€’ 𝑦 : one training point in the training set 𝐸 β€’ 𝑏 𝑑 : the 𝑑 -th output for the training point 𝑦 β€’ 𝑧 𝑑 : the 𝑑 -th element of the label indicator vector for 𝑦 β€’ Our variables are all the weights π‘₯ on all the edges β–ͺ Apparent difficulty: we don’t know the β€˜correct’ output of hidden units β–ͺ It turns out to be OK: we can still do gradient descent. The trick you need is the chain rule β–ͺ The algorithm is known as back-propagation slide 17

  16. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) π‘₯ 11 𝑦 1 𝐹 𝑦 𝑦 2 πœ–πΉ 𝑦 want to compute 4 πœ–π‘₯ 11 slide 18

  17. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 𝑧 βˆ’ 𝑏 2 𝑏 1 𝐹 𝑦 slide 19

  18. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 slide 20

  19. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 (4) π‘₯ 12 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) π‘₯ 12 slide 21

  20. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 (4) πœ–πΉ π’š (4) = πœ–πΉ π’š πœ–π‘ 1 πœ–π‘¨ 1 By Chain Rule: (4) (4) πœ–π‘ 1 πœ–π‘₯ 11 πœ–π‘¨ 1 πœ–π‘₯ 11 slide 22

  21. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 (4) πœ–πΉ π’š πœ–π‘¨ 1 (4) (4) = 2(𝑏 1 βˆ’ 𝑧 1 )𝑕′ 𝑨 1 By Chain Rule: (4) πœ–π‘₯ 11 πœ–π‘₯ 11 slide 23

  22. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (4) 𝑏 1 (3) (4) = 2(𝑏 1 βˆ’ 𝑧 1 )𝑕′ 𝑨 1 By Chain Rule: πœ–π‘₯ 11 slide 24

  23. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (4) (4) (3) (4) = 2(𝑏 1 βˆ’ 𝑧 1 )𝑕 𝑨 1 1 βˆ’ 𝑕 𝑨 1 𝑏 1 By Chain Rule: πœ–π‘₯ 11 slide 25

  24. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (3) (4) = 2 𝑏 1 βˆ’ 𝑧 1 𝑏 1 1 βˆ’ 𝑏 1 𝑏 1 By Chain Rule: πœ–π‘₯ 11 slide 26

  25. Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π‘₯ 11 (3) + π‘₯ 12 (4) 𝑏 1 (4) 𝑏 2 (3) 𝑨 1 (4) π‘₯ 11 𝑦 1 𝑏 1 𝑧 βˆ’ 𝑏 2 = 𝐹 𝑦 𝑏 2 𝑦 2 4 𝑏 1 (3) π‘₯ 11 (4) 𝑕 𝑨 1 𝑧 βˆ’ 𝑏 2 (4) 𝑨 1 𝑏 1 𝐹 𝑦 4 𝑏 2 (3) πœ–π‘ 1 π‘₯ 12 πœ–πΉ π’š (4) (4) = 𝑕′ 𝑨 1 = 2(𝑏 1 βˆ’ 𝑧 1 ) πœ–π‘ 1 πœ–π‘¨ 1 πœ–πΉ π’š (3) (4) = 2 𝑏 1 βˆ’ 𝑧 1 𝑏 1 1 βˆ’ 𝑏 1 𝑏 1 By Chain Rule: πœ–π‘₯ 11 Can be computed by network activation slide 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend