neural networks
play

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural - PowerPoint PPT Presentation

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms10ms cycle


  1. Neural networks Chapter 20 Chapter 20 1

  2. Outline ♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer networks ♦ Applications of neural networks Chapter 20 2

  3. Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20 3

  4. McCulloch–Pitts “unit” Output is a “squashed” linear function of the inputs: � Σ j W j,i a j � a i ← g ( in i ) = g Bias Weight a 0 = − 1 a i = g ( in i ) W 0 ,i g in i W j,i Σ a j a i Input� Input� Activation� Output� Output Links Function Function Links Chapter 20 4

  5. Activation functions g ( in i ) g ( in i ) + 1 + 1 in i in i (a)� (b)� (a) is a step function or threshold function (b) is a sigmoid function 1 / (1 + e − x ) Changing the bias weight W 0 ,i moves the threshold location Chapter 20 5

  6. Implementing logical functions McCulloch and Pitts: every Boolean function can be implemented (with large enough network) AND? OR? NOT? MAJORITY? Chapter 20 6

  7. Implementing logical functions McCulloch and Pitts: every Boolean function can be implemented (with large enough network) W 0 = 1.5 W 0 = 0.5 W 0 = – 0.5 W 1 = 1 W 1 = 1 W 1 = –1 W 2 = 1 W 2 = 1 AND OR NOT Chapter 20 7

  8. Network structures Feed-forward networks: – single-layer perceptrons – multi-layer networks Feed-forward networks implement functions, have no internal state Recurrent networks: – Hopfield networks have symmetric weights ( W i,j = W j,i ) g ( x ) = sign ( x ) , a i = ± 1 ; holographic associative memory – Boltzmann machines use stochastic activation functions, ≈ MCMC in BNs – recurrent neural nets have directed cycles with delays ⇒ have internal state (like flip-flops), can oscillate etc. Chapter 20 8

  9. Feed-forward example W 1,3 1 3 W W 3,5 1,4 5 W W 2,3 4,5 2 4 W 2,4 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g ( W 3 , 5 · a 3 + W 4 , 5 · a 4 ) = g ( W 3 , 5 · g ( W 1 , 3 · a 1 + W 2 , 3 · a 2 ) + W 4 , 5 · g ( W 1 , 4 · a 1 + W 2 , 4 · a 2 )) Chapter 20 9

  10. Perceptrons Perceptron output 1 0.8 0.6 0.4 0.2 -4 -2 0 2 4 0 x 2 -4 Output Input -2 0 W j,i x 1 2 4 Units Units Chapter 20 10

  11. Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc. Represents a linear separator in input space: Σ j W j x j > 0 or W · x > 0 I I I 1 1 1 1 1 1 ? 0 0 0 I I I 0 1 0 1 0 1 2 2 2 I xor I I I I I and or (a) (b) (c) 1 2 1 2 1 2 Chapter 20 11

  12. Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Chapter 20 12

  13. Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E =? ∂W j Chapter 20 13

  14. Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j Chapter 20 14

  15. Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j = − Err × g ′ ( in ) × x j Chapter 20 15

  16. Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j = − Err × g ′ ( in ) × x j Simple weight update rule: W j ← W j + α × Err × g ′ ( in ) × x j E.g., +ve error ⇒ increase network output ⇒ increase weights on +ve inputs, decrease on -ve inputs Chapter 20 16

  17. Perceptron learning W = random initial values for iter = 1 to T for i = 1 to N (all examples) � x = input for example i y = output for example i W old = W Err = y − g ( W old · � x ) for j = 1 to M (all weights) W j = W j + α · Err · g ′ ( W old · � x ) · x j Chapter 20 17

  18. Perceptron learning contd. Derivative of sigmoid g ( x ) can be written in simple form: 1 g ( x ) = 1 + e − x g ′ ( x ) = ? Chapter 20 18

  19. Perceptron learning contd. Derivative of sigmoid g ( x ) can be written in simple form: 1 g ( x ) = 1 + e − x e − x g ′ ( x ) = (1 + e − x ) 2 = e − x g ( x ) 2 Also, 1 + e − x ⇒ g ( x ) + e − x g ( x ) = 1 ⇒ e − x = 1 − g ( x ) 1 g ( x ) = g ( x ) So g ′ ( x ) = 1 − g ( x ) g ( x ) 2 g ( x ) = (1 − g ( x )) g ( x ) Chapter 20 19

  20. Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set Proportion correct on test set 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 Perceptron 0.6 Decision tree 0.5 0.5 Perceptron Decision tree 0.4 0.4 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data Chapter 20 20

  21. Multilayer networks Layers are usually fully connected; numbers of hidden units typically chosen by hand a i Output units W j,i a j Hidden units W k,j a k Input units Chapter 20 21

  22. Expressiveness of MLPs All continuous functions w/ 1 hidden layer, all functions w/ 2 hidden layers h W ( x 1 , x 2 ) h W ( x 1 , x 2 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 -2 0 2 4 -4 -2 0 2 4 0 0 x 2 x 2 -4 -4 -2 -2 0 0 x 1 x 1 2 2 4 4 Chapter 20 22

  23. Training a MLP In general have n output nodes, E ≡ 1 i Err 2 i , � 2 where Err i = ( y i − a i ) and i runs over all nodes in the output layer. � Need to calculate ∂E ∂W ij for any W ij . Chapter 20 23

  24. Training a MLP cont. Can approximate derivatives by: f ′ ( x ) ≈ f ( x + h ) − f ( x ) h ∂E ( W ) ≈ E ( W + (0 , . . . , h, . . . , 0)) − E ( W ) ∂W ij h What would this entail for a network with n weights? Chapter 20 24

  25. Training a MLP cont. Can approximate derivatives by: f ′ ( x ) ≈ f ( x + h ) − f ( x ) h ∂E ( W ) ≈ E ( W + (0 , . . . , h, . . . , 0)) − E ( W ) ∂W ij h What would this entail for a network with n weights? - one iteration would take O ( n 2 ) time Complicated networks have tens of thousands of weights, O ( n 2 ) time is intractable. Back-propagation is a recursive method of calculating all of these derivatives in O ( n ) time. Chapter 20 25

  26. Back-propagation learning In general have n output nodes, E ≡ 1 i Err 2 i , � 2 where Err i = ( y i − a i ) and i runs over all nodes in the output layer. � Output layer: same as for single-layer perceptron, W j,i ← W j,i + α × a j × ∆ i where ∆ i = Err i × g ′ ( in i ) Hidden layers: back-propagate the error from the output layer: ∆ j = g ′ ( in j ) i W j,i ∆ i . � Update rule for weights in hidden layers: W k,j ← W k,j + α × a k × ∆ j . Chapter 20 26

  27. Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i ∂W j,i ∂W j,i Chapter 20 27

  28. Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i Chapter 20 28

  29. Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂W j,i Chapter 20 29

  30. Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂   = − ( y i − a i ) g ′ ( in i ) k W k,i a j �     ∂W j,i ∂W j,i Chapter 20 30

  31. Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂   = − ( y i − a i ) g ′ ( in i ) k W k,i a j �     ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) a j = − a j ∆ i where ∆ i = ( y i − a i ) g ′ ( in i ) Chapter 20 31

  32. Back-propagation derivation: hidden layer For a node j in a hidden layer: ∂E = ? ∂W k,j Chapter 20 32

  33. “Reminder”: Chain rule for partial derivatives For f ( x, y ) , with f differentiable wrt x and y , and x and y differentiable wrt u and v : ∂f ∂f ∂x ∂u + ∂f ∂y = ∂u ∂x ∂y ∂u and ∂f ∂f ∂x ∂v + ∂f ∂y = ∂v ∂x ∂y ∂v Chapter 20 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend