neural networks
play

Neural networks Slides adapted from Stuart Russell Slides adapted - PowerPoint PPT Presentation

Neural networks Slides adapted from Stuart Russell Slides adapted from Stuart Russell 1 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms10ms cycle time Signals are noisy spike trains of electrical potential Axonal


  1. Neural networks Slides adapted from Stuart Russell Slides adapted from Stuart Russell 1

  2. Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Slides adapted from Stuart Russell 2

  3. McCulloch–Pitts “unit” Output is a “squashed” linear function of the inputs: � Σ j W j,i a j � a i ← g ( in i ) = g Bias Weight a 0 = � 1 a i = g ( in i ) W 0 ,i g in i W j,i � a j a i Input Input Activation Output Output Links Function Function Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do Slides adapted from Stuart Russell 3

  4. Activation functions g ( in i ) g ( in i ) + 1 + 1 in i in i (a) (b) (a) is a step function or threshold function (b) is a sigmoid function 1 / (1 + e − x ) Changing the bias weight W 0 ,i moves the threshold location Slides adapted from Stuart Russell 4

  5. Network structures Feed-forward networks: – single-layer perceptrons – multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: – recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc. ⇒ Slides adapted from Stuart Russell 5

  6. Feed-forward example W 1,3 1 3 W 3,5 W 1,4 5 W W 2,3 4,5 2 4 W 2,4 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g ( W 3 , 5 · a 3 + W 4 , 5 · a 4 ) = g ( W 3 , 5 · g ( W 1 , 3 · a 1 + W 2 , 3 · a 2 ) + W 4 , 5 · g ( W 1 , 4 · a 1 + W 2 , 4 · a 2 )) Adjusting weights changes the function: do learning this way! Slides adapted from Stuart Russell 6

  7. Single-layer perceptrons Perceptron output 1 0.8 0.6 0.4 0.2 -4 -2 0 2 4 0 x 2 -4 Output Input -2 0 2 W j,i x 1 4 Units Units Adjusting weights moves the location, orientation, and steepness of cli ff Slides adapted from Stuart Russell 7

  8. Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960). Represents a linear separator in input space: Σ j W j x j > 0 or W · x > 0 Can represent AND, OR, NOT, majority, etc.: W 0 = 1.5 W 0 = 0.5 W 0 = – 0.5 W 1 = 1 W 1 = 1 W 1 = –1 W 2 = 1 W 2 = 1 AND OR NOT x 1 x 1 x 1 1 1 1 ? 0 0 0 x 2 x 2 x 2 0 1 0 1 0 1 But not XOR: (a) x 1 and x 2 (b) x 1 or x 2 (c) x 1 xor x 2 Slides adapted from Stuart Russell 8

  9. Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Slides adapted from Stuart Russell 9

  10. Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W ( x 1 , x 2 ) h W ( x 1 , x 2 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 -2 0 2 4 -4 -2 0 2 4 0 0 x 2 x 2 -4 -4 -2 -2 0 0 2 2 x 1 x 1 4 4 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units Slides adapted from Stuart Russell 10

  11. Back-propagation learning At each epoch, sum gradient updates for all examples and apply Training curve for 100 restaurant examples: finds exact fit 14 Total error on training set 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima Slides adapted from Stuart Russell 11

  12. Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error LeNet (1998): 768–192–30–10 unit MLP = 0.9% error SVMs: ≈ 0.6% error Current best: 0.24% error (committee of convolutional nets) Slides adapted from Stuart Russell 12

  13. Example: ALVINN steering direction [Pomerleau, 1995] slide 5

  14. Backpropagation Slides adapted from Kyunghyun Cho

  15. Learning as an Optimization Ultimately, learning is ( mostly ) N 1 X θ = arg min c (( x n , y n ) | θ ) + λ Ω ( θ , D ) , N θ n = 1 where c (( x , y ) | θ ) is a per-sample cost function.

  16. Gradient Descent Gradient-descent Algorithm: θ t = θ t − 1 � η r L ( θ t − 1 ) where, in our case, N L ( θ ) = 1 X l (( x n , y n ) | θ ) . N n = 1 Let us assume that Ω ( θ , D ) = 0.

  17. Stochastic Gradient Descent Often, it is too costly to compute C ( θ ) due to a large training set. Stochastic gradient descent algorithm: θ t = θ t � 1 � η t r l ( x 0 , y 0 ) | θ t � 1 � � , where ( x 0 , y 0 ) is a randomly chosen sample from D , and 1 1 η t � 2 < 1 . η t ! 1 and X X � t = 1 t = 1 Let us assume that Ω ( θ , D ) = 0.

  18. Almost there. . . How do we compute the gradient e ffi ciently for neural networks?

  19. Backpropagation Algorithm – (1) Forward Pass � � � � � � � � � � Forward Computation: L ( f ( h 1 ( x 1 , x 2 , θ h 1 ) , h 2 ( x 1 , x 2 , θ h 2 ) , θ f ) , y ) Multilayer Perceptron with a single hidden layer: L ( x , y , θ ) = 1 �� 2 y − U > φ W > x � � 2

  20. Backpropagation Algorithm – (2) Chain Rule � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Chain rule of derivatives: ✓ ∂ f ◆ ∂ L = ∂ L ∂ f = ∂ L ∂ h 1 + ∂ f ∂ h 2 ∂ x 1 ∂ f ∂ x 1 ∂ f ∂ h 1 ∂ x 1 ∂ h 2 ∂ x 1

  21. Backpropagation Algorithm – (3) Shared Derivatives � � � � � � � � � � � � � � � � � � � � � � � � Local derivatives are shared : ✓ ∂ f ◆ ∂ L = ∂ L ∂ h 1 + ∂ f ∂ h 2 ∂ x 1 ∂ f ∂ h 1 ∂ x 1 ∂ h 2 ∂ x 1 ✓ ∂ f ◆ ∂ L = ∂ L ∂ h 1 + ∂ f ∂ h 2 ∂ x 2 ∂ f ∂ h 1 ∂ x 2 ∂ h 2 ∂ x 2

  22. Backpropagation Algorithm – (4) Local Computation � � � � Each node computes � � � � � � I Forward: h ( a 1 , a 2 , . . . , a q ) � � � � � � � � � ∂ a 1 , ∂ h ∂ h ∂ a 2 , . . . , ∂ h I Backward: ∂ a q � � � � � � � � � � � � � � �

  23. Backpropagation Algorithm – Requirements � � � � I Each node computes a � � � � � � di ff erentiable function 1 � � � � � � � � � I Directed Acyclic Graph 2 � � � � � � � � � � � � � � � 1 Well. . . ? 2 Well. . . ?

  24. Backpropagation Algorithm – Automatic Di ff erentiation � � � � � � � � � � � � � � � � � � � � � � � � I Generalized approach to computing partial derivatives I As long as your neural network fits the requirements, you do not need to derive the derivatives yourself! I Theano, Torch, . . .

  25. ����������������������������� ������������������������������� ��������������������������������������������� ��������������������������������������������� ����������� ����������� � ����������� �����������

  26. ����������������� ������������� ������ �� ����� �� ����� � ��������������������������������������������� ��������������������������������������������� ����������� ����������� �� ����������� �����������

  27. ����������������� ������������� ������������ �� ��������� ������������������������� �������������������������������������� ����������������������� �� � ��������������������������������������������� ��������������������������������������������� ����������� ����������� �� ����������� �����������

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend