Neural networks Slides adapted from Stuart Russell Slides adapted - PowerPoint PPT Presentation

Neural networks Slides adapted from Stuart Russell Slides adapted from Stuart Russell 1

Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Slides adapted from Stuart Russell 2

McCulloch–Pitts “unit” Output is a “squashed” linear function of the inputs: � Σ j W j,i a j � a i ← g ( in i ) = g Bias Weight a 0 = � 1 a i = g ( in i ) W 0 ,i g in i W j,i � a j a i Input Input Activation Output Output Links Function Function Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do Slides adapted from Stuart Russell 3

Activation functions g ( in i ) g ( in i ) + 1 + 1 in i in i (a) (b) (a) is a step function or threshold function (b) is a sigmoid function 1 / (1 + e − x ) Changing the bias weight W 0 ,i moves the threshold location Slides adapted from Stuart Russell 4

Network structures Feed-forward networks: – single-layer perceptrons – multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: – recurrent neural nets have directed cycles with delays have internal state (like flip-flops), can oscillate etc. ⇒ Slides adapted from Stuart Russell 5

Feed-forward example W 1,3 1 3 W 3,5 W 1,4 5 W W 2,3 4,5 2 4 W 2,4 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g ( W 3 , 5 · a 3 + W 4 , 5 · a 4 ) = g ( W 3 , 5 · g ( W 1 , 3 · a 1 + W 2 , 3 · a 2 ) + W 4 , 5 · g ( W 1 , 4 · a 1 + W 2 , 4 · a 2 )) Adjusting weights changes the function: do learning this way! Slides adapted from Stuart Russell 6

Single-layer perceptrons Perceptron output 1 0.8 0.6 0.4 0.2 -4 -2 0 2 4 0 x 2 -4 Output Input -2 0 2 W j,i x 1 4 Units Units Adjusting weights moves the location, orientation, and steepness of cli ff Slides adapted from Stuart Russell 7

Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960). Represents a linear separator in input space: Σ j W j x j > 0 or W · x > 0 Can represent AND, OR, NOT, majority, etc.: W 0 = 1.5 W 0 = 0.5 W 0 = – 0.5 W 1 = 1 W 1 = 1 W 1 = –1 W 2 = 1 W 2 = 1 AND OR NOT x 1 x 1 x 1 1 1 1 ? 0 0 0 x 2 x 2 x 2 0 1 0 1 0 1 But not XOR: (a) x 1 and x 2 (b) x 1 or x 2 (c) x 1 xor x 2 Slides adapted from Stuart Russell 8

Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Slides adapted from Stuart Russell 9

Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W ( x 1 , x 2 ) h W ( x 1 , x 2 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 -2 0 2 4 -4 -2 0 2 4 0 0 x 2 x 2 -4 -4 -2 -2 0 0 2 2 x 1 x 1 4 4 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units Slides adapted from Stuart Russell 10

Back-propagation learning At each epoch, sum gradient updates for all examples and apply Training curve for 100 restaurant examples: finds exact fit 14 Total error on training set 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima Slides adapted from Stuart Russell 11

Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error LeNet (1998): 768–192–30–10 unit MLP = 0.9% error SVMs: ≈ 0.6% error Current best: 0.24% error (committee of convolutional nets) Slides adapted from Stuart Russell 12

Example: ALVINN steering direction [Pomerleau, 1995] slide 5

Backpropagation Slides adapted from Kyunghyun Cho

Learning as an Optimization Ultimately, learning is ( mostly ) N 1 X θ = arg min c (( x n , y n ) | θ ) + λ Ω ( θ , D ) , N θ n = 1 where c (( x , y ) | θ ) is a per-sample cost function.

Gradient Descent Gradient-descent Algorithm: θ t = θ t − 1 � η r L ( θ t − 1 ) where, in our case, N L ( θ ) = 1 X l (( x n , y n ) | θ ) . N n = 1 Let us assume that Ω ( θ , D ) = 0.

Stochastic Gradient Descent Often, it is too costly to compute C ( θ ) due to a large training set. Stochastic gradient descent algorithm: θ t = θ t � 1 � η t r l ( x 0 , y 0 ) | θ t � 1 � � , where ( x 0 , y 0 ) is a randomly chosen sample from D , and 1 1 η t � 2 < 1 . η t ! 1 and X X � t = 1 t = 1 Let us assume that Ω ( θ , D ) = 0.

Almost there. . . How do we compute the gradient e ffi ciently for neural networks?

Backpropagation Algorithm – (1) Forward Pass � � � � � � � � � � Forward Computation: L ( f ( h 1 ( x 1 , x 2 , θ h 1 ) , h 2 ( x 1 , x 2 , θ h 2 ) , θ f ) , y ) Multilayer Perceptron with a single hidden layer: L ( x , y , θ ) = 1 �� 2 y − U > φ W > x � � 2

Backpropagation Algorithm – (2) Chain Rule � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Chain rule of derivatives: ✓ ∂ f ◆ ∂ L = ∂ L ∂ f = ∂ L ∂ h 1 + ∂ f ∂ h 2 ∂ x 1 ∂ f ∂ x 1 ∂ f ∂ h 1 ∂ x 1 ∂ h 2 ∂ x 1

Backpropagation Algorithm – (3) Shared Derivatives � � � � � � � � � � � � � � � � � � � � � � � � Local derivatives are shared : ✓ ∂ f ◆ ∂ L = ∂ L ∂ h 1 + ∂ f ∂ h 2 ∂ x 1 ∂ f ∂ h 1 ∂ x 1 ∂ h 2 ∂ x 1 ✓ ∂ f ◆ ∂ L = ∂ L ∂ h 1 + ∂ f ∂ h 2 ∂ x 2 ∂ f ∂ h 1 ∂ x 2 ∂ h 2 ∂ x 2

Backpropagation Algorithm – (4) Local Computation � � � � Each node computes � � � � � � I Forward: h ( a 1 , a 2 , . . . , a q ) � � � � � � � � � ∂ a 1 , ∂ h ∂ h ∂ a 2 , . . . , ∂ h I Backward: ∂ a q � � � � � � � � � � � � � � �

Backpropagation Algorithm – Requirements � � � � I Each node computes a � � � � � � di ff erentiable function 1 � � � � � � � � � I Directed Acyclic Graph 2 � � � � � � � � � � � � � � � 1 Well. . . ? 2 Well. . . ?

Backpropagation Algorithm – Automatic Di ff erentiation � � � � � � � � � � � � � � � � � � � � � � � � I Generalized approach to computing partial derivatives I As long as your neural network fits the requirements, you do not need to derive the derivatives yourself! I Theano, Torch, . . .

��

Neural networks Slides adapted from Stuart Russell Slides adapted - PowerPoint PPT Presentation

Neural networks Slides adapted from Stuart Russell Slides adapted from Stuart Russell 1 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms10ms cycle time Signals are noisy spike trains of electrical potential Axonal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Neural Networks Overview CS89.11/189.2 - Spring 2020 Our Neurons Our Neurons Dendrites Our

STEP 5. Slides of a neuron, nerve, and spinal cord #1 Neuron slide This step can wait until

CS-184: Computer Graphics Lecture #4: 2D Transformations Prof. James OBrien University of

How to present good Rebecca Barter How to present good Rebecca Barter Sources:

1 I. RESPONSE OF THE NEURON TO INJURY (summary) If the axon is damaged, A. All neurons -

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

An introduction to CQRS and Axon Framework Finances forgotten treasure Allard Buijze

Python + NEURON Interpreter HOC Section Neuron specific syntax Range Variable Mechanism