Introduction to Neural Networks David Stutz - PowerPoint PPT Presentation

Introduction to Neural Networks David Stutz david.stutz@rwth-aachen.de Seminar Selected Topics in WS 2013/2014 – February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6 Computer Science Department RWTH Aachen University, Germany Stutz – Neural Networks 1 / 35

Outline 1. Literature 2. Motivation 3. Artificial Neural Networks (a) The Perceptron (b) Multilayer Perceptrons (c) Expressive Power 4. Network Training (a) Parameter Optimization (b) Error Backpropagation 5. Regularization 6. Pattern Classification 7. Conclusion Stutz – Neural Networks 2 / 35

1. Literature [Bishop 06] Pattern Recognition and Machine Learning. 2006 . ◮ Chapter 5 gives a short introduction to neural networks in pattern recognition. [Bishop 95] Neural Networks for Pattern Recognition. 1995 . [Haykin 05] Neural Networks A Comprehensive Foundation. 2005 [Duda & Hart + 01] Pattern Classification. 2001 . ◮ Chapter 6 covers mainly the same aspects as Bishop. [Rumelhart & Hinton + 86] Learning Representations by Back-Propagating Errors. 1986 ◮ Error backpropagation algorithm. [Rosenblatt 58] The Perceptron: A Probabilistic Model of Information Storage and Organization in the Brain. 1958 Stutz – Neural Networks 3 / 35

2. Motivation Theoretically, a state-of-the-art computer is a lot faster than the human brain – comparing the number of operations per second. Nevertheless, we consider the human brain somewhat smarter than a computer. Why? ◮ Learning – The human brain learns from experience and prior knowledge to perform new tasks. How to specify “learning” with respect to computers? ◮ Let g be an unknown target function . ◮ Let T := { ( x n , t n ≈ g ( x n )) : 1 ≤ n ≤ N } be a set of (noisy) training data. ◮ Task: learn a good approximation of g . Artificial neural networks, simply neural networks , try to solve this problem by modeling the structure of the human brain ... See ◮ [Haykin 05] for details on how artificial neural networks model the human brain. Stutz – Neural Networks 4 / 35

3. Artificial Neural Networks – Processing Units Core component of a neural network: processing unit = neuron of the human brain. A processing unit maps multiple input values onto one output value y : w 0 A unit is labeled according to its output x 1 y y := f ( z ) . . . x D ◮ x 1 , . . . , x D are inputs, e.g. from other processing units within the network. ◮ w 0 is an external input called bias . ◮ The propagation rule maps all input values onto the actual input z . ◮ The activation function is applied to obtain y = f ( z ) . Stutz – Neural Networks 5 / 35

3. Artificial Neural Networks – Network Graphs A neural network is a set of interconnected processing units. We visualize a neural network by means of a network graph : ◮ Nodes represent the processing units. ◮ Processing units are interconnected by directed edges. Output of x 1 is propagated to y 1 x 1 y 1 x 2 y 2 A unit is labeled according to its output Stutz – Neural Networks 6 / 35

3. The Perceptron Introduced by Rosenblatt in [Rosenblatt 58]. The (single-layer) perceptron consists of D input units and C output units. ◮ Propagation rule: weighted sum over inputs x i with weights w ij . ◮ Input unit i : single input value z = x i and identity activation function. ◮ Output unit j calculates the output � D � D � � � � x 0 :=1 y j ( x, w ) = f ( z j ) = f w jk x k + w j 0 = f w jk x k . (1) k =1 k =0 propagation rule with additional bias w j 0 Stutz – Neural Networks 7 / 35

3. The Perceptron – Network Graph Additional unit x 0 := 1 to include the bias as weight x 0 y 1 y 1 ( x, w ) x 1 x 1 . . . . . Units are arranged . y C y C ( x, w ) in layers x D x D input layer output layer Stutz – Neural Networks 8 / 35

3. The Perceptron – Activation Functions Used propagation rule: weighted sum over all inputs. How to choose the activation function f ( z ) ? ◮ Heaviside function h ( z ) models the electrical impulse of neurons in the human brain: � 1 if z ≥ 0 h ( z ) = (2) . 0 if z < 0 Stutz – Neural Networks 9 / 35

3. The Perceptron – Activation Functions In general we prefer monotonic, differentiable activation functions. ◮ Logistic sigmoid σ ( z ) as differentiable version of the Heaviside function: 1 σ ( z ) 1 σ ( z ) = 1 + exp( − z ) 0 − 2 0 2 z ◮ Or its extension for multiple output units, the softmax activation function: exp( z i ) σ ( z, i ) = . (3) � C k =1 exp( z k ) See ◮ [Bishop 95] or [Duda & Hart + 01] for more on activation functions and their properties. Stutz – Neural Networks 10 / 35

3. Multilayer Perceptrons Idea: Add additional L > 0 hidden layers in between the input and output layer. ◮ m ( l ) hidden units in layer ( l ) with m (0) := D and m ( L +1) := C . ◮ Hidden unit i in layer l calculates the output   layer m ( l − 1) � y ( l ) w ik y ( l − 1)  . = f (4)  i k k =0 unit A multilayer perceptron models a function     y ( L +1) y 1 ( x, w ) 1 . y ( · , w ) : R D �→ R C , x �→ y ( x, w ) = . .  = .   (5) . .    y ( L +1) y C ( x, w ) C where y ( L +1) is the output of the i -th output unit. i Stutz – Neural Networks 11 / 35

3. Two-Layer Perceptron – Network Graph hidden layer y (1) 0 x 0 y (1) y (2) 1 y 1 ( x, w ) x 1 x 1 1 . . . . . . . . . y (1) y (2) x D x D m (1) y C ( x, w ) C input layer output layer Stutz – Neural Networks 12 / 35

3. Expressive Power – Boolean AND Which target functions can be modeled using a single-layer perceptron? ◮ A single-layer perceptron represents a hyperplane in multidimensional space. x 2 (0 , 1) (1 , 1) (0 , 0) (1 , 0) x 1 Modeling boolean AND with target function g ( x 1 , x 2 ) ∈ { 0 , 1 } . Stutz – Neural Networks 13 / 35

3. Expressive Power – XOR Problem Problem: How to model boolean exclusive OR (XOR) using a line in two-dimensional space? ◮ Boolean XOR cannot be modeled using a single-layer perceptron. x 2 (0 , 1) (1 , 1) (0 , 0) (1 , 0) x 1 Boolean exclusive OR target function. Stutz – Neural Networks 14 / 35

3. Expressive Power – Conclusion Do additional hidden layers help? ◮ Yes. A multilayer perceptron with L > 0 additional hidden layers is a universal approximator. See ◮ [Hornik & Stinchcombe + 89] for details on multilayer perceptrons as universal approxima- tors. ◮ [Duda & Hart + 01] for a detailed discussion of the XOR Problem. Stutz – Neural Networks 15 / 35

4. Network Training Training a neural network means adjusting the weights to get a good approximation of the target function. How does a neural network learn? ◮ Supervised learning : Training set T provides both input values and the corresponding target values: input value – pattern T := { ( x n , t n ) : 1 ≤ n ≤ N } . (6) target value ◮ Approximation performance of the neural network can be evaluated using a distance mea- sure between approximation and target function. Stutz – Neural Networks 16 / 35

4. Network Training – Error Measures Sum-of-squared error function: k -th component weight vector of modeled function y N N C E n ( w ) = 1 � � � ( y k ( x n , w ) − t nk ) 2 . E ( w ) = (7) 2 n =1 n =1 k =1 k -th entry of t n Cross-entropy error function: N N C � � � E ( w ) = E n ( w ) = − t nk log y k ( x n , w ) . (8) n =1 n =1 k =1 See ◮ [Bishop 95] for a more detailed discussion of error measures for network training. Stutz – Neural Networks 17 / 35

4. Network Training – Training Approaches Idea: Adjust the weights such that the error is minimized. Stochastic training Randomly choose an input value x n and update the weights based on the error E n ( w ) . Mini-batch training Process a subset M ⊆ { 1 , . . . , N } of all input values and update the weights based on the error � n ∈ M E n ( w ) . Batch training Process all input values x n , 1 ≤ n ≤ N and update the weights based on the overall error E ( w ) = � N n =1 E n ( w ) . Stutz – Neural Networks 18 / 35

4. Parameter Optimization How to minimize the error E ( w ) ? Problem: E ( w ) can be nonlinear and may have multiple local minima. Iterative optimization algorithms: ◮ Let w [0] be a starting vector for the weights. ◮ w [ t ] is the weight vector in the t -th iteration of the optimization algorithm. ◮ In iteration [ t + 1] choose a weight update ∆ w [ t ] and set w [ t + 1] = w [ t ] + ∆ w [ t ] . (9) ◮ Different optimization algorithms choose different weight updates. Stutz – Neural Networks 19 / 35

4. Parameter Optimization – Gradient Descent Idea: In each iteration take a step in the direction of the negative gradient. ◮ The direction of the steepest descent. w [0] w [1] w [2] w [3] w [4] ◮ Weight update ∆ w [ t ] is given by ∆ w [ t ] = − γ ∂E ∂w [ t ] . (10) learning rate – step size Stutz – Neural Networks 20 / 35

Introduction to Neural Networks David Stutz - PowerPoint PPT Presentation

Introduction to Neural Networks David Stutz david.stutz@rwth-aachen.de Seminar Selected Topics in WS 2013/2014 February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl fr Informatik 6 Computer Science Department

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Natural language processing with neural networks. Hubert Brykowski Europython 2019 Hubert

Make Some Noise Unleashing the Power of Convolutional Neural Networks for Profiled Side-channel

Reconstruction II Neural Networks in Monte Carlo Rendering Philipp Slusallek Karol Myszkowski

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2020/ In

RVTensor: A light-weight neural network inference framework based on the RISC-V architecture

Quality-Aware Neural Complementary Item Recommendation Yin Zhang , Haokai Lu, Wei Niu, James

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey

Adaptive Layout Decomposition with Graph Embedding Neural Networks Wei Li 1 , Jialu Xia 1 , Yuzhe