Chapter 7. Neural Networks Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c

Introduction ◮ Chapter 11. only focus on Feedforward NNs. Related to projection pursuit regression: f ( x ) = � M m =1 g m ( w ′ m x ), where each w m is a vector of weights and g m is a smooth nonparametric function; to be estimated. really? ◮ Here: + CNN; later recurrent NNs (for seq data). No autoencoders (unsupervised) ...? Goodfellow, Bengio, Courville (2016). Deep Learning. http://www.deeplearningbook.org/ ◮ Two high waves in 1960s and late 1980s-90s. ◮ McCulloch & Pitts model (1943): n j ( t ) = I ( � i → j w ij n i ( t − 1) > θ j ). w ij can be > 0 (excitatory) or < 0 (inhibitory).

◮ A biological neuron vs an artificial neuron (perceptron). Google: images biological neural network tutorial Minsky & Papert’s (1969) XOR problem: XOR ( X 1 , X 2 ) = 1 if X 1 � = X 2 ; = 0 o/w. X 1 , X 2 ∈ { 0 , 1 } . Perceptron: f = I ( α 0 + α ′ X > 0). ◮ Feldman’s (1985) “one hundred step program”: at most 100 steps within a human reaction time. because a human can recognize another person in 100 ms, while the processing time of a neuron is 1ms. = ⇒ human brain works in a massively parallel and distributed way. ◮ Cognitive science: human vision is performed in a series of layers in the brain. ◮ Human can learn. ◮ Hebb (1949) model: w ij ← w ij + η y i y j , reinforcing learning by simultaneous activations.

Feed-forward NNs ◮ Fig 11.2 ◮ Input: X ◮ A (hidden) layer: for m = 1 , ..., M , Z m = σ ( α 0 m + α ′ m X ), Z = ( Z 1 , ..., Z M ) ′ . activation function: σ ( v ) = 1 / (1 + exp( − v )), sigmoid (or logit − 1 ); Q: what is each Z m ? hyperbolic tangent: tanh ( v ) = 2 σ ( v ) − 1. ◮ ...(may have multiple (hidden) layers)... ◮ Output: f 1 ( X ) , ..., f K ( X ). T k = β 0 k + β ′ k Z , T = ( T 1 , ..., T K ) ′ , f k ( X ) = g k ( T ). regression: g k ( T ) = T k ; classification: g k ( T ) = exp( T k ) / � K j =1 exp( T j ); softmax or multi-logit − 1 function.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c � � � � � � � � � � � � Y Y Y Y Y Y � � � � � � � � � � � � K K 1 1 2 2 Z 1 Z 1 Z 2 Z 2 Z Z Z Z 3 3 M m X X X 2 X X 3 X X X X p X p 1 1 2 3 p-1 p-1 FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c 1 / (1 + e − v ) 1.0 0.5 0.0 -10 -5 0 5 10 v FIGURE 11.3. Plot of the sigmoid function σ ( v ) = 1 / (1 + exp( − v )) (red curve), commonly used in the hidden layer of a neural network. Included are σ ( sv ) for s = 1 2 (blue curve) and s = 10 (purple curve). The scale parameter s controls the activation rate, and we can see that large s amounts to a hard activation at v = 0 . Note that σ ( s ( v − v 0 )) shifts the activation threshold from 0 to v 0 .

◮ How to fit the model? ◮ Given training data: ( Y i , X i ), i = 1 , ..., n . ◮ For regression, minimize R ( θ ) = � K � n i =1 ( Y ik − f k ( X i )) 2 . k =1 ◮ For classification, minimize R ( θ ) = − � K � n i =1 Y ik log f k ( X i ). k =1 And G ( x ) = arg max f k ( x ). ◮ Can use other loss functions. ◮ How to minimize R ( θ )? Gradient descent, called back-propagation. § 11.4 Very popular and appealing! recall Hebb model ◮ Other algorithms: Newton’s, conjugate-gradient, ...

Back-propagation algorithm ◮ Given: training data ( Y i , X i ), i = 1 , ..., n . ◮ Goal: estimate α ’s and β ’s. k ( Y ik − f k ( X i )) 2 := � i r 2 Consider R ( θ ) = � � i R i := � i . i ◮ NN: input X i , output ( f 1 ( X i ) , ..., f K ( X i )) ′ . Z mi = σ ( α 0 m + α ′ m X i ), Z i = ( Z 1 i , ..., Z Mi ) ′ , T ki = β 0 k + β ′ k Z i , T i = ( T 1 i , ..., T Ki ) ′ , f k ( X i ) = g k ( T i ) = T ki . ◮ Chain rule: ∂ R i = ∂ R i ∂ r i ∂ g k ∂ T i ∂β km ∂ r i ∂ g k ∂ T i ∂β km ∂ R i = − 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) Z mi := δ ki Z mi , ∂β km

Back-propagation algorithm (cont’ed) ◮ ∂ R i = ∂ R i ∂ r i ∂ g k ∂ T i ∂ Z i ∂α ml ∂ r i ∂ g k ∂ T i ∂ Z i ∂α ml ∂ R i � 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) β km σ ′ ( α ′ = − m X i ) X il := s mi X il . ∂α ml k where δ ki , s mi are “errors” from the current model. ◮ Update at step r + 1: ∂ R i � ∂ R i � β ( r +1) = β ( r ) β ( r ) ,α ( r ) , α ( r +1) = α ( r ) � � � � km − γ r ml − γ r β ( r ) ,α ( r ) . km � ml � ∂β km ∂α ml � � i i γ r : learning rate ; a tuning parameter; can be fixed or selected/decayed. too large/small then ... ◮ training epoch : a cycle of updating

Some issues ◮ Starting values: Existence of many local minima and saddle points. Multiple tries; model averaging, ... Data preprocessing: centering at 0 and scaling ◮ Stochastic gradient descent ( SGD ): use a minibatch (i.e. a random subset) of the training data for a few iterations; minimbatch size: 32 or 64 or 128 or ..., a tuning parameter. ◮ +: simple and intuitive; -: slow ◮ Modifications: SGD + Momentum SGD: x t +1 = x t − γ ∇ f ( x t ) . SGD+M: v t +1 = ρ v t + ∇ f ( x t ), x t +1 = x t − γ v t +1 . ... (AdaGrad, RMSProp) ... Adam , default (now!)

Some issues (cont’ed) ◮ Over-fitting? Universal Approx Thm If add more units or layers, then... 1) Early stopping! 2) Regularization: add a penalty term , e.g. Ridge; use km β 2 ml α 2 R ( θ ) + λ J ( θ ) with J ( θ ) = � km + � ml ; called weight decay ; Fig 11.4. Performance: Figs 11.6-8 3) Regularization: Dropout (randomly) a subset/proportion of nodes/units or connections during training; an ensemble; more robust. ◮ A main technical issue with a deep NN: gradients vanishing or exploding, why? use ReLU : f ( x ) = max(0 , x ); batch normalization; .... ◮ Transfer learning : reusing trained networks: why? http: //jmlr.org/proceedings/papers/v32/donahue14.pdf ◮ Example code: ex7.1.r

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 Wei Pan c Introduction Chapter 11. only focus on

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural networks Chapter 20, Section 5 Chapter 20, Section 5 1 Outline Brains Neural

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

Andrea Chiappo andrea.chiappo@fysik.su.se Co-authors: Jan Conrad, Nils Hkansson, Johann

Analysis strategies and treatment of systematic effects in the KATRIN experiment Martin Sle zk

CSE 527 Lectures 12-13 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

ss rst st

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Sambuz

Useful Links

Newsletter

Mail Us