chapter 7 neural networks
play

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 Wei Pan c Introduction Chapter 11. only focus on


  1. Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c

  2. Introduction ◮ Chapter 11. only focus on Feedforward NNs. Related to projection pursuit regression: f ( x ) = � M m =1 g m ( w ′ m x ), where each w m is a vector of weights and g m is a smooth nonparametric function; to be estimated. really? ◮ Here: + CNN; later recurrent NNs (for seq data). No autoencoders (unsupervised) ...? Goodfellow, Bengio, Courville (2016). Deep Learning. http://www.deeplearningbook.org/ ◮ Two high waves in 1960s and late 1980s-90s. ◮ McCulloch & Pitts model (1943): n j ( t ) = I ( � i → j w ij n i ( t − 1) > θ j ). w ij can be > 0 (excitatory) or < 0 (inhibitory).

  3. ◮ A biological neuron vs an artificial neuron (perceptron). Google: images biological neural network tutorial Minsky & Papert’s (1969) XOR problem: XOR ( X 1 , X 2 ) = 1 if X 1 � = X 2 ; = 0 o/w. X 1 , X 2 ∈ { 0 , 1 } . Perceptron: f = I ( α 0 + α ′ X > 0). ◮ Feldman’s (1985) “one hundred step program”: at most 100 steps within a human reaction time. because a human can recognize another person in 100 ms, while the processing time of a neuron is 1ms. = ⇒ human brain works in a massively parallel and distributed way. ◮ Cognitive science: human vision is performed in a series of layers in the brain. ◮ Human can learn. ◮ Hebb (1949) model: w ij ← w ij + η y i y j , reinforcing learning by simultaneous activations.

  4. Feed-forward NNs ◮ Fig 11.2 ◮ Input: X ◮ A (hidden) layer: for m = 1 , ..., M , Z m = σ ( α 0 m + α ′ m X ), Z = ( Z 1 , ..., Z M ) ′ . activation function: σ ( v ) = 1 / (1 + exp( − v )), sigmoid (or logit − 1 ); Q: what is each Z m ? hyperbolic tangent: tanh ( v ) = 2 σ ( v ) − 1. ◮ ...(may have multiple (hidden) layers)... ◮ Output: f 1 ( X ) , ..., f K ( X ). T k = β 0 k + β ′ k Z , T = ( T 1 , ..., T K ) ′ , f k ( X ) = g k ( T ). regression: g k ( T ) = T k ; classification: g k ( T ) = exp( T k ) / � K j =1 exp( T j ); softmax or multi-logit − 1 function.

  5. Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c � � � � � � � � � � � � Y Y Y Y Y Y � � � � � � � � � � � � K K 1 1 2 2 Z 1 Z 1 Z 2 Z 2 Z Z Z Z 3 3 M m X X X 2 X X 3 X X X X p X p 1 1 2 3 p-1 p-1 FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network.

  6. Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c 1 / (1 + e − v ) 1.0 0.5 0.0 -10 -5 0 5 10 v FIGURE 11.3. Plot of the sigmoid function σ ( v ) = 1 / (1 + exp( − v )) (red curve), commonly used in the hidden layer of a neural network. Included are σ ( sv ) for s = 1 2 (blue curve) and s = 10 (purple curve). The scale parameter s controls the activation rate, and we can see that large s amounts to a hard activation at v = 0 . Note that σ ( s ( v − v 0 )) shifts the activation threshold from 0 to v 0 .

  7. ◮ How to fit the model? ◮ Given training data: ( Y i , X i ), i = 1 , ..., n . ◮ For regression, minimize R ( θ ) = � K � n i =1 ( Y ik − f k ( X i )) 2 . k =1 ◮ For classification, minimize R ( θ ) = − � K � n i =1 Y ik log f k ( X i ). k =1 And G ( x ) = arg max f k ( x ). ◮ Can use other loss functions. ◮ How to minimize R ( θ )? Gradient descent, called back-propagation. § 11.4 Very popular and appealing! recall Hebb model ◮ Other algorithms: Newton’s, conjugate-gradient, ...

  8. Back-propagation algorithm ◮ Given: training data ( Y i , X i ), i = 1 , ..., n . ◮ Goal: estimate α ’s and β ’s. k ( Y ik − f k ( X i )) 2 := � i r 2 Consider R ( θ ) = � � i R i := � i . i ◮ NN: input X i , output ( f 1 ( X i ) , ..., f K ( X i )) ′ . Z mi = σ ( α 0 m + α ′ m X i ), Z i = ( Z 1 i , ..., Z Mi ) ′ , T ki = β 0 k + β ′ k Z i , T i = ( T 1 i , ..., T Ki ) ′ , f k ( X i ) = g k ( T i ) = T ki . ◮ Chain rule: ∂ R i = ∂ R i ∂ r i ∂ g k ∂ T i ∂β km ∂ r i ∂ g k ∂ T i ∂β km ∂ R i = − 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) Z mi := δ ki Z mi , ∂β km

  9. Back-propagation algorithm (cont’ed) ◮ ∂ R i = ∂ R i ∂ r i ∂ g k ∂ T i ∂ Z i ∂α ml ∂ r i ∂ g k ∂ T i ∂ Z i ∂α ml ∂ R i � 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) β km σ ′ ( α ′ = − m X i ) X il := s mi X il . ∂α ml k where δ ki , s mi are “errors” from the current model. ◮ Update at step r + 1: ∂ R i � ∂ R i � β ( r +1) = β ( r ) β ( r ) ,α ( r ) , α ( r +1) = α ( r ) � � � � km − γ r ml − γ r β ( r ) ,α ( r ) . km � ml � ∂β km ∂α ml � � i i γ r : learning rate ; a tuning parameter; can be fixed or selected/decayed. too large/small then ... ◮ training epoch : a cycle of updating

  10. Some issues ◮ Starting values: Existence of many local minima and saddle points. Multiple tries; model averaging, ... Data preprocessing: centering at 0 and scaling ◮ Stochastic gradient descent ( SGD ): use a minibatch (i.e. a random subset) of the training data for a few iterations; minimbatch size: 32 or 64 or 128 or ..., a tuning parameter. ◮ +: simple and intuitive; -: slow ◮ Modifications: SGD + Momentum SGD: x t +1 = x t − γ ∇ f ( x t ) . SGD+M: v t +1 = ρ v t + ∇ f ( x t ), x t +1 = x t − γ v t +1 . ... (AdaGrad, RMSProp) ... Adam , default (now!)

  11. Some issues (cont’ed) ◮ Over-fitting? Universal Approx Thm If add more units or layers, then... 1) Early stopping! 2) Regularization: add a penalty term , e.g. Ridge; use km β 2 ml α 2 R ( θ ) + λ J ( θ ) with J ( θ ) = � km + � ml ; called weight decay ; Fig 11.4. Performance: Figs 11.6-8 3) Regularization: Dropout (randomly) a subset/proportion of nodes/units or connections during training; an ensemble; more robust. ◮ A main technical issue with a deep NN: gradients vanishing or exploding, why? use ReLU : f ( x ) = max(0 , x ); batch normalization; .... ◮ Transfer learning : reusing trained networks: why? http: //jmlr.org/proceedings/papers/v32/donahue14.pdf ◮ Example code: ex7.1.r

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend