neural networks
play

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 - PowerPoint PPT Presentation

Feed-forward Networks Network Training Error Backpropagation Deep Learning Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network Training Error Backpropagation Deep Learning Neural Networks Neural


  1. Feed-forward Networks Network Training Error Backpropagation Deep Learning Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5

  2. Feed-forward Networks Network Training Error Backpropagation Deep Learning Neural Networks • Neural networks arise from attempts to model human/animal brains • Many models, many claims of biological plausibility • We will focus on multi-layer perceptrons • Mathematical properties rather than plausibility

  3. Feed-forward Networks Network Training Error Backpropagation Deep Learning Applications of Neural Networks • Many success stories for neural networks, old and new • Credit card fraud detection • Hand-written digit recognition • Face detection • Autonomous driving (CMU ALVINN) • Object recognition • Speech recognition

  4. Feed-forward Networks Network Training Error Backpropagation Deep Learning Outline Feed-forward Networks Network Training Error Backpropagation Deep Learning

  5. Feed-forward Networks Network Training Error Backpropagation Deep Learning Outline Feed-forward Networks Network Training Error Backpropagation Deep Learning

  6. Feed-forward Networks Network Training Error Backpropagation Deep Learning Feed-forward Networks • We have looked at generalized linear models of the form:   M � y ( x , w ) = f w j φ j ( x )   j = 1 for fixed non-linear basis functions φ ( · ) • We now extend this model by allowing adaptive basis functions, and learning their parameters • In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be another non-linear function of linear combination of the inputs:   M � φ j ( x ) = f . . .   j = 1

  7. Feed-forward Networks Network Training Error Backpropagation Deep Learning Feed-forward Networks • We have looked at generalized linear models of the form:   M � y ( x , w ) = f w j φ j ( x )   j = 1 for fixed non-linear basis functions φ ( · ) • We now extend this model by allowing adaptive basis functions, and learning their parameters • In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be another non-linear function of linear combination of the inputs:   M � φ j ( x ) = f . . .   j = 1

  8. Feed-forward Networks Network Training Error Backpropagation Deep Learning Feed-forward Networks • Starting with input x = ( x 1 , . . . , x D ) , construct linear combinations: D � w ( 1 ) ji x i + w ( 1 ) a j = j 0 i = 1 These a j are known as activations • Pass through an activation function h ( · ) to get output z j = h ( a j ) • Model of an individual neuron from Russell and Norvig, AIMA2e

  9. Feed-forward Networks Network Training Error Backpropagation Deep Learning Feed-forward Networks • Starting with input x = ( x 1 , . . . , x D ) , construct linear combinations: D � w ( 1 ) ji x i + w ( 1 ) a j = j 0 i = 1 These a j are known as activations • Pass through an activation function h ( · ) to get output z j = h ( a j ) • Model of an individual neuron from Russell and Norvig, AIMA2e

  10. Feed-forward Networks Network Training Error Backpropagation Deep Learning Feed-forward Networks • Starting with input x = ( x 1 , . . . , x D ) , construct linear combinations: D � w ( 1 ) ji x i + w ( 1 ) a j = j 0 i = 1 These a j are known as activations • Pass through an activation function h ( · ) to get output z j = h ( a j ) • Model of an individual neuron from Russell and Norvig, AIMA2e

  11. Feed-forward Networks Network Training Error Backpropagation Deep Learning Activation Functions • Can use a variety of activation functions • Sigmoidal (S-shaped) • Logistic sigmoid 1 / ( 1 + exp ( − a )) (useful for binary classification) • Hyperbolic tangent tanh • Radial basis function z j = � i ( x i − w ji ) 2 • Softmax • Useful for multi-class classification • Identity • Useful for regression • Threshold • Max, ReLU, Leaky ReLU, . . . • Needs to be differentiable* for gradient-based learning (later) • Can use different activation functions in each unit

  12. Feed-forward Networks Network Training Error Backpropagation Deep Learning Feed-forward Networks hidden units z M w (1) w (2) MD KM x D y K outputs inputs y 1 x 1 w (2) z 1 10 x 0 z 0 • Connect together a number of these units into a feed-forward network (DAG) • Above shows a network with one layer of hidden units • Implements function:   � D � M � � w ( 2 ) w ( 1 ) ji x i + w ( 1 ) + w ( 2 ) y k ( x , w ) = h kj h   j 0 k 0 j = 1 i = 1

  13. Feed-forward Networks Network Training Error Backpropagation Deep Learning Outline Feed-forward Networks Network Training Error Backpropagation Deep Learning

  14. Feed-forward Networks Network Training Error Backpropagation Deep Learning Network Training • Given a specified network structure, how do we set its parameters (weights)? • As usual, we define a criterion to measure how well our network performs, optimize against it • For regression, training data are ( x n , t ) , t n ∈ R • Squared error naturally arises: N � { y ( x n , w ) − t n } 2 E ( w ) = n = 1 • For binary classification, this is another discriminative model, ML: N � y t n n { 1 − y n } 1 − t n p ( t | w ) = n = 1 N � E ( w ) = − { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } n = 1

  15. Feed-forward Networks Network Training Error Backpropagation Deep Learning Network Training • Given a specified network structure, how do we set its parameters (weights)? • As usual, we define a criterion to measure how well our network performs, optimize against it • For regression, training data are ( x n , t ) , t n ∈ R • Squared error naturally arises: N � { y ( x n , w ) − t n } 2 E ( w ) = n = 1 • For binary classification, this is another discriminative model, ML: N � y t n n { 1 − y n } 1 − t n p ( t | w ) = n = 1 N � E ( w ) = − { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } n = 1

  16. Feed-forward Networks Network Training Error Backpropagation Deep Learning Network Training • Given a specified network structure, how do we set its parameters (weights)? • As usual, we define a criterion to measure how well our network performs, optimize against it • For regression, training data are ( x n , t ) , t n ∈ R • Squared error naturally arises: N � { y ( x n , w ) − t n } 2 E ( w ) = n = 1 • For binary classification, this is another discriminative model, ML: N � y t n n { 1 − y n } 1 − t n p ( t | w ) = n = 1 N � E ( w ) = − { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } n = 1

  17. Feed-forward Networks Network Training Error Backpropagation Deep Learning Parameter Optimization E ( w ) w 1 w A w B w C w 2 ∇ E • For either of these problems, the error function E ( w ) is nasty • Nasty = non-convex • Non-convex = has local minima

  18. Feed-forward Networks Network Training Error Backpropagation Deep Learning Descent Methods • The typical strategy for optimization problems of this sort is a descent method: w ( τ + 1 ) = w ( τ ) + ∆ w ( τ ) • As we’ve seen before, these come in many flavours • Gradient descent ∇ E ( w ( τ ) ) • Stochastic gradient descent ∇ E n ( w ( τ ) ) • Newton-Raphson (second order) ∇ 2 • All of these can be used here, stochastic gradient descent is particularly effective • Redundancy in training data, escaping local minima

  19. Feed-forward Networks Network Training Error Backpropagation Deep Learning Descent Methods • The typical strategy for optimization problems of this sort is a descent method: w ( τ + 1 ) = w ( τ ) + ∆ w ( τ ) • As we’ve seen before, these come in many flavours • Gradient descent ∇ E ( w ( τ ) ) • Stochastic gradient descent ∇ E n ( w ( τ ) ) • Newton-Raphson (second order) ∇ 2 • All of these can be used here, stochastic gradient descent is particularly effective • Redundancy in training data, escaping local minima

  20. Feed-forward Networks Network Training Error Backpropagation Deep Learning Descent Methods • The typical strategy for optimization problems of this sort is a descent method: w ( τ + 1 ) = w ( τ ) + ∆ w ( τ ) • As we’ve seen before, these come in many flavours • Gradient descent ∇ E ( w ( τ ) ) • Stochastic gradient descent ∇ E n ( w ( τ ) ) • Newton-Raphson (second order) ∇ 2 • All of these can be used here, stochastic gradient descent is particularly effective • Redundancy in training data, escaping local minima

  21. Feed-forward Networks Network Training Error Backpropagation Deep Learning Computing Gradients • The function y ( x n , w ) implemented by a network is complicated • It isn’t obvious how to compute error function derivatives with respect to weights • Numerical method for calculating error derivatives, use finite differences: ≈ E n ( w ji + ǫ ) − E n ( w ji − ǫ ) ∂ E n ∂ w ji 2 ǫ • How much computation would this take with W weights in the network? • O ( W ) per derivative, O ( W 2 ) total per gradient descent step

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend