multi layer vs single layer networks
play

Multi-Layer vs. Single-Layer Networks Single-layer networks based - PowerPoint PPT Presentation

Multi-Layer vs. Single-Layer Networks Single-layer networks based on a linear combination of the input variables which is transformed by linear/non-linear activation function are limited in terms of the range of functions they can


  1. Multi-Layer vs. Single-Layer Networks Single-layer networks • based on a linear combination of the input variables which is transformed by linear/non-linear activation function • are limited in terms of the range of functions they can represent Multi-layer networks • consist of multiple layers and are capable of approximating any continuous functional mapping • are compared to single-layer networks not so straightforward to train – p. 82

  2. Multi-Layer Network outputs y 1 y K second layer, (weights v kj ) bias hidden units z 0 = 1 z 1 z M first layer, (weights w ji ) bias x 0 = 1 x 1 x d inputs Connection in first layer from input unit i to hidden unit j is denoted as w ji . Connection from hidden unit j to output unit k is denoted as v kj . – p. 83

  3. Multi-Layer Network (cont.) Hidden unit j receives input d d � � a j = w ji x i + w j 0 = w ji x i i =1 i =0 and produces output � d � � z j = g ( a j ) = g w ji x i . i =0 Output unit k thus receives M M � � a k = v kj z j + v k 0 = v kj z j j =1 j =0 – p. 84

  4. Multi-Layer Network (cont.) and produces the final output    � � d M M � � �  = g y k = g ( a k ) = g v kj z j v kj g w ji x i    j =0 j =0 i =0 Note that the activation function g ( · ) in the first layer can be different from those in the second layer (or other layers). – p. 85

  5. Multi-Layer Networks Example y 1 y 2 v 13 v 23 v 10 v 20 bias z 0 = 1 z 1 z 2 z 3 w 10 w 20 w 22 w 32 w 12 w 30 bias x 0 = 1 x 1 x 2 Note: sometimes the layers of units are counted (here three layers), rather the layers of adaptive weights. In this course L -layer network is referred to a network with L layers of adaptive weights. – p. 86

  6. LMS Learning Rule for Multi-Layer Networks • We have seen that the LMS learning rule is based on the gradient descent algorithm. • The LMS learning rule worked because the error is proportional to the square difference between actual output y and target output t and can be evaluated for each output unit. • In a multi-layer network we can use LMS learning rule on the hidden-to-output layer weights because target outputs are known. Problem : we cannot compute the target outputs of the input-to-hidden weights because these values are unknown, or, to put it the other way around, how to update the weights in the first layer? – p. 87

  7. Backpropagation (Hidden-to-Output Layer) Recall that we want to minimize the error on training patterns between actual output y k and target output t k : K E = 1 � ( y k − t k ) 2 . 2 k =1 Backpropagation learning rule is based on gradient descent: ∆ w = − η ∂E ∂ w , component form ∆ w st = − η ∂E ∂w st Apply chain rule for differentiation: ∂E = ∂E ∂a k ∂v kj ∂a k ∂v kj – p. 88

  8. Backprop. (Hidden-to-Output Layer) (cont.) Gradient descent rule gives: − η ∂E = − η ( y k − t k ) g ′ ( a k ) z j ∆ v kj = ∂v kj = − ηδ k z j where δ k = ( y k − t k ) g ′ ( a k ) . Observe that this result is identical to that obtained for LMS. – p. 89

  9. Backpropagation (Input-to-Hidden Layer) For the input-to-hidden connection we must differentiate with respect to the w ji ’s which are deeply embedded in   �  2 � d K M E = 1 � � �  − t k  g v kj g w ji x i   2 j =0 i =0 k =1 Apply chain rule: − η ∂E = − η ∂E ∂z j ∂a j ∆ w ji = ∂w ji ∂z j ∂a j ∂w ji K � ( y k − t k ) g ′ ( a k ) v kj g ′ ( a j ) x i = − η � �� � k =1 δ k K � δ k v kj g ′ ( a j ) x i = − η – p. 90 k =1

  10. Backprop. (Input-to-Hidden Layer) (cont.) ∆ w ji = − ηδ j x i where K � δ j = g ′ ( a j ) v kj δ k k =1 Observe: that we need to propagate the errors ( δ ’s) backwards to update the weights v and w ∆ v kj = − ηδ k z j ( y k − t k ) g ′ ( a k ) δ k = ∆ w ji = − ηδ j x i K � g ′ ( a j ) = δ j v kj δ k k =1 – p. 91

  11. Error Backpropagation • Apply input x and forward propagate through the network using a j = � d i =0 w ji x i and z j = g ( a j ) to find the activations of all the hidden and output units. • Compute the deltas δ k for all the output units using δ k = ( y k − t k ) g ′ ( a k ) . • Backpropagate the δ ’s using δ j = g ′ ( a j ) � K k =1 v kj δ k to obtain δ j for each hidden unit in the network. Time and space complexity: d input units, M hidden units and K output units results in M ( d + 1) weights in first layer and K ( M + 1) weights in second layer. Space and time complexity is O ( M ( K + d )) . If e training epochs are performed, then time complexity is O ( e M ( K + d )) . – p. 92

  12. Backprop. (Output-to-Hidden Layer) Vis. y 1 y 2 v new = v 13 − ηδ 1 z 3 δ 1 13 δ 1 δ 1 δ 1 bias z 0 = 1 z 1 z 2 z 3 bias x 0 = 1 x 1 x 2 – p. 93

  13. Backprop. (Hidden-to-Input Layer) Vis. y 1 y 2 δ 1 δ 2 bias z 0 = 1 z 1 z 2 z 3 w new = w 12 − η [ g ′ ( a 1 )( v 11 δ 1 + v 21 δ 2 )] x 2 12 � �� � δ j bias x 0 = 1 x 1 x 2 – p. 94

  14. Property of Activation Functions • In the Backpropagation algorithm the derivative of g ( a ) is required to evaluate the δ ’s. • Activation functions 1 and g 1 ( a ) = g 2 ( a ) = tanh( βa ) 1 + exp( − βa ) obey the property g ′ 1 ( a ) = β g 1 ( a )(1 − g 1 ( a )) β (1 − ( g 2 ( a )) 2 ) g ′ 2 ( a ) = – p. 95

  15. Online Backpropagation Algorithm input : ( x 1 , t 1 ) , . . . , ( x N , t N ) ∈ R d × {C 1 , C 2 , . . . , C K } , η ∈ R + , max.epoch ∈ N , ǫ ∈ R + output : w , v begin Randomly initialize w , v epoch ← 0 repeat for n ← 1 to N do x ← select pattern x n v kj ← v kj − ηδ k z j w ji ← w ji − ηδ j x i epoch ← epoch + 1 until ( epoch = max.epoch ) or ( �∇ E � < ǫ ) return w , v end – p. 96

  16. Batch Backpropagation Algorithm input : ( x 1 , t 1 ) , . . . , ( x N , t N ) ∈ R d × {C 1 , C 2 , . . . , C K } , η ∈ R + , max.epoch ∈ N , ǫ ∈ R + output : w , v begin Randomly initialize w , v epoch ← 0 , ∆ w ji ← 0 , ∆ v kj ← 0 repeat for n ← 1 to N do x ← select pattern x n ∆ v kj ← ∆ v kj − ηδ k z j , ∆ w ji ← ∆ w ji − ηδ j x i v kj ← v kj + ∆ v kj w ji ← w ji + ∆ w ji epoch ← epoch + 1 until ( epoch = max.epoch ) or ( �∇ E � < ǫ ) return w , v end – p. 97

  17. Multi-Layer Networks & Heaviside Step Func. ������� ������� ���� ���� ������� ������� ������� ������� ���� ���� ������� ������� ������� ������� ���� ���� ������� ������� ����� ����� ������� ������� ���� ���� ������� ������� ����� ����� ������� ������� ���� ���� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ����� ����� Possible decision boundaries which can be generated by networks having various numbers of layers and using Heaviside activation function. – p. 98

  18. Multi-Layer NN for XOR Separability Problem y x 1 XOR x 2 x 1 x 2 − 1 0 . 7 − 0 . 4 − 1 − 1 − 1 − 1 +1 +1 − 1 . 5 z 1 z 2 z 0 +1 − 1 +1 0 . 5 1 1 +1 +1 − 1 1 1 x 1 x 2 x 0 � − 1 if a < 0 g ( a ) = +1 if a ≥ 0 – p. 99

  19. Multi-Layer NN for XOR Sep. Problem (cont.) 1 1 0.5 0.5 0 0 -0.5 -0.5 1 1 0.5 0.5 − 1 − 1 − 1 0 0 x2 -0.5 x2 − 1 -0.5 -0.5 0 -0.5 0 x1 x1 0.5 0.5 − 1 1 − 1 1 1 ��� ��� 1 ��� ��� ��� ��� 0.5 0 − 1 1 1 -0.5 0.5 0 �� �� − 1 x2 �� �� − 1 �� �� − 1 -0.5 -0.5 0 x1 0.5 − 1 1 – p. 100

  20. Expressive Power of Multi-Layer Networks With a two-layer network and a sufficient number of hidden units, any type of function can be represented when given proper nonlinearities and weights. The famous mathematician Andrey Kolmogorov proved that any continuous function y ( x ) defined on the unit hypercube [0 , 1] n , n ≥ 2 can be represented in the form � d � 2 n +1 � � y ( x ) = Ξ j Ψ ij ( x i ) j =1 i =1 for properly chosen Ξ j and Ψ ij . – p. 101

  21. Bayes Decision Region vs. Neural Network 2.5 2.0 1.5 y 1.0 0.5 0.0 0 2 4 6 8 10 x Points from blue and red class are generated by a mixture of Gaussians. Black curve shows optimal separation in a Bayes sense. Gray curve shows neural network separation of two independent backpropagation learning runs. – p. 102

  22. Neural Network (Density) Decision Region – p. 103

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend