 
              Multi-Layer vs. Single-Layer Networks Single-layer networks • based on a linear combination of the input variables which is transformed by linear/non-linear activation function • are limited in terms of the range of functions they can represent Multi-layer networks • consist of multiple layers and are capable of approximating any continuous functional mapping • are compared to single-layer networks not so straightforward to train – p. 82
Multi-Layer Network outputs y 1 y K second layer, (weights v kj ) bias hidden units z 0 = 1 z 1 z M first layer, (weights w ji ) bias x 0 = 1 x 1 x d inputs Connection in first layer from input unit i to hidden unit j is denoted as w ji . Connection from hidden unit j to output unit k is denoted as v kj . – p. 83
Multi-Layer Network (cont.) Hidden unit j receives input d d � � a j = w ji x i + w j 0 = w ji x i i =1 i =0 and produces output � d � � z j = g ( a j ) = g w ji x i . i =0 Output unit k thus receives M M � � a k = v kj z j + v k 0 = v kj z j j =1 j =0 – p. 84
Multi-Layer Network (cont.) and produces the final output    � � d M M � � �  = g y k = g ( a k ) = g v kj z j v kj g w ji x i    j =0 j =0 i =0 Note that the activation function g ( · ) in the first layer can be different from those in the second layer (or other layers). – p. 85
Multi-Layer Networks Example y 1 y 2 v 13 v 23 v 10 v 20 bias z 0 = 1 z 1 z 2 z 3 w 10 w 20 w 22 w 32 w 12 w 30 bias x 0 = 1 x 1 x 2 Note: sometimes the layers of units are counted (here three layers), rather the layers of adaptive weights. In this course L -layer network is referred to a network with L layers of adaptive weights. – p. 86
LMS Learning Rule for Multi-Layer Networks • We have seen that the LMS learning rule is based on the gradient descent algorithm. • The LMS learning rule worked because the error is proportional to the square difference between actual output y and target output t and can be evaluated for each output unit. • In a multi-layer network we can use LMS learning rule on the hidden-to-output layer weights because target outputs are known. Problem : we cannot compute the target outputs of the input-to-hidden weights because these values are unknown, or, to put it the other way around, how to update the weights in the first layer? – p. 87
Backpropagation (Hidden-to-Output Layer) Recall that we want to minimize the error on training patterns between actual output y k and target output t k : K E = 1 � ( y k − t k ) 2 . 2 k =1 Backpropagation learning rule is based on gradient descent: ∆ w = − η ∂E ∂ w , component form ∆ w st = − η ∂E ∂w st Apply chain rule for differentiation: ∂E = ∂E ∂a k ∂v kj ∂a k ∂v kj – p. 88
Backprop. (Hidden-to-Output Layer) (cont.) Gradient descent rule gives: − η ∂E = − η ( y k − t k ) g ′ ( a k ) z j ∆ v kj = ∂v kj = − ηδ k z j where δ k = ( y k − t k ) g ′ ( a k ) . Observe that this result is identical to that obtained for LMS. – p. 89
Backpropagation (Input-to-Hidden Layer) For the input-to-hidden connection we must differentiate with respect to the w ji ’s which are deeply embedded in   �  2 � d K M E = 1 � � �  − t k  g v kj g w ji x i   2 j =0 i =0 k =1 Apply chain rule: − η ∂E = − η ∂E ∂z j ∂a j ∆ w ji = ∂w ji ∂z j ∂a j ∂w ji K � ( y k − t k ) g ′ ( a k ) v kj g ′ ( a j ) x i = − η � �� � k =1 δ k K � δ k v kj g ′ ( a j ) x i = − η – p. 90 k =1
Backprop. (Input-to-Hidden Layer) (cont.) ∆ w ji = − ηδ j x i where K � δ j = g ′ ( a j ) v kj δ k k =1 Observe: that we need to propagate the errors ( δ ’s) backwards to update the weights v and w ∆ v kj = − ηδ k z j ( y k − t k ) g ′ ( a k ) δ k = ∆ w ji = − ηδ j x i K � g ′ ( a j ) = δ j v kj δ k k =1 – p. 91
Error Backpropagation • Apply input x and forward propagate through the network using a j = � d i =0 w ji x i and z j = g ( a j ) to find the activations of all the hidden and output units. • Compute the deltas δ k for all the output units using δ k = ( y k − t k ) g ′ ( a k ) . • Backpropagate the δ ’s using δ j = g ′ ( a j ) � K k =1 v kj δ k to obtain δ j for each hidden unit in the network. Time and space complexity: d input units, M hidden units and K output units results in M ( d + 1) weights in first layer and K ( M + 1) weights in second layer. Space and time complexity is O ( M ( K + d )) . If e training epochs are performed, then time complexity is O ( e M ( K + d )) . – p. 92
Backprop. (Output-to-Hidden Layer) Vis. y 1 y 2 v new = v 13 − ηδ 1 z 3 δ 1 13 δ 1 δ 1 δ 1 bias z 0 = 1 z 1 z 2 z 3 bias x 0 = 1 x 1 x 2 – p. 93
Backprop. (Hidden-to-Input Layer) Vis. y 1 y 2 δ 1 δ 2 bias z 0 = 1 z 1 z 2 z 3 w new = w 12 − η [ g ′ ( a 1 )( v 11 δ 1 + v 21 δ 2 )] x 2 12 � �� � δ j bias x 0 = 1 x 1 x 2 – p. 94
Property of Activation Functions • In the Backpropagation algorithm the derivative of g ( a ) is required to evaluate the δ ’s. • Activation functions 1 and g 1 ( a ) = g 2 ( a ) = tanh( βa ) 1 + exp( − βa ) obey the property g ′ 1 ( a ) = β g 1 ( a )(1 − g 1 ( a )) β (1 − ( g 2 ( a )) 2 ) g ′ 2 ( a ) = – p. 95
Online Backpropagation Algorithm input : ( x 1 , t 1 ) , . . . , ( x N , t N ) ∈ R d × {C 1 , C 2 , . . . , C K } , η ∈ R + , max.epoch ∈ N , ǫ ∈ R + output : w , v begin Randomly initialize w , v epoch ← 0 repeat for n ← 1 to N do x ← select pattern x n v kj ← v kj − ηδ k z j w ji ← w ji − ηδ j x i epoch ← epoch + 1 until ( epoch = max.epoch ) or ( �∇ E � < ǫ ) return w , v end – p. 96
Batch Backpropagation Algorithm input : ( x 1 , t 1 ) , . . . , ( x N , t N ) ∈ R d × {C 1 , C 2 , . . . , C K } , η ∈ R + , max.epoch ∈ N , ǫ ∈ R + output : w , v begin Randomly initialize w , v epoch ← 0 , ∆ w ji ← 0 , ∆ v kj ← 0 repeat for n ← 1 to N do x ← select pattern x n ∆ v kj ← ∆ v kj − ηδ k z j , ∆ w ji ← ∆ w ji − ηδ j x i v kj ← v kj + ∆ v kj w ji ← w ji + ∆ w ji epoch ← epoch + 1 until ( epoch = max.epoch ) or ( �∇ E � < ǫ ) return w , v end – p. 97
Multi-Layer Networks & Heaviside Step Func. ������� ������� ���� ���� ������� ������� ������� ������� ���� ���� ������� ������� ������� ������� ���� ���� ������� ������� ����� ����� ������� ������� ���� ���� ������� ������� ����� ����� ������� ������� ���� ���� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ������� ������� ����� ����� ������� ������� ����� ����� Possible decision boundaries which can be generated by networks having various numbers of layers and using Heaviside activation function. – p. 98
Multi-Layer NN for XOR Separability Problem y x 1 XOR x 2 x 1 x 2 − 1 0 . 7 − 0 . 4 − 1 − 1 − 1 − 1 +1 +1 − 1 . 5 z 1 z 2 z 0 +1 − 1 +1 0 . 5 1 1 +1 +1 − 1 1 1 x 1 x 2 x 0 � − 1 if a < 0 g ( a ) = +1 if a ≥ 0 – p. 99
Multi-Layer NN for XOR Sep. Problem (cont.) 1 1 0.5 0.5 0 0 -0.5 -0.5 1 1 0.5 0.5 − 1 − 1 − 1 0 0 x2 -0.5 x2 − 1 -0.5 -0.5 0 -0.5 0 x1 x1 0.5 0.5 − 1 1 − 1 1 1 ��� ��� 1 ��� ��� ��� ��� 0.5 0 − 1 1 1 -0.5 0.5 0 �� �� − 1 x2 �� �� − 1 �� �� − 1 -0.5 -0.5 0 x1 0.5 − 1 1 – p. 100
Expressive Power of Multi-Layer Networks With a two-layer network and a sufficient number of hidden units, any type of function can be represented when given proper nonlinearities and weights. The famous mathematician Andrey Kolmogorov proved that any continuous function y ( x ) defined on the unit hypercube [0 , 1] n , n ≥ 2 can be represented in the form � d � 2 n +1 � � y ( x ) = Ξ j Ψ ij ( x i ) j =1 i =1 for properly chosen Ξ j and Ψ ij . – p. 101
Bayes Decision Region vs. Neural Network 2.5 2.0 1.5 y 1.0 0.5 0.0 0 2 4 6 8 10 x Points from blue and red class are generated by a mixture of Gaussians. Black curve shows optimal separation in a Bayes sense. Gray curve shows neural network separation of two independent backpropagation learning runs. – p. 102
Neural Network (Density) Decision Region – p. 103
Recommend
More recommend