computational graphs and backpropagation

Computational Graphs, and Backpropagation Michael Collins, Columbia - PowerPoint PPT Presentation

Computational Graphs, and Backpropagation Michael Collins, Columbia University A Key Problem: Calculating Derivatives exp ( v ( y ) ( x ; ) + y ) p ( y | x ; , v ) = (1) y Y exp ( v ( y ) ( x ; ) + y


  1. Computational Graphs, and Backpropagation Michael Collins, Columbia University

  2. A Key Problem: Calculating Derivatives exp ( v ( y ) · φ ( x ; θ ) + γ y ) p ( y | x ; θ, v ) = (1) � y ′ ∈Y exp ( v ( y ′ ) · φ ( x ; θ ) + γ y ′ ) where φ ( x ; θ ) = g ( Wx + b ) and ◮ m is an integer specifying the number of hidden units ◮ W ∈ R m × d and b ∈ R m are the parameters in θ . g : R m → R m is the transfer function ◮ Key question, given a training example ( x i , y i ) , define L ( θ, v ) = − log p ( y i | x i ; θ, v ) How do we calculate derivatives such as dL ( θ,v ) dW k,j ?

  3. A Simple Version of Stochastic Gradient Descent (Continued) Algorithm: ◮ For t = 1 . . . T ◮ Select an integer i uniformly at random from { 1 . . . n } ◮ Define L ( θ, v ) = − log p ( y i | x i ; θ, v ) ◮ For each parameter θ j , θ j = θ j − η t × dL ( θ,v ) dθ j ◮ For each label y , for each parameter v k ( y ) , v k ( y ) = v k ( y ) − η t × dL ( θ,v ) dv k ( y ) ◮ For each label y , γ y = γ y − η t × dL ( θ,v ) dγ y Output: parameters θ and v

  4. Overview ◮ Introduction ◮ The chain rule ◮ Derivatives in a single-layer neural network ◮ Computational graphs ◮ Backpropagation in computational graphs ◮ Justification for backpropagation

  5. Partial Derivatives Assume we have scalar variables z 1 , z 2 . . . z n , and y , and a function f , and we define y = f ( z 1 , z 2 , . . . z n ) Then the partial derivative of f with respect to z i is written as ∂f ( z 1 , z 2 , . . . z n ) ∂z i We will also write the partial derivative as � f ∂y � � ∂z i � z 1 ...z m which can be read as “the partial derivative of y with respect to z i , under function f , at values z 1 . . . z m ”

  6. Partial Derivatives (continued) We will also write the partial derivative as � f ∂y � � ∂z i � z 1 ...z m which can be read as “the partial derivative of y with respect to z i , under function f , at values z 1 . . . z m ” The notation including f is non-standard, but helps to alleviate a lot of potential confusion... We will sometimes drop f and/or z 1 . . . z m when this is clear from context

  7. The Chain Rule Assume we have equations y = f ( z ) , z = g ( x ) h ( x ) = f ( g ( x )) Then dh ( x ) f ( g ( x )) × dg ( x ) = d dx dz dx Or equivalently, � � � h f g ∂y = ∂y × ∂z � � � � � � ∂x � ∂z � ∂x � x g ( x ) x

  8. The Chain Rule Assume we have equations y = f ( z ) , z = g ( x ) h ( x ) = f ( g ( x )) then dh ( x ) = d f ( g ( x )) × dg ( x ) dx dz dx For example, assume f ( z ) = z 2 and g ( x ) = x 3 . Assume in addition that x = 2 . Then: dg ( x ) f ( z ) d z = x 3 = 8 , = 3 x 2 = 12 , f ( z ) = z 2 = 64 , = 2 z = 16 dx dz from which it follows that dh ( x ) = 12 × 16 = 192 dx

  9. The Chain Rule (continued) Assume we have equations y = f ( z ) z 1 = g 1 ( x ) , z 2 = g 2 ( x ) , . . . , z n = g n ( x ) For some functions f , g 1 . . . g n , where z is a vector z ∈ R n , and x is a vector x ∈ R m . Define the function h ( x ) = f ( g 1 ( x ) , g 2 ( x ) , . . . g n ( x )) Then we have ∂h ( x ) ∂f ( z ) ∂g i ( x ) � = ∂x j ∂z i ∂x j i where z is the vector g 1 ( x ) , g 2 ( x ) , . . . g n ( x ) .

  10. The Jacobian Matrix Assume we have a function f : R n → R m that takes some vector x ∈ R n and then returns a vector y ∈ R m : y = f ( x ) The Jacobian J ∈ R m × n is defined as the matrix with entries J i,j = ∂f i ( x ) ∂x j Hence the Jacobian contains all partial derivatives of the function.

  11. The Jacobian Matrix Assume we have a function f : R n → R m that takes some vector x ∈ R n and then returns a vector y ∈ R m : y = f ( x ) The Jacobian J ∈ R m × n is defined as the matrix with entries J i,j = ∂f i ( x ) ∂x j Hence the Jacobian contains all partial derivatives of the function. We will also use � f ∂y � � ∂x � x for vectors y and x to refer to the Jacobian matrix with respect to a function f mapping x to y , evaluated at x

  12. An Example of the Jacobian: The LOG-SOFTMAX Function We define LS : R K → R K to be the function such that for k = 1 . . . K , � � exp { l k } � LS k ( l ) = log = l k − log exp { l k ′ } � k ′ exp { l k ′ } k ′ The Jacobian then has entries � ∂ LS ( l ) � = ∂ LS k ( l ) exp { l k ′ } = [[ k = k ′ ]] − � k ′′ exp { l k ′′ } ∂l ∂l k ′ k,k ′ where [[ k = k ′ ]] = 1 if k = k ′ , 0 otherwise.

  13. The Chain Rule (continued) Assume we have equations y = f ( z 1 , z 2 , . . . z n ) z i = g i ( x 1 , x 2 , . . . x m ) for i = 1 . . . n where y is a vector, z i for all i are vectors, and x j for all j are vectors. Define h ( x 1 . . . x m ) to be the composition of f and g , so y = h ( x 1 . . . x m ) . Then � n � � g i h f ∂z i ∂y ∂y � � � � = × � � � ∂x j ∂z i � � ∂x j � i =1 � �� � � �� � � �� � d ( y ) × d ( x j ) d ( y ) × d ( z i ) d ( z i ) × d ( x j ) where d ( v ) is the dimensionality of vector v .

  14. Overview ◮ Introduction ◮ The chain rule ◮ Derivatives in a single-layer neural network ◮ Computational graphs ◮ Backpropagation in computational graphs ◮ Justification for backpropagation

  15. Derivatives in a Feedforward Network Definitions: The set of possible labels is Y . We define K = |Y| . g : R m → R m is a transfer function. We define LS = LOG-SOFTMAX. Inputs: x i ∈ R d , y i ∈ Y , W ∈ R m × d , b ∈ R m , V ∈ R K × m , γ ∈ R K . Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

  16. Jacobian Involving Matrices Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i If W ∈ R m × d , z ∈ R m , the Jacobian ∂z ∂W is a matrix of dimension m × m ′ where m ′ = ( m × d ) is the number of entries in W . So we treat W as a vector with ( m × d ) elements.

  17. Local Functions Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) Leaf variables: W , x i , b , V , γ , y i l ∈ R K = V h + γ Intermediate variables: z , h , l , q q ∈ R K = LS ( l ) o ∈ R = − q y i Output variable: o Each intermediate variable has a “Local” function: f z ( W, x i , b ) = Wx i + b, f h ( z ) = g ( z ) , f l ( h ) = V h + γ, . . .

  18. Global Functions Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) Leaf variables: W , x i , b , V , γ , y i l ∈ R K = V h + γ Intermediate variables: z , h , l , q q ∈ R K = LS ( l ) o ∈ R = − q y i Output variable: o f o to be the Global functions: for the output variable o , we define ¯ function that maps the leaf values W , x i , b , V , γ , y i to the output value o = ¯ f o ( W, x i , b, V, γ, y i ) . We use similar definitions for ¯ f z ( W, x i , b, V, γ, y i ) , ¯ f h ( W, x i , b, V, γ, y i ) , etc.

  19. Applying the Chain Rule Derivative: ¯ � f o Equations: ∂o � = � ∂W � Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

  20. Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

  21. Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = ¯ � f o � f q � f l ∂o × ∂q ∂l � � � = × h ∈ R m � � � = g ( z ) ∂q ∂l ∂W � � � l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

  22. Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = ¯ � f o � f q � f l ∂o × ∂q ∂l � � � = × h ∈ R m � � � = g ( z ) ∂q ∂l ∂W � � � l ∈ R K = V h + γ ¯ � f o � f q � f l � f h ∂o × ∂q × ∂l × ∂h � � � � = q ∈ R K � � � � = LS ( l ) ∂q ∂l ∂h ∂W � � � � o ∈ R = − q y i

  23. Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = ¯ � f o � f q � f l ∂o × ∂q ∂l � � � = × h ∈ R m � � � = g ( z ) ∂q ∂l ∂W � � � l ∈ R K = V h + γ ¯ � f o � f q � f l � f h ∂o × ∂q × ∂l × ∂h � � � � = q ∈ R K � � � � = LS ( l ) ∂q ∂l ∂h ∂W � � � � o ∈ R = − q y i ¯ � f o � f q � f l � f h � f z ∂o × ∂q × ∂l × ∂h × ∂z � � � � � = � � � � � ∂q ∂l ∂h ∂z ∂W � � � � �

Recommend


More recommend