Computational Graphs, and Backpropagation Michael Collins, Columbia - PowerPoint PPT Presentation

Computational Graphs, and Backpropagation Michael Collins, Columbia University

A Key Problem: Calculating Derivatives exp ( v ( y ) · φ ( x ; θ ) + γ y ) p ( y | x ; θ, v ) = (1) � y ′ ∈Y exp ( v ( y ′ ) · φ ( x ; θ ) + γ y ′ ) where φ ( x ; θ ) = g ( Wx + b ) and ◮ m is an integer specifying the number of hidden units ◮ W ∈ R m × d and b ∈ R m are the parameters in θ . g : R m → R m is the transfer function ◮ Key question, given a training example ( x i , y i ) , define L ( θ, v ) = − log p ( y i | x i ; θ, v ) How do we calculate derivatives such as dL ( θ,v ) dW k,j ?

A Simple Version of Stochastic Gradient Descent (Continued) Algorithm: ◮ For t = 1 . . . T ◮ Select an integer i uniformly at random from { 1 . . . n } ◮ Define L ( θ, v ) = − log p ( y i | x i ; θ, v ) ◮ For each parameter θ j , θ j = θ j − η t × dL ( θ,v ) dθ j ◮ For each label y , for each parameter v k ( y ) , v k ( y ) = v k ( y ) − η t × dL ( θ,v ) dv k ( y ) ◮ For each label y , γ y = γ y − η t × dL ( θ,v ) dγ y Output: parameters θ and v

Overview ◮ Introduction ◮ The chain rule ◮ Derivatives in a single-layer neural network ◮ Computational graphs ◮ Backpropagation in computational graphs ◮ Justification for backpropagation

Partial Derivatives Assume we have scalar variables z 1 , z 2 . . . z n , and y , and a function f , and we define y = f ( z 1 , z 2 , . . . z n ) Then the partial derivative of f with respect to z i is written as ∂f ( z 1 , z 2 , . . . z n ) ∂z i We will also write the partial derivative as � f ∂y � � ∂z i � z 1 ...z m which can be read as “the partial derivative of y with respect to z i , under function f , at values z 1 . . . z m ”

Partial Derivatives (continued) We will also write the partial derivative as � f ∂y � � ∂z i � z 1 ...z m which can be read as “the partial derivative of y with respect to z i , under function f , at values z 1 . . . z m ” The notation including f is non-standard, but helps to alleviate a lot of potential confusion... We will sometimes drop f and/or z 1 . . . z m when this is clear from context

The Chain Rule Assume we have equations y = f ( z ) , z = g ( x ) h ( x ) = f ( g ( x )) Then dh ( x ) f ( g ( x )) × dg ( x ) = d dx dz dx Or equivalently, � � � h f g ∂y = ∂y × ∂z � � � � � � ∂x � ∂z � ∂x � x g ( x ) x

The Chain Rule Assume we have equations y = f ( z ) , z = g ( x ) h ( x ) = f ( g ( x )) then dh ( x ) = d f ( g ( x )) × dg ( x ) dx dz dx For example, assume f ( z ) = z 2 and g ( x ) = x 3 . Assume in addition that x = 2 . Then: dg ( x ) f ( z ) d z = x 3 = 8 , = 3 x 2 = 12 , f ( z ) = z 2 = 64 , = 2 z = 16 dx dz from which it follows that dh ( x ) = 12 × 16 = 192 dx

The Chain Rule (continued) Assume we have equations y = f ( z ) z 1 = g 1 ( x ) , z 2 = g 2 ( x ) , . . . , z n = g n ( x ) For some functions f , g 1 . . . g n , where z is a vector z ∈ R n , and x is a vector x ∈ R m . Define the function h ( x ) = f ( g 1 ( x ) , g 2 ( x ) , . . . g n ( x )) Then we have ∂h ( x ) ∂f ( z ) ∂g i ( x ) � = ∂x j ∂z i ∂x j i where z is the vector g 1 ( x ) , g 2 ( x ) , . . . g n ( x ) .

The Jacobian Matrix Assume we have a function f : R n → R m that takes some vector x ∈ R n and then returns a vector y ∈ R m : y = f ( x ) The Jacobian J ∈ R m × n is defined as the matrix with entries J i,j = ∂f i ( x ) ∂x j Hence the Jacobian contains all partial derivatives of the function.

The Jacobian Matrix Assume we have a function f : R n → R m that takes some vector x ∈ R n and then returns a vector y ∈ R m : y = f ( x ) The Jacobian J ∈ R m × n is defined as the matrix with entries J i,j = ∂f i ( x ) ∂x j Hence the Jacobian contains all partial derivatives of the function. We will also use � f ∂y � � ∂x � x for vectors y and x to refer to the Jacobian matrix with respect to a function f mapping x to y , evaluated at x

An Example of the Jacobian: The LOG-SOFTMAX Function We define LS : R K → R K to be the function such that for k = 1 . . . K , � � exp { l k } � LS k ( l ) = log = l k − log exp { l k ′ } � k ′ exp { l k ′ } k ′ The Jacobian then has entries � ∂ LS ( l ) � = ∂ LS k ( l ) exp { l k ′ } = [[ k = k ′ ]] − � k ′′ exp { l k ′′ } ∂l ∂l k ′ k,k ′ where [[ k = k ′ ]] = 1 if k = k ′ , 0 otherwise.

The Chain Rule (continued) Assume we have equations y = f ( z 1 , z 2 , . . . z n ) z i = g i ( x 1 , x 2 , . . . x m ) for i = 1 . . . n where y is a vector, z i for all i are vectors, and x j for all j are vectors. Define h ( x 1 . . . x m ) to be the composition of f and g , so y = h ( x 1 . . . x m ) . Then � n � � g i h f ∂z i ∂y ∂y � � � � = × � � � ∂x j ∂z i � � ∂x j � i =1 � �� d ( y ) × d ( x j ) d ( y ) × d ( z i ) d ( z i ) × d ( x j ) where d ( v ) is the dimensionality of vector v .

Overview ◮ Introduction ◮ The chain rule ◮ Derivatives in a single-layer neural network ◮ Computational graphs ◮ Backpropagation in computational graphs ◮ Justification for backpropagation

Derivatives in a Feedforward Network Definitions: The set of possible labels is Y . We define K = |Y| . g : R m → R m is a transfer function. We define LS = LOG-SOFTMAX. Inputs: x i ∈ R d , y i ∈ Y , W ∈ R m × d , b ∈ R m , V ∈ R K × m , γ ∈ R K . Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

Jacobian Involving Matrices Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i If W ∈ R m × d , z ∈ R m , the Jacobian ∂z ∂W is a matrix of dimension m × m ′ where m ′ = ( m × d ) is the number of entries in W . So we treat W as a vector with ( m × d ) elements.

Local Functions Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) Leaf variables: W , x i , b , V , γ , y i l ∈ R K = V h + γ Intermediate variables: z , h , l , q q ∈ R K = LS ( l ) o ∈ R = − q y i Output variable: o Each intermediate variable has a “Local” function: f z ( W, x i , b ) = Wx i + b, f h ( z ) = g ( z ) , f l ( h ) = V h + γ, . . .

Global Functions Equations: Wx i + b z ∈ R m = h ∈ R m = g ( z ) Leaf variables: W , x i , b , V , γ , y i l ∈ R K = V h + γ Intermediate variables: z , h , l , q q ∈ R K = LS ( l ) o ∈ R = − q y i Output variable: o f o to be the Global functions: for the output variable o , we define ¯ function that maps the leaf values W , x i , b , V , γ , y i to the output value o = ¯ f o ( W, x i , b, V, γ, y i ) . We use similar definitions for ¯ f z ( W, x i , b, V, γ, y i ) , ¯ f h ( W, x i , b, V, γ, y i ) , etc.

Applying the Chain Rule Derivative: ¯ � f o Equations: ∂o � = � ∂W � Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = h ∈ R m = g ( z ) l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = ¯ � f o � f q � f l ∂o × ∂q ∂l � � � = × h ∈ R m � � � = g ( z ) ∂q ∂l ∂W � � � l ∈ R K = V h + γ q ∈ R K = LS ( l ) o ∈ R = − q y i

Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = ¯ � f o � f q � f l ∂o × ∂q ∂l � � � = × h ∈ R m � � � = g ( z ) ∂q ∂l ∂W � � � l ∈ R K = V h + γ ¯ � f o � f q � f l � f h ∂o × ∂q × ∂l × ∂h � � � � = q ∈ R K � � � � = LS ( l ) ∂q ∂l ∂h ∂W � � � � o ∈ R = − q y i

Applying the Chain Rule Derivative: ¯ ¯ � f o � f o � f q Equations: ∂o ∂o × ∂q � � � = � � � ∂W ∂q ∂W � � � Wx i + b z ∈ R m = ¯ � f o � f q � f l ∂o × ∂q ∂l � � � = × h ∈ R m � � � = g ( z ) ∂q ∂l ∂W � � � l ∈ R K = V h + γ ¯ � f o � f q � f l � f h ∂o × ∂q × ∂l × ∂h � � � � = q ∈ R K � � � � = LS ( l ) ∂q ∂l ∂h ∂W � � � � o ∈ R = − q y i ¯ � f o � f q � f l � f h � f z ∂o × ∂q × ∂l × ∂h × ∂z � � � � � = � � � � � ∂q ∂l ∂h ∂z ∂W � � � � �

Computational Graphs, and Backpropagation Michael Collins, Columbia - PowerPoint PPT Presentation

Computational Graphs, and Backpropagation Michael Collins, Columbia University A Key Problem: Calculating Derivatives exp ( v ( y ) ( x ; ) + y ) p ( y | x ; , v ) = (1) y Y exp ( v ( y ) ( x ; ) + y

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

House of Graphs: Introduction what are interesting graphs? GraPHedron First Definition of

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion Section Slides credits: Barak

Access and Opportunity Investigating the roles of VET and University in Australias post

Discussion of Fillat, Garetto, Goetz: Global Banks Dynamics and the International

Multilevel Pricing Schemes in a Deregulated Wireless Network Market Yezekael Hayel CERI/LIA

MA 123, Chapter 5: Formulas for Derivatives (pp. 83-102, Gootman) Chapters Goal: Know and

10. Second derivative test Lets turn to the problem of determining the nature of the critical

From Math 2220 Class 10 ample Higher Partials in Polar Cordinates Dr. Allen Back Gradient

MAT265: Calculus for Engineers I Classwork and Derivative Reference Sheet 11 September, 2015

Computational Graphs, and Backpropagation Michael Collins, Columbia - PowerPoint PPT Presentation

Computational Graphs, and Backpropagation Michael Collins, Columbia University A Key Problem: Calculating Derivatives exp ( v ( y ) ( x ; ) + y ) p ( y | x ; , v ) = (1) y Y exp ( v ( y ) ( x ; ) + y

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&amp;A 3 BACKPROPAGATION 4 A

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

House of Graphs: Introduction what are interesting graphs? GraPHedron First Definition of

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion Section Slides credits: Barak

Access and Opportunity Investigating the roles of VET and University in Australias post

Discussion of Fillat, Garetto, Goetz: Global Banks Dynamics and the International

Multilevel Pricing Schemes in a Deregulated Wireless Network Market Yezekael Hayel CERI/LIA

MA 123, Chapter 5: Formulas for Derivatives (pp. 83-102, Gootman) Chapters Goal: Know and

10. Second derivative test Lets turn to the problem of determining the nature of the critical

From Math 2220 Class 10 ample Higher Partials in Polar Cordinates Dr. Allen Back Gradient

MAT265: Calculus for Engineers I Classwork and Derivative Reference Sheet 11 September, 2015

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A