CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - - PowerPoint PPT Presentation

cmp784
SMART_READER_LITE
LIVE PREVIEW

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - - PowerPoint PPT Presentation

Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2018 Breaking news! Practical 1 is out! Learning neural word embeddings Due Friday, Mar. 16,


slide-1
SLIDE 1

Lecture #03 – Multi-layer Perceptrons

Aykut Erdem // Hacettepe University // Spring 2018

CMP784

DEEP LEARNING

Image: Jose-Luis Olivares

slide-2
SLIDE 2

Breaking news!

  • Practical 1 is out!

—Learning neural word embeddings —Due Friday, Mar. 16, 23:59:59

  • Paper presentations and quizzes will start

next week!

− Discuss your slides with me 3-4 days prior to your presentation − submit your final slides by the night before the class. − We don’t have any code walker or demonstrator.

2
slide-3
SLIDE 3

Previously on CMP784

  • Learning problem
  • Parametric vs. non-parametric models
  • Nearest—neighbor classifier
  • Linear classification
  • Linear regression
  • Capacity
  • Hyperparameter
  • Underfitting
  • Overfitting
  • Variance-Bias tradeoff
  • Model selection
  • Cross-validation
3
slide-4
SLIDE 4

Lecture overview

  • the perceptron
  • the multi-layer perceptron
  • stochastic gradient descent
  • backpropagation
  • shallow yet very powerful: word2vec
  • Discl

sclaimer: Much of the material and slides for this lecture were borrowed from

—Hugo Larochelle’s Neural networks slides —Nick Locascio’s MIT 6.S191 slides —Efstratios Gavves and Max Willing’s UvA deep learning class —Leonid Sigal’s CPSC532L class —Richard Socher’s CS224d class —Dan Jurafsky’s CS124 class

4
slide-5
SLIDE 5

A Brief History of Neural Networks

5

Image: VUNI Inc.

today

slide-6
SLIDE 6 6

The Perceptron

slide-7
SLIDE 7

The Perceptron

7

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-8
SLIDE 8

Perceptron Forward Pass

  • Neuron pre-activation

(or input activation)

  • Neuron output activation:

where

w are the weights (parameters) b is the bias term g(·) is called the activation function

8
  • a(x) = b + P

i wixi = b + w>x

P

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • x0

x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-9
SLIDE 9

Output Activation of The Neuron

9

Bi t ri s ed

(x a(x

(from Pascal Vincent’s slides) Image credit: Pascal Vincent

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

Bias only changes the position of the riff Range is determined by g(·)

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-10
SLIDE 10

Linear Activation Function

10
  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • No nonlinear transformation
  • No input squashing

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity tion

  • {
  • g(a) = a
slide-11
SLIDE 11

Sigmoid Activation Function

11

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • g(a) = sigm(a) =

1 1+exp(a)

s

  • utput between 0 and 1
  • utput between 0 and 1
  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • Squashes

the neuron’s

  • utput

between 0 and 1

  • Always

positive

  • Bounded
  • Strictly

Increasing

slide-12
SLIDE 12

Perceptron Forward Pass

12

2 3

  • 1

5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

slide-13
SLIDE 13

Perceptron Forward Pass

13

2 3

  • 1

5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity

  • h(x) = g(a

(x

  • xi)

(2*0.1) + (3*0.5) + (-1*2.5) + (5*0.2) + (1*3.0)

slide-14
SLIDE 14

Perceptron Forward Pass

14

2 3

  • 1

5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity

h(x) = g(3.2) = σ(3.2) 1 1 + e−3.2 = 0.96

slide-15
SLIDE 15

Hyperbolic Tangent (tanh) Activation Function

15

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • Squashes the

neuron’s output between

  • 1 and 1
  • Can be positive
  • r negative
  • Bounded
  • Strictly

Increasing

h(a) = exp(a)exp(a)

exp(a)+exp(a) = exp(2a)1 exp(2a)+1

  • g(a) = tanh(a) = exp(a)exp(a)

exp(a)+exp(a)

slide-16
SLIDE 16

Rectified Linear (ReLU) Activation Function

16

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • Bounded below

by 0 (always non-negative)

  • Not upper

bounded

  • Strictly

increasing

  • Tends to

produce units with sparse activities

  • g(a) = reclin(a) = max(0, a)
slide-17
SLIDE 17

Decision Boundary of a Neuron

  • Could do binary classification:

—with sigmoid, one can interpret neuron as estimating p(y = 1 | x) —also known as logistic regression classifier —if activation is greater than 0.5, predict 1 —otherwise predict 0

Same idea can be applied to a tanh activation

17

Image credit: Pascal Vincent

(from Pascal Vincent’s slides)

han

Decision boundary is linear

slide-18
SLIDE 18

Capacity of Single Neuron

  • Can solve linearly separable problems
18

1 1 1 1 1 1

OR (x1, x2)

AND (x1, x2)

AND (x1, x2) (x1

, x2) , x2) , x2)

(x1 (x1

slide-19
SLIDE 19

Capacity of Single Neuron

  • Can not solve non-linearly separable problems
  • Need to transform the input into a better representation
  • Remember basis functions!
19

1 1

?

XOR (x1, x2)

(x1

1 1

XOR (x1, x2) AND (x1, x2)

AND (x1, x2)

, x2)

slide-20
SLIDE 20

Perceptron Diagram Simplified

20

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-21
SLIDE 21

Perceptron Diagram Simplified

21

x0 x1 x2 xn

inputs …

  • utput
slide-22
SLIDE 22

Multi-Output Perceptron

  • Remember multi-way classification

—We need multiple outputs (1 output per class) —We need to estimate conditional probability p(y = c|x) —Discriminative Learning

  • Softmax activation function at the output

—Strictly positive —sums to one

  • Predict class with the highest estimated class conditional probability.
22

x0 x1 x2 xn

inputs …

  • utput
  • n
  • |
  • o(a) = softmax(a) =

h

exp(a1) P

c exp(ac) . . .

exp(aC) P

c exp(ac)

i>

slide-23
SLIDE 23 23

Multi-Layer Perceptron

slide-24
SLIDE 24

Single Hidden Layer Neural Network

  • Hidden layer pre-activation:
  • Hidden layer activation:
  • Output layer activation:
24

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer

  • a(x) = b(1) + W(1)x

⇣ a(x)i = b(1)

i

+ P

j W (1) i,j xj

  • h(x) = g(a(x))
  • (x) = o

⇣ b(2) + w(2)h(1)x ⌘

>

slide-25
SLIDE 25

Multi-Layer Perceptron (MLP)

  • Consider a network with L hidden

layers.

—layer pre-activation for k>0 —hidden layer activation from 1 to L: —output layer activation (k=L+1)

25

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer

  • a(k)(x) = b(k) + W(k)h(k1)(x) (
  • h(k)(x) = g(a(k)(x))
  • h(L+1)(x) = o(a(L+1)(x)) = f(x)
  • (h(0)(x) = x)
slide-26
SLIDE 26

Deep Neural Network

26

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer h1 h2 h0 hn …

slide-27
SLIDE 27

Capacity of Neural Networks

  • Consider a single layer neural network
27

Image credit: Pascal Vincent

R´ eseaux de neurones

1 1 1 1 .5
  • 1.5
.7
  • .4
  • 1

x1 x2 x

1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1

y1 y2 z zk

wkj wji

x1 x2 x1 x2 x1 x2 y1 y2

sortie k entr´ ee i cach´ ee j biais

Input Hidden Output bias

(from Pascal Vincent’s slides)

slide-28
SLIDE 28

Capacity of Neural Networks

  • Consider a single layer neural network
28

Image credit: Pascal Vincent

y1 y2 y4 y3 y3 y4 y2 y1 x1 x2 z1 z1 x1 x2

(from Pascal Vincent’s slides)

slide-29
SLIDE 29

Capacity of Neural Networks

  • Consider a single layer neural network
29

Image credit: Pascal Vincent (from Pascal Vincent’s slides)

slide-30
SLIDE 30

Universal Approximation

  • Universal Approximation Theorem (Hornik, 1991):

—“a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’

  • This applies for sigmoid, tanh and many other activation functions.
  • However, this does not mean that there is learning algorithm that can

find the necessary parameter values.

30
slide-31
SLIDE 31 31

Applying Neural Networks

slide-32
SLIDE 32

Example Problem: Will my flight be delayed?

32

Example Problem: Will my Flight be Delayed?

x0 x1 h2 h1

h0 Predicted: 0.05 [-20, 45] Temperature: -20 F Wind Speed: 45 mph

slide-33
SLIDE 33

Example Problem: Will my flight be delayed?

33

Example Problem: Will my Flight be Delayed?

x0 x1 h2 h1

h0 Predicted: 0.05 [-20, 45] Predicted: 0.05

[-20, 45]

x0 x1 h1 h2 h0

slide-34
SLIDE 34

Example Problem: Will my flight be delayed?

34

Example Problem: Will my Flight be Delayed?

x0 x1 h2 h1

h0 Predicted: 0.05 [-20, 45] Predicted: 0.05 Actual: 1

[-20, 45]

x0 x1 h1 h2 h0

slide-35
SLIDE 35

Quantifying Loss

35

Predicted: 0.05 Actual: 1

[-20, 45]

x0 x1 h1 h2 h0

Predicted

`(f(x(i); ✓), y(i))

Actual

slide-36
SLIDE 36

Total Loss

36

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Predicted Actual

J(✓) = 1 N X

i

`(f(x(i); ✓), y(i))

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

slide-37
SLIDE 37

Total Loss

37

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Predicted Actual

J(✓) = 1 N X

i

`(f(x(i); ✓), y(i))

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

slide-38
SLIDE 38

Binary Cross Entropy Loss

38

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

Jcross entropy(θ) = 1 N X

i

y(i) log(f(x(i); θ)) + (1 − y(i)) log(1 − f(x(i); θ)))

  • For classification problems with a softmax output layer.
  • Maximize log-probability of the correct class given an input
slide-39
SLIDE 39

Binary Cross Entropy Loss

39

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

JMSE(θ) = 1 N X

i

⇣ f(x(i); θ) − y(i)⌘2

slide-40
SLIDE 40 40

Training Neural Networks

slide-41
SLIDE 41

arg min

θ

1 T X

t

l(f(x(t); θ), y(t)) + λΩ(θ)

Training

  • Learning is cast as optimization.

—For classification problems, we would like to minimize classification error —Loss function can sometimes be viewed as a surrogate for what we want to optimize (e.g. upper bound)

41

Loss function Regularizer

Training Neural Networks: Objective

MIT 6.S191 | Intro to Deep Learning | IAP 2017

loss function

=

Training Neural Networks: Objective

MIT 6.S191 | Intro to Deep Learning | IAP 2017

slide-42
SLIDE 42

Loss is a function of the model’s parameters

42

Loss is a function of the model’s parameters

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Loss is a function of the model’s parameters

MIT 6.S191 | Intro to Deep Learning | IAP 2017

slide-43
SLIDE 43

How to minimize loss?

43

Loss is a function of the model’s parameters

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Start at random point

slide-44
SLIDE 44

How to minimize loss?

44

Compute:

slide-45
SLIDE 45

How to minimize loss?

45

Move in direction opposite

  • f gradient to new point
slide-46
SLIDE 46

How to minimize loss?

46

Move in direction opposite

  • f gradient to new point
slide-47
SLIDE 47

How to minimize loss?

47

Repeat!

slide-48
SLIDE 48

This is called Stochastic Gradient Descent (SGD)

48

Repeat!

slide-49
SLIDE 49

Stochastic Gradient Descent (SGD)

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y):
  • Compute Loss Gradient:
  • Update θ with update rule:
49
  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

slide-50
SLIDE 50

Why is it Stochastic Gradient Descent?

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y):
  • Compute Loss Gradient:
  • Update θ with update rule:
50
  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017 Only an estimate of true gradient!

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

slide-51
SLIDE 51

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • Why is it Stochastic Gradient Descent?
  • Initialize θ randomly
  • For N Epochs
  • For each training batch {(x0, y0),…, (xB, yB)}:
  • Compute Loss Gradient:
  • Update θ with update rule:
51
  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017 More accurate estimate!

  • Initialize θ randomly
  • For N Epochs

○ For each training batch {(x0, y0), … , (xB, yB)}: ■ Compute Loss Gradient: ■ Update θ with update rule:

Minibatches Reduce Gradient Variance

MIT 6.S191 | Intro to Deep Learning | IAP 2017

More accurate estimate!

Advantages:

  • More accurate estimation of gradient

⎯ Smoother convergence ⎯ Allows for larger learning rates

  • Minibatches lead to fast training!

⎯ Can parallelize computation + achieve significant speed increases on GPU’s

slide-52
SLIDE 52

θ)

Training epoch = Iteration of all examples

Stochastic Gradient Descent (SGD)

  • Algorithm that performs updates after each example

—initialize —for N iterations

—for each training example or batch

  • To apply this algorithm to neural network training, we need:

—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )

52

ze:

  • θ ⌘ {W(1), b(1), . . . , W(L+1), b(L+1)}

mple

  • r
  • (x(t), y(t))

r 8

  • ∆ = rθl(f(x(t); θ), y(t)) λrθΩ(θ)
  • P r
  • θ θ + α ∆

Training epoch = Iteration over all all examples

ent: ,

  • Ω(θ)

,

r

  • rθΩ(θ)

ion:

  • l(f(x(t); θ), y(t))

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • tions
slide-53
SLIDE 53

θ)

Training epoch = Iteration of all examples

Stochastic Gradient Descent (SGD)

  • Algorithm that performs updates after each example

—initialize —for N iterations

—for each training example or batch

  • To apply this algorithm to neural network training, we need:

—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )

53

ze:

  • θ ⌘ {W(1), b(1), . . . , W(L+1), b(L+1)}

mple

  • r
  • (x(t), y(t))

r 8

  • ∆ = rθl(f(x(t); θ), y(t)) λrθΩ(θ)
  • P r
  • θ θ + α ∆

Training epoch = Iteration over all all examples

ent: ,

  • Ω(θ)

,

r

  • rθΩ(θ)

ion:

  • l(f(x(t); θ), y(t))

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • tions
slide-54
SLIDE 54

What is a neural network again?

  • A family of parametric, non-linear and hierarchical representation learning

functions

  • ⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function
  • Given training corpus {X, Y} find optimal parameters
54

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

slide-55
SLIDE 55

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
55

h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Forward connections (Feedforward architecture)

slide-56
SLIDE 56

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
56

Input Interweaved connections (Directed Acyclic Graph architecture – DAGNN)

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)

slide-57
SLIDE 57

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
57

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Loopy connections (Recurrent architecture, special care needed)

h1(xi; θ)

slide-58
SLIDE 58

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
58

Functions → Modules

h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Input

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)

slide-59
SLIDE 59

What is a module

  • A module is a building block for our network
  • Each module is an object/function ! = h(x; ") that

⎯ Contains trainable parameters (") ⎯ Receives as an argument an input $ ⎯ And returns an output ! based on the activation function h(...)

  • The activation function should be (at least) first order

differentiable (almost) everywhere

  • For easier/more efficient backpropagation

→ store module input

⎯ easy to get module output fast ⎯ easy to compute derivatives

59

Input

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)

slide-60
SLIDE 60

Anything goes or do special constraints exist?

  • A neural network is a composition of modules (building blocks)
  • Any architecture works
  • If the architecture is a feedforward cascade, no special care
  • If acyclic, there is right order
  • f computing the forward

computations

  • If there are loops, these

form recu current connections (revisited later)

60
slide-61
SLIDE 61

What is a module

  • Simply compute the activation of each module in the

network

  • We need to know the precise function behind each

module h!(... )

  • Recursive operations
  • One module’s output is another’s input
  • Steps
  • Visit modules one by one starting from the data input
  • Some modules might have several inputs from multiple modules
  • Compute modules activations wi

with th th the right t order

  • Make sure all the inputs computed at the right time
61

Input

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ) where

al = hl(xl; θ) al = xl+1 xl = al−1

  • r
slide-62
SLIDE 62

What is a module

  • Simply compute the gradients of each module

for our data

  • We need to know the gradient formulation of each

module !h"(#";$") w.r.t. their inputs #" and parameters $"

  • We need the forward computations first
  • Their result is the sum of losses for our input data
  • Then take the reverse network (reverse connections)

and traverse it backwards

  • Instead of using the activation functions, we use

their gradients

  • The whole process can be described very neatly and concisely

with the backpropagation algorithm

62

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) h2(xi; θ) h4(xi; θ)

dLoss(Input)

slide-63
SLIDE 63

Again, what is a neural network again?

  • d

⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function

  • Given training corpus {X, Y} find optimal parameters
  • To use any gradient descent based optimization

we need the gradients

  • How to compute the gradients for such a complicated function enclosing other

functions, like "# (... )?

63

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

✓ θt+1 = θt − ηt ∂L ∂θt ◆

∂L ∂θl , l = 1, . . . , L

slide-64
SLIDE 64

Again, what is a neural network again?

  • d

⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function

  • Given training corpus {X, Y} find optimal parameters
  • To use any gradient descent based optimization

we need the gradients

  • How to co

compute the grad adients s for su such ch a a co complicat cated funct ction encl closi sing other funct ctions, s, like ke "# (... )? ?

64

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

✓ θt+1 = θt − ηt ∂L ∂θt ◆

∂L ∂θl , l = 1, . . . , L