Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

neural networks design
SMART_READER_LITE
LIVE PREVIEW

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

Neural Networks: Design Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49 Outline The Basics 1 Example:


slide-1
SLIDE 1

Neural Networks: Design

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49

slide-2
SLIDE 2

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 2 / 49

slide-3
SLIDE 3

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 3 / 49

slide-4
SLIDE 4

Model: a Composite Function I

A feedforward neural networks, or multilayer perceptron, defines a function composition ˆ y = f (L)(···f (2)(f (1)(x;θ (1));θ (2));θ (L)) that approximates the target function f ⇤

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 4 / 49

slide-5
SLIDE 5

Model: a Composite Function I

A feedforward neural networks, or multilayer perceptron, defines a function composition ˆ y = f (L)(···f (2)(f (1)(x;θ (1));θ (2));θ (L)) that approximates the target function f ⇤ Parameters θ (1),··· ,θ (L) learned from training set X

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 4 / 49

slide-6
SLIDE 6

Model: a Composite Function I

A feedforward neural networks, or multilayer perceptron, defines a function composition ˆ y = f (L)(···f (2)(f (1)(x;θ (1));θ (2));θ (L)) that approximates the target function f ⇤ Parameters θ (1),··· ,θ (L) learned from training set X “Feedforward” because information flows from input to output

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 4 / 49

slide-7
SLIDE 7

Model: a Composite Function II

At each layer k, the function f (k)(· ;W(k),b(k)) is nonlinear and

  • utputs value a(k) 2 RD(k), where

a(k) = act(k)(W(k)>a(k1) +b(k))

act(i)(·) : R ! R is an activation function applied elementwisely

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 5 / 49

slide-8
SLIDE 8

Model: a Composite Function II

At each layer k, the function f (k)(· ;W(k),b(k)) is nonlinear and

  • utputs value a(k) 2 RD(k), where

a(k) = act(k)(W(k)>a(k1) +b(k))

act(i)(·) : R ! R is an activation function applied elementwisely

Shorthand: a(k) = act(k)(W(k)>a(k1))

a(k1) 2 RD(k1)+1, a(k1) = 1, and W(k) 2 R(D(k1)+1)⇥D(k)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 5 / 49

slide-9
SLIDE 9

Neurons I

Each f (k)

j

= act(k)(W(k)>

:,j

a(k1)) = act(k)(z(k)

j ) is a unit (or neuron)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 6 / 49

slide-10
SLIDE 10

Neurons I

Each f (k)

j

= act(k)(W(k)>

:,j

a(k1)) = act(k)(z(k)

j ) is a unit (or neuron)

E.g., the perceptron

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 6 / 49

slide-11
SLIDE 11

Neurons I

Each f (k)

j

= act(k)(W(k)>

:,j

a(k1)) = act(k)(z(k)

j ) is a unit (or neuron)

E.g., the perceptron Loosely guided by neuroscience

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 6 / 49

slide-12
SLIDE 12

Neurons II

Modern NN design is mainly guided by mathematical and engineering

  • disciplines. Consider a binary classifier where y 2 {0,1}:

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

slide-13
SLIDE 13

Neurons II

Modern NN design is mainly guided by mathematical and engineering

  • disciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a(k) = max(0,z(k))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

slide-14
SLIDE 14

Neurons II

Modern NN design is mainly guided by mathematical and engineering

  • disciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a(k) = max(0,z(k)) Output unit: a(L) = ˆ ρ = σ(z(L)), assuming Pr(y = 1|x) ⇠ Bernoulli(ρ)

Prediction: ˆ y = 1( ˆ ρ; ˆ ρ > 0.5) = 1(z(L);z(L) > 0)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

slide-15
SLIDE 15

Neurons II

Modern NN design is mainly guided by mathematical and engineering

  • disciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a(k) = max(0,z(k)) Output unit: a(L) = ˆ ρ = σ(z(L)), assuming Pr(y = 1|x) ⇠ Bernoulli(ρ)

Prediction: ˆ y = 1( ˆ ρ; ˆ ρ > 0.5) = 1(z(L);z(L) > 0) A logistic regressor with input a(L1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

slide-16
SLIDE 16

Representation Learning

The outputs a(1), a(2), ···, a(L1) of hidden layers f (1), f (2), ···, f (L1) are distributed representation of x

Nonlinear to input space since f (k)’s are nonlinear

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

slide-17
SLIDE 17

Representation Learning

The outputs a(1), a(2), ···, a(L1) of hidden layers f (1), f (2), ···, f (L1) are distributed representation of x

Nonlinear to input space since f (k)’s are nonlinear Usually more abstract at a deeper layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

slide-18
SLIDE 18

Representation Learning

The outputs a(1), a(2), ···, a(L1) of hidden layers f (1), f (2), ···, f (L1) are distributed representation of x

Nonlinear to input space since f (k)’s are nonlinear Usually more abstract at a deeper layer

f (L) is the actual prediction function

Like in non-linear SVM/polynomial regression, a simple linear function suffices: z(L) = W(L)a(L1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

slide-19
SLIDE 19

Representation Learning

The outputs a(1), a(2), ···, a(L1) of hidden layers f (1), f (2), ···, f (L1) are distributed representation of x

Nonlinear to input space since f (k)’s are nonlinear Usually more abstract at a deeper layer

f (L) is the actual prediction function

Like in non-linear SVM/polynomial regression, a simple linear function suffices: z(L) = W(L)a(L1) act(L)(·) just “normalizes” z(L) to give ˆ ρ 2 (0,1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

slide-20
SLIDE 20

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 9 / 49

slide-21
SLIDE 21

Learning the XOR I

Why ReLUs learn nonlinear (and better) representation?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

slide-22
SLIDE 22

Learning the XOR I

Why ReLUs learn nonlinear (and better) representation? Let’s learn XOR (f ⇤) in a binary classification task

x 2 R2 and y 2 {0,1}

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

slide-23
SLIDE 23

Learning the XOR I

Why ReLUs learn nonlinear (and better) representation? Let’s learn XOR (f ⇤) in a binary classification task

x 2 R2 and y 2 {0,1} Nonlinear, so cannot be learned by linear models

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

slide-24
SLIDE 24

Learning the XOR I

Why ReLUs learn nonlinear (and better) representation? Let’s learn XOR (f ⇤) in a binary classification task

x 2 R2 and y 2 {0,1} Nonlinear, so cannot be learned by linear models

Consider an NN with 1 hidden layer:

a(1) = max(0,W(1)>x) a(2) = ˆ ρ = σ(w(2)>a(1)) Prediction: 1( ˆ ρ; ˆ ρ > 0.5)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

slide-25
SLIDE 25

Learning the XOR I

Why ReLUs learn nonlinear (and better) representation? Let’s learn XOR (f ⇤) in a binary classification task

x 2 R2 and y 2 {0,1} Nonlinear, so cannot be learned by linear models

Consider an NN with 1 hidden layer:

a(1) = max(0,W(1)>x) a(2) = ˆ ρ = σ(w(2)>a(1)) Prediction: 1( ˆ ρ; ˆ ρ > 0.5)

Learns XOR by “merging” data points first

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

slide-26
SLIDE 26

Learning the XOR II

X = 2 6 6 4 1 1 1 1 1 1 1 1 3 7 7 5 2 RN⇥(1+D), W(1) = 2 4 1 1 1 1 1 3 5, w(2) = 2 4 1 2 4 3 5 ˆ y = 2 6 6 4 1 1 3 7 7 5 = 1(σ([ 1 max(0,XW(1)) ]w(2)) > 0.5)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 11 / 49

slide-27
SLIDE 27

Latent Representation A(1)

XW(1) = 2 6 6 4 1 1 1 1 1 1 1 1 3 7 7 5 2 4 1 1 1 1 1 3 5 = 2 6 6 4 1 1 1 2 1 3 7 7 5 A(1) = [ 1 max(0,XW(1)) ] = 2 6 6 4 1 1 1 1 1 1 2 1 3 7 7 5

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 12 / 49

slide-28
SLIDE 28

Output Distribution a(2)

a(2) = σ(A(1)w(2)) = σ B B @ 2 6 6 4 1 1 1 1 1 1 2 1 3 7 7 5 2 4 1 2 4 3 5 1 C C A = σ B B @ 2 6 6 4 1 1 1 1 3 7 7 5 1 C C A ˆ y = 1(a(2) > 0.5) = 2 6 6 4 1 1 3 7 7 5

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

slide-29
SLIDE 29

Output Distribution a(2)

a(2) = σ(A(1)w(2)) = σ B B @ 2 6 6 4 1 1 1 1 1 1 2 1 3 7 7 5 2 4 1 2 4 3 5 1 C C A = σ B B @ 2 6 6 4 1 1 1 1 3 7 7 5 1 C C A ˆ y = 1(a(2) > 0.5) = 2 6 6 4 1 1 3 7 7 5 But how to train W(1) and w(2) from examples?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

slide-30
SLIDE 30

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 14 / 49

slide-31
SLIDE 31

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-32
SLIDE 32

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmaxΘ logP(X|Θ)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-33
SLIDE 33

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmaxΘ logP(X|Θ) = argminΘ logP(X|Θ)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-34
SLIDE 34

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmaxΘ logP(X|Θ) = argminΘ logP(X|Θ) = argminΘ ∑i logP(x(i),y(i) |Θ)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-35
SLIDE 35

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmaxΘ logP(X|Θ) = argminΘ logP(X|Θ) = argminΘ ∑i logP(x(i),y(i) |Θ) = argminΘ ∑i[logP(y(i) |x(i),Θ)logP(x(i) |Θ)]

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-36
SLIDE 36

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmaxΘ logP(X|Θ) = argminΘ logP(X|Θ) = argminΘ ∑i logP(x(i),y(i) |Θ) = argminΘ ∑i[logP(y(i) |x(i),Θ)logP(x(i) |Θ)] = argminΘ ∑i logP(y(i) |x(i),Θ) = argminΘ ∑i C(i)(Θ)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-37
SLIDE 37

Training an NN

Given examples: X = {(x(i),y(i))}N

i=1

How to learn parameters Θ = {W(1), ···, W(L)}? Most NNs are trained using the maximum likelihood by default (assuming i.i.d examples): argmaxΘ logP(X|Θ) = argminΘ logP(X|Θ) = argminΘ ∑i logP(x(i),y(i) |Θ) = argminΘ ∑i[logP(y(i) |x(i),Θ)logP(x(i) |Θ)] = argminΘ ∑i logP(y(i) |x(i),Θ) = argminΘ ∑i C(i)(Θ) The minimizer ˆ Θ is an unbiased estimator of “true” Θ⇤

Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

slide-38
SLIDE 38

Example: Binary Classification

Pr(y = 1|x) ⇠ Bernoulli(ρ), where x 2 RD and y 2 {0,1} a(L) = ˆ ρ = σ(z(L)) the predicted distribution

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

slide-39
SLIDE 39

Example: Binary Classification

Pr(y = 1|x) ⇠ Bernoulli(ρ), where x 2 RD and y 2 {0,1} a(L) = ˆ ρ = σ(z(L)) the predicted distribution The cost function C(i)(Θ) can be written as: C(i)(Θ) = logP(y(i) |x(i);Θ) = log[(a(L))y(i)(1a(L))1y(i)] = log[σ(z(L))y(i)(1σ(z(L)))1y(i)]

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

slide-40
SLIDE 40

Example: Binary Classification

Pr(y = 1|x) ⇠ Bernoulli(ρ), where x 2 RD and y 2 {0,1} a(L) = ˆ ρ = σ(z(L)) the predicted distribution The cost function C(i)(Θ) can be written as: C(i)(Θ) = logP(y(i) |x(i);Θ) = log[(a(L))y(i)(1a(L))1y(i)] = log[σ(z(L))y(i)(1σ(z(L)))1y(i)] = log[σ((2y(i) 1)z(L))]

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

slide-41
SLIDE 41

Example: Binary Classification

Pr(y = 1|x) ⇠ Bernoulli(ρ), where x 2 RD and y 2 {0,1} a(L) = ˆ ρ = σ(z(L)) the predicted distribution The cost function C(i)(Θ) can be written as: C(i)(Θ) = logP(y(i) |x(i);Θ) = log[(a(L))y(i)(1a(L))1y(i)] = log[σ(z(L))y(i)(1σ(z(L)))1y(i)] = log[σ((2y(i) 1)z(L))] = ζ((12y(i))z(L))

ζ(·) is the softplus function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

slide-42
SLIDE 42

Optimization Algorithm

Most NNs use SGD to solve the problem argminΘ ∑i C(i)(Θ) (Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ(0) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M; Θ(t+1) Θ(t) η∇Θ ∑M

i=1 C(i)(Θ(t));

}

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

slide-43
SLIDE 43

Optimization Algorithm

Most NNs use SGD to solve the problem argminΘ ∑i C(i)(Θ)

Fast convergence in time [1]

(Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ(0) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M; Θ(t+1) Θ(t) η∇Θ ∑M

i=1 C(i)(Θ(t));

}

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

slide-44
SLIDE 44

Optimization Algorithm

Most NNs use SGD to solve the problem argminΘ ∑i C(i)(Θ)

Fast convergence in time [1] Supports (GPU-based) parallelism

(Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ(0) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M; Θ(t+1) Θ(t) η∇Θ ∑M

i=1 C(i)(Θ(t));

}

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

slide-45
SLIDE 45

Optimization Algorithm

Most NNs use SGD to solve the problem argminΘ ∑i C(i)(Θ)

Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning

(Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ(0) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M; Θ(t+1) Θ(t) η∇Θ ∑M

i=1 C(i)(Θ(t));

}

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

slide-46
SLIDE 46

Optimization Algorithm

Most NNs use SGD to solve the problem argminΘ ∑i C(i)(Θ)

Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ(0) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M; Θ(t+1) Θ(t) η∇Θ ∑M

i=1 C(i)(Θ(t));

}

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

slide-47
SLIDE 47

Optimization Algorithm

Most NNs use SGD to solve the problem argminΘ ∑i C(i)(Θ)

Fast convergence in time [1] Supports (GPU-based) parallelism Supports online learning Easy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD) Initialize Θ(0) randomly; Repeat until convergence { Randomly partition the training set X into minibatches of size M; Θ(t+1) Θ(t) η∇Θ ∑M

i=1 C(i)(Θ(t));

} How to compute ∇Θ ∑i C(i)(Θ(t)) efficiently?

There could be a huge number of W(k)

i,j ’s in Θ

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

slide-48
SLIDE 48

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 18 / 49

slide-49
SLIDE 49

Back Propagation

Θ(t+1) Θ(t) η∇Θ

M

n=1

C(n)(Θ(t)) We have ∇Θ ∑n C(n)(Θ(t)) = ∑n ∇ΘC(n)(Θ(t))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

slide-50
SLIDE 50

Back Propagation

Θ(t+1) Θ(t) η∇Θ

M

n=1

C(n)(Θ(t)) We have ∇Θ ∑n C(n)(Θ(t)) = ∑n ∇ΘC(n)(Θ(t)) Let c(n) = C(n)(Θ(t)), our goal is to evaluate ∂c(n) ∂W(k)

i,j

for all i, j, k, and n

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

slide-51
SLIDE 51

Back Propagation

Θ(t+1) Θ(t) η∇Θ

M

n=1

C(n)(Θ(t)) We have ∇Θ ∑n C(n)(Θ(t)) = ∑n ∇ΘC(n)(Θ(t)) Let c(n) = C(n)(Θ(t)), our goal is to evaluate ∂c(n) ∂W(k)

i,j

for all i, j, k, and n Back propagation (or simply backprop) is an efficient way to evaluate multiple partial derivatives at once

Assuming the partial derivatives share some common evaluation steps

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

slide-52
SLIDE 52

Back Propagation

Θ(t+1) Θ(t) η∇Θ

M

n=1

C(n)(Θ(t)) We have ∇Θ ∑n C(n)(Θ(t)) = ∑n ∇ΘC(n)(Θ(t)) Let c(n) = C(n)(Θ(t)), our goal is to evaluate ∂c(n) ∂W(k)

i,j

for all i, j, k, and n Back propagation (or simply backprop) is an efficient way to evaluate multiple partial derivatives at once

Assuming the partial derivatives share some common evaluation steps

By the chain rule, we have ∂c(n) ∂W(k)

i,j

= ∂c(n) ∂z(k)

j

· ∂z(k)

j

∂W(k)

i,j

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

slide-53
SLIDE 53

Forward Pass

The second term:

∂z(k)

j

∂W(k)

i,j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

slide-54
SLIDE 54

Forward Pass

The second term:

∂z(k)

j

∂W(k)

i,j

When k = 1, we have z(1)

j

= ∑i W(1)

i,j x(n) i

and ∂z(1)

j

∂W(1)

i,j

= x(n)

i

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

slide-55
SLIDE 55

Forward Pass

The second term:

∂z(k)

j

∂W(k)

i,j

When k = 1, we have z(1)

j

= ∑i W(1)

i,j x(n) i

and ∂z(1)

j

∂W(1)

i,j

= x(n)

i

Otherwise (k > 1), we have z(k)

j

= ∑i W(k)

i,j a(k1) i

and ∂z(k)

j

∂W(1)

i,j

= a(k1)

i

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

slide-56
SLIDE 56

Forward Pass

The second term:

∂z(k)

j

∂W(k)

i,j

When k = 1, we have z(1)

j

= ∑i W(1)

i,j x(n) i

and ∂z(1)

j

∂W(1)

i,j

= x(n)

i

Otherwise (k > 1), we have z(k)

j

= ∑i W(k)

i,j a(k1) i

and ∂z(k)

j

∂W(1)

i,j

= a(k1)

i

We can get the second terms of all

∂c(n) ∂W(k)

i,j

’s starting from the most shallow layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

slide-57
SLIDE 57

Backward Pass I

Conversely, we can get the first terms of all

∂c(n) ∂W(k)

i,j

’s starting from the deepest layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

slide-58
SLIDE 58

Backward Pass I

Conversely, we can get the first terms of all

∂c(n) ∂W(k)

i,j

’s starting from the deepest layer Define error signal δ (k)

j

as the first term ∂c(n)

∂z(k)

j Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

slide-59
SLIDE 59

Backward Pass I

Conversely, we can get the first terms of all

∂c(n) ∂W(k)

i,j

’s starting from the deepest layer Define error signal δ (k)

j

as the first term ∂c(n)

∂z(k)

j

When k = L, the evaluation varies from task to task

Depending on the definition of functions act(L) and C(n)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

slide-60
SLIDE 60

Backward Pass I

Conversely, we can get the first terms of all

∂c(n) ∂W(k)

i,j

’s starting from the deepest layer Define error signal δ (k)

j

as the first term ∂c(n)

∂z(k)

j

When k = L, the evaluation varies from task to task

Depending on the definition of functions act(L) and C(n)

E.g., in binary classification, we have: δ (L) = ∂c(n) ∂z(L) = ∂ζ((12y(n))z(L)) ∂z(L) = σ((12y(n))z(L))·(12y(n))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

slide-61
SLIDE 61

Backward Pass II

When k < L, we have δ (k)

j

= ∂c(n)

∂z(k)

j

= ∂c(n)

∂a(k)

j

·

∂a(k)

j

∂z(k)

j

= ∂c(n)

∂a(k)

j

·act0(z(k)

j )

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

slide-62
SLIDE 62

Backward Pass II

When k < L, we have δ (k)

j

= ∂c(n)

∂z(k)

j

= ∂c(n)

∂a(k)

j

·

∂a(k)

j

∂z(k)

j

= ∂c(n)

∂a(k)

j

·act0(z(k)

j )

= ✓ ∑s

∂c(n) ∂z(k+1)

s

· ∂z(k+1)

s

∂a(k)

j

◆ act0(z(k)

j )

Theorem (Chain Rule) Let g : R ! Rd and f : Rd ! R, then f (f g)0(x) = f 0(g(x))g0(x) = ∇f(g(x))> 2 6 4 g0

1(x)

. . . g0

n(x)

3 7 5.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

slide-63
SLIDE 63

Backward Pass II

When k < L, we have δ (k)

j

= ∂c(n)

∂z(k)

j

= ∂c(n)

∂a(k)

j

·

∂a(k)

j

∂z(k)

j

= ∂c(n)

∂a(k)

j

·act0(z(k)

j )

= ✓ ∑s

∂c(n) ∂z(k+1)

s

· ∂z(k+1)

s

∂a(k)

j

◆ act0(z(k)

j )

= ✓ ∑s δ (k+1)

s

·

∂ ∑i W(k+1)

i,s

a(k)

i

∂a(k)

j

◆ act0(z(k)

j )

Theorem (Chain Rule) Let g : R ! Rd and f : Rd ! R, then f (f g)0(x) = f 0(g(x))g0(x) = ∇f(g(x))> 2 6 4 g0

1(x)

. . . g0

n(x)

3 7 5.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

slide-64
SLIDE 64

Backward Pass II

When k < L, we have δ (k)

j

= ∂c(n)

∂z(k)

j

= ∂c(n)

∂a(k)

j

·

∂a(k)

j

∂z(k)

j

= ∂c(n)

∂a(k)

j

·act0(z(k)

j )

= ✓ ∑s

∂c(n) ∂z(k+1)

s

· ∂z(k+1)

s

∂a(k)

j

◆ act0(z(k)

j )

= ✓ ∑s δ (k+1)

s

·

∂ ∑i W(k+1)

i,s

a(k)

i

∂a(k)

j

◆ act0(z(k)

j )

= ⇣ ∑s δ (k+1)

s

·W(k+1)

j,s

⌘ act0(z(k)

j )

Theorem (Chain Rule) Let g : R ! Rd and f : Rd ! R, then f (f g)0(x) = f 0(g(x))g0(x) = ∇f(g(x))> 2 6 4 g0

1(x)

. . . g0

n(x)

3 7 5.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

slide-65
SLIDE 65

Backward Pass III

δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

We can evaluate all δ (k)

j

’s starting from the deepest layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

slide-66
SLIDE 66

Backward Pass III

δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

We can evaluate all δ (k)

j

’s starting from the deepest layer The information propagate along a new kind of feedforward network:

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

slide-67
SLIDE 67

Backward Pass III

δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

We can evaluate all δ (k)

j

’s starting from the deepest layer The information propagate along a new kind of feedforward network:

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

slide-68
SLIDE 68

Backprop Algorithm (Minibatch Size M = 1)

Input: (x(n),y(n)) and Θ(t) Forward pass: a(0) ⇥ 1 x(n) ⇤>; for k 1 to L do z(k) W(k)>a(k1) ; a(k) act(z(k)) ; end Backward pass: Compute error signal δ (L) (e.g., (12y(n))σ((12y(n))z(L)) in binary classification) for k L1 to 1 do δ (k) act0(z(k))(W(k+1)δ (k+1)) ; end Return

∂c(n) ∂W(k) = a(k1) ⌦δ (k) for all k

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 24 / 49

slide-69
SLIDE 69

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1 and Θ(t)

Forward pass: A(0) ⇥ a(0,1) ··· a(0,M) ⇤>; for k 1 to L do Z(k) A(k1)W(k) ; A(k) act(Z(k)) ; end Backward pass: Compute error signals ∆(L) = h δ (L,0) ··· δ (L,M) i> for k L1 to 1 do ∆(k) act0(Z(k))(∆(k+1)W(k+1)>) ; end Return

∂c(n) ∂W(k) = ∑M n=1 a(k1,n) ⌦δ (k,n) for all k

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

slide-70
SLIDE 70

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1 and Θ(t)

Forward pass: A(0) ⇥ a(0,1) ··· a(0,M) ⇤>; for k 1 to L do Z(k) A(k1)W(k) ; A(k) act(Z(k)) ; end Backward pass: Compute error signals ∆(L) = h δ (L,0) ··· δ (L,M) i> for k L1 to 1 do ∆(k) act0(Z(k))(∆(k+1)W(k+1)>) ; end Return

∂c(n) ∂W(k) = ∑M n=1 a(k1,n) ⌦δ (k,n) for all k

Speed up with GPUs?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

slide-71
SLIDE 71

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1 and Θ(t)

Forward pass: A(0) ⇥ a(0,1) ··· a(0,M) ⇤>; for k 1 to L do Z(k) A(k1)W(k) ; A(k) act(Z(k)) ; end Backward pass: Compute error signals ∆(L) = h δ (L,0) ··· δ (L,M) i> for k L1 to 1 do ∆(k) act0(Z(k))(∆(k+1)W(k+1)>) ; end Return

∂c(n) ∂W(k) = ∑M n=1 a(k1,n) ⌦δ (k,n) for all k

Speed up with GPUs? Large width (D(k)) at each layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

slide-72
SLIDE 72

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1 and Θ(t)

Forward pass: A(0) ⇥ a(0,1) ··· a(0,M) ⇤>; for k 1 to L do Z(k) A(k1)W(k) ; A(k) act(Z(k)) ; end Backward pass: Compute error signals ∆(L) = h δ (L,0) ··· δ (L,M) i> for k L1 to 1 do ∆(k) act0(Z(k))(∆(k+1)W(k+1)>) ; end Return

∂c(n) ∂W(k) = ∑M n=1 a(k1,n) ⌦δ (k,n) for all k

Speed up with GPUs? Large width (D(k)) at each layer Large batch size

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

slide-73
SLIDE 73

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 26 / 49

slide-74
SLIDE 74

Neuron Design

The design of modern neurons is largely influenced by how an NN is trained

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

slide-75
SLIDE 75

Neuron Design

The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: argmax

Θ logP(X|Θ) = argmin Θ ∑ i

logP(y(i) |x(i),Θ)

Universal cost function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

slide-76
SLIDE 76

Neuron Design

The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: argmax

Θ logP(X|Θ) = argmin Θ ∑ i

logP(y(i) |x(i),Θ)

Universal cost function Different output units for different P(y|x)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

slide-77
SLIDE 77

Neuron Design

The design of modern neurons is largely influenced by how an NN is trained Maximum likelihood principle: argmax

Θ logP(X|Θ) = argmin Θ ∑ i

logP(y(i) |x(i),Θ)

Universal cost function Different output units for different P(y|x)

Gradient-based optimization:

During SGD, the gradient ∂c(n) ∂W(k)

i,j

= ∂c(n) ∂z(k)

j

· ∂z(k)

j

∂W(k)

i,j

= δ (k)

j

∂z(k)

j

∂W(k)

i,j

should be sufficiently large before we get a satisfactory NN

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

slide-78
SLIDE 78

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 28 / 49

slide-79
SLIDE 79

Negative Log Likelihood and Cross Entropy

The cost function of most NNs: argmax

Θ logP(X|Θ) = argmin Θ ∑ i

logP(y(i) |x(i),Θ)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

slide-80
SLIDE 80

Negative Log Likelihood and Cross Entropy

The cost function of most NNs: argmax

Θ logP(X|Θ) = argmin Θ ∑ i

logP(y(i) |x(i),Θ) For NNs that output an entire distribution ˆ P(y|x), the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: argmin

ˆ P

E(x,y)⇠Empirical(X) ⇥ log ˆ P(y|x) ⇤

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

slide-81
SLIDE 81

Negative Log Likelihood and Cross Entropy

The cost function of most NNs: argmax

Θ logP(X|Θ) = argmin Θ ∑ i

logP(y(i) |x(i),Θ) For NNs that output an entire distribution ˆ P(y|x), the problem can be equivalently described as minimizing the cross entropy (or KL divergence) from ˆ P to the empirical distribution of data: argmin

ˆ P

E(x,y)⇠Empirical(X) ⇥ log ˆ P(y|x) ⇤ Provides a consistent way to define output units

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

slide-82
SLIDE 82

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1|x) ⇠ Bernoulli(ρ)

y 2 {0,1} and ρ 2 (0,1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

slide-83
SLIDE 83

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1|x) ⇠ Bernoulli(ρ)

y 2 {0,1} and ρ 2 (0,1)

Sigmoid output unit: a(L) = ˆ ρ = σ(z(L)) = exp(z(L)) exp(z(L))+1

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

slide-84
SLIDE 84

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1|x) ⇠ Bernoulli(ρ)

y 2 {0,1} and ρ 2 (0,1)

Sigmoid output unit: a(L) = ˆ ρ = σ(z(L)) = exp(z(L)) exp(z(L))+1 δ (L) = ∂c(n)

∂z(L) = ∂log ˆ P(y(n) |x(n);Θ) ∂z(L)

= (12y(n))σ((12y(n))z(L))

Close to 0 only when y(n) = 1 and z(L) is large positive;

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

slide-85
SLIDE 85

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1|x) ⇠ Bernoulli(ρ)

y 2 {0,1} and ρ 2 (0,1)

Sigmoid output unit: a(L) = ˆ ρ = σ(z(L)) = exp(z(L)) exp(z(L))+1 δ (L) = ∂c(n)

∂z(L) = ∂log ˆ P(y(n) |x(n);Θ) ∂z(L)

= (12y(n))σ((12y(n))z(L))

Close to 0 only when y(n) = 1 and z(L) is large positive; or y(n) = 0 and z(L) is small negative

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

slide-86
SLIDE 86

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1|x) ⇠ Bernoulli(ρ)

y 2 {0,1} and ρ 2 (0,1)

Sigmoid output unit: a(L) = ˆ ρ = σ(z(L)) = exp(z(L)) exp(z(L))+1 δ (L) = ∂c(n)

∂z(L) = ∂log ˆ P(y(n) |x(n);Θ) ∂z(L)

= (12y(n))σ((12y(n))z(L))

Close to 0 only when y(n) = 1 and z(L) is large positive; or y(n) = 0 and z(L) is small negative

The loss c(n) saturates (becomes flat) only when ˆ ρ is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

slide-87
SLIDE 87

Softmax Units for Categorical Output Distributions I

In multiclass classification, we can assume that P(y|x) ⇠ Categorical(ρ), where y,ρ 2 RK and 1>ρ = 1

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

slide-88
SLIDE 88

Softmax Units for Categorical Output Distributions I

In multiclass classification, we can assume that P(y|x) ⇠ Categorical(ρ), where y,ρ 2 RK and 1>ρ = 1 Softmax units: a(L)

j

= ˆ ρj = sofmax(z(L))j = exp(z(L)

j

) ∑K

i=1 exp(z(L) i

)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

slide-89
SLIDE 89

Softmax Units for Categorical Output Distributions I

In multiclass classification, we can assume that P(y|x) ⇠ Categorical(ρ), where y,ρ 2 RK and 1>ρ = 1 Softmax units: a(L)

j

= ˆ ρj = sofmax(z(L))j = exp(z(L)

j

) ∑K

i=1 exp(z(L) i

) Actually, to define a Categorical distribution, we only need ρ1,··· ,ρK1 (ρK = 1∑K1

i=1 ρi can be discarded)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

slide-90
SLIDE 90

Softmax Units for Categorical Output Distributions I

In multiclass classification, we can assume that P(y|x) ⇠ Categorical(ρ), where y,ρ 2 RK and 1>ρ = 1 Softmax units: a(L)

j

= ˆ ρj = sofmax(z(L))j = exp(z(L)

j

) ∑K

i=1 exp(z(L) i

) Actually, to define a Categorical distribution, we only need ρ1,··· ,ρK1 (ρK = 1∑K1

i=1 ρi can be discarded)

We can alternatively define K 1 output units (discarding a(L)

K = ˆ

ρK = 1): a(L)

j

= ˆ ρj = exp(z(L)

j

) ∑K1

i=1 exp(z(L) i

)+1 that is a direct generalization of σ in binary classification

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

slide-91
SLIDE 91

Softmax Units for Categorical Output Distributions I

In multiclass classification, we can assume that P(y|x) ⇠ Categorical(ρ), where y,ρ 2 RK and 1>ρ = 1 Softmax units: a(L)

j

= ˆ ρj = sofmax(z(L))j = exp(z(L)

j

) ∑K

i=1 exp(z(L) i

) Actually, to define a Categorical distribution, we only need ρ1,··· ,ρK1 (ρK = 1∑K1

i=1 ρi can be discarded)

We can alternatively define K 1 output units (discarding a(L)

K = ˆ

ρK = 1): a(L)

j

= ˆ ρj = exp(z(L)

j

) ∑K1

i=1 exp(z(L) i

)+1 that is a direct generalization of σ in binary classification In practice, the two versions make little difference

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

slide-92
SLIDE 92

Softmax Units for Categorical Output Distributions II

Now we have δ (L)

j

= ∂c(n) ∂z(L)

j

= ∂ log ˆ P(y(n) |x(n);Θ) ∂z(L)

j

= ∂ log ⇣ ∏i ˆ ρ1(y(n);y(n)=i)

i

⌘ ∂z(L)

j

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

slide-93
SLIDE 93

Softmax Units for Categorical Output Distributions II

Now we have δ (L)

j

= ∂c(n) ∂z(L)

j

= ∂ log ˆ P(y(n) |x(n);Θ) ∂z(L)

j

= ∂ log ⇣ ∏i ˆ ρ1(y(n);y(n)=i)

i

⌘ ∂z(L)

j

If y(n) = j, then δ (L)

j

= ∂ log ˆ

ρj ∂z(L)

j

= 1

ˆ ρj

⇣ ˆ ρj ˆ ρ2

j

⌘ = ˆ ρj 1

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

slide-94
SLIDE 94

Softmax Units for Categorical Output Distributions II

Now we have δ (L)

j

= ∂c(n) ∂z(L)

j

= ∂ log ˆ P(y(n) |x(n);Θ) ∂z(L)

j

= ∂ log ⇣ ∏i ˆ ρ1(y(n);y(n)=i)

i

⌘ ∂z(L)

j

If y(n) = j, then δ (L)

j

= ∂ log ˆ

ρj ∂z(L)

j

= 1

ˆ ρj

⇣ ˆ ρj ˆ ρ2

j

⌘ = ˆ ρj 1

δ (L)

j

is close to 0 only when ˆ ρj is “correct” In this case, z(L)

j

dominates among all z(L)

i

’s

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

slide-95
SLIDE 95

Softmax Units for Categorical Output Distributions II

Now we have δ (L)

j

= ∂c(n) ∂z(L)

j

= ∂ log ˆ P(y(n) |x(n);Θ) ∂z(L)

j

= ∂ log ⇣ ∏i ˆ ρ1(y(n);y(n)=i)

i

⌘ ∂z(L)

j

If y(n) = j, then δ (L)

j

= ∂ log ˆ

ρj ∂z(L)

j

= 1

ˆ ρj

⇣ ˆ ρj ˆ ρ2

j

⌘ = ˆ ρj 1

δ (L)

j

is close to 0 only when ˆ ρj is “correct” In this case, z(L)

j

dominates among all z(L)

i

’s

If y(n) = i 6= j, then δ (L)

j

= ∂ log ˆ

ρi ∂z(L)

j

= 1

ˆ ρi ( ˆ

ρi ˆ ρj) = ˆ ρj

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

slide-96
SLIDE 96

Softmax Units for Categorical Output Distributions II

Now we have δ (L)

j

= ∂c(n) ∂z(L)

j

= ∂ log ˆ P(y(n) |x(n);Θ) ∂z(L)

j

= ∂ log ⇣ ∏i ˆ ρ1(y(n);y(n)=i)

i

⌘ ∂z(L)

j

If y(n) = j, then δ (L)

j

= ∂ log ˆ

ρj ∂z(L)

j

= 1

ˆ ρj

⇣ ˆ ρj ˆ ρ2

j

⌘ = ˆ ρj 1

δ (L)

j

is close to 0 only when ˆ ρj is “correct” In this case, z(L)

j

dominates among all z(L)

i

’s

If y(n) = i 6= j, then δ (L)

j

= ∂ log ˆ

ρi ∂z(L)

j

= 1

ˆ ρi ( ˆ

ρi ˆ ρj) = ˆ ρj

Again, close to 0 only when ˆ ρj is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

slide-97
SLIDE 97

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

slide-98
SLIDE 98

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x For example, we can assume P(y|x) ⇠ N (µ,Σ) for regression

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

slide-99
SLIDE 99

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x For example, we can assume P(y|x) ⇠ N (µ,Σ) for regression How to design output neurons if we want to predict the mean ˆ µ?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

slide-100
SLIDE 100

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x For example, we can assume P(y|x) ⇠ N (µ,Σ) for regression How to design output neurons if we want to predict the mean ˆ µ? Linear units: a(L) = ˆ µ = z(L)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

slide-101
SLIDE 101

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x For example, we can assume P(y|x) ⇠ N (µ,Σ) for regression How to design output neurons if we want to predict the mean ˆ µ? Linear units: a(L) = ˆ µ = z(L) We have δ (L) = ∂c(n) ∂z(L) = ∂ logN (y(n); ˆ µ,Σ) ∂z(L) Let Σ = I, maximizing the log-likelihood is equivalent to minimizing the SSE/MSE

δ (L) = ∂ky(n) z(L)k2/∂z(L) (see linear regression)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

slide-102
SLIDE 102

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x For example, we can assume P(y|x) ⇠ N (µ,Σ) for regression How to design output neurons if we want to predict the mean ˆ µ? Linear units: a(L) = ˆ µ = z(L) We have δ (L) = ∂c(n) ∂z(L) = ∂ logN (y(n); ˆ µ,Σ) ∂z(L) Let Σ = I, maximizing the log-likelihood is equivalent to minimizing the SSE/MSE

δ (L) = ∂ky(n) z(L)k2/∂z(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradient based optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

slide-103
SLIDE 103

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 34 / 49

slide-104
SLIDE 104

Design Considerations

Most units differ from each other only in activation functions: a(k) = act(z(k)) = act(W(k)>a(k1))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 35 / 49

slide-105
SLIDE 105

Design Considerations

Most units differ from each other only in activation functions: a(k) = act(z(k)) = act(W(k)>a(k1)) Why use ReLU as default hidden units?

act(z(k)) = max(0,z(k))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 35 / 49

slide-106
SLIDE 106

Design Considerations

Most units differ from each other only in activation functions: a(k) = act(z(k)) = act(W(k)>a(k1)) Why use ReLU as default hidden units?

act(z(k)) = max(0,z(k))

Why not, for example, use Sigmoid as hidden units?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 35 / 49

slide-107
SLIDE 107

Vanishing Gradient Problem

In backward pass of Backprop: δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

slide-108
SLIDE 108

Vanishing Gradient Problem

In backward pass of Backprop: δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

If act0(·) = σ0(·) < 1, then δ (k)

j

becomes smaller and smaller during backward pass

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

slide-109
SLIDE 109

Vanishing Gradient Problem

In backward pass of Backprop: δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

If act0(·) = σ0(·) < 1, then δ (k)

j

becomes smaller and smaller during backward pass The surface of cost function becomes very flat at shallow layers

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

slide-110
SLIDE 110

Vanishing Gradient Problem

In backward pass of Backprop: δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

If act0(·) = σ0(·) < 1, then δ (k)

j

becomes smaller and smaller during backward pass The surface of cost function becomes very flat at shallow layers Slows down the learning speed of entire network

Weights at deeper layers depend on those in shallow ones

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

slide-111
SLIDE 111

Vanishing Gradient Problem

In backward pass of Backprop: δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

If act0(·) = σ0(·) < 1, then δ (k)

j

becomes smaller and smaller during backward pass The surface of cost function becomes very flat at shallow layers Slows down the learning speed of entire network

Weights at deeper layers depend on those in shallow ones

Numeric problems, e.g., underflow

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

slide-112
SLIDE 112

ReLU I

act0(z(k)) = ⇢ 1, if z(k) > 0 0,

  • therwise

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

slide-113
SLIDE 113

ReLU I

act0(z(k)) = ⇢ 1, if z(k) > 0 0,

  • therwise

No vanishing gradients δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

slide-114
SLIDE 114

ReLU I

act0(z(k)) = ⇢ 1, if z(k) > 0 0,

  • therwise

No vanishing gradients δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

What if z(k) = 0?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

slide-115
SLIDE 115

ReLU I

act0(z(k)) = ⇢ 1, if z(k) > 0 0,

  • therwise

No vanishing gradients δ (k)

j

= ✓

s

δ (k+1)

s

·W(k+1)

j,s

◆ act0(z(k)

j )

What if z(k) = 0? In practice, we usually assign 1 or 0 randomly

Floating points are not precise anyway

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

slide-116
SLIDE 116

ReLU II

Why piecewise linear?

To avoid vanishing gradient, we can modify σ(·) to make it steeper at middle and σ0(·) > 1

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 38 / 49

slide-117
SLIDE 117

ReLU II

Why piecewise linear?

To avoid vanishing gradient, we can modify σ(·) to make it steeper at middle and σ0(·) > 1

The second derivative ReLU00(·) is 0 everywhere

Eliminates the second-order effects and makes the gradient-based

  • ptimization more useful (than, e.g., Newton methods)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 38 / 49

slide-118
SLIDE 118

ReLU II

Why piecewise linear?

To avoid vanishing gradient, we can modify σ(·) to make it steeper at middle and σ0(·) > 1

The second derivative ReLU00(·) is 0 everywhere

Eliminates the second-order effects and makes the gradient-based

  • ptimization more useful (than, e.g., Newton methods)

Problem: for neurons with δ (k)

j

= 0, theirs weights W(k)

:,j will not be

updated ∂c(n) ∂W(k)

i,j

= δ (k) ∂z(k)

j

∂W(k)

i,j

Improvement?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 38 / 49

slide-119
SLIDE 119

Leaky/Parametric ReLU

act(z(k)) = max(α ·z(k),z(k)), for some α 2 R

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 39 / 49

slide-120
SLIDE 120

Leaky/Parametric ReLU

act(z(k)) = max(α ·z(k),z(k)), for some α 2 R Leaky ReLU: α is set in advance (fixed during training)

Usually a small value Or domain-specific

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 39 / 49

slide-121
SLIDE 121

Leaky/Parametric ReLU

act(z(k)) = max(α ·z(k),z(k)), for some α 2 R Leaky ReLU: α is set in advance (fixed during training)

Usually a small value Or domain-specific

Example: absolute value rectification α = 1

Used for object recognition from images Seek features that are invariant under a polarity reversal of the input illumination

Parametric ReLU (PReLU): α learned automatically by gradient descent

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 39 / 49

slide-122
SLIDE 122

Maxout Units I

Maxout units generalize ReLU variants further: act(z(k))j = max

s

zj,s

a(k1) is linearly mapped to multiple groups of z(k)

j,: ’s

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 40 / 49

slide-123
SLIDE 123

Maxout Units I

Maxout units generalize ReLU variants further: act(z(k))j = max

s

zj,s

a(k1) is linearly mapped to multiple groups of z(k)

j,: ’s

Learns a piecewise linear, convex activation function automatically

Covers both leaky ReLU and PReLU

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 40 / 49

slide-124
SLIDE 124

Maxout Units II

How to train an NN with maxout units?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

slide-125
SLIDE 125

Maxout Units II

How to train an NN with maxout units? Given a training example (x(n),y(n)), update the weights that corresponds to the winning z(k)

j,s ’s for this example

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

slide-126
SLIDE 126

Maxout Units II

How to train an NN with maxout units? Given a training example (x(n),y(n)), update the weights that corresponds to the winning z(k)

j,s ’s for this example

Different examples may update different parts of the network

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

slide-127
SLIDE 127

Maxout Units II

How to train an NN with maxout units? Given a training example (x(n),y(n)), update the weights that corresponds to the winning z(k)

j,s ’s for this example

Different examples may update different parts of the network Offers some “redundancy” that helps to resist the catastrophic forgetting phenomenon [2]

An NN may forget how to perform tasks that they were trained on in the past

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

slide-128
SLIDE 128

Maxout Units III

Cons?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 42 / 49

slide-129
SLIDE 129

Maxout Units III

Cons? Each maxout unit is now parametrized by multiple weight vectors instead of just one

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 42 / 49

slide-130
SLIDE 130

Maxout Units III

Cons? Each maxout unit is now parametrized by multiple weight vectors instead of just one Typically requires more training data Otherwise, regularization is needed

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 42 / 49

slide-131
SLIDE 131

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 43 / 49

slide-132
SLIDE 132

Architecture Design

Thin-and-deep or fat-and-shallow?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 44 / 49

slide-133
SLIDE 133

Architecture Design

Thin-and-deep or fat-and-shallow? Theorem (Universal Approximation Theorem [3, 4]) A feedforward network with at least one hidden layer can approximate any continuous function (on a closed and bounded subset of RD) or any function mapping from a finite dimensional discrete space to another. In short, a feedforward network with a single layer is sufficient to represent any function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 44 / 49

slide-134
SLIDE 134

Architecture Design

Thin-and-deep or fat-and-shallow? Theorem (Universal Approximation Theorem [3, 4]) A feedforward network with at least one hidden layer can approximate any continuous function (on a closed and bounded subset of RD) or any function mapping from a finite dimensional discrete space to another. In short, a feedforward network with a single layer is sufficient to represent any function Why going deep?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 44 / 49

slide-135
SLIDE 135

Exponential Gain in Number of Hidden Units

Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [5]

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 45 / 49

slide-136
SLIDE 136

Exponential Gain in Number of Hidden Units

Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [5]

Deep NNs are easier to learn given a fixed amount of data

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 45 / 49

slide-137
SLIDE 137

Exponential Gain in Number of Hidden Units

Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [5]

Deep NNs are easier to learn given a fixed amount of data

Example: an NN with absolute value rectification units Each hidden unit specifies where to fold the input space in order to create mirror responses (on both sides of the absolute value) By composing these folding operations, we obtain an exponentially large number of piecewise linear regions which can capture all kinds of regular (e.g., repeating) patterns

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 45 / 49

slide-138
SLIDE 138

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that the function we want to learn should involve composition of several simpler functions

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

slide-139
SLIDE 139

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that the function we want to learn should involve composition of several simpler functions

If valid, deep NNs give better generalizability

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

slide-140
SLIDE 140

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that the function we want to learn should involve composition of several simpler functions

If valid, deep NNs give better generalizability

When valid?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

slide-141
SLIDE 141

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that the function we want to learn should involve composition of several simpler functions

If valid, deep NNs give better generalizability

When valid? Representation learning point of view:

Learning problem consists of discovering a set of underlying factors Factors can in turn be described using other, simpler underlying factors

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

slide-142
SLIDE 142

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that the function we want to learn should involve composition of several simpler functions

If valid, deep NNs give better generalizability

When valid? Representation learning point of view:

Learning problem consists of discovering a set of underlying factors Factors can in turn be described using other, simpler underlying factors

Computer program point of view:

Function to learn is a computer program consisting of multiple steps Each step makes use of the previous step’s output

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

slide-143
SLIDE 143

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that the function we want to learn should involve composition of several simpler functions

If valid, deep NNs give better generalizability

When valid? Representation learning point of view:

Learning problem consists of discovering a set of underlying factors Factors can in turn be described using other, simpler underlying factors

Computer program point of view:

Function to learn is a computer program consisting of multiple steps Each step makes use of the previous step’s output Intermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

slide-144
SLIDE 144

Outline

1

The Basics Example: Learning the XOR

2

Training Back Propagation

3

Neuron Design Cost Function & Output Neurons Hidden Neurons

4

Architecture Design Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 47 / 49

slide-145
SLIDE 145

width & depth

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 48 / 49

slide-146
SLIDE 146

Reference I

[1] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [2] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013. [3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 48 / 49

slide-147
SLIDE 147

Reference II

[4] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993. [5] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 49 / 49