[PPT] - Applied Machine Learning Applied Machine Learning Multilayer PowerPoint Presentation

SLIDE 1

Applied Machine Learning Applied Machine Learning

Multilayer Perceptron

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

SLIDE 2

multilayer percepron: model different supervised learning tasks activation functions architecture of a neural network its expressive power regularization techniques

Learning objectives Learning objectives

2

SLIDE 3

Adaptive bases Adaptive bases

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

3

SLIDE 4

Adaptive bases Adaptive bases

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

consider the adaptive bases in a general form (contrast to decision trees)

3

SLIDE 5

Adaptive bases Adaptive bases

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

consider the adaptive bases in a general form (contrast to decision trees) use gradient descent to find good parameters (contrast to boosting)

3

SLIDE 6

Adaptive bases Adaptive bases

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

consider the adaptive bases in a general form (contrast to decision trees) use gradient descent to find good parameters (contrast to boosting) create more complex adaptive bases by combining simpler bases leads to deep neural networks

3

SLIDE 7

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases

4 . 1

non-adaptive case

SLIDE 8

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases cost: J(w) =

(f(x

; w) −

2 1 ∑n (n)

y )

(n) 2 4 . 1

non-adaptive case

SLIDE 9

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

the model is linear in its parameters the cost is convex in w (unique minimum) even has a closed form solution model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

the center are fixed cost: J(w) =

(f(x

; w) −

2 1 ∑n (n)

y )

(n) 2 4 . 1

non-adaptive case

SLIDE 10

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

the model is linear in its parameters the cost is convex in w (unique minimum) even has a closed form solution model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

the center are fixed cost: J(w) =

(f(x

; w) −

2 1 ∑n (n)

y )

(n) 2 4 . 1

non-adaptive case

SLIDE 11

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

the model is linear in its parameters the cost is convex in w (unique minimum) even has a closed form solution model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

the center are fixed cost: J(w) =

(f(x

; w) −

2 1 ∑n (n)

y )

(n) 2

how to minimize the cost?

4 . 1

non-adaptive case we can make the bases adaptive by learning these centers model: f(x; w, μ) =

w ϕ (x; μ )

∑d

d d d

adaptive case

SLIDE 12

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

the model is linear in its parameters the cost is convex in w (unique minimum) even has a closed form solution model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

the center are fixed cost: J(w) =

(f(x

; w) −

2 1 ∑n (n)

y )

(n) 2

how to minimize the cost? not convex in all model parameters use gradient descent to find a local minimum

4 . 1

non-adaptive case we can make the bases adaptive by learning these centers model: f(x; w, μ) =

w ϕ (x; μ )

∑d

d d d

adaptive case

SLIDE 13

Adaptive Adaptive Radial Bases Radial Bases

ϕ

(x) =

d

e−

s2 (x−μ

)

d 2

the model is linear in its parameters the cost is convex in w (unique minimum) even has a closed form solution model: f(x; w) =

w ϕ (x)

∑d

d d

Gaussian bases, or radial bases

mu = np.linspace(0,4,10) #4 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

the center are fixed cost: J(w) =

(f(x

; w) −

2 1 ∑n (n)

y )

(n) 2

how to minimize the cost? not convex in all model parameters use gradient descent to find a local minimum

note that the basis centers are adaptively changing

4 . 1

non-adaptive case we can make the bases adaptive by learning these centers model: f(x; w, μ) =

w ϕ (x; μ )

∑d

d d d

adaptive case

SLIDE 14

Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1 using adaptive sigmoid bases gives us a neural network

μ

d

s

=

d

1

non-adaptive case

4 . 2

SLIDE 15

Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1 using adaptive sigmoid bases gives us a neural network is fixed to D locations

μ

d

s

=

d

1

non-adaptive case

4 . 2

SLIDE 16

Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu))) mu = np.linspace(0,3,10) #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

using adaptive sigmoid bases gives us a neural network model: f(x; w) =

w ϕ (x)

∑d

d d

is fixed to D locations

μ

d

s

=

d

1

non-adaptive case

4 . 2

...

ϕ

(x)

1

ϕ

(x)

2

ϕ

(x)

D

w

1

w

D

w

2

y ^

SLIDE 17

Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu))) mu = np.linspace(0,3,10) #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

using adaptive sigmoid bases gives us a neural network model: f(x; w) =

w ϕ (x)

∑d

d d

D=10 is fixed to D locations

μ

d

s

=

d

1

non-adaptive case

4 . 2

...

ϕ

(x)

1

ϕ

(x)

2

ϕ

(x)

D

w

1

w

D

w

2

y ^

SLIDE 18

Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

phi = lambda x,mu,sigma: 1/(1 + np.exp(-(x - mu))) mu = np.linspace(0,3,10) #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

using adaptive sigmoid bases gives us a neural network model: f(x; w) =

w ϕ (x)

∑d

d d

D=10 D=5 D=3 is fixed to D locations

μ

d

s

=

d

1

non-adaptive case

4 . 2

...

ϕ

(x)

1

ϕ

(x)

2

ϕ

(x)

D

w

1

w

D

w

2

y ^

SLIDE 19

Adaptive Adaptive Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

rewrite the sigmoid basis

ϕ

(x) =

d

σ(

) =

s

d

x−μ

d

σ(v

x +

d

b

)

d

4 . 3

SLIDE 20

Adaptive Adaptive Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

rewrite the sigmoid basis

ϕ

(x) =

d

σ(

) =

s

d

x−μ

d

σ(v

x +

d

b

)

d each basis is the logistic regression model ϕ

(x) =

d

σ(v

x +

d ⊤

b

)

d

assuming input is higher than one dimension

4 . 3

SLIDE 21

Adaptive Adaptive Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

rewrite the sigmoid basis

ϕ

(x) =

d

σ(

) =

s

d

x−μ

d

σ(v

x +

d

b

)

d model: f(x; w, v, b) =

w σ(v x +

∑d

d d

b

)

d

each basis is the logistic regression model ϕ

(x) =

d

σ(v

x +

d ⊤

b

)

d

assuming input is higher than one dimension

4 . 3

SLIDE 22

Adaptive Adaptive Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

rewrite the sigmoid basis

ϕ

(x) =

d

σ(

) =

s

d

x−μ

d

σ(v

x +

d

b

)

d model: f(x; w, v, b) =

w σ(v x +

∑d

d d

b

)

d

each basis is the logistic regression model ϕ

(x) =

d

σ(v

x +

d ⊤

b

)

d

assuming input is higher than one dimension

this is a neural network with two layers

...

ϕ

1

w

1

w

D

w

2

y ^ ϕ

D

v

, b

D D

v

, b

1 1

x

4 . 3

SLIDE 23

Winter 2020 | Applied Machine Learning (COMP551)

Adaptive Adaptive Sigmoid Bases Sigmoid Bases

ϕ

(x) =

d 1+e

−(

)

s d x−μ d

1

rewrite the sigmoid basis

ϕ

(x) =

d

σ(

) =

s

d

x−μ

d

σ(v

x +

d

b

)

d model: f(x; w, v, b) =

w σ(v x +

∑d

d d

b

)

d

each basis is the logistic regression model ϕ

(x) =

d

σ(v

x +

d ⊤

b

)

d

assuming input is higher than one dimension

ptimize using gradient descent (find a local optima)

D=3 adaptive bases D=3 fixed bases

this is a neural network with two layers

...

ϕ

1

w

1

w

D

w

2

y ^ ϕ

D

v

, b

D D

v

, b

1 1

x

4 . 3

SLIDE 24

Multilayer Perceptron ( Multilayer Perceptron (MLP MLP)

suppose we have D inputs K outputs M hidden units

=

y ^k g (

W h( V x ))

∑m

k,m

∑d

m,d d

nonlinearity, activation function: we have different choices

x

, … , x

1 D

z

, … , z

1 M

, … ,

y ^1 y ^K

more compressed form

=

y ^ g(W h(V x))

non-linearities are applied elementwise

for simplicity we may drop bias terms

W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 1

model

SLIDE 25

Regression Regression using Neural Networks

using Neural Networks

the choice of activation function in the final layer depends on the task

W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 2

=

y ^ g(W h(V x))

model

SLIDE 26

Regression Regression using Neural Networks

using Neural Networks

the choice of activation function in the final layer depends on the task

regression

identity function + L2 loss : Gaussian likelihood

=

y ^ g( Wz ) = Wz

L(y,

) =

y ^

∣∣y −

2 1

∣∣ =

y ^ 2

2

log N(y;

, βI) +

y ^ constant

we may have one or more output variables W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 2

=

y ^ g(W h(V x))

model

SLIDE 27

Regression Regression using Neural Networks

using Neural Networks

the choice of activation function in the final layer depends on the task

regression

identity function + L2 loss : Gaussian likelihood

=

y ^ g( Wz ) = Wz

L(y,

) =

y ^

∣∣y −

2 1

∣∣ =

y ^ 2

2

log N(y;

, βI) +

y ^ constant

we may have one or more output variables W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 2

=

y ^ g(W h(V x))

model we may explicitly produce a distribution at output - e.g., mean and variance of a Gaussian mixture of Gaussians the loss will be the log-likelihood of the data under our model

L(y,

) =

y ^ log p(y; f(x))

neural network outputs the parameters of a distribution

more generally

SLIDE 28

Classification Classification using neural networks

using neural networks

the choice of activation function in the final layer depends on the task

W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 3

=

y ^ g(W h(V x))

model

SLIDE 29

Classification Classification using neural networks

using neural networks

the choice of activation function in the final layer depends on the task

L(y,

) =

y ^ y log

+

y ^ (1 − y) log(1 −

) =

y ^ log Bernouli(y;

)

y ^

binary classification

=

y ^ g(Wz) = (1 + e )

−Wz −1

logistic sigmoid + CE loss: Bernouli likelihood scalar output C=1 W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 3

=

y ^ g(W h(V x))

model

SLIDE 30

Classification Classification using neural networks

using neural networks

the choice of activation function in the final layer depends on the task

L(y,

) =

y ^ y log

+

y ^ (1 − y) log(1 −

) =

y ^ log Bernouli(y;

)

y ^

binary classification

=

y ^ g(Wz) = (1 + e )

−Wz −1

logistic sigmoid + CE loss: Bernouli likelihood scalar output C=1

multiclass classification

=

y ^ g(Wz) = softmax(Wz)

softmax + multi-class CE loss: categorical likelihood L(y,

) =

y ^

y log =

∑k

k

y ^k log Categorical(y;

)

y ^ C is the number of classes W x

1

... ... ...

x

2

x

D

1 1 z

1

z

2

z

M

y ^1 y ^2 y ^C input hidden units

utput

V

5 . 3

=

y ^ g(W h(V x))

model

SLIDE 31

Activation function Activation function

for middle layer(s) there is more freedom in the choice of activation function

5 . 4

SLIDE 32

Activation function Activation function

for middle layer(s) there is more freedom in the choice of activation function identity (no activation function)

h(x) = x

5 . 4

SLIDE 33

Activation function Activation function

for middle layer(s) there is more freedom in the choice of activation function identity (no activation function)

h(x) = x

composition of two linear functions is linear

x =

W ′

WV W x

′

K × M M × D K × D 5 . 4

SLIDE 34

Activation function Activation function

for middle layer(s) there is more freedom in the choice of activation function identity (no activation function)

h(x) = x

composition of two linear functions is linear

x =

W ′

WV W x

′

K × M M × D K × D

M < min(D, K)

so nothing is gained (in representation power) by stacking linear layers exception: if then the hidden layer is compressing the data (W' is low-rank) this idea is used in dimensionality reduction (later!)

5 . 4

SLIDE 35

Activation function Activation function

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 36

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 37

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

the same function used in logistic regression used to be the function of choice in neural networks

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 38

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

the same function used in logistic regression used to be the function of choice in neural networks away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 39

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

the same function used in logistic regression used to be the function of choice in neural networks away from zero it changes slowly, so the derivative is small (leads to vanishing gradient)

σ(x) =

∂x ∂

σ(x)(1 − σ(x))

its derivative is easy to remember

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 40

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

the same function used in logistic regression used to be the function of choice in neural networks away from zero it changes slowly, so the derivative is small (leads to vanishing gradient) hyperbolic tangent

h(x) = 2σ(x) − 1 =

e +e

x −x

e −e

x −x

similar to sigmoid, but symmetric

σ(x) =

∂x ∂

σ(x)(1 − σ(x))

its derivative is easy to remember

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 41

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

the same function used in logistic regression used to be the function of choice in neural networks away from zero it changes slowly, so the derivative is small (leads to vanishing gradient) hyperbolic tangent

h(x) = 2σ(x) − 1 =

e +e

x −x

e −e

x −x

similar to sigmoid, but symmetric

σ(x) =

∂x ∂

σ(x)(1 − σ(x))

its derivative is easy to remember

ften better for optimization because close to zero it

similar to a linear function

(rather than an affine function when using logistic)

5 . 5

for middle layer(s) there is more freedom in the choice of activation function

SLIDE 42

Activation function Activation function

logistic function

h(x) = σ(x) =

1+e−x 1

the same function used in logistic regression used to be the function of choice in neural networks away from zero it changes slowly, so the derivative is small (leads to vanishing gradient) hyperbolic tangent

h(x) = 2σ(x) − 1 =

e +e

x −x

e −e

x −x

similar to sigmoid, but symmetric

σ(x) =

∂x ∂

σ(x)(1 − σ(x))

its derivative is easy to remember

ften better for optimization because close to zero it

similar to a linear function

(rather than an affine function when using logistic)

5 . 5

tanh(x) =

∂x ∂

1 − tanh(x)2

similar problem with vanishing gradient for middle layer(s) there is more freedom in the choice of activation function

SLIDE 43

Activation function Activation function

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

SLIDE 44

Activation function Activation function

Rectified Linear Unit (ReLU)

h(x) = max(0, x)

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

SLIDE 45

Activation function Activation function

replacing logistic with ReLU significantly improves the training of deep networks zero derivative if the unit is "inactive" initialization should ensure active units at the beginning of optimization Rectified Linear Unit (ReLU)

h(x) = max(0, x)

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

SLIDE 46

Activation function Activation function

replacing logistic with ReLU significantly improves the training of deep networks zero derivative if the unit is "inactive" initialization should ensure active units at the beginning of optimization Rectified Linear Unit (ReLU)

h(x) = max(0, x)

leaky ReLU h(x) = max(0, x) + γ min(0, x) fixes the zero-gradient problem

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

SLIDE 47

Activation function Activation function

replacing logistic with ReLU significantly improves the training of deep networks zero derivative if the unit is "inactive" initialization should ensure active units at the beginning of optimization Rectified Linear Unit (ReLU)

h(x) = max(0, x)

leaky ReLU h(x) = max(0, x) + γ min(0, x) fixes the zero-gradient problem parameteric ReLU: make a learnable parameter

γ

for middle layer(s) there is more freedom in the choice of activation function

5 . 6

SLIDE 48

Winter 2020 | Applied Machine Learning (COMP551)

Activation function Activation function

replacing logistic with ReLU significantly improves the training of deep networks zero derivative if the unit is "inactive" initialization should ensure active units at the beginning of optimization Rectified Linear Unit (ReLU)

h(x) = max(0, x)

leaky ReLU h(x) = max(0, x) + γ min(0, x) fixes the zero-gradient problem parameteric ReLU: make a learnable parameter

γ

Softplus (differentiable everywhere)

h(x) = log(1 + e )

x

it doesn't perform as well in practice for middle layer(s) there is more freedom in the choice of activation function

5 . 6

SLIDE 49

Network architecture Network architecture

architecture is the overall structure of the network

6 . 1

SLIDE 50

Network architecture Network architecture

architecture is the overall structure of the network feedforward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1 y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

depth width

6 . 1

SLIDE 51

Network architecture Network architecture

architecture is the overall structure of the network feedforward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1 y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

depth width

6 . 1

x

1

... ...

x

2

x

D

z

1

z

2

z

M

fully connected

x

1

... ...

x

2

x

D

z

1

z

2

z

M

sparsely connected each layer can be fully connected (dense) or sparse

SLIDE 52

Network architecture Network architecture

architecture is the overall structure of the network feed-forward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network each layer can be fully connected (dense) or sparse

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1

6 . 2

y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

skip connection

SLIDE 53

Network architecture Network architecture

architecture is the overall structure of the network feed-forward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network each layer can be fully connected (dense) or sparse layers may have skip layer connections helps with gradient flow

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1

6 . 2

y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

skip connection

SLIDE 54

Network architecture Network architecture

architecture is the overall structure of the network feed-forward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network each layer can be fully connected (dense) or sparse layers may have skip layer connections helps with gradient flow units may have different activations

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1

6 . 2

y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

skip connection

SLIDE 55

Network architecture Network architecture

architecture is the overall structure of the network feed-forward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network each layer can be fully connected (dense) or sparse layers may have skip layer connections helps with gradient flow units may have different activations parameters may be shared across units (e.g., in conv-nets)

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1

6 . 2

y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

skip connection parameter sharing

SLIDE 56

Winter 2020 | Applied Machine Learning (COMP551)

Network architecture Network architecture

architecture is the overall structure of the network feed-forward network (aka multilayer perceptron) can have many layers # layers is called the depth of the network each layer can be fully connected (dense) or sparse layers may have skip layer connections helps with gradient flow units may have different activations parameters may be shared across units (e.g., in conv-nets)

x

1

... ... ...

x

2

x

D

z

1

z

2

z

M

y ^1

6 . 2

y ^2 y ^C

... ... ... ...

z

1 {ℓ}

z

2 {ℓ}

z

M {ℓ}

... ... ...

skip connection parameter sharing

more generally a directed acyclic graph (DAG) expresses the feed-forward architecture

SLIDE 57

Expressive power Expressive power

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy universal approximation theorem

7 . 1

SLIDE 58

Expressive power Expressive power

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy universal approximation theorem for 1D input we can see this even with fixed bases M = 100 in this example the fit is good (hard to see the blue line)

7 . 1

SLIDE 59

Expressive power Expressive power

an MLP with single hidden layer can approximate any continuous function with arbitrary accuracy universal approximation theorem for 1D input we can see this even with fixed bases M = 100 in this example the fit is good (hard to see the blue line) however # bases (M) should grow exponentially with D (curse of dimensionality)

7 . 1

SLIDE 60

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem

7 . 2

SLIDE 61

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem Deep networks (with ReLU activation) of bounded width are also shown to be universal

7 . 2

SLIDE 62

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem Deep networks (with ReLU activation) of bounded width are also shown to be universal

empirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer) assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2

SLIDE 63

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem Deep networks (with ReLU activation) of bounded width are also shown to be universal

empirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer) assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2

SLIDE 64

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem Deep networks (with ReLU activation) of bounded width are also shown to be universal

empirically it is observed that increasing depth is often more effective than increasing width (#parameters per layer) assuming a compositional functional form (through depth) is a useful inductive bias

7 . 2

SLIDE 65

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem Deep networks (with ReLU activation) of bounded width are also shown to be universal

7 . 3

SLIDE 66

Winter 2020 | Applied Machine Learning (COMP551)

Depth vs Width Depth vs Width

an MLP with single hidden layer can approximate any continuous function on with arbitrary accuracy

Caveats we may need a very wide network (large M) this is only about training error, we care about test error

universal approximation theorem Deep networks (with ReLU activation) of bounded width are also shown to be universal

number of regions (in which the network is linear) grows exponentially with depth

W x =

{ℓ} {ℓ}

layer ℓ

h(W x) =

{ℓ}

∣W x∣

{ℓ}

simplified demonstration

W x =

{ℓ+1} {ℓ+1}

7 . 3

SLIDE 67

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction:

8 . 1

SLIDE 68

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction: L1 and L2 regularization (weight decay)

8 . 1

SLIDE 69

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction: L1 and L2 regularization (weight decay) data augmentation noise robustness early stopping bagging and dropout

8 . 1

SLIDE 70

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction: L1 and L2 regularization (weight decay) data augmentation noise robustness early stopping bagging and dropout sparse representations (e.g., L1 penalty on hidden unit activations)

8 . 1

SLIDE 71

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction: L1 and L2 regularization (weight decay) data augmentation noise robustness early stopping bagging and dropout sparse representations (e.g., L1 penalty on hidden unit activations) semi-supervised and multi-task learning

8 . 1

SLIDE 72

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction: L1 and L2 regularization (weight decay) data augmentation noise robustness early stopping bagging and dropout sparse representations (e.g., L1 penalty on hidden unit activations) semi-supervised and multi-task learning adversarial training

8 . 1

SLIDE 73

Regularization strategies Regularization strategies

universality of neural networks also means they can overfit strategies for variance reduction: L1 and L2 regularization (weight decay) data augmentation noise robustness early stopping bagging and dropout sparse representations (e.g., L1 penalty on hidden unit activations) semi-supervised and multi-task learning adversarial training parameter-tying

8 . 1

SLIDE 74

Data augmentation Data augmentation

a larger dataset results in a better generalization

8 . 2

SLIDE 75

Data augmentation Data augmentation

a larger dataset results in a better generalization

N = 20 N = 40 N = 80

example: in all 3 examples below training error is close to zero

however, a larger training dataset leads to better generalization

8 . 2

SLIDE 76

Data augmentation Data augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformations that change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

8 . 3

SLIDE 77

special approaches to data-augmentation adding noise to the input adding noise to hidden units noise in higher level of abstraction

Data augmentation Data augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformations that change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

8 . 3

SLIDE 78

special approaches to data-augmentation adding noise to the input adding noise to hidden units noise in higher level of abstraction

Data augmentation Data augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformations that change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x)

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

(x, y)

p ^

x , y ∼

(n )

′

(n )

′

p ^

learn a generative model of the data use for training

8 . 3

SLIDE 79

special approaches to data-augmentation adding noise to the input adding noise to hidden units noise in higher level of abstraction

Data augmentation Data augmentation

a larger dataset results in a better generalization

increase the size of dataset by adding reasonable transformations that change the label in predictable ways; e.g.,

τ(x)

f(τ(x)) = f(x) sometimes we an achieve the same goal by designing the models that are invariant to a given set of transformations

image: https://github.com/aleju/imgaug/blob/master/README.md

idea

(x, y)

p ^

x , y ∼

(n )

′

(n )

′

p ^

learn a generative model of the data use for training

8 . 3

SLIDE 80

Noise robustness Noise robustness

make the model robust to noise in

8 . 4

SLIDE 81

Noise robustness Noise robustness

make the model robust to noise in input (data augmentation) hidden units (e.g., in dropout)

8 . 4

SLIDE 82

Noise robustness Noise robustness

make the model robust to noise in weights the loss is not sensitive to small changes in the weight (flat minima)

image credit: Keshkar et al'17

input (data augmentation) hidden units (e.g., in dropout)

8 . 4

SLIDE 83

Noise robustness Noise robustness

make the model robust to noise in weights the loss is not sensitive to small changes in the weight (flat minima)

flat minima generalize better good performance of SGD using small minibatch is attributed to flat minima in this case, SGD regularizes the model due to gradient noise

image credit: Keshkar et al'17

input (data augmentation) hidden units (e.g., in dropout)

8 . 4

SLIDE 84

Noise robustness Noise robustness

make the model robust to noise in

label smoothing

utput (avoid overfitting, specially to wrong labels)

a heuristic is to replace hard labels with "soft-labels"

[0, 0, 1, 0] → [

, , 1 −

3 ϵ 3 ϵ

ϵ,

]

3 ϵ

e.g.,

weights the loss is not sensitive to small changes in the weight (flat minima)

flat minima generalize better good performance of SGD using small minibatch is attributed to flat minima in this case, SGD regularizes the model due to gradient noise

image credit: Keshkar et al'17

input (data augmentation) hidden units (e.g., in dropout)

8 . 4

SLIDE 85

Early stopping Early stopping

the test loss-vs-time step is "often" U-shaped use validation for early stopping also saves computation!

8 . 5

SLIDE 86

Early stopping Early stopping

the test loss-vs-time step is "often" U-shaped use validation for early stopping also saves computation! early stopping bounds the region of the parameter-space that is reachable in T time-steps assuming bounded gradient starting with a small w it has an effect similar to L2 regularization we get the regularization path (various ) we saw a similar phenomena in boosting

λ

8 . 5

SLIDE 87

Bagging Bagging

several sources of variance in neural networks, such as

ptimization

initialization randomness of SGD learning rate and other hyper-parameters choice of architecture number of layers, hidden units, etc.

8 . 6

SLIDE 88

Bagging Bagging

use bagging or even averaging without bootstrap to reduce variance issue: computationally expensive

several sources of variance in neural networks, such as

ptimization

initialization randomness of SGD learning rate and other hyper-parameters choice of architecture number of layers, hidden units, etc.

8 . 6

SLIDE 89

Dropout Dropout

idea randomly remove a subset of units during training as opposed to bagging a single model is trained

8 . 7

SLIDE 90

Dropout Dropout

idea randomly remove a subset of units during training as opposed to bagging a single model is trained exponentially many subnetworks that share parameters can be viewed as

8 . 7

SLIDE 91

Dropout Dropout

idea randomly remove a subset of units during training as opposed to bagging a single model is trained exponentially many subnetworks that share parameters can be viewed as is one of the most effective regularization schemes for MLPs

8 . 7

SLIDE 92

Dropout Dropout

at test time during training

8 . 8

SLIDE 93

Dropout Dropout

at test time during training for each instance (n): randomly dropout each unit with probability p (e.g., p=.5)

nly the remaining subnetwork participates in training

8 . 8

SLIDE 94

Dropout Dropout

at test time ideally we want to average over the prediction of all possible sub-networks during training for each instance (n): randomly dropout each unit with probability p (e.g., p=.5)

nly the remaining subnetwork participates in training

8 . 8

SLIDE 95

Dropout Dropout

at test time ideally we want to average over the prediction of all possible sub-networks 1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout during training for each instance (n): randomly dropout each unit with probability p (e.g., p=.5)

nly the remaining subnetwork participates in training

this is computationally infeasible, instead

8 . 8

SLIDE 96

Winter 2020 | Applied Machine Learning (COMP551)

Dropout Dropout

at test time ideally we want to average over the prediction of all possible sub-networks 1) Monte Carlo dropout: average the prediction of several feed-forward passes using dropout 2) weight scaling: scale the weights by p to compensate for dropout

e.g., for 50% dropout, scale by a factor of 2 in general this is not equivalent to the average prediction of the ensemble

during training for each instance (n): randomly dropout each unit with probability p (e.g., p=.5)

nly the remaining subnetwork participates in training

this is computationally infeasible, instead

8 . 8

SLIDE 97

Summary Summary

Deep feed-forward networks learn adaptive bases more complex bases at higher layers increasing depth is often preferable to width various choices of activation function and architecture universal approximation power their expressive power often necessitates using regularization schemes

9