Lecture 19: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation

β–Ά
lecture 19 anatomy of nn
SMART_READER_LITE
LIVE PREVIEW

Lecture 19: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation

Lecture 19: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS


slide-1
SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

Lecture 19: Anatomy of NN

slide-2
SLIDE 2

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-3
SLIDE 3

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-4
SLIDE 4

CS109A, PROTOPAPAS, RADER, TANNER

Anatomy of artificial neural network (ANN) X Y

input neuron node

  • utput

W

slide-5
SLIDE 5

CS109A, PROTOPAPAS, RADER, TANNER

Anatomy of artificial neural network (ANN)

X Y

𝑍 = 𝑔(β„Ž)

input neuron node

  • utput

Affine transformation We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. Activation

slide-6
SLIDE 6

CS109A, PROTOPAPAS, RADER, TANNER

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 , 𝑋

*

𝑋

+

Input layer hidden layer

  • utput layer

𝑨* = 𝑋

* /π‘Œ = 𝑋 **π‘Œ* + 𝑋 *+π‘Œ+ + 𝑋 *1

β„Ž* = 𝑔(𝑨*) 𝑨+ = 𝑋

+ /π‘Œ = 𝑋 +*π‘Œ* + 𝑋 ++π‘Œ+ + 𝑋 +1

h+= 𝑔(𝑨+) 𝑍 , = 𝑕(β„Ž*, β„Ž+) Output function 𝐾 = β„’(𝑍 ,, 𝑍) Loss function

We will talk later about the choice of the output layer and the loss

  • function. So far we consider sigmoid as the output and log-bernouli.
slide-7
SLIDE 7

CS109A, PROTOPAPAS, RADER, TANNER

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*+

Input layer hidden layer 1

  • utput layer

𝑋

+*

𝑋

++

hidden layer 2

slide-8
SLIDE 8

CS109A, PROTOPAPAS, RADER, TANNER

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*+

Input layer hidden layer 1

  • utput layer

𝑋

8*

𝑋

8+

hidden layer n … … We will talk later about the choice of the number of layers.

slide-9
SLIDE 9

CS109A, PROTOPAPAS, RADER, TANNER

𝑋

8*

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*:

Input layer hidden layer 1, 3 nodes

  • utput layer

𝑋

8:

hidden layer n 3 nodes …

𝑋

*+

𝑋

8+

slide-10
SLIDE 10

CS109A, PROTOPAPAS, RADER, TANNER

𝑋

8*

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*;

Input layer hidden layer 1,

  • utput layer

𝑋

8;

hidden layer n … … …

m nodes

m nodes We will talk later about the choice of the number of nodes.

slide-11
SLIDE 11

CS109A, PROTOPAPAS, RADER, TANNER

𝑋

8*

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*;

Input layer hidden layer 1,

  • utput layer

𝑋

8;

hidden layer n … … …

m nodes

m nodes Number of inputs is specified by the data Number of inputs d

slide-12
SLIDE 12

CS109A, PROTOPAPAS, RADER, TANNER

Anatomy of artificial neural network (ANN)

hidden layer 1 hidden layer 2

  • utput layer

input layer

slide-13
SLIDE 13

CS109A, PROTOPAPAS, RADER, TANNER

hidden layer 1 hidden layer 2 input layer

Anatomy of artificial neural network (ANN)

  • utput layer
slide-14
SLIDE 14

CS109A, PROTOPAPAS, RADER, TANNER

Why layers? Representation

Representation matters!

slide-15
SLIDE 15

CS109A, PROTOPAPAS, RADER, TANNER

Learning Multiple Components

slide-16
SLIDE 16

CS109A, PROTOPAPAS, RADER, TANNER

Depth = Repeated Compositions

slide-17
SLIDE 17

CS109A, PROTOPAPAS, RADER, TANNER

Neural Networks

Hand-written digit recognition: MNIST data

slide-18
SLIDE 18

CS109A, PROTOPAPAS, RADER, TANNER

Depth = Repeated Compositions

slide-19
SLIDE 19

CS109A, PROTOPAPAS, RADER, TANNER

Beyond Linear Models

Linear models:

  • Can be fit efficiently (via convex optimization)
  • Limited model capacity

Alternative: Where 𝜚 is a non-linear transform

f (x) = wTφ(x)

slide-20
SLIDE 20

CS109A, PROTOPAPAS, RADER, TANNER

Traditional ML

Manually engineer 𝜚

  • Domain specific, enormous human effort

Generic transform

  • Maps to a higher-dimensional space
  • Kernel methods: e.g. RBF kernels
  • Over fitting: does not generalize well to test set
  • Cannot encode enough prior information
slide-21
SLIDE 21

CS109A, PROTOPAPAS, RADER, TANNER

Deep Learning

  • Directly learn 𝜚

𝑔 𝑦; πœ„ = 𝑋/𝜚(𝑦; πœ„)

  • 𝜚 𝑦; πœ„ is an automatically-learned re

repre resentation of x

  • For deep networks, 𝜚 is the function learned by the hidden layers of the

network

  • πœ„ are the learned weights
  • Non-convex optimization
  • Can encode prior beliefs, generalizes well
slide-22
SLIDE 22

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-23
SLIDE 23

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-24
SLIDE 24

CS109A, PROTOPAPAS, RADER, TANNER

Activation function

β„Ž = 𝑔(𝑋/π‘Œ + 𝑐) The activation function should:

  • Provide non-linearity
  • Ensure gradients remain large through hidden unit

Common choices are

  • Sigmoid
  • Relu, leaky ReLU, Generalized ReLU, MaxOut
  • softplus
  • tanh
  • swish
slide-25
SLIDE 25

CS109A, PROTOPAPAS, RADER, TANNER

Activation function

β„Ž = 𝑔(𝑋/π‘Œ + 𝑐) The activation function should:

  • Provide non-linearity
  • Ensure gradients remain large through hidden unit

Common choices are

  • sigmoid
  • tanh
  • ReLU, leaky ReLU, Generalized ReLU, MaxOut
  • softplus
  • swish
slide-26
SLIDE 26

CS109A, PROTOPAPAS, RADER, TANNER

Activation function

β„Ž = 𝑔(𝑋/π‘Œ + 𝑐) The activation function should:

  • Provide non-linearity
  • Ensure gradients remain large through hidden unit

Common choices are

  • sigmoid
  • tanh
  • ReLU, leaky ReLU, Generalized ReLU, MaxOut
  • softplus
  • swish
slide-27
SLIDE 27

CS109A, PROTOPAPAS, RADER, TANNER

Sigmoid (aka Logistic)

Derivative is zero for much of the domain. This leads to β€œvanishing gradients” in backpropagation.

𝑧 = 1 1 + 𝑓UV

slide-28
SLIDE 28

CS109A, PROTOPAPAS, RADER, TANNER

Hyperbolic Tangent (Tanh)

Same problem of β€œvanishing gradients” as sigmoid.

𝑧 = 𝑓V βˆ’ 𝑓UV 𝑓V + 𝑓UV

slide-29
SLIDE 29

CS109A, PROTOPAPAS, RADER, TANNER

Rectified Linear Unit (ReLU)

Two major advantages: 1. No vanishing gradient when x > 0

  • 2. Provides sparsity (regularization) since y = 0 when x < 0

𝑧 = max (0, 𝑦)

slide-30
SLIDE 30

CS109A, PROTOPAPAS, RADER, TANNER

Leaky ReLU

  • Tries to fix β€œdying ReLU” problem: derivative is non-zero everywhere.
  • Some people report success with this form of activation function, but the results are not

always consistent

𝑧 = max 0, 𝑦 + 𝛽 min(0,1) where 𝛽 takes a small value

slide-31
SLIDE 31

CS109A, PROTOPAPAS, RADER, TANNER

Generalized ReLU

Generalization: For 𝛽Z > 0 𝑕 𝑦Z, 𝛽 = max 𝑏, 𝑦Z + 𝛽 min{0, 𝑦Z}

slide-32
SLIDE 32

CS109A, PROTOPAPAS, RADER, TANNER

softplus

The logistic sigmoid function is a smooth approximation of the derivative of the rectifier

𝑧 = log(1 + 𝑓V)

slide-33
SLIDE 33

CS109A, PROTOPAPAS, RADER, TANNER

Maxout

Max of k linear functions. Directly learn the activation function. 𝑕(𝑦) = max

Z∈{*,…,b} 𝛽Z𝑦Z + 𝛾

slide-34
SLIDE 34

CS109A, PROTOPAPAS, RADER, TANNER

Swish: A Self-Gated Activation Function

𝑕 𝑦 = 𝑦 𝜏(𝑦)

Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.

slide-35
SLIDE 35

CS109A, PROTOPAPAS, RADER, TANNER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-36
SLIDE 36

CS109A, PROTOPAPAS, RADER, TANNER

Loss Function

Likelihood for a given point: π‘ž 𝑧Z 𝑋; 𝑦Z Assume independency, likelihood for all measurements: 𝑀 𝑋; π‘Œ, 𝑍 = π‘ž 𝑍 𝑋; π‘Œ = g π‘ž 𝑧Z 𝑋; 𝑦Z

  • Z

Maximize the likelihood, or equivalently maximize the log-likelihood: log 𝑀(𝑋; π‘Œ, 𝑍) = i log π‘ž 𝑧Z 𝑋; 𝑦Z

  • Z

Turn this into a loss function: β„’ 𝑋; π‘Œ, 𝑍 = βˆ’ log 𝑀(𝑋; π‘Œ, 𝑍)

slide-37
SLIDE 37

CS109A, PROTOPAPAS, RADER, TANNER

Loss Function

Do not need to design separate loss functions if we follow this simple procedure Examples:

  • Distribution is Normal then likelihood is:

π‘ž 𝑧Z 𝑋; 𝑦Z = 1 √ 2𝜌+𝜏 𝑓U opUo

qp r +s^+

β„’ 𝑋; π‘Œ, 𝑍 = βˆ‘ 𝑧Z βˆ’ 𝑧 qZ +

  • Z
  • Distribution is Bernouli then likelihood is:

π‘ž 𝑧Z 𝑋; 𝑦Z = π‘žZ

  • p 1 βˆ’ π‘žZ *Uop

β„’ 𝑋; π‘Œ, 𝑍 = βˆ’ βˆ‘ 𝑧Z log π‘žZ + (1 βˆ’ 𝑧Z) log(1 βˆ’ π‘žZ)

  • Z

𝐍𝐓𝐅 Cross-Entropy

slide-38
SLIDE 38

CS109A, PROTOPAPAS, RADER, TANNER

Design Choices

Activation function

Loss function Output units Architecture Optimizer

slide-39
SLIDE 39

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Loss Function Binary

slide-40
SLIDE 40

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Loss Function Binary Bernoulli

slide-41
SLIDE 41

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Loss Function Binary Bernoulli Binary Cross Entropy

slide-42
SLIDE 42

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Loss Function Binary Bernoulli ? Binary Cross Entropy

slide-43
SLIDE 43

CS109A, PROTOPAPAS, RADER, TANNER

Output unit for binary classification X

π‘Œ ⟹ 𝜚 π‘Œ ⟹ 𝑄 𝑧 = 1 = 1 1 + 𝑓U{(|) 𝜚 π‘Œ

𝑍 , = P(y = 1)

OUTPUT UNIT

𝜏(𝜚)

X

𝑍 , = P(y = 1)

slide-44
SLIDE 44

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy

slide-45
SLIDE 45

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete

slide-46
SLIDE 46

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinouli

slide-47
SLIDE 47

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinouli Cross Entropy

slide-48
SLIDE 48

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinouli ? Cross Entropy

slide-49
SLIDE 49

CS109A, PROTOPAPAS, RADER, TANNER

Output unit for multi-class classification X 𝑍 , = [P

*, P +, P :]

OUTPUT UNIT

slide-50
SLIDE 50

CS109A, PROTOPAPAS, RADER, TANNER

SoftMax 𝑍 , = 𝑓{€ | βˆ‘ 𝑓{€ |

  • bβ€š*

rest of the network

OUTPUT UNIT

A score B score C score Probability of A Probability of B Probability of C

𝜚b π‘Œ

slide-51
SLIDE 51

CS109A, PROTOPAPAS, RADER, TANNER

SoftMax 𝑍 , = 𝑓{€ | βˆ‘ 𝑓{€ |

  • bβ€š*

rest of the network

OUTPUT UNIT

A score B score C score Probability of A Probability of B Probability of C

𝜚b π‘Œ

SoftMax

slide-52
SLIDE 52

CS109A, PROTOPAPAS, RADER, TANNER

SoftMax 𝑍 , = 𝑓{€ | βˆ‘ 𝑓{€ |

  • bβ€š*

rest of the network

OUTPUT UNIT

Probability of A Probability of B Probability of C

𝜚b π‘Œ

SoftMax

slide-53
SLIDE 53

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy

slide-54
SLIDE 54

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous

slide-55
SLIDE 55

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian

slide-56
SLIDE 56

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian MSE

slide-57
SLIDE 57

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian ? MSE

slide-58
SLIDE 58

CS109A, PROTOPAPAS, RADER, TANNER

Output unit for regression X

π‘Œ ⟹ 𝜚 π‘Œ ⟹ 𝑍 , = π‘‹πœš(π‘Œ) 𝜚 π‘Œ

OUTPUT UNIT

W𝜚

X

𝑍 , 𝑍 ,

slide-59
SLIDE 59

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE

slide-60
SLIDE 60

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary

slide-61
SLIDE 61

CS109A, PROTOPAPAS, RADER, TANNER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary

  • GANS

Lectures 18-19 in CS109B

slide-62
SLIDE 62

CS109A, PROTOPAPAS, RADER, TANNER

Loss Function

Example: sigmoid output + squared loss Flat surfaces π‘€Ζ’β€ž = 𝑧 βˆ’ 𝑧 q + = y βˆ’ 𝜏 𝑦

+

slide-63
SLIDE 63

CS109A, PROTOPAPAS, RADER, TANNER

Cost Function

Example: sigmoid output + cross-entropy loss 𝑀…† 𝑧, 𝑧 q = βˆ’{ 𝑧 log 𝑧 q + 1 βˆ’ 𝑧 log(1 βˆ’ 𝑧 q)}

slide-64
SLIDE 64

CS109A, PROTOPAPAS, RADER, TANNER

Design Choices

Activation function

Loss function Output units Architecture Optimizer

slide-65
SLIDE 65

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

slide-66
SLIDE 66

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

slide-67
SLIDE 67

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

slide-68
SLIDE 68

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

…

slide-69
SLIDE 69

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

slide-70
SLIDE 70

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

slide-71
SLIDE 71

CS109A, PROTOPAPAS, RADER, TANNER

NN in action

slide-72
SLIDE 72

CS109A, PROTOPAPAS, RADER, TANNER

Universal Approximation Theorem

Think of a Neural Network as function approximation. 𝑍 = 𝑔 𝑦 + πœ— 𝑍 = 𝑔 Λ†(𝑦) + πœ— NN: ⟹ 𝑔 Λ†(𝑦) One hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy So why deeper?

  • Shallow net may need (exponentially) more width
  • Shallow net may overfit more

width

𝑋

+

𝑋

:

𝑋

‰

𝑋

*

depth

slide-73
SLIDE 73

CS109A, PROTOPAPAS, RADER, TANNER

Better Generalization with Depth

(Goodfellow 2017)

slide-74
SLIDE 74

CS109A, PROTOPAPAS, RADER, TANNER

Shallow Nets Overfit More

(Goodfellow 2017) The 3-layer nets perform worse on the test set, even with similar number of total parameters. The 11-layer net generalizes better on the test set when controlling for number of parameters. Depth helps, and it’s not just because of more parameters Don’t worry about this word β€œconvolutional”. It’s just a special type of neural network, often used for images.

slide-75
SLIDE 75

CS109A, PROTOPAPAS, RADER, TANNER

slide-76
SLIDE 76

CS109A, PROTOPAPAS, RADER, TANNER

Lab time with Pavlos

1. Install Keras or tensorboard 2

  • 2. Build the same thing we did for exercise from Lecture 18

but now with Keras.