Lecture 18: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation

β–Ά
lecture 18 anatomy of nn
SMART_READER_LITE
LIVE PREVIEW

Lecture 18: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation

Lecture 18: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS , R ADER


slide-1
SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas and Kevin Rader

Lecture 18: Anatomy of NN

slide-2
SLIDE 2

CS109A, PROTOPAPAS, RADER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-3
SLIDE 3

CS109A, PROTOPAPAS, RADER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-4
SLIDE 4

CS109A, PROTOPAPAS, RADER

Anatomy of artificial neural network (ANN) X Y

input neuron node

  • utput
slide-5
SLIDE 5

CS109A, PROTOPAPAS, RADER

Anatomy of artificial neural network (ANN)

X Y

𝑍 = 𝑔(β„Ž)

input neuron node

  • utput

Affine transformation We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. Activation

slide-6
SLIDE 6

CS109A, PROTOPAPAS, RADER

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 , 𝑋

*

𝑋

+

Input layer hidden layer

  • utput layer

𝑨* = 𝑋

* /π‘Œ = 𝑋 **π‘Œ* + 𝑋 *+π‘Œ+ + 𝑋 *1

β„Ž* = 𝑔(𝑨*) 𝑨+ = 𝑋

+ /π‘Œ = 𝑋 +*π‘Œ* + 𝑋 ++π‘Œ+ + 𝑋 +1

h+= 𝑔(𝑨+) 𝑍 , = 𝑕(β„Ž*, β„Ž+) Output function 𝐾 = β„’(𝑍 ,, 𝑍) Loss function

We will talk later about the choice of the output layer and the loss

  • function. So far we consider sigmoid as the output and log-bernouli.
slide-7
SLIDE 7

CS109A, PROTOPAPAS, RADER

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*+

Input layer hidden layer 1

  • utput layer

𝑋

+*

𝑋

++

hidden layer 2

slide-8
SLIDE 8

CS109A, PROTOPAPAS, RADER

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*+

Input layer hidden layer 1

  • utput layer

𝑋

8*

𝑋

8+

hidden layer n … … We will talk later about the choice of the number of layers.

slide-9
SLIDE 9

CS109A, PROTOPAPAS, RADER

𝑋

8*

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*:

Input layer hidden layer 1, 3 nodes

  • utput layer

𝑋

8:

hidden layer n 3 nodes …

𝑋

*+

𝑋

8+

slide-10
SLIDE 10

CS109A, PROTOPAPAS, RADER

𝑋

8*

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*;

Input layer hidden layer 1,

  • utput layer

𝑋

8;

hidden layer n … … …

m nodes

m nodes We will talk later about the choice of the number of nodes.

slide-11
SLIDE 11

CS109A, PROTOPAPAS, RADER

𝑋

8*

Anatomy of artificial neural network (ANN)

π‘Œ* π‘Œ+ 𝑍 𝑋

**

𝑋

*;

Input layer hidden layer 1,

  • utput layer

𝑋

8;

hidden layer n … … …

m nodes

m nodes Number of inputs is specified by the data Number of inputs d

slide-12
SLIDE 12

CS109A, PROTOPAPAS, RADER

Why layers? Representation

Representation matters!

slide-13
SLIDE 13

CS109A, PROTOPAPAS, RADER

Learning Multiple Components

slide-14
SLIDE 14

CS109A, PROTOPAPAS, RADER

Depth = Repeated Compositions

slide-15
SLIDE 15

CS109A, PROTOPAPAS, RADER

Neural Networks

Hand-written digit recognition: MNIST data

slide-16
SLIDE 16

CS109A, PROTOPAPAS, RADER

Depth = Repeated Compositions

slide-17
SLIDE 17

CS109A, PROTOPAPAS, RADER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-18
SLIDE 18

CS109A, PROTOPAPAS, RADER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-19
SLIDE 19

CS109A, PROTOPAPAS, RADER

Activation function

β„Ž = 𝑔(𝑋/π‘Œ + 𝑐) The activation function should:

  • Ensures not linearity
  • Ensure gradients remain large through hidden unit

Common choices are

  • Sigmoid
  • Relu, leaky ReLU, Generalized ReLU, MaxOut
  • softplus
  • tanh
  • swish
slide-20
SLIDE 20

CS109A, PROTOPAPAS, RADER

Activation function

β„Ž = 𝑔(𝑋/π‘Œ + 𝑐) The activation function should:

  • Ensures not linearity
  • Ensure gradients remain large through hidden unit

Common choices are

  • Sigmoid
  • Relu, leaky ReLU, Generalized ReLU, MaxOut
  • softplus
  • tanh
  • swish
slide-21
SLIDE 21

CS109A, PROTOPAPAS, RADER

Beyond Linear Models

Linear models:

  • Can be fit efficiently (via convex optimization)
  • Limited model capacity

Alternative: Where 𝜚 is a non-linear transform

f (x) = wTφ(x)

slide-22
SLIDE 22

CS109A, PROTOPAPAS, RADER

Traditional ML

Manually engineer 𝜚

  • Domain specific, enormous human effort

Generic transform

  • Maps to a higher-dimensional space
  • Kernel methods: e.g. RBF kernels
  • Over fitting: does not generalize well to test set
  • Cannot encode enough prior information
slide-23
SLIDE 23

CS109A, PROTOPAPAS, RADER

Deep Learning

  • Directly learn 𝜚

𝑔 𝑦; πœ„ = 𝑋/𝜚(𝑦; πœ„)

  • where πœ„ are parameters of the transform
  • 𝜚 defines hidden layers
  • Non-convex optimization
  • Can encode prior beliefs, generalizes well
slide-24
SLIDE 24

CS109A, PROTOPAPAS, RADER

Activation function

β„Ž = 𝑔(𝑋/π‘Œ + 𝑐) The activation function should:

  • Ensures not linearity
  • Ensure gradients remain large through hidden unit

Common choices are

  • Sigmoid
  • Relu, leaky ReLU, Generalized ReLU, MaxOut
  • softplus
  • tanh
  • swish
slide-25
SLIDE 25

CS109A, PROTOPAPAS, RADER

slide-26
SLIDE 26

CS109A, PROTOPAPAS, RADER

ReLU and Softplus

  • b/W
slide-27
SLIDE 27

CS109A, PROTOPAPAS, RADER

Generalized ReLU

Generalization: For 𝛽H > 0 𝑕 𝑦H, 𝛽 = max 𝑏, 𝑦H + 𝛽 min{0, 𝑦H}

slide-28
SLIDE 28

CS109A, PROTOPAPAS, RADER

Maxout

Max of k linear functions. Directly learn the activation function. 𝑕(𝑦) = max

H∈{*,…,S} 𝛽H𝑦H + 𝛾

slide-29
SLIDE 29

CS109A, PROTOPAPAS, RADER

Swish: A Self-Gated Activation Function

𝑕 𝑦 = 𝑦 𝜏(𝑦) Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.

slide-30
SLIDE 30

CS109A, PROTOPAPAS, RADER

Outline

Anatomy of a NN Design choices

  • Activation function
  • Loss function
  • Output units
  • Architecture
slide-31
SLIDE 31

CS109A, PROTOPAPAS, RADER

Loss Function

Cross-entropy between training data and model distribution (i.e. negative log-likelihood)

𝐾 𝑋 = βˆ’π”½X,Y~[

\]^_^ log π‘ždefgh(y|x)

Do not need to design separate loss functions. Gradient of cost function must be large enough

slide-32
SLIDE 32

CS109A, PROTOPAPAS, RADER

Loss Function

Example: sigmoid output + squared loss Flat surfaces 𝑀jk = 𝑧 βˆ’ 𝑧 \ + = y βˆ’ 𝜏 𝑦

+

slide-33
SLIDE 33

CS109A, PROTOPAPAS, RADER

Cost Function

Example: sigmoid output + cross-entropy loss 𝑀no 𝑧, 𝑧 \ = βˆ’{ 𝑧 log 𝑧 \ + 1 βˆ’ 𝑧 log(1 βˆ’ 𝑧 \)}

slide-34
SLIDE 34

CS109A, PROTOPAPAS, RADER

Design Choices

Activation function

Loss function Output units Architecture Optimizer

slide-35
SLIDE 35

CS109A, PROTOPAPAS, RADER

Output Units

Output Type Output Distribution Output layer Cost Function Binary

slide-36
SLIDE 36

CS109A, PROTOPAPAS, RADER

Link function X 𝑍 , = P(y = 0)

π‘Œ ⟹ 𝜚 π‘Œ = 𝑋/π‘Œ ⟹ 𝑄 𝑧 = 0 = 1 1 + 𝑓u(v)

X

𝜚 π‘Œ

𝑍 , = P(y = 0)

OUTPUT UNIT

𝜏(𝜚)

slide-37
SLIDE 37

CS109A, PROTOPAPAS, RADER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy

slide-38
SLIDE 38

CS109A, PROTOPAPAS, RADER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete

slide-39
SLIDE 39

CS109A, PROTOPAPAS, RADER

Link function multi-class problem X 𝑍 , X

𝜚(π‘Œ)

OUTPUT UNIT

SoftMax

𝑍 , = 𝑓u{ v βˆ‘ 𝑓u{ v

} S~*

slide-40
SLIDE 40

CS109A, PROTOPAPAS, RADER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy

slide-41
SLIDE 41

CS109A, PROTOPAPAS, RADER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE

slide-42
SLIDE 42

CS109A, PROTOPAPAS, RADER

Output Units

Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary

  • GANS
slide-43
SLIDE 43

CS109A, PROTOPAPAS, RADER

Design Choices

Activation function

Loss function Output units Architecture Optimizer

slide-44
SLIDE 44

CS109A, PROTOPAPAS, RADER

NN in action

slide-45
SLIDE 45

CS109A, PROTOPAPAS, RADER

NN in action

slide-46
SLIDE 46

CS109A, PROTOPAPAS, RADER

NN in action

slide-47
SLIDE 47

CS109A, PROTOPAPAS, RADER

NN in action

…

slide-48
SLIDE 48

CS109A, PROTOPAPAS, RADER

NN in action

slide-49
SLIDE 49

CS109A, PROTOPAPAS, RADER

NN in action

slide-50
SLIDE 50

CS109A, PROTOPAPAS, RADER

NN in action

slide-51
SLIDE 51

CS109A, PROTOPAPAS, RADER

Universal Approximation Theorem

Think of a Neural Network as function approximation. 𝑍 = 𝑔 𝑦 + πœ— 𝑍 = 𝑔 €(𝑦) + πœ— NN: ⟹ 𝑔 €(𝑦) One hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy So why deeper?

  • Shallow net may need (exponentially) more width
  • Shallow net may overfit more

width

𝑋

+

𝑋

:

𝑋

  • 𝑋

*

depth

slide-52
SLIDE 52

CS109A, PROTOPAPAS, RADER

Better Generalization with Depth

(Goodfellow 2017)

slide-53
SLIDE 53

CS109A, PROTOPAPAS, RADER

Large, Shallow Nets Overfit More

(Goodfellow 2017)