Lecture 18: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation
Lecture 18: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation
Lecture 18: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS , R ADER
CS109A, PROTOPAPAS, RADER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER
Anatomy of artificial neural network (ANN) X Y
input neuron node
- utput
CS109A, PROTOPAPAS, RADER
Anatomy of artificial neural network (ANN)
X Y
π = π(β)
input neuron node
- utput
Affine transformation We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. Activation
CS109A, PROTOPAPAS, RADER
Anatomy of artificial neural network (ANN)
π* π+ π , π
*
π
+
Input layer hidden layer
- utput layer
π¨* = π
* /π = π **π* + π *+π+ + π *1
β* = π(π¨*) π¨+ = π
+ /π = π +*π* + π ++π+ + π +1
h+= π(π¨+) π , = π(β*, β+) Output function πΎ = β(π ,, π) Loss function
We will talk later about the choice of the output layer and the loss
- function. So far we consider sigmoid as the output and log-bernouli.
CS109A, PROTOPAPAS, RADER
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*+
Input layer hidden layer 1
- utput layer
π
+*
π
++
hidden layer 2
CS109A, PROTOPAPAS, RADER
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*+
Input layer hidden layer 1
- utput layer
π
8*
π
8+
hidden layer n β¦ β¦ We will talk later about the choice of the number of layers.
CS109A, PROTOPAPAS, RADER
π
8*
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*:
Input layer hidden layer 1, 3 nodes
- utput layer
π
8:
hidden layer n 3 nodes β¦
π
*+
π
8+
CS109A, PROTOPAPAS, RADER
π
8*
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*;
Input layer hidden layer 1,
- utput layer
π
8;
hidden layer n β¦ β¦ β¦
m nodes
m nodes We will talk later about the choice of the number of nodes.
CS109A, PROTOPAPAS, RADER
π
8*
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*;
Input layer hidden layer 1,
- utput layer
π
8;
hidden layer n β¦ β¦ β¦
m nodes
m nodes Number of inputs is specified by the data Number of inputs d
CS109A, PROTOPAPAS, RADER
Why layers? Representation
Representation matters!
CS109A, PROTOPAPAS, RADER
Learning Multiple Components
CS109A, PROTOPAPAS, RADER
Depth = Repeated Compositions
CS109A, PROTOPAPAS, RADER
Neural Networks
Hand-written digit recognition: MNIST data
CS109A, PROTOPAPAS, RADER
Depth = Repeated Compositions
CS109A, PROTOPAPAS, RADER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER
Activation function
β = π(π/π + π) The activation function should:
- Ensures not linearity
- Ensure gradients remain large through hidden unit
Common choices are
- Sigmoid
- Relu, leaky ReLU, Generalized ReLU, MaxOut
- softplus
- tanh
- swish
CS109A, PROTOPAPAS, RADER
Activation function
β = π(π/π + π) The activation function should:
- Ensures not linearity
- Ensure gradients remain large through hidden unit
Common choices are
- Sigmoid
- Relu, leaky ReLU, Generalized ReLU, MaxOut
- softplus
- tanh
- swish
CS109A, PROTOPAPAS, RADER
Beyond Linear Models
Linear models:
- Can be fit efficiently (via convex optimization)
- Limited model capacity
Alternative: Where π is a non-linear transform
f (x) = wTΟ(x)
CS109A, PROTOPAPAS, RADER
Traditional ML
Manually engineer π
- Domain specific, enormous human effort
Generic transform
- Maps to a higher-dimensional space
- Kernel methods: e.g. RBF kernels
- Over fitting: does not generalize well to test set
- Cannot encode enough prior information
CS109A, PROTOPAPAS, RADER
Deep Learning
- Directly learn π
π π¦; π = π/π(π¦; π)
- where π are parameters of the transform
- π defines hidden layers
- Non-convex optimization
- Can encode prior beliefs, generalizes well
CS109A, PROTOPAPAS, RADER
Activation function
β = π(π/π + π) The activation function should:
- Ensures not linearity
- Ensure gradients remain large through hidden unit
Common choices are
- Sigmoid
- Relu, leaky ReLU, Generalized ReLU, MaxOut
- softplus
- tanh
- swish
CS109A, PROTOPAPAS, RADER
CS109A, PROTOPAPAS, RADER
ReLU and Softplus
- b/W
CS109A, PROTOPAPAS, RADER
Generalized ReLU
Generalization: For π½H > 0 π π¦H, π½ = max π, π¦H + π½ min{0, π¦H}
CS109A, PROTOPAPAS, RADER
Maxout
Max of k linear functions. Directly learn the activation function. π(π¦) = max
Hβ{*,β¦,S} π½Hπ¦H + πΎ
CS109A, PROTOPAPAS, RADER
Swish: A Self-Gated Activation Function
π π¦ = π¦ π(π¦) Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.
CS109A, PROTOPAPAS, RADER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER
Loss Function
Cross-entropy between training data and model distribution (i.e. negative log-likelihood)
πΎ π = βπ½X,Y~[
\]^_^ log πdefgh(y|x)
Do not need to design separate loss functions. Gradient of cost function must be large enough
CS109A, PROTOPAPAS, RADER
Loss Function
Example: sigmoid output + squared loss Flat surfaces πjk = π§ β π§ \ + = y β π π¦
+
CS109A, PROTOPAPAS, RADER
Cost Function
Example: sigmoid output + cross-entropy loss πno π§, π§ \ = β{ π§ log π§ \ + 1 β π§ log(1 β π§ \)}
CS109A, PROTOPAPAS, RADER
Design Choices
Activation function
Loss function Output units Architecture Optimizer
CS109A, PROTOPAPAS, RADER
Output Units
Output Type Output Distribution Output layer Cost Function Binary
CS109A, PROTOPAPAS, RADER
Link function X π , = P(y = 0)
π βΉ π π = π/π βΉ π π§ = 0 = 1 1 + πu(v)
X
π π
π , = P(y = 0)
OUTPUT UNIT
π(π)
CS109A, PROTOPAPAS, RADER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy
CS109A, PROTOPAPAS, RADER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete
CS109A, PROTOPAPAS, RADER
Link function multi-class problem X π , X
π(π)
OUTPUT UNIT
SoftMax
π , = πu{ v β πu{ v
} S~*
CS109A, PROTOPAPAS, RADER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy
CS109A, PROTOPAPAS, RADER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE
CS109A, PROTOPAPAS, RADER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary
- GANS
CS109A, PROTOPAPAS, RADER
Design Choices
Activation function
Loss function Output units Architecture Optimizer
CS109A, PROTOPAPAS, RADER
NN in action
CS109A, PROTOPAPAS, RADER
NN in action
CS109A, PROTOPAPAS, RADER
NN in action
CS109A, PROTOPAPAS, RADER
NN in action
β¦
CS109A, PROTOPAPAS, RADER
NN in action
CS109A, PROTOPAPAS, RADER
NN in action
CS109A, PROTOPAPAS, RADER
NN in action
CS109A, PROTOPAPAS, RADER
Universal Approximation Theorem
Think of a Neural Network as function approximation. π = π π¦ + π π = π β¬(π¦) + π NN: βΉ π β¬(π¦) One hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy So why deeper?
- Shallow net may need (exponentially) more width
- Shallow net may overfit more
width
π
+
π
:
π
- π
*
depth
CS109A, PROTOPAPAS, RADER
Better Generalization with Depth
(Goodfellow 2017)
CS109A, PROTOPAPAS, RADER
Large, Shallow Nets Overfit More
(Goodfellow 2017)