Lecture 19: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation
Lecture 19: Anatomy of NN CS109A Introduction to Data Science - - PowerPoint PPT Presentation
Lecture 19: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS
CS109A, PROTOPAPAS, RADER, TANNER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER, TANNER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER, TANNER
Anatomy of artificial neural network (ANN) X Y
input neuron node
- utput
W
CS109A, PROTOPAPAS, RADER, TANNER
Anatomy of artificial neural network (ANN)
X Y
π = π(β)
input neuron node
- utput
Affine transformation We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. Activation
CS109A, PROTOPAPAS, RADER, TANNER
Anatomy of artificial neural network (ANN)
π* π+ π , π
*
π
+
Input layer hidden layer
- utput layer
π¨* = π
* /π = π **π* + π *+π+ + π *1
β* = π(π¨*) π¨+ = π
+ /π = π +*π* + π ++π+ + π +1
h+= π(π¨+) π , = π(β*, β+) Output function πΎ = β(π ,, π) Loss function
We will talk later about the choice of the output layer and the loss
- function. So far we consider sigmoid as the output and log-bernouli.
CS109A, PROTOPAPAS, RADER, TANNER
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*+
Input layer hidden layer 1
- utput layer
π
+*
π
++
hidden layer 2
CS109A, PROTOPAPAS, RADER, TANNER
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*+
Input layer hidden layer 1
- utput layer
π
8*
π
8+
hidden layer n β¦ β¦ We will talk later about the choice of the number of layers.
CS109A, PROTOPAPAS, RADER, TANNER
π
8*
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*:
Input layer hidden layer 1, 3 nodes
- utput layer
π
8:
hidden layer n 3 nodes β¦
π
*+
π
8+
CS109A, PROTOPAPAS, RADER, TANNER
π
8*
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*;
Input layer hidden layer 1,
- utput layer
π
8;
hidden layer n β¦ β¦ β¦
m nodes
m nodes We will talk later about the choice of the number of nodes.
CS109A, PROTOPAPAS, RADER, TANNER
π
8*
Anatomy of artificial neural network (ANN)
π* π+ π π
**
π
*;
Input layer hidden layer 1,
- utput layer
π
8;
hidden layer n β¦ β¦ β¦
m nodes
m nodes Number of inputs is specified by the data Number of inputs d
CS109A, PROTOPAPAS, RADER, TANNER
Anatomy of artificial neural network (ANN)
hidden layer 1 hidden layer 2
- utput layer
input layer
CS109A, PROTOPAPAS, RADER, TANNER
hidden layer 1 hidden layer 2 input layer
Anatomy of artificial neural network (ANN)
- utput layer
CS109A, PROTOPAPAS, RADER, TANNER
Why layers? Representation
Representation matters!
CS109A, PROTOPAPAS, RADER, TANNER
Learning Multiple Components
CS109A, PROTOPAPAS, RADER, TANNER
Depth = Repeated Compositions
CS109A, PROTOPAPAS, RADER, TANNER
Neural Networks
Hand-written digit recognition: MNIST data
CS109A, PROTOPAPAS, RADER, TANNER
Depth = Repeated Compositions
CS109A, PROTOPAPAS, RADER, TANNER
Beyond Linear Models
Linear models:
- Can be fit efficiently (via convex optimization)
- Limited model capacity
Alternative: Where π is a non-linear transform
f (x) = wTΟ(x)
CS109A, PROTOPAPAS, RADER, TANNER
Traditional ML
Manually engineer π
- Domain specific, enormous human effort
Generic transform
- Maps to a higher-dimensional space
- Kernel methods: e.g. RBF kernels
- Over fitting: does not generalize well to test set
- Cannot encode enough prior information
CS109A, PROTOPAPAS, RADER, TANNER
Deep Learning
- Directly learn π
π π¦; π = π/π(π¦; π)
- π π¦; π is an automatically-learned re
repre resentation of x
- For deep networks, π is the function learned by the hidden layers of the
network
- π are the learned weights
- Non-convex optimization
- Can encode prior beliefs, generalizes well
CS109A, PROTOPAPAS, RADER, TANNER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER, TANNER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER, TANNER
Activation function
β = π(π/π + π) The activation function should:
- Provide non-linearity
- Ensure gradients remain large through hidden unit
Common choices are
- Sigmoid
- Relu, leaky ReLU, Generalized ReLU, MaxOut
- softplus
- tanh
- swish
CS109A, PROTOPAPAS, RADER, TANNER
Activation function
β = π(π/π + π) The activation function should:
- Provide non-linearity
- Ensure gradients remain large through hidden unit
Common choices are
- sigmoid
- tanh
- ReLU, leaky ReLU, Generalized ReLU, MaxOut
- softplus
- swish
CS109A, PROTOPAPAS, RADER, TANNER
Activation function
β = π(π/π + π) The activation function should:
- Provide non-linearity
- Ensure gradients remain large through hidden unit
Common choices are
- sigmoid
- tanh
- ReLU, leaky ReLU, Generalized ReLU, MaxOut
- softplus
- swish
CS109A, PROTOPAPAS, RADER, TANNER
Sigmoid (aka Logistic)
Derivative is zero for much of the domain. This leads to βvanishing gradientsβ in backpropagation.
π§ = 1 1 + πUV
CS109A, PROTOPAPAS, RADER, TANNER
Hyperbolic Tangent (Tanh)
Same problem of βvanishing gradientsβ as sigmoid.
π§ = πV β πUV πV + πUV
CS109A, PROTOPAPAS, RADER, TANNER
Rectified Linear Unit (ReLU)
Two major advantages: 1. No vanishing gradient when x > 0
- 2. Provides sparsity (regularization) since y = 0 when x < 0
π§ = max (0, π¦)
CS109A, PROTOPAPAS, RADER, TANNER
Leaky ReLU
- Tries to fix βdying ReLUβ problem: derivative is non-zero everywhere.
- Some people report success with this form of activation function, but the results are not
always consistent
π§ = max 0, π¦ + π½ min(0,1) where π½ takes a small value
CS109A, PROTOPAPAS, RADER, TANNER
Generalized ReLU
Generalization: For π½Z > 0 π π¦Z, π½ = max π, π¦Z + π½ min{0, π¦Z}
CS109A, PROTOPAPAS, RADER, TANNER
softplus
The logistic sigmoid function is a smooth approximation of the derivative of the rectifier
π§ = log(1 + πV)
CS109A, PROTOPAPAS, RADER, TANNER
Maxout
Max of k linear functions. Directly learn the activation function. π(π¦) = max
Zβ{*,β¦,b} π½Zπ¦Z + πΎ
CS109A, PROTOPAPAS, RADER, TANNER
Swish: A Self-Gated Activation Function
π π¦ = π¦ π(π¦)
Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets.
CS109A, PROTOPAPAS, RADER, TANNER
Outline
Anatomy of a NN Design choices
- Activation function
- Loss function
- Output units
- Architecture
CS109A, PROTOPAPAS, RADER, TANNER
Loss Function
Likelihood for a given point: π π§Z π; π¦Z Assume independency, likelihood for all measurements: π π; π, π = π π π; π = g π π§Z π; π¦Z
- Z
Maximize the likelihood, or equivalently maximize the log-likelihood: log π(π; π, π) = i log π π§Z π; π¦Z
- Z
Turn this into a loss function: β π; π, π = β log π(π; π, π)
CS109A, PROTOPAPAS, RADER, TANNER
Loss Function
Do not need to design separate loss functions if we follow this simple procedure Examples:
- Distribution is Normal then likelihood is:
π π§Z π; π¦Z = 1 β 2π+π πU opUo
qp r +s^+
β π; π, π = β π§Z β π§ qZ +
- Z
- Distribution is Bernouli then likelihood is:
π π§Z π; π¦Z = πZ
- p 1 β πZ *Uop
β π; π, π = β β π§Z log πZ + (1 β π§Z) log(1 β πZ)
- Z
πππ Cross-Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Design Choices
Activation function
Loss function Output units Architecture Optimizer
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Loss Function Binary
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Loss Function Binary Bernoulli
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Loss Function Binary Bernoulli Binary Cross Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Loss Function Binary Bernoulli ? Binary Cross Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Output unit for binary classification X
π βΉ π π βΉ π π§ = 1 = 1 1 + πU{(|) π π
π , = P(y = 1)
OUTPUT UNIT
π(π)
X
π , = P(y = 1)
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinouli
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinouli Cross Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinouli ? Cross Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Output unit for multi-class classification X π , = [P
*, P +, P :]
OUTPUT UNIT
CS109A, PROTOPAPAS, RADER, TANNER
SoftMax π , = π{β¬ | β π{β¬ |
- bβ*
rest of the network
OUTPUT UNIT
A score B score C score Probability of A Probability of B Probability of C
πb π
CS109A, PROTOPAPAS, RADER, TANNER
SoftMax π , = π{β¬ | β π{β¬ |
- bβ*
rest of the network
OUTPUT UNIT
A score B score C score Probability of A Probability of B Probability of C
πb π
SoftMax
CS109A, PROTOPAPAS, RADER, TANNER
SoftMax π , = π{β¬ | β π{β¬ |
- bβ*
rest of the network
OUTPUT UNIT
Probability of A Probability of B Probability of C
πb π
SoftMax
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian MSE
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian ? MSE
CS109A, PROTOPAPAS, RADER, TANNER
Output unit for regression X
π βΉ π π βΉ π , = ππ(π) π π
OUTPUT UNIT
Wπ
X
π , π ,
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary
CS109A, PROTOPAPAS, RADER, TANNER
Output Units
Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary
- GANS
Lectures 18-19 in CS109B
CS109A, PROTOPAPAS, RADER, TANNER
Loss Function
Example: sigmoid output + squared loss Flat surfaces πΖβ = π§ β π§ q + = y β π π¦
+
CS109A, PROTOPAPAS, RADER, TANNER
Cost Function
Example: sigmoid output + cross-entropy loss πβ¦β π§, π§ q = β{ π§ log π§ q + 1 β π§ log(1 β π§ q)}
CS109A, PROTOPAPAS, RADER, TANNER
Design Choices
Activation function
Loss function Output units Architecture Optimizer
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
β¦
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
CS109A, PROTOPAPAS, RADER, TANNER
NN in action
CS109A, PROTOPAPAS, RADER, TANNER
Universal Approximation Theorem
Think of a Neural Network as function approximation. π = π π¦ + π π = π Λ(π¦) + π NN: βΉ π Λ(π¦) One hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy So why deeper?
- Shallow net may need (exponentially) more width
- Shallow net may overfit more
width
π
+
π
:
π
β°
π
*
depth
CS109A, PROTOPAPAS, RADER, TANNER
Better Generalization with Depth
(Goodfellow 2017)
CS109A, PROTOPAPAS, RADER, TANNER
Shallow Nets Overfit More
(Goodfellow 2017) The 3-layer nets perform worse on the test set, even with similar number of total parameters. The 11-layer net generalizes better on the test set when controlling for number of parameters. Depth helps, and itβs not just because of more parameters Donβt worry about this word βconvolutionalβ. Itβs just a special type of neural network, often used for images.
CS109A, PROTOPAPAS, RADER, TANNER
CS109A, PROTOPAPAS, RADER, TANNER
Lab time with Pavlos
1. Install Keras or tensorboard 2
- 2. Build the same thing we did for exercise from Lecture 18