Neural Networks and Autodifferentiation
CMSC 678 UMBC
Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - - PowerPoint PPT Presentation
Neural Networks and Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models ) exp( , ) model the posterior probabilities of the K classes via linear functions
Neural Networks and Autodifferentiation
CMSC 678 UMBC
Recap from last timeβ¦
Maximum Entropy (Log-linear) Models
π π¦ π§) β exp(πππ π¦, π§ )
βmodel the posterior probabilities of the K classes via linear functions in ΞΈ, while at the same time ensuring that they sum to one and remain in [0, 1]β ~ Ch 4.4 β[The log-linear estimate] is the least biased estimate possible
is maximally noncommittal with regard to missing information.β Jaynes, 1957
β¦
label x
Normalization for Classification
weight1 * f1(fatally shot, X) weight2 * f2(seriously wounded, X) weight3 * f3(Shining Path, X)
Connections to Other Techniques
Log-Linear Models (Multinomial) logistic regression Softmax regression Max`imum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets
π§ = ΰ·
π
πππ¦π + π
the response can be a general (transformed) version of another response
log π(π¦ = π) log π(π¦ = πΏ) = ΰ·
π
πππ(π¦π, π) + π
logistic regression
Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pΞΈ thinks it computes for feature fk ΰ·
π
π½π[π(π¦β², π§π)
Outline
Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)
Sigmoid
s=10 s=0.5 s=1 π π€ = 1 1 + exp(βπ‘π€)
Sigmoid
s=10 s=0.5 s=1
ππ π€ ππ€ = π‘ β π π€ β 1 β π π€
π π€ = 1 1 + exp(βπ‘π€) calc practice: verify for yourself
Remember Multi-class Linear Regression/Perceptron?
π± π¦ π§ π§ = π±ππ¦ + π
if y > 0: class 1 else: class 2
Linear Regression/Perceptron: A Per-Class View
π± π¦ π§ π§ = π±ππ¦ + π π±π π¦ π§ π§1 = π±π
ππ¦ + ππ±π π§2 π§2 = π±π
ππ¦ + ππ§1
if y > 0: class 1 else: class 2
i = argmax {y1, y2} class i
binary version is special case
Logistic Regression/Classification
π± π¦ π§ π§ = π(π±ππ¦ + π) π¦ π§ = softmax(π±ππ¦ + π) π±π π§ π§1 β exp(π±π
ππ¦ + π)π±π π§2 π§2 β exp( π±π
ππ¦ + π)π§1
i = argmax {y1, y2} class i
Logistic Regression/Classification
π¦ π±π π§ π§1 β exp(π±π
ππ¦ + π)π±π π§2 π§2 β exp( π±π
ππ¦ + π)π§1
i = argmax {y1, y2} class i Q: Why didnβt our maxent formulation from last class have multiple weight vectors?
Logistic Regression/Classification
π¦ π±π π§ π§1 β exp(π±π
ππ¦ + π)π±π π§2 π§2 β exp( π±π
ππ¦ + π)π§1
i = argmax {y1, y2} class i Q: Why didnβt our maxent formulation from last class have multiple weight vectors? A: Implicitly it did. Our formulation was π§ β exp(π₯ππ π¦, π§ )
Stacking Logistic Regression
π±π π¦ βπ = π(π±π£
ππ¦ + π0)β π§ Goal: you still want to predict y Idea: Can making an initial round
predictions h help? π±π π±π π±π
Stacking Logistic Regression
π¦ πΈ β π§ π§π = softmax(ππ€
πβ + π1)π§1 π§2 Predict y from your first round of predictions h Idea: data/signal compression π±π π±π π±π π±π βπ = π(π±π£
ππ¦ + π0)Stacking Logistic Regression
π¦ βπ = π(π±π£
ππ¦ + π0)β π§ π§π = softmax(ππ€
πβ + π1)π§1 π§2 Do we need (binary) probabilities here? πΈ π±π π±π π±π π±π
Stacking Logistic Regression
π¦ βπ = πΊ(π±π£
ππ¦ + π0)β π§ π§π = softmax(ππ€
πβ + π1)π§1 π§2 F: (non-linear) activation function Do we need probabilities here? πΈ π±π π±π π±π π±π
Stacking Logistic Regression
π¦ βπ = πΊ(π±π£
ππ¦ + π0)β π§ π§π = softmax(ππ€
πβ + π1)π§1 π§2 F: (non-linear) activation function Do we need probabilities here? Classification: probably Regression: not really πΈ π±π π±π π±π π±π
Stacking Logistic Regression
π¦ βπ = πΊ(π±π£
ππ¦ + π0)β π§ π§π = G(ππ€
πβ + π1)π§1 π§2 F: (non-linear) activation function Classification: softmax Regression: identity G: (non-linear) activation function πΈ π±π π±π π±π π±π
Multilayer Perceptron, a.k.a. Feed-Forward Neural Network
π¦ βπ = πΊ(π±π£
ππ¦ + π0)β π§ π§1 π§2 F: (non-linear) activation function Classification: softmax Regression: identity G: (non-linear) activation function π§π = G(ππ€
πβ + π1)πΈ π±π π±π π±π π±π
Feed-Forward Neural Network
π¦ βπ = πΊ(π±π£
ππ¦ + π0)β π§ π§1 π§2 π§π = G(ππ€
πβ + π1)πΈ π±π π±π π±π π±π
πΈ: # output X # hidden π±: # hidden X # input
Why Non-Linear?
π¦ β π§
π§π = G ππ
πβ + π1
π§π = π» πΎπ
π πΊ π₯π ππ¦ + π0 π
π§1 π§2 πΈ π±π π±π π±π π±π
Feed-Forward
π¦ β π§ π§1 π§2 πΈ π±π π±π π±π π±π information/ computation flow no self-loops (recurrence/reuse of weights)
Why βNeural?β
argue from neuroscience perspective neurons (in the brain) receive input and βfireβ when sufficiently excited/activated
Image courtesy Hamed PirsiavashUniversal Function Approximator
Theorem [Kurt Hornik et al., 1989]: Let F be a continuous function on a bounded subset of D-dimensional space. Then there exists a two-layer network G with finite number of hidden units that approximates F arbitrarily
βa two-layer network can approximate any functionβ Going from one to two layers dramatically improves the representation power of the network
Slide courtesy Hamed PirsiavashHow Deep Can They Be?
So many choices: Architecture # of hidden layers # of units per hidden layer
Slide courtesy Hamed PirsiavashComputational Issues: Vanishing gradients Gradients shrink as one moves away from the output layer Convergence is slow Opportunities: Training deep networks is an active area of research Layer-wise initialization (perhaps using unsupervised data) Engineering: GPUs to train on massive labelled datasets
Some Results: Digit Classification
logistic regression
ESL, Ch 11
simple feed forward (similar to MNIST in A2, but not exactly the same)
Tensorflow Playground
http://playground.tensorflow.org Experiment with small (toy) data neural networks in your browser Feel free to use this to gain an intuition
Outline
Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)
Empirical Risk Minimization
βxent π§β, π§ = β ΰ·
π
π§β π log π(π§ = π)
Cross entropy loss
βL2 π§β, π§ = (π§β β π§)^2
mean squared error/L2 loss
βsqβexpt π§β, π§ = π§β β π π§
2 2
squared expectation loss
βhinge π§β, π§ = max 0, 1 + max
πβ π§β π§ π β π§β[π]
hinge loss
Gradient Descent: Backpropagate the Error
Set t = 0 Pick a starting value ΞΈt Until converged: for example(s) i:
(mini)batch epoch epoch: a single run over all training data (mini-)batch: a run over a subset
Gradients for Feed Forward Neural Network
π§π = π πΎπ
π π π₯ π ππ¦ + π0 π
β = β ΰ·
π
π§β π log π§π
πβ ππ₯
ππ
πβ ππΎππ = β1 π§π§β ππ§π§β ππΎππ
β: a vector
Gradients for Feed Forward Neural Network
π§π = π πΎπ
π π π₯ π ππ¦ + π0 π
β = β ΰ·
π
π§β π log π§π
πβ ππ₯
ππ
πβ ππΎππ = β1 π§π§β ππ§π§β ππΎππ = βπβ² πΎπ§β
π β
π πΎπ§β
π β
ππΎπ
πβ
ππΎππ
β: a vector
Gradients for Feed Forward Neural Network
π§π = π πΎπ
π π π₯ π ππ¦ + π0 π
β = β ΰ·
π
π§β π log π§π
πβ ππ₯
ππ
πβ ππΎππ = β1 π§π§β ππ§π§β ππΎππ = βπβ² πΎπ§β
π β
π πΎπ§β
π β
ππΎπ
πβ
ππΎππ = βπβ² πΎπ§β
π β
π πΎπ§β
π β
π Οπ πΎπ§βπβπ ππΎππ
β: a vector
Gradients for Feed Forward Neural Network
π§π = π πΎπ
π π π₯ π ππ¦ + π0 π
β = β ΰ·
π
π§β π log π§π
πβ ππ₯
ππ
= 1 β π πΎπ§β
π β
πΎπ§βππβ² π₯
π ππ¦ π¦π
πβ ππΎππ = β1 π§π§β ππ§π§β ππΎππ = βπβ² πΎπ§β
π β
π πΎπ§β
π β
ππΎπ
πβ
ππΎππ = βπβ² πΎπ§β
π β
π πΎπ§β
π β
π Οπ πΎπ§βπβπ ππΎππ = 1 β π πΎπ§β
π β
βπ
β: a vector
Gradients for Feed Forward Neural Network
π§π = π πΎπ
π π π₯ π ππ¦ + π0 π
β = β ΰ·
π
π§β π log π§π
πβ ππ₯
ππ
= 1 β π πΎπ§β
π β
πΎπ§βππβ² π₯
π ππ¦ π¦π
πβ ππΎππ = 1 β π πΎπ§β
π β
βπ
β: a vector
Debugging can be hard to do!
Gradients for Feed Forward Neural Network
π§π = π πΎπ
π π π₯ π ππ¦ + π0 π
β = β ΰ·
π
π§β π log π§π
πβ ππ₯
ππ
= 1 β π πΎπ§β
π β
πΎπ§βππβ² π₯
π ππ¦ π¦π
πβ ππΎππ = 1 β π πΎπ§β
π β
βπ
β: a vector
Debugging can be hard to do!
Dropout: Regularization in Neural Networks
π¦ β π§ π§1 π§2 πΈ π±π π±π π±π π±π
randomly ignore βneuronsβ (hi) during training
Instance 1
Dropout: Regularization in Neural Networks
π¦ β π§ π§1 π§2 πΈ π±π π±π π±π π±π
randomly ignore βneuronsβ (hi) during training
Instance 2
Dropout: Regularization in Neural Networks
π¦ β π§ π§1 π§2 πΈ π±π π±π π±π π±π
randomly ignore βneuronsβ (hi) during training
Instance 3
Dropout: Regularization in Neural Networks
π¦ β π§ π§1 π§2 πΈ π±π π±π π±π π±π
randomly ignore βneuronsβ (hi) during training
Instance 1
Outline
Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)
Finding Gradients
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)what are the partial derivatives?
Finding Gradients
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)ππ(π¦1, π¦2) ππ¦1 = 2π¦1 + π π¦1 β π¦2 πβ1 β 2π¦1 π¦1
2 + π¦2 2Finding Gradients
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)ππ(π¦1, π¦2) ππ¦1 = 2π¦1 + π π¦1 β π¦2 πβ1 β 2π¦1 π¦1
2 + π¦2 2ππ(π¦1, π¦2) ππ¦2 = βπ π¦1 β π¦2 πβ1 β 2π¦2 π¦1
2 + π¦2 2chain rule (multiple times)
Autodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 βstraight lineβ program π¨4 = π¨3
πautodiff: a way of finding gradients mechanistic/procedural two (standard) modes: forward and reverse ML often uses reverse mode
Autodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 βstraight lineβ program π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ computation graph π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 βstraight lineβ program π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§
goals:
ππ§ ππ¦1 ππ§ ππ¦2
π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§
Γ°π’ = ππ§ ππ’
adjoint
goals:
ππ§ ππ¦1 ππ§ ππ¦2
π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = ππ§ ππ¨6 = ππ§ ππ¨7 ππ¨7 ππ¨6 = Γ°π¨7 β β1 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = ππ§ ππ¨6 = ππ§ ππ¨7 ππ¨7 ππ¨6 = Γ°π¨7 β β1 Γ°π¨4 = ππ§ ππ¨4 = ππ§ ππ¨7 ππ¨7 ππ¨4 = Γ°π¨7 β 1 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨5 = ππ§ ππ¨5 = ππ§ ππ¨7 ππ¨7 ππ¨5 = ππ§ ππ¨7 ππ¨7 ππ¨6 ππ¨6 ππ¨5 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨5 = ππ§ ππ¨5 = ππ§ ππ¨7 ππ¨7 ππ¨5 = ππ§ ππ¨7 ππ¨7 ππ¨6 ππ¨6 ππ¨5 = Γ°π¨6 β 1 π¨5 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 = ππ§ ππ¨1 = ππ§ ππ¨7 ππ¨7 ππ¨1 + ππ§ ππ¨7 ππ¨7 ππ¨6 ππ¨6 ππ¨5 ππ¨5 ππ¨1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 = ππ§ ππ¨1 = Γ°π¨7 β 1 + Γ°π¨5 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 Γ°π¨2 = ππ§ ππ¨2 = Γ°π¨5 β 1 π¨4 = π¨3
πAutodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 Γ°π¨3 = ππ§ π3 = Γ°π¨4 β π β π¨3
πβ1π¨4 = π¨3
πΓ°π¨2 = Γ°π¨5 β 1
Autodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨4 = π¨3
ππ¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§ Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 Γ°π¨2 = Γ°π¨5 β 1 Γ°π¨3 = Γ°π¨4 β π β π¨3
πβ1Γ°π¦1 += Γ°π¨1 β 2π¦1 Γ°π¦1 += Γ°π¨3 β 1 Γ°π¦2 += Γ°π¨2 β 2π¦2 Γ°π¦2 += Γ°π¨3 β β1
Autodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨4 = π¨3
ππ¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§
Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 Γ°π¨2 = Γ°π¨5 β 1 Γ°π¨3 = Γ°π¨4 β π β π¨3
πβ1Γ°π¦1 += Γ°π¨1 β 2π¦1 Γ°π¦1 += Γ°π¨3 β 1 Γ°π¦2 += Γ°π¨2 β 2π¦2 Γ°π¦2 += Γ°π¨3 β β1
Autodifferentiation
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨4 = π¨3
ππ¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§
Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 Γ°π¨2 = Γ°π¨5 β 1 Γ°π¨3 = Γ°π¨4 β π β π¨3
πβ1Γ°π¦1 += Γ°π¨1 β 2π¦1 Γ°π¦1 += Γ°π¨3 β 1 Γ°π¦2 += Γ°π¨2 β 2π¦2 Γ°π¦2 += Γ°π¨3 β β1
autodifferentiation in reverse mode
Autodifferentiation in Reverse Mode
π π¦1, π¦2 = π¦1
2 + π¦1 β π¦2 π β log(π¦1 2 + π¦2 2)π¨1 = π¦1
2π¨3 = (π¦1 β π¦2) π¨4 = π¨3
ππ¨2 = π¦2
2π¨5 = π¨1 + π¨2 π¨6 = log π¨5 π¨7 = π¨1 + π¨4 β π¨6 π§ = π¨7 π¦1 π¦2 π¨1 π¨2 π¨3 π¨4 π¨5 π¨6 π¨7 π§
Γ°π§ = 1
Γ°π’ = ππ§ ππ’
adjoint goals:
ππ§ ππ¦1 ππ§ ππ¦2
Γ°π¨7 = ππ§ ππ¨7 = 1 Γ°π¨6 = Γ°π¨7 β β1 Γ°π¨4 = Γ°π¨7 β 1 Γ°π¨1 += Γ°π¨7 β 1 Γ°π¨5 = Γ°π¨6 β 1 π¨5 Γ°π¨1 += Γ°π¨5 β 1 Γ°π¨2 = Γ°π¨5 β 1 Γ°π¨3 = Γ°π¨4 β π β π¨3
πβ1Γ°π¦1 += Γ°π¨1 β 2π¦1 Γ°π¦1 += Γ°π¨3 β 1 Γ°π¦2 += Γ°π¨2 β 2π¦2 Γ°π¦2 += Γ°π¨3 β β1
π¦1 = 2 x2 = 1 π π¦1 = 2, π¦2 = 1 β 3.390562 π = 1 πΌ
π¦ = (4.2, β1.4)by exact gradients πΌ
π¦ = (4.2, β1.4)by autodiff
Code Proof of Autodiff
>> def autodiff(x1,x2,a=1.0):
z1=x1**2 z2=x2**2 z3=(x1-x2) z4=z3**a z5=z1+z2 z6=numpy.log(z5) z7=z1+z4-z6 y=z7 dy=1 dz7=dy dz6=dz7*-1.0 dz5=dz6*1.0/z5 dz4=dz7*1.0 dz3=dz4*a*z3**(a-1) dz2=dz5*1.0 dz1=dz7*1.0 +dz5*1.0 dx1=dz1*2*x1+dz3*1.0 dx2=dz2*2*x2+dz3*-1.0 return dx1, dx2
>> autodiff(2,1) (4.2, -1.4) >> def f(x1, x2): return x1**2 + (x1-x2)**1 - numpy.log(x1**2+x2**2)
Code Proof of Autodiff
>> def autodiff(x1,x2,a=1.0):
z1=x1**2 z2=x2**2 z3=(x1-x2) z4=z3**a z5=z1+z2 z6=numpy.log(z5) z7=z1+z4-z6 y=z7 dy=1 dz7=dy dz6=dz7*-1.0 dz5=dz6*1.0/z5 dz4=dz7*1.0 dz3=dz4*a*z3**(a-1) dz2=dz5*1.0 dz1=dz7*1.0 +dz5*1.0 dx1=dz1*2*x1+dz3*1.0 dx2=dz2*2*x2+dz3*-1.0 return dx1, dx2
>> autodiff(2,1) (4.2, -1.4) >> def f(x1, x2): return x1**2 + (x1-x2)**1 - numpy.log(x1**2+x2**2) backward pass forward pass
Outline
Neural networks: non- linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)
Gradient Descent: Backpropagate the Error
Set t = 0 Pick a starting value ΞΈt Until converged: for example(s) i: 1. Compute loss l on xi 2. Get gradient g t = lβ(xi) 3. Get scaling factor Ο t 4. Set ΞΈ t+1 = ΞΈ t - Ο t *g t 5. Set t += 1