Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - - PowerPoint PPT Presentation

β–Ά
autodifferentiation
SMART_READER_LITE
LIVE PREVIEW

Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - - PowerPoint PPT Presentation

Neural Networks and Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models ) exp( , ) model the posterior probabilities of the K classes via linear functions


slide-1
SLIDE 1

Neural Networks and Autodifferentiation

CMSC 678 UMBC

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Maximum Entropy (Log-linear) Models

π‘ž 𝑦 𝑧) ∝ exp(πœ„π‘ˆπ‘” 𝑦, 𝑧 )

β€œmodel the posterior probabilities of the K classes via linear functions in ΞΈ, while at the same time ensuring that they sum to one and remain in [0, 1]” ~ Ch 4.4 β€œ[The log-linear estimate] is the least biased estimate possible

  • n the given information; i.e., it

is maximally noncommittal with regard to missing information.” Jaynes, 1957

slide-4
SLIDE 4

exp( )

…

Ξ£

label x

Z =

Normalization for Classification

weight1 * f1(fatally shot, X) weight2 * f2(seriously wounded, X) weight3 * f3(Shining Path, X)

slide-5
SLIDE 5

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Max`imum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets

𝑧 = ෍

𝑙

πœ„π‘™π‘¦π‘™ + 𝑐

the response can be a general (transformed) version of another response

log π‘ž(𝑦 = 𝑗) log π‘ž(𝑦 = 𝐿) = ෍

𝑙

πœ„π‘™π‘”(𝑦𝑙, 𝑗) + 𝑐

logistic regression

slide-6
SLIDE 6

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pθ thinks it computes for feature fk ෍

𝑗

π”½π‘ž[𝑔(𝑦′, 𝑧𝑗)

slide-7
SLIDE 7

Outline

Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

slide-8
SLIDE 8

Sigmoid

s=10 s=0.5 s=1 𝜏 𝑀 = 1 1 + exp(βˆ’π‘‘π‘€)

slide-9
SLIDE 9

Sigmoid

s=10 s=0.5 s=1

πœ–πœ 𝑀 πœ–π‘€ = 𝑑 βˆ— 𝜏 𝑀 βˆ— 1 βˆ’ 𝜏 𝑀

𝜏 𝑀 = 1 1 + exp(βˆ’π‘‘π‘€) calc practice: verify for yourself

slide-10
SLIDE 10

Remember Multi-class Linear Regression/Perceptron?

𝐱 𝑦 𝑧 𝑧 = π±π‘ˆπ‘¦ + 𝑐

  • utput:

if y > 0: class 1 else: class 2

slide-11
SLIDE 11

Linear Regression/Perceptron: A Per-Class View

𝐱 𝑦 𝑧 𝑧 = π±π‘ˆπ‘¦ + 𝑐 π±πŸ‘ 𝑦 𝑧 𝑧1 = 𝐱𝟐

π‘ˆπ‘¦ + 𝑐

𝐱𝟐 𝑧2 𝑧2 = π±πŸ‘

π‘ˆπ‘¦ + 𝑐

𝑧1

  • utput:

if y > 0: class 1 else: class 2

  • utput:

i = argmax {y1, y2} class i

binary version is special case

slide-12
SLIDE 12

Logistic Regression/Classification

𝐱 𝑦 𝑧 𝑧 = 𝜏(π±π‘ˆπ‘¦ + 𝑐) 𝑦 𝑧 = softmax(π±π‘ˆπ‘¦ + 𝑐) π±πŸ‘ 𝑧 𝑧1 ∝ exp(𝐱𝟐

π‘ˆπ‘¦ + 𝑐)

𝐱𝟐 𝑧2 𝑧2 ∝ exp( π±πŸ‘

π‘ˆπ‘¦ + 𝑐)

𝑧1

  • utput:

i = argmax {y1, y2} class i

slide-13
SLIDE 13

Logistic Regression/Classification

𝑦 π±πŸ‘ 𝑧 𝑧1 ∝ exp(𝐱𝟐

π‘ˆπ‘¦ + 𝑐)

𝐱𝟐 𝑧2 𝑧2 ∝ exp( π±πŸ‘

π‘ˆπ‘¦ + 𝑐)

𝑧1

  • utput:

i = argmax {y1, y2} class i Q: Why didn’t our maxent formulation from last class have multiple weight vectors?

slide-14
SLIDE 14

Logistic Regression/Classification

𝑦 π±πŸ‘ 𝑧 𝑧1 ∝ exp(𝐱𝟐

π‘ˆπ‘¦ + 𝑐)

𝐱𝟐 𝑧2 𝑧2 ∝ exp( π±πŸ‘

π‘ˆπ‘¦ + 𝑐)

𝑧1

  • utput:

i = argmax {y1, y2} class i Q: Why didn’t our maxent formulation from last class have multiple weight vectors? A: Implicitly it did. Our formulation was 𝑧 ∝ exp(π‘₯π‘ˆπ‘” 𝑦, 𝑧 )

slide-15
SLIDE 15

Stacking Logistic Regression

𝐱𝟐 𝑦 β„Žπ‘— = 𝜏(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 Goal: you still want to predict y Idea: Can making an initial round

  • f separate (independent) binary

predictions h help? π±πŸ‘ π±πŸ’ π±πŸ“

slide-16
SLIDE 16

Stacking Logistic Regression

𝑦 𝜸 β„Ž 𝑧 π‘§π‘˜ = softmax(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝑧1 𝑧2 Predict y from your first round of predictions h Idea: data/signal compression 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“ β„Žπ‘— = 𝜏(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)
slide-17
SLIDE 17

Stacking Logistic Regression

𝑦 β„Žπ‘— = 𝜏(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 π‘§π‘˜ = softmax(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝑧1 𝑧2 Do we need (binary) probabilities here? 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

slide-18
SLIDE 18

Stacking Logistic Regression

𝑦 β„Žπ‘— = 𝐺(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 π‘§π‘˜ = softmax(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝑧1 𝑧2 F: (non-linear) activation function Do we need probabilities here? 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

slide-19
SLIDE 19

Stacking Logistic Regression

𝑦 β„Žπ‘— = 𝐺(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 π‘§π‘˜ = softmax(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝑧1 𝑧2 F: (non-linear) activation function Do we need probabilities here? Classification: probably Regression: not really 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

slide-20
SLIDE 20

Stacking Logistic Regression

𝑦 β„Žπ‘— = 𝐺(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 π‘§π‘˜ = G(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝑧1 𝑧2 F: (non-linear) activation function Classification: softmax Regression: identity G: (non-linear) activation function 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

slide-21
SLIDE 21

Multilayer Perceptron, a.k.a. Feed-Forward Neural Network

𝑦 β„Žπ‘— = 𝐺(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 𝑧1 𝑧2 F: (non-linear) activation function Classification: softmax Regression: identity G: (non-linear) activation function π‘§π‘˜ = G(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

slide-22
SLIDE 22

Feed-Forward Neural Network

𝑦 β„Žπ‘— = 𝐺(𝐱𝐣

π‘ˆπ‘¦ + 𝑐0)

β„Ž 𝑧 𝑧1 𝑧2 π‘§π‘˜ = G(𝛄𝐀

π‘ˆβ„Ž + 𝑐1)

𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

𝜸: # output X # hidden 𝐱: # hidden X # input

slide-23
SLIDE 23

Why Non-Linear?

𝑦 β„Ž 𝑧

π‘§π‘˜ = G π›„π‘˜

π‘ˆβ„Ž + 𝑐1

π‘§π‘˜ = 𝐻 π›Ύπ‘˜

π‘ˆ 𝐺 π‘₯𝑗 π‘ˆπ‘¦ + 𝑐0 𝑗

𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

slide-24
SLIDE 24

Feed-Forward

𝑦 β„Ž 𝑧 𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“ information/ computation flow no self-loops (recurrence/reuse of weights)

slide-25
SLIDE 25

Why β€œNeural?”

argue from neuroscience perspective neurons (in the brain) receive input and β€œfire” when sufficiently excited/activated

Image courtesy Hamed Pirsiavash
slide-26
SLIDE 26

Universal Function Approximator

Theorem [Kurt Hornik et al., 1989]: Let F be a continuous function on a bounded subset of D-dimensional space. Then there exists a two-layer network G with finite number of hidden units that approximates F arbitrarily

  • well. For all x in the domain of F, |F(x) – G(x) |< Ξ΅

β€œa two-layer network can approximate any function” Going from one to two layers dramatically improves the representation power of the network

Slide courtesy Hamed Pirsiavash
slide-27
SLIDE 27

How Deep Can They Be?

So many choices: Architecture # of hidden layers # of units per hidden layer

Slide courtesy Hamed Pirsiavash

Computational Issues: Vanishing gradients Gradients shrink as one moves away from the output layer Convergence is slow Opportunities: Training deep networks is an active area of research Layer-wise initialization (perhaps using unsupervised data) Engineering: GPUs to train on massive labelled datasets

slide-28
SLIDE 28

Some Results: Digit Classification

logistic regression

ESL, Ch 11

simple feed forward (similar to MNIST in A2, but not exactly the same)

slide-29
SLIDE 29

Tensorflow Playground

http://playground.tensorflow.org Experiment with small (toy) data neural networks in your browser Feel free to use this to gain an intuition

slide-30
SLIDE 30

Outline

Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

slide-31
SLIDE 31

Empirical Risk Minimization

β„“xent π‘§βˆ—, 𝑧 = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log π‘ž(𝑧 = 𝑙)

Cross entropy loss

β„“L2 π‘§βˆ—, 𝑧 = (π‘§βˆ— βˆ’ 𝑧)^2

mean squared error/L2 loss

β„“sqβˆ’expt π‘§βˆ—, 𝑧 = π‘§βˆ— βˆ’ π‘ž 𝑧

2 2

squared expectation loss

β„“hinge π‘§βˆ—, 𝑧 = max 0, 1 + max

π‘˜β‰ π‘§βˆ— 𝑧 π‘˜ βˆ’ π‘§βˆ—[π‘˜]

hinge loss

slide-32
SLIDE 32

Gradient Descent: Backpropagate the Error

Set t = 0 Pick a starting value ΞΈt Until converged: for example(s) i:

  • 1. Compute loss l on xi
  • 2. Get gradient g t = l’(xi)
  • 3. Get scaling factor ρ t
  • 4. Set ΞΈ t+1 = ΞΈ t - ρ t *g t
  • 5. Set t += 1

(mini)batch epoch epoch: a single run over all training data (mini-)batch: a run over a subset

  • f the data
slide-33
SLIDE 33

Gradients for Feed Forward Neural Network

𝑧𝑙 = 𝜏 𝛾𝑙

π‘ˆ 𝜏 π‘₯ π‘˜ π‘ˆπ‘¦ + 𝑐0 π‘˜

β„’ = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log 𝑧𝑙

πœ–β„’ πœ–π‘₯

π‘˜π‘š

πœ–β„’ πœ–π›Ύπ‘™π‘˜ = βˆ’1 π‘§π‘§βˆ— πœ–π‘§π‘§βˆ— πœ–π›Ύπ‘™π‘˜

β„Ž: a vector

slide-34
SLIDE 34

Gradients for Feed Forward Neural Network

𝑧𝑙 = 𝜏 𝛾𝑙

π‘ˆ 𝜏 π‘₯ π‘˜ π‘ˆπ‘¦ + 𝑐0 π‘˜

β„’ = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log 𝑧𝑙

πœ–β„’ πœ–π‘₯

π‘˜π‘š

πœ–β„’ πœ–π›Ύπ‘™π‘˜ = βˆ’1 π‘§π‘§βˆ— πœ–π‘§π‘§βˆ— πœ–π›Ύπ‘™π‘˜ = βˆ’πœβ€² π›Ύπ‘§βˆ—

π‘ˆ β„Ž

𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

πœ–π›Ύπ‘™

π‘ˆβ„Ž

πœ–π›Ύπ‘™π‘˜

β„Ž: a vector

slide-35
SLIDE 35

Gradients for Feed Forward Neural Network

𝑧𝑙 = 𝜏 𝛾𝑙

π‘ˆ 𝜏 π‘₯ π‘˜ π‘ˆπ‘¦ + 𝑐0 π‘˜

β„’ = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log 𝑧𝑙

πœ–β„’ πœ–π‘₯

π‘˜π‘š

πœ–β„’ πœ–π›Ύπ‘™π‘˜ = βˆ’1 π‘§π‘§βˆ— πœ–π‘§π‘§βˆ— πœ–π›Ύπ‘™π‘˜ = βˆ’πœβ€² π›Ύπ‘§βˆ—

π‘ˆ β„Ž

𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

πœ–π›Ύπ‘™

π‘ˆβ„Ž

πœ–π›Ύπ‘™π‘˜ = βˆ’πœβ€² π›Ύπ‘§βˆ—

π‘ˆ β„Ž

𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

πœ– Οƒπ‘˜ π›Ύπ‘§βˆ—π‘˜β„Žπ‘˜ πœ–π›Ύπ‘™π‘˜

β„Ž: a vector

slide-36
SLIDE 36

Gradients for Feed Forward Neural Network

𝑧𝑙 = 𝜏 𝛾𝑙

π‘ˆ 𝜏 π‘₯ π‘˜ π‘ˆπ‘¦ + 𝑐0 π‘˜

β„’ = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log 𝑧𝑙

πœ–β„’ πœ–π‘₯

π‘˜π‘š

= 1 βˆ’ 𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

π›Ύπ‘§βˆ—π‘˜πœβ€² π‘₯

π‘˜ π‘ˆπ‘¦ π‘¦π‘š

πœ–β„’ πœ–π›Ύπ‘™π‘˜ = βˆ’1 π‘§π‘§βˆ— πœ–π‘§π‘§βˆ— πœ–π›Ύπ‘™π‘˜ = βˆ’πœβ€² π›Ύπ‘§βˆ—

π‘ˆ β„Ž

𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

πœ–π›Ύπ‘™

π‘ˆβ„Ž

πœ–π›Ύπ‘™π‘˜ = βˆ’πœβ€² π›Ύπ‘§βˆ—

π‘ˆ β„Ž

𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

πœ– Οƒπ‘˜ π›Ύπ‘§βˆ—π‘˜β„Žπ‘˜ πœ–π›Ύπ‘™π‘˜ = 1 βˆ’ 𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

β„Žπ‘˜

β„Ž: a vector

slide-37
SLIDE 37

Gradients for Feed Forward Neural Network

𝑧𝑙 = 𝜏 𝛾𝑙

π‘ˆ 𝜏 π‘₯ π‘˜ π‘ˆπ‘¦ + 𝑐0 π‘˜

β„’ = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log 𝑧𝑙

πœ–β„’ πœ–π‘₯

π‘˜π‘š

= 1 βˆ’ 𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

π›Ύπ‘§βˆ—π‘˜πœβ€² π‘₯

π‘˜ π‘ˆπ‘¦ π‘¦π‘š

πœ–β„’ πœ–π›Ύπ‘™π‘˜ = 1 βˆ’ 𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

β„Žπ‘˜

β„Ž: a vector

Debugging can be hard to do!

slide-38
SLIDE 38

Gradients for Feed Forward Neural Network

𝑧𝑙 = 𝜏 𝛾𝑙

π‘ˆ 𝜏 π‘₯ π‘˜ π‘ˆπ‘¦ + 𝑐0 π‘˜

β„’ = βˆ’ ෍

𝑙

π‘§βˆ— 𝑙 log 𝑧𝑙

πœ–β„’ πœ–π‘₯

π‘˜π‘š

= 1 βˆ’ 𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

π›Ύπ‘§βˆ—π‘˜πœβ€² π‘₯

π‘˜ π‘ˆπ‘¦ π‘¦π‘š

πœ–β„’ πœ–π›Ύπ‘™π‘˜ = 1 βˆ’ 𝜏 π›Ύπ‘§βˆ—

π‘ˆ β„Ž

β„Žπ‘˜

β„Ž: a vector

Debugging can be hard to do!

slide-39
SLIDE 39

Dropout: Regularization in Neural Networks

𝑦 β„Ž 𝑧 𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

randomly ignore β€œneurons” (hi) during training

Instance 1

slide-40
SLIDE 40

Dropout: Regularization in Neural Networks

𝑦 β„Ž 𝑧 𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

randomly ignore β€œneurons” (hi) during training

Instance 2

slide-41
SLIDE 41

Dropout: Regularization in Neural Networks

𝑦 β„Ž 𝑧 𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

randomly ignore β€œneurons” (hi) during training

Instance 3

slide-42
SLIDE 42

Dropout: Regularization in Neural Networks

𝑦 β„Ž 𝑧 𝑧1 𝑧2 𝜸 𝐱𝟐 π±πŸ‘ π±πŸ’ π±πŸ“

randomly ignore β€œneurons” (hi) during training

Instance 1

slide-43
SLIDE 43

Outline

Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

slide-44
SLIDE 44

Finding Gradients

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

what are the partial derivatives?

slide-45
SLIDE 45

Finding Gradients

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

πœ–π‘”(𝑦1, 𝑦2) πœ–π‘¦1 = 2𝑦1 + 𝑏 𝑦1 βˆ’ 𝑦2 π‘βˆ’1 βˆ’ 2𝑦1 𝑦1

2 + 𝑦2 2
slide-46
SLIDE 46

Finding Gradients

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

πœ–π‘”(𝑦1, 𝑦2) πœ–π‘¦1 = 2𝑦1 + 𝑏 𝑦1 βˆ’ 𝑦2 π‘βˆ’1 βˆ’ 2𝑦1 𝑦1

2 + 𝑦2 2

πœ–π‘”(𝑦1, 𝑦2) πœ–π‘¦2 = βˆ’π‘ 𝑦1 βˆ’ 𝑦2 π‘βˆ’1 βˆ’ 2𝑦2 𝑦1

2 + 𝑦2 2

chain rule (multiple times)

slide-47
SLIDE 47

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑨4 = 𝑨3

𝑏
slide-48
SLIDE 48

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 β€œstraight line” program 𝑨4 = 𝑨3

𝑏

autodiff: a way of finding gradients mechanistic/procedural two (standard) modes: forward and reverse ML often uses reverse mode

slide-49
SLIDE 49

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 β€œstraight line” program 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 computation graph 𝑨4 = 𝑨3

𝑏
slide-50
SLIDE 50

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 β€œstraight line” program 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧

goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

𝑨4 = 𝑨3

𝑏
slide-51
SLIDE 51

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint

goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

𝑨4 = 𝑨3

𝑏
slide-52
SLIDE 52

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

𝑨4 = 𝑨3

𝑏
slide-53
SLIDE 53

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 𝑨4 = 𝑨3

𝑏
slide-54
SLIDE 54

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = πœ–π‘§ πœ–π‘¨6 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨6 = ð𝑨7 βˆ— βˆ’1 𝑨4 = 𝑨3

𝑏
slide-55
SLIDE 55

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = πœ–π‘§ πœ–π‘¨6 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = πœ–π‘§ πœ–π‘¨4 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨4 = ð𝑨7 βˆ— 1 𝑨4 = 𝑨3

𝑏
slide-56
SLIDE 56

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨5 = πœ–π‘§ πœ–π‘¨5 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨5 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨6 πœ–π‘¨6 πœ–π‘¨5 𝑨4 = 𝑨3

𝑏
slide-57
SLIDE 57

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨5 = πœ–π‘§ πœ–π‘¨5 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨5 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨6 πœ–π‘¨6 πœ–π‘¨5 = ð𝑨6 βˆ— 1 𝑨5 𝑨4 = 𝑨3

𝑏
slide-58
SLIDE 58

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 = πœ–π‘§ πœ–π‘¨1 = πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨1 + πœ–π‘§ πœ–π‘¨7 πœ–π‘¨7 πœ–π‘¨6 πœ–π‘¨6 πœ–π‘¨5 πœ–π‘¨5 πœ–π‘¨1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 𝑨4 = 𝑨3

𝑏
slide-59
SLIDE 59

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 = πœ–π‘§ πœ–π‘¨1 = ð𝑨7 βˆ— 1 + ð𝑨5 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 𝑨4 = 𝑨3

𝑏
slide-60
SLIDE 60

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 𝑨4 = 𝑨3

𝑏
slide-61
SLIDE 61

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 ð𝑨2 = πœ–π‘§ πœ–π‘¨2 = ð𝑨5 βˆ— 1 𝑨4 = 𝑨3

𝑏
slide-62
SLIDE 62

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 ð𝑨3 = πœ–π‘§ πœ–3 = ð𝑨4 βˆ— 𝑏 βˆ— 𝑨3

π‘βˆ’1

𝑨4 = 𝑨3

𝑏

ð𝑨2 = ð𝑨5 βˆ— 1

slide-63
SLIDE 63

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨4 = 𝑨3

𝑏

𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧 ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 ð𝑨2 = ð𝑨5 βˆ— 1 ð𝑨3 = ð𝑨4 βˆ— 𝑏 βˆ— 𝑨3

π‘βˆ’1

ð𝑦1 += ð𝑨1 βˆ— 2𝑦1 ð𝑦1 += ð𝑨3 βˆ— 1 ð𝑦2 += ð𝑨2 βˆ— 2𝑦2 ð𝑦2 += ð𝑨3 βˆ— βˆ’1

slide-64
SLIDE 64

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨4 = 𝑨3

𝑏

𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧

ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 ð𝑨2 = ð𝑨5 βˆ— 1 ð𝑨3 = ð𝑨4 βˆ— 𝑏 βˆ— 𝑨3

π‘βˆ’1

ð𝑦1 += ð𝑨1 βˆ— 2𝑦1 ð𝑦1 += ð𝑨3 βˆ— 1 ð𝑦2 += ð𝑨2 βˆ— 2𝑦2 ð𝑦2 += ð𝑨3 βˆ— βˆ’1

slide-65
SLIDE 65

Autodifferentiation

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨4 = 𝑨3

𝑏

𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧

ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 ð𝑨2 = ð𝑨5 βˆ— 1 ð𝑨3 = ð𝑨4 βˆ— 𝑏 βˆ— 𝑨3

π‘βˆ’1

ð𝑦1 += ð𝑨1 βˆ— 2𝑦1 ð𝑦1 += ð𝑨3 βˆ— 1 ð𝑦2 += ð𝑨2 βˆ— 2𝑦2 ð𝑦2 += ð𝑨3 βˆ— βˆ’1

autodifferentiation in reverse mode

slide-66
SLIDE 66

Autodifferentiation in Reverse Mode

𝑔 𝑦1, 𝑦2 = 𝑦1

2 + 𝑦1 βˆ’ 𝑦2 𝑏 βˆ’ log(𝑦1 2 + 𝑦2 2)

𝑨1 = 𝑦1

2

𝑨3 = (𝑦1 βˆ’ 𝑦2) 𝑨4 = 𝑨3

𝑏

𝑨2 = 𝑦2

2

𝑨5 = 𝑨1 + 𝑨2 𝑨6 = log 𝑨5 𝑨7 = 𝑨1 + 𝑨4 βˆ’ 𝑨6 𝑧 = 𝑨7 𝑦1 𝑦2 𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑨6 𝑨7 𝑧

ð𝑧 = 1

ð𝑒 = πœ–π‘§ πœ–π‘’

adjoint goals:

πœ–π‘§ πœ–π‘¦1 πœ–π‘§ πœ–π‘¦2

ð𝑨7 = πœ–π‘§ πœ–π‘¨7 = 1 ð𝑨6 = ð𝑨7 βˆ— βˆ’1 ð𝑨4 = ð𝑨7 βˆ— 1 ð𝑨1 += ð𝑨7 βˆ— 1 ð𝑨5 = ð𝑨6 βˆ— 1 𝑨5 ð𝑨1 += ð𝑨5 βˆ— 1 ð𝑨2 = ð𝑨5 βˆ— 1 ð𝑨3 = ð𝑨4 βˆ— 𝑏 βˆ— 𝑨3

π‘βˆ’1

ð𝑦1 += ð𝑨1 βˆ— 2𝑦1 ð𝑦1 += ð𝑨3 βˆ— 1 ð𝑦2 += ð𝑨2 βˆ— 2𝑦2 ð𝑦2 += ð𝑨3 βˆ— βˆ’1

𝑦1 = 2 x2 = 1 𝑔 𝑦1 = 2, 𝑦2 = 1 β‰ˆ 3.390562 𝑏 = 1 𝛼

𝑦 = (4.2, βˆ’1.4)

by exact gradients 𝛼

𝑦 = (4.2, βˆ’1.4)

by autodiff

slide-67
SLIDE 67

Code Proof of Autodiff

>> def autodiff(x1,x2,a=1.0):

z1=x1**2 z2=x2**2 z3=(x1-x2) z4=z3**a z5=z1+z2 z6=numpy.log(z5) z7=z1+z4-z6 y=z7 dy=1 dz7=dy dz6=dz7*-1.0 dz5=dz6*1.0/z5 dz4=dz7*1.0 dz3=dz4*a*z3**(a-1) dz2=dz5*1.0 dz1=dz7*1.0 +dz5*1.0 dx1=dz1*2*x1+dz3*1.0 dx2=dz2*2*x2+dz3*-1.0 return dx1, dx2

>> autodiff(2,1) (4.2, -1.4) >> def f(x1, x2): return x1**2 + (x1-x2)**1 - numpy.log(x1**2+x2**2)

slide-68
SLIDE 68

Code Proof of Autodiff

>> def autodiff(x1,x2,a=1.0):

z1=x1**2 z2=x2**2 z3=(x1-x2) z4=z3**a z5=z1+z2 z6=numpy.log(z5) z7=z1+z4-z6 y=z7 dy=1 dz7=dy dz6=dz7*-1.0 dz5=dz6*1.0/z5 dz4=dz7*1.0 dz3=dz4*a*z3**(a-1) dz2=dz5*1.0 dz1=dz7*1.0 +dz5*1.0 dx1=dz1*2*x1+dz3*1.0 dx2=dz2*2*x2+dz3*-1.0 return dx1, dx2

>> autodiff(2,1) (4.2, -1.4) >> def f(x1, x2): return x1**2 + (x1-x2)**1 - numpy.log(x1**2+x2**2) backward pass forward pass

slide-69
SLIDE 69

Outline

Neural networks: non- linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

Gradient Descent: Backpropagate the Error

Set t = 0 Pick a starting value ΞΈt Until converged: for example(s) i: 1. Compute loss l on xi 2. Get gradient g t = l’(xi) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t - ρ t *g t 5. Set t += 1