CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - - PowerPoint PPT Presentation

cse 490 u deep learning spring 2016
SMART_READER_LITE
LIVE PREVIEW

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - - PowerPoint PPT Presentation

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5


slide-1
SLIDE 1

CSE 490 U: Deep Learning Spring 2016

Yejin Choi

Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Human Neurons

  • Switching time
  • ~ 0.001 second
  • Number of neurons

– 1010

  • Connections per neuron

– 104-5

  • Scene recognition time

– 0.1 seconds

  • Number of cycles per scene recognition?

– 100 à much parallel computation!

slide-5
SLIDE 5

g

Perceptron as a Neural Network

This is one neuron:

– Input edges x1 ... xn, along with basis – The sum is represented graphically – Sum passed through an activation function g

slide-6
SLIDE 6

Sigmoid Neuron

g

Just change g!

  • Why would we want to do this?
  • Notice new output range [0,1]. What was it before?
  • Look familiar?
slide-7
SLIDE 7

Optimizing a neuron

We train to minimize sum-squared error

⇧l ⇧wi = − ↵

j

[yj − g(w0 + ↵

i

wixj

i)] ⇧

⇧wi g(w0 + ↵

i

wixj

i)

Solution just depends on g’: derivative of activation function!

∂ ∂xf(g(x)) = f(g(x))g(x)

∂ ∂wi g(w0 + X

i

wixj

i) = xj ig0(w0 +

X

i

wixj

i)

slide-8
SLIDE 8

Sigmoid units: have to differentiate g

g(x) = g(x)(1 − g(x))

slide-9
SLIDE 9

g

Perceptron, linear classification, Boolean functions: xi∈{0,1}

  • Can learn x1 ∨ x2?
  • -0.5 + x1 + x2
  • Can learn x1 ∧ x2?
  • -1.5 + x1 + x2
  • Can learn any conjunction or disjunction?
  • 0.5 + x1 + … + xn
  • (-n+0.5) + x1 + … + xn
  • Can learn majority?
  • (-0.5*n) + x1 + … + xn
  • What are we missing? The dreaded XOR!, etc.
slide-10
SLIDE 10

Going beyond linear classification

Solving the XOR problem y = x1 XOR x2 v1 =(x1 ∧¬x2) = -1.5+2x1-x2 v2 =(x2 ∧¬x1) = -1.5+2x2-x1 y = v1∨v2 = -0.5+v1+v2

x1 x2 1 v1 v2 y 1

  • 0.5

1 1

  • 1.5

2

  • 1

2

  • 1
  • 1.5

= (x1 ∧ ¬x2) ∨ (x2 ∧¬x1)

slide-11
SLIDE 11

Hidden layer

  • Single unit:
  • 1-hidden layer:
  • No longer convex function!
slide-12
SLIDE 12

Example data for NN with hidden layer

slide-13
SLIDE 13

Learned weights for hidden layer

slide-14
SLIDE 14

Why “representation learning”?

  • MaxEnt (multinomial logistic regression):
  • NNs:

y = softmax(w · f(x, y)) y = softmax(w · σ(Ux)) y = softmax(w · σ(U (n)(...σ(U (2)σ(U (1)x))))

You design the feature vector Feature representations are “learned” through hidden layers

slide-15
SLIDE 15

Very deep models in computer vision

slide-16
SLIDE 16

RECURRENT NEURAL NETWORKS

slide-17
SLIDE 17

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

Recurrent Neural Networks (RNNs)

  • Each RNN unit computes a new hidden state using the previous state and a

new input

  • Each RNN unit (optionally) makes an output using the current hidden state
  • Hidden states are continuous vectors

– Can represent very rich information – Possibly the entire history from the beginning

  • Parameters are shared (tied) across all RNN units (unlike feedforwardNNs)

ht = f(xt, ht−1)

ht ∈ RD

yt = softmax(V ht)

slide-18
SLIDE 18

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNN:

ht = f(xt, ht−1) yt = softmax(V ht)

yt = softmax(V ht) ht = tanh(Uxt + Wht−1 + b)

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

slide-19
SLIDE 19

Many uses of RNNs

  • Input: a sequence
  • Output: one label (classification)
  • Example: sentiment classification

ht = f(xt, ht−1)

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

y = softmax(V hn)

  • 1. Classification (seq to one)
slide-20
SLIDE 20
  • 2. one to seq
  • Input: one item
  • Output: a sequence
  • Example: Image captioning

ht = f(xt, ht−1)

yt = softmax(V ht)

𝑦" ℎ" ℎ# ℎ$ ℎ% ℎ$ ℎ# ℎ" Cat sitting on top of ….

Many uses of RNNs

slide-21
SLIDE 21
  • 3. sequence tagging
  • Input: a sequence
  • Output: a sequence (of the same length)
  • Example: POS tagging, Named Entity Recognition
  • How about Language Models?

– Yes! RNNs can be used as LMs! – RNNs make markov assumption: T/F?

ht = f(xt, ht−1)

yt = softmax(V ht)

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ% ℎ$ ℎ# ℎ"

Many uses of RNNs

slide-22
SLIDE 22
  • 4. Language models
  • Input: a sequence of words
  • Output: one next word
  • Output: or a sequence of next words
  • During training, x_t is the actual word in the training sentence.
  • During testing, x_t is the word predicted from the previous time step.
  • Does RNN LMs make Markov assumption?

– i.e., the next word depends only on the previous N words

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ% ℎ$ ℎ# ℎ"

Many uses of RNNs

ht = f(xt, ht−1)

yt = softmax(V ht)

slide-23
SLIDE 23
  • 5. seq2seq (aka “encoder-decoder”)
  • Input: a sequence
  • Output: a sequence (of different length)
  • Examples?

ht = f(xt, ht−1) yt = softmax(V ht)

𝑦" 𝑦# 𝑦$ ℎ" ℎ# ℎ$ ℎ% ℎ% ℎ' ℎ( ℎ) ℎ( ℎ'

Many uses of RNNs

slide-24
SLIDE 24

Many uses of RNNs

  • 4. seq2seq (aka “encoder-decoder”)

Figure from http://www.wildml.com/category/conversational-agents/

  • Conversation and Dialogue
  • Machine Translation
slide-25
SLIDE 25

Many uses of RNNs

  • 4. seq2seq (aka “encoder-decoder”)

John has a dog

𝑦" 𝑦# 𝑦$ ℎ" ℎ# ℎ$ ℎ% ℎ% ℎ' ℎ( ℎ) ℎ( ℎ'

Parsing!

  • “Grammar as Foreign Language” (Vinyals et al., 2015)
slide-26
SLIDE 26

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNN:

ht = f(xt, ht−1) yt = softmax(V ht)

yt = softmax(V ht) ht = tanh(Uxt + Wht−1 + b)

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

slide-27
SLIDE 27

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNNs:
  • LSTMs (Long Short-term Memory Networks):

ht = f(xt, ht−1)

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ft = σ(U (f)xt + W (f)ht−1 + b(f))

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) ct = ft ct−1 + it ˜ ct ht = ot tanh(ct)

There are many known variations to this set of equations!

ht = tanh(Uxt + Wht−1 + b)

𝑑" 𝑑# 𝑑$ 𝑑% 𝑑+ : cell state ℎ+ : hidden state

slide-28
SLIDE 28

LSTMS (LONG SHORT-TERM MEMORY NETWORKS

𝑑+," ℎ+," 𝑑+ ℎ+

Figure by Christopher Olah (colah.github.io)

slide-29
SLIDE 29

LSTMS (LONG SHORT-TERM MEMORY NETWORKS

sigmoid: [0,1] ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

slide-30
SLIDE 30

LSTMS (LONG SHORT-TERM MEMORY NETWORKS

sigmoid: [0,1] tanh: [-1,1] it = σ(U (i)xt + W (i)ht−1 + b(i)) ˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

slide-31
SLIDE 31

LSTMS (LONG SHORT-TERM MEMORY NETWORKS

sigmoid: [0,1] tanh: [-1,1]

ct = ft ct−1 + it ˜ ct

New cell content:

  • mix old cell with the new temp cell

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

slide-32
SLIDE 32

LSTMS (LONG SHORT-TERM MEMORY NETWORKS

ct = ft ct−1 + it ˜ ct

New cell content:

  • mix old cell with the new temp cell

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ht = ot tanh(ct)

Hidden state:

Figure by Christopher Olah (colah.github.io)

slide-33
SLIDE 33

LSTMS (LONG SHORT-TERM MEMORY NETWORKS

it = σ(U (i)xt + W (i)ht−1 + b(i)) Input gate: use the input or not

ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ct = ft ct−1 + it ˜ ct

New cell content:

  • mix old cell with the new temp cell

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) New cell content (temp):

ht = ot tanh(ct)

Hidden state: 𝑑+," ℎ+," 𝑑+ ℎ+

slide-34
SLIDE 34

vanishing gradient problem for RNNs.

  • The shading of the nodes in the unfolded network indicates their sensitivity to

the inputs at time one (the darker the shade, the greater the sensitivity).

  • The sensitivity decays over time as new inputs overwrite the activations of the

hidden layer, and the network ‘forgets’ the first inputs.

Example from Graves 2012

slide-35
SLIDE 35

Preservation of gradient information by LSTM

  • For simplicity, all gates are either entirely open (‘O’) or closed (‘—’).
  • The memory cell ‘remembers’ the first input as long as the forget gate is open and

the input gate is closed.

  • The sensitivity of the output layer can be switched on and off by the output gate

without affecting the cell.

Forget gate Input gate Output gate Example from Graves 2012

slide-36
SLIDE 36

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNNs:
  • GRUs (Gated Recurrent Units):

ht = f(xt, ht−1)

𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

zt = σ(U (z)xt + W (z)ht−1 + b(z)) rt = σ(U (r)xt + W (r)ht−1 + b(r))

˜ ht = tanh(U (h)xt + W (h)(rt ht−1) + b(h))

ht = (1 zt) ht−1 + zt ˜ ht

Less parameters than LSTMs. Easier to train for comparable performance!

ht = tanh(Uxt + Wht−1 + b)

slide-37
SLIDE 37

Recursive Neural Networks

  • Sometimes, inference over a tree structure makes more sense than

sequential structure

  • An example of compositionality in ideological bias detection

(red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree

Example from Iyyer et al., 2014

slide-38
SLIDE 38

Recursive Neural Networks

  • NNs connected as a tree
  • Tree structure is fixed a priori
  • Parameters are shared, similarly as RNNs

Example from Iyyer et al., 2014

slide-39
SLIDE 39

LEARNING: BACKPROPAGATION

slide-40
SLIDE 40

Error Backpropagation

  • Model parameters:

for brevity:

x0 x1 x2 xP f(x, ~ ✓) ~ ✓ = {wij, wjk, wkl}

Next 10 slides on back propagation are adapted from Andrew Rosenberg

~ ✓ = {w(1)

ij , w(2) jk , w(3) kl }

w(1)

ij

w(2)

jk

w(3)

kl

slide-41
SLIDE 41

Error Backpropagation

  • Model parameters:
  • Let a and z be the input and output of each node

41

x0 x1 x2 xP f(x, ~ ✓) wij wjk wkl zj zk zi aj zl al ak ~ ✓ = {wij, wjk, wkl}

slide-42
SLIDE 42

Error Backpropagation

wij

wjk

zj

aj

aj = X

i

wijzi

zj = g(aj)

zi

slide-43
SLIDE 43

x0 x1 x2 xP f(x, ~ ✓) wij wjk wkl zj zk zi aj ak zl aj = X

i

wijzi al ak = X

j

wjkzj al = X

k

wklzk

zj = g(aj) zk = g(ak)

zl = g(al)

  • Let a and z be the input and output of each node
slide-44
SLIDE 44

x0 x1 x2 xP f(x, ~ ✓) wij wjk wkl zj zk zi aj ak zl aj = X

i

wijzi al ak = X

j

wjkzj al = X

k

wklzk

zj = g(aj) zk = g(ak)

zl = g(al)

  • Let a and z be the input and output of each node
slide-45
SLIDE 45

Training: minimize loss

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

R(θ) = 1 N

N

X

n=0

L(yn − f(xn)) = 1 N

N

X

n=0

1 2 (yn − f(xn))2 = 1 N

N

X

n=0

1 2 @yn − g @X

k

wklg @X

j

wjkg X

i

wijxn,i !1 A 1 A 1 A

2

Empirical Risk Function

slide-46
SLIDE 46

Training: minimize loss

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

R(θ) = 1 N

N

X

n=0

L(yn − f(xn)) = 1 N

N

X

n=0

1 2 (yn − f(xn))2 = 1 N

N

X

n=0

1 2 @yn − g @X

k

wklg @X

j

wjkg X

i

wijxn,i !1 A 1 A 1 A

2

Empirical Risk Function

slide-47
SLIDE 47

Error Backpropagation

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

Optimize last layer weights wkl

Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X

n

 ∂Ln ∂al,n ∂al,n ∂wkl

  • Calculus chain rule
slide-48
SLIDE 48

Error Backpropagation

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

Optimize last layer weights wkl

Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X

n

 ∂Ln ∂al,n ∂al,n ∂wkl

  • Calculus chain rule

∂R ∂wkl = 1 N X

n

∂ 1

2(yn − g(al,n))2

∂al,n ∂al,n ∂wkl

slide-49
SLIDE 49

Error Backpropagation

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

Optimize last layer weights wkl

Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X

n

 ∂Ln ∂al,n ∂al,n ∂wkl

  • Calculus chain rule

∂R ∂wkl = 1 N X

n

∂ 1

2(yn − g(al,n))2

∂al,n ∂zk,nwkl ∂wkl

slide-50
SLIDE 50

Error Backpropagation

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

Optimize last layer weights wkl

Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X

n

 ∂Ln ∂al,n ∂al,n ∂wkl

  • Calculus chain rule

∂R ∂wkl = 1 N X

n

∂ 1

2(yn − g(al,n))2

∂al,n ∂zk,nwkl ∂wkl

  • = 1

N X

n

[−(yn − zl,n)g0(al,n)] zk,n

slide-51
SLIDE 51

Error Backpropagation

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

Optimize last layer weights wkl

Ln = 1 2 (yn − f(xn))2 ∂R ∂wkl = 1 N X

n

 ∂Ln ∂al,n ∂al,n ∂wkl

  • Calculus chain rule

∂R ∂wkl = 1 N X

n

∂ 1

2(yn − g(al,n))2

∂al,n ∂zk,nwkl ∂wkl

  • =

1 N X

n

[−(yn − zl,n)g0(al,n)] zk,n = 1 N X

n

δl,nnzk,n

slide-52
SLIDE 52

Error Backpropagation

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

Repeat for all previous layers ∂R ∂wkl = 1 N X

n

 ∂Ln ∂al,n ∂al,n ∂wkl

  • = 1

N X

n

[−(yn − zl,n)g0(al,n)] zk,n = 1 N X

n

δl,nzk,n ∂R ∂wjk = 1 N X

n

 ∂Ln ∂ak,n ∂ak,n ∂wjk

  • = 1

N X

n

"X

l

δl,nwklg0(ak,n) # zj,n = 1 N X

n

δk,nzj,n ∂R ∂wij = 1 N X

n

 ∂Ln ∂aj,n ∂aj,n ∂wij

  • = 1

N X

n

"X

k

δk,nwjkg0(aj,n) # zi,n = 1 N X

n

δj,nzi,n

slide-53
SLIDE 53

aj = X

i

wijzi

zj = g(aj)

∂R ∂wjk = 1 N X

n

 ∂Ln ∂ak,n ∂ak,n ∂wjk

  • = 1

N X

n

"X

l

δl,nwklg0(ak,n) # zj,n = 1 N X

n

δk,nzj,n ∂R ∂wij = 1 N X

n

 ∂Ln ∂aj,n ∂aj,n ∂wij

  • = 1

N X

n

"X

k

δk,nwjkg0(aj,n) # zi,n = 1 N X

n

δj,nzi,n

wij

wjk

zj

aj

zi

δj

δi δk zk

Backprop Recursion

slide-54
SLIDE 54

x0

x1 x2 xP f(x, ~ ✓)

wij wjk wkl

zj

zk

zi aj ak zl al

wt+1

ij

= wt

ij − η ∂R

wij wt+1

jk

= wt

jk − η ∂R

wkl wt+1

kl

= wt

kl − η ∂R

wkl

Learning: Gradient Descent

slide-55
SLIDE 55

Backpropagation

  • Starts with a forward sweep to compute all the intermediate function values
  • Through backprop, computes the partial derivatives recursively
  • A form of dynamic programming

– Instead of considering exponentially many paths between a weight w_ij and the final loss (risk), store and reuse intermediate results.

  • A type of automatic differentiation. (there are other variants e.g., recursive

differentiation only through forward propagation.

zi

δj

∂R ∂wij

Forward Gradient

slide-56
SLIDE 56

Backpropagation

  • TensorFlow(https://www.tensorflow.org/)
  • Torch (http://torch.ch/)
  • Theano (http://deeplearning.net/software/theano/)
  • CNTK (https://github.com/Microsoft/CNTK)
  • cnn (https://github.com/clab/cnn)
  • Caffe (http://caffe.berkeleyvision.org/)

Primary Interface Language:

  • Python
  • Lua
  • Python
  • C++
  • C++
  • C++

Forward Gradient

slide-57
SLIDE 57

Cross Entropy Loss (aka log loss, logistic loss)

  • Cross Entropy
  • Related quantities

– Entropy – KL divergence (the distance between two distributions p and q)

  • Use Cross Entropy for models that should have more probabilistic flavor

(e.g., language models)

  • Use Mean Squared Error loss for models that focus on correct/incorrect

predictions

H(p, q) = Ep[−log q] = H(p) + DKL(p||q) H(p, q) = − X

y

p(y) log q(y)

H(p) = X

y

p(y)log p(y)

DKL(p||q) = X

y

p(y) log p(y) q(y) MSE = 1 2(y − f(x))2

Predicted prob True prob

slide-58
SLIDE 58

RNN Learning: Backprop Through Time (BPTT)

  • Similar to backprop with non-recurrent NNs
  • But unlike feedforward (non-recurrent) NNs, each unit in the

computation graph repeats the exact same parameters…

  • Backprop gradients of the parameters of each unit as if they

are different parameters

  • When updating the parameters using the gradients, use the

average gradients throughout the entire chain of units. 𝑦" 𝑦# 𝑦$ 𝑦% ℎ" ℎ# ℎ$ ℎ%

slide-59
SLIDE 59

Convergence of backprop

  • Without non-linearity or hidden layers, learning is convex
  • ptimization

– Gradient descent reaches global minima

  • Multilayer neural nets (with nonlinearity) are not convex

– Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years

  • Neural nets are back with a new name

– Deep belief networks – Huge error reduction when trained with lots of data on GPUs

slide-60
SLIDE 60

Overfitting in NNs

  • Are NNs likely to overfit?

– Yes, they can represent arbitrary functions!!!

  • Avoiding overfitting?

– More training data – Fewer hidden nodes / better topology – Random perturbation to the graph topology (“Dropout”) – Regularization – Early stopping