CSE 481: NLP Capstone Spring 2017 Yejin Choi University of - - PowerPoint PPT Presentation

cse 481 nlp capstone spring 2017
SMART_READER_LITE
LIVE PREVIEW

CSE 481: NLP Capstone Spring 2017 Yejin Choi University of - - PowerPoint PPT Presentation

CSE 481: NLP Capstone Spring 2017 Yejin Choi University of Washington Office Hour News Hannah: Wed 2 - 3pm @ CSE 220 Maarten: Wed 2 - 3pm @ CSE 220 Yejin: Tue 2pm - 3:30pm Wed 5pm - 5:30pm @ CSE 578 All:


slide-1
SLIDE 1

CSE 481: NLP Capstone Spring 2017

Yejin Choi University of Washington

slide-2
SLIDE 2

Office Hour News

  • Hannah:

– Wed 2 - 3pm @ CSE 220

  • Maarten:

– Wed 2 - 3pm @ CSE 220

  • Yejin:

– Tue 2pm - 3:30pm – Wed 5pm - 5:30pm @ CSE 578

  • All:

– Thu 12pm – 1:25pm @ ??? for some weeks

  • Google doc sign up required

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

GPU NEWS!

4

slide-5
SLIDE 5

GPU NEWS!

  • 1. Back in stock!

– desktop with 2 GPUs can be set up at $4000

  • 2. Microsoft Azure kindly agreed to donate

free GPU cycles for the class!!!!!

  • 3. You can sign up to Azure today for free

$200 credits

5

slide-6
SLIDE 6

RECURRENT NEURAL RECURRENT NEURAL NETWORKS NETWORKS

slide-7
SLIDE 7

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

Recurrent Neural Networks (RNNs)

  • Each RNN unit computes a new hidden state using the previous

state and a new input

  • Each RNN unit (optionally) makes an output using the current hidden

state

  • Hidden states are continuous vectors

– Can represent very rich information – Possibly the entire history from the beginning

  • Parameters are shared (tied) across all RNN units (unlike feedforward

NNs) ht = f(xt, ht−1) ht ∈ RD yt = softmax(V ht)

slide-8
SLIDE 8

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNN:

ht = f(xt, ht−1) yt = softmax(V ht) yt = softmax(V ht)

ht = tanh(Uxt + Wht−1 + b)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

slide-9
SLIDE 9

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNNs:
  • LSTMs (Long Short-term Memory Networks):

ht = f(xt, ht−1)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ft = σ(U (f)xt + W (f)ht−1 + b(f))

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) ct = ft ct−1 + it ˜ ct

ht = ot tanh(ct)

There are many known variations to this set of equations!

ht = tanh(Uxt + Wht−1 + b)

​𝑑↓1 ​𝑑↓2 ​𝑑↓3 ​𝑑↓4 ​𝑑↓𝑢

: cell state

​ℎ↓𝑢

: hidden state

slide-10
SLIDE 10

Many uses of RNNs

  • Input: a sequence
  • Output: one label (classification)
  • Example: sentiment classification

ht = f(xt, ht−1)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

y = softmax(V hn)

  • 1. Classification (seq to one)
slide-11
SLIDE 11
  • 2. one to seq
  • Input: one item
  • Output: a sequence
  • Example: Image captioning

ht = f(xt, ht−1) yt = softmax(V ht)

​𝑦↓1 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4 ​ℎ↓3 ​ℎ↓2 ​ℎ↓1 Cat sitting on top of ….

Many uses of RNNs

slide-12
SLIDE 12
  • 3. sequence tagging
  • Input: a sequence
  • Output: a sequence (of the same length)
  • Example: POS tagging, Named Entity Recognition
  • How about Language Models?

– Yes! RNNs can be used as LMs! – RNNs make markov assumption: T/F?

ht = f(xt, ht−1) yt = softmax(V ht)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4 ​ℎ↓3 ​ℎ↓2 ​ℎ↓1

Many uses of RNNs

slide-13
SLIDE 13
  • 4. Language models
  • Input: a sequence of words
  • Output: one next word
  • Output: or a sequence of next words
  • During training, x_t is the actual word in the training sentence.
  • During testing, x_t is the word predicted from the previous time

step.

  • Does RNN LMs make Markov assumption?

– i.e., the next word depends only on the previous N words

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4 ​ℎ↓3 ​ℎ↓2 ​ℎ↓1

Many uses of RNNs

ht = f(xt, ht−1) yt = softmax(V ht)

slide-14
SLIDE 14
  • 5. seq2seq (aka “encoder-decoder”)
  • Input: a sequence
  • Output: a sequence (of different length)
  • Examples?

ht = f(xt, ht−1) yt = softmax(V ht)

​ 𝑦↓ 1 ​ 𝑦↓ 2 ​ 𝑦↓ 3 ​ ℎ↓ 1 ​ ℎ↓ 2 ​ ℎ↓ 3 ​ ℎ↓ 4 ​ ℎ↓ 4 ​ ℎ↓ 5 ​ ℎ↓ 6 ​ ℎ↓ 7 ​ ℎ↓ 6 ​ ℎ↓ 5

Many uses of RNNs

slide-15
SLIDE 15

Many uses of RNNs

  • 4. seq2seq (aka “encoder-decoder”)

Figure from http://www.wildml.com/category/conversational-agents/

  • Conversation and Dialogue
  • Machine Translation
slide-16
SLIDE 16

Many uses of RNNs

  • 4. seq2seq (aka “encoder-decoder”)

John has a dog ​ 𝑦↓ 1 ​ 𝑦↓ 2 ​ 𝑦↓ 3 ​ ℎ↓ 1 ​ ℎ↓ 2 ​ ℎ↓ 3 ​ ℎ↓ 4 ​ ℎ↓ 4 ​ ℎ↓ 5 ​ ℎ↓ 6 ​ ℎ↓ 7 ​ ℎ↓ 6 ​ ℎ↓ 5

Parsing!

  • “Grammar as Foreign Language” (Vinyals et al., 2015)
slide-17
SLIDE 17

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNN:

ht = f(xt, ht−1) yt = softmax(V ht) yt = softmax(V ht)

ht = tanh(Uxt + Wht−1 + b)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

slide-18
SLIDE 18

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNNs:
  • LSTMs (Long Short-term Memory Networks):

ht = f(xt, ht−1)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ft = σ(U (f)xt + W (f)ht−1 + b(f))

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) ct = ft ct−1 + it ˜ ct

ht = ot tanh(ct)

There are many known variations to this set of equations!

ht = tanh(Uxt + Wht−1 + b)

​𝑑↓1 ​𝑑↓2 ​𝑑↓3 ​𝑑↓4 ​𝑑↓𝑢

: cell state

​ℎ↓𝑢

: hidden state

slide-19
SLIDE 19

LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS

​𝑑↓𝑢 −1 ​ℎ↓𝑢 −1 ​𝑑↓𝑢 ​ℎ↓𝑢

Figure by Christopher Olah (colah.github.io)

slide-20
SLIDE 20

LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS

sigmoid: [0,1] ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

slide-21
SLIDE 21

LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS

sigmoid: [0,1] tanh: [-1,1] it = σ(U (i)xt + W (i)ht−1 + b(i)) ˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

slide-22
SLIDE 22

LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS

sigmoid: [0,1] tanh: [-1,1]

ct = ft ct−1 + it ˜ ct

New cell content:

  • mix old cell with the new temp cell

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not

Figure by Christopher Olah (colah.github.io)

slide-23
SLIDE 23

LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS

ct = ft ct−1 + it ˜ ct

New cell content:

  • mix old cell with the new temp cell

it = σ(U (i)xt + W (i)ht−1 + b(i))

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) Input gate: use the input or not New cell content (temp): ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ht = ot tanh(ct)

Hidden state:

Figure by Christopher Olah (colah.github.io)

slide-24
SLIDE 24

LSTMS ( LSTMS (LONG ONG SHOR HORT-TERM ERM MEMOR EMORY Y NETWORKS NETWORKS

it = σ(U (i)xt + W (i)ht−1 + b(i)) Input gate: use the input or not

ft = σ(U (f)xt + W (f)ht−1 + b(f)) Forget gate: forget the past or not Output gate: output from the new cell or not

  • t = σ(U (o)xt + W (o)ht−1 + b(o))

ct = ft ct−1 + it ˜ ct

New cell content:

  • mix old cell with the new temp cell

˜ ct = tanh(U (c)xt + W (c)ht−1 + b(c)) New cell content (temp):

ht = ot tanh(ct)

Hidden state: ​𝑑↓𝑢 −1 ​ℎ↓𝑢 −1 ​𝑑↓𝑢 ​ℎ↓𝑢

slide-25
SLIDE 25

vanishing gradient problem for RNNs.

  • The shading of the nodes in the unfolded network indicates their

sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity).

  • The sensitivity decays over time as new inputs overwrite the activations
  • f the hidden layer, and the network ‘forgets’ the first inputs.

Example from Graves 2012

slide-26
SLIDE 26

Preservation of gradient information by LSTM

  • For simplicity, all gates are either entirely open (‘O’) or closed (‘—’).
  • The memory cell ‘remembers’ the first input as long as the forget gate is
  • pen and the input gate is closed.
  • The sensitivity of the output layer can be switched on and off by the output

gate without affecting the cell.

Forget gate Input gate Output gate Example from Graves 2012

slide-27
SLIDE 27

Recurrent Neural Networks (RNNs)

  • Generic RNNs:
  • Vanilla RNNs:
  • GRUs (Gated Recurrent Units):

ht = f(xt, ht−1)

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ ℎ↓ 1 ​ ℎ↓ 2 ​ ℎ↓ 3 ​ ℎ↓ 4

zt = σ(U (z)xt + W (z)ht−1 + b(z)) rt = σ(U (r)xt + W (r)ht−1 + b(r))

˜ ht = tanh(U (h)xt + W (h)(rt ht−1) + b(h))

ht = (1 zt) ht−1 + zt ˜ ht

Z: Update gate R: Reset gate Less parameters than LSTMs. Easier to train for comparable performance!

ht = tanh(Uxt + Wht−1 + b)

slide-28
SLIDE 28

Gates

  • Gates contextually control information

flow

  • Open/close with sigmoid
  • In LSTMs and GRUs, they are used to

(contextually) maintain longer term history

28

slide-29
SLIDE 29

Bi-directional RNNs

29

  • Can incorporate context from both directions
  • Generally improves over uni-directional RNNs
slide-30
SLIDE 30

Google NMT (Oct 2016)

slide-31
SLIDE 31

Recursive Neural Networks

  • Sometimes, inference over a tree structure makes more sense

than sequential structure

  • An example of compositionality in ideological bias detection

(red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree

Example from Iyyer et al., 2014

slide-32
SLIDE 32

Recursive Neural Networks

  • NNs connected as a tree
  • Tree structure is fixed a priori
  • Parameters are shared, similarly as RNNs

Example from Iyyer et al., 2014

slide-33
SLIDE 33

Tree LSTMs

33

  • Are tree LSTMs more

expressive than sequence LSTMs?

  • I.e., recursive vs recurrent
  • When Are Tree Structures

Necessary for Deep Learning of Representations? Jiwei Li, Minh-Thang Luong, Dan Jurafsky and Eduard Hovy. EMNLP , 2015.

slide-34
SLIDE 34

Neural Probabilistic Language Model (Bengio 2003)

34

slide-35
SLIDE 35

Neural Probabilistic Language Model (Bengio 2003)

35

NNDMLP1(x) = [tanh(xW1 + b1), x]W 2 + b2

I W1 ∈ Rdin×dhid, b1 ∈ R1×dhid; first affine transformation I W2 ∈ R(dhid+din)×dout, b2 ∈ R1×dout; second affine transformation

  • Each word prediction is a

separate feed forward neural network

  • Feedforward NNLM is a

Markovian language model

  • Dashed lines show
  • ptional direct

connections

slide-36
SLIDE 36

LEARNING: LEARNING: BACKPROP BACKPROPAGA AGATION TION

slide-37
SLIDE 37

Error Backpropagation

  • Model parameters:

for brevity:

x0 x1 x2 xP f(x, ~ ✓) ~ ✓ = {wij, wjk, wkl}

Next 10 slides on back propagation are adapted from Andrew Rosenberg

~ ✓ = {w(1)

ij , w(2) jk , w(3) kl }

w(1)

ij

w(2)

jk

w(3)

kl

slide-38
SLIDE 38

x0 x1 x2 xP f(x, ~ ✓) wij wjk wkl zj zk zi aj ak zl al wt+1

ij

= wt

ij − η ∂R

wij wt+1

jk

= wt

jk − η ∂R

wkl wt+1

kl

= wt

kl − η ∂R

wkl

Learning: Gradient Descent

slide-39
SLIDE 39

Backpropagation

  • Starts with a forward sweep to compute all the intermediate function

values

  • Through backprop, computes the partial derivatives recursively
  • A form of dynamic programming

– Instead of considering exponentially many paths between a weight w_ij and the final loss (risk), store and reuse intermediate results.

  • A type of automatic differentiation. (there are other variants e.g., recursive

differentiation only through forward propagation.

zi

δj

∂R ∂wij

Forward Gradient

slide-40
SLIDE 40

Backpropagation

  • TensorFlow (https://www.tensorflow.org/)
  • Torch (http://torch.ch/)
  • Theano (http://deeplearning.net/software/theano/)
  • CNTK (https://github.com/Microsoft/CNTK)
  • cnn (https://github.com/clab/cnn)
  • Caffe (http://caffe.berkeleyvision.org/)

Primary Interface Language:

  • Python
  • Lua
  • Python
  • C++
  • C++
  • C++

Forward Gradient

slide-41
SLIDE 41

Cross Entropy Loss (aka log loss, logistic

loss)

  • Cross Entropy
  • Related quantities

– Entropy – KL divergence (the distance between two distributions p and q)

  • Use Cross Entropy for models that should have more probabilistic

flavor (e.g., language models)

  • Use Mean Squared Error loss for models that focus on correct/

incorrect predictions

H(p, q) = Ep[−log q] = H(p) + DKL(p||q)

H(p, q) = − X

y

p(y) log q(y) H(p) = X

y

p(y)log p(y)

DKL(p||q) = X

y

p(y) log p(y) q(y) MSE = 1 2(y − f(x))2

Predicted prob True prob

slide-42
SLIDE 42

RNN Learning: Backprop Through Time

(BPTT)

  • Similar to backprop with non-recurrent NNs
  • But unlike feedforward (non-recurrent) NNs, each unit in

the computation graph repeats the exact same parameters…

  • Backprop gradients of the parameters of each unit as if

they are different parameters

  • When updating the parameters using the gradients, use

the average gradients throughout the entire chain of units.

​𝑦↓1 ​𝑦↓2 ​𝑦↓3 ​𝑦↓4 ​ℎ↓1 ​ℎ↓2 ​ℎ↓3 ​ℎ↓4

slide-43
SLIDE 43

LEARNING: TRAINING DEEP LEARNING: TRAINING DEEP NETWORKS NETWORKS

slide-44
SLIDE 44

Vanishing / exploding Gradients

  • Deep networks are hard to train
  • Gradients go through multiple layers
  • The multiplicative effect tends to lead to

exploding or vanishing gradients

  • Practical solutions w.r.t.

– network architecture – numerical operations

44

slide-45
SLIDE 45

Vanishing / exploding Gradients

  • Practical solutions w.r.t. network

architecture

– Add skip connections to reduce distance

  • Residual networks, highway networks, …

– Add gates (and memory cells) to allow longer term memory

  • LSTMs, GRUs, memory networks, …

45

slide-46
SLIDE 46

Gradients of deep networks

NNlayer(x) = ReLU(xW1 + b1)

hn hn−1 . . . h2 h1 x

I Can have similar issues with vanishing gradients.

∂L ∂hn−1,jn−1

= ∑

jn

1(hn,jn > 0)Wjn−1,jn ∂L ∂hn,jn

46

Diagram borrowed from Alex Rush

slide-47
SLIDE 47

47

Effects of Skip Connections on Gradients

  • Thought Experiment: Additive Skip-Connections

NNsl1(x) = 1 2 ReLU(xW1 + b1) + 1 2x

hn hn−1 . . . h3 h2 h1 x

∂L ∂hn−1,jn−1

=

1 2

(∑

jn

1(hn,jn > 0)Wjn−1,jn ∂L ∂hn,jn

) +

1 2

(hn−1,jn−1

∂L ∂hn,jn−1

)

Diagram borrowed from Alex Rush

slide-48
SLIDE 48

48

Effects of Skip Connections on Gradients

  • Thought Experiment: Dynamic Skip-Connections

NNsl2(x)

= (1 − t) ReLU(xW1 + b1) + tx

t

=

σ(xWt + bt) W1

Rdhid×dhid Wt

Rdhid×1

hn hn−1 . . . h3 h2 h1 x

Diagram borrowed from Alex Rush

slide-49
SLIDE 49

Highway Network (Srivastava et al., 2015)

  • A plain feedforward neural network:

– H is a typical affine transformation followed by a non- linear activation

  • Highway network:

– T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity

49

y = H(x, WH).

y = H(x, WH)· T(x, WT) + x · C(x, WC).

slide-50
SLIDE 50

Residual Networks

  • ResNet (He et al. 2015): first very deep (152 layers)

network successfully trained for object recognition

50

  • Plaint net

any two stacked layers

a(0)

weight layer weight layer

relu relu

  • Residual net

weight layer weight layer

relu relu

a 0 = b 0 + 0

identity

b(0)

slide-51
SLIDE 51

Residual Networks

51

  • Plaint net

any two stacked layers

a(0)

weight layer weight layer

relu relu

  • Residual net

weight layer weight layer

relu relu

a 0 = b 0 + 0

identity

b(0)

  • F(x) is a residual mapping with respect to identity
  • Direct input connection +x leads to a nice property w.r.t. back

propagation --- more direct influence from the final loss to any deep layer

  • In contrast, LSTMs & Highway networks allow for long distance

input connection only through “gates”.

slide-52
SLIDE 52

Residual Networks

Revolution of Depth

11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000

AlexNet, 8 layers (ILSVRC 2012)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014)

input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivation soft max1 Soft maxAct ivat ion soft max2

GoogleNet, 22 layers (ILSVRC 2014)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

52

slide-53
SLIDE 53

Residual Networks

1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2

AlexNet, 8 layers (ILSVRC 2012)

Revolution of Depth

ResNet, 152 layers (ILSVRC 2015)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

53

slide-54
SLIDE 54

Residual Networks

Revolution of Depth

3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow 8 layers 19 layers 22 layers

152 layers

8 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

54

slide-55
SLIDE 55

Highway Network (Srivastava et al., 2015)

  • A plain feedforward neural network:

– H is a typical affine transformation followed by a non- linear activation

  • Highway network:

– T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity

55

y = H(x, WH).

y = H(x, WH)· T(x, WT) + x · C(x, WC).

slide-56
SLIDE 56

@Schmidhubered

56

slide-57
SLIDE 57

Vanishing / exploding Gradients

  • Practical solutions w.r.t. numerical operations

– Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid)

  • ReLU or hard-tanh instead

57

slide-58
SLIDE 58

Sigmoid

  • Often used for gates
  • Pro: neuron-like,

differentiable

  • Con: gradients saturate to

zero almost everywhere except x near zero => vanishing gradients

  • Batch normalization helps

58

σ(x) = 1 1 + e−x σ0(x) = σ(x)(1 − σ(x))

slide-59
SLIDE 59

Tanh

  • Often used for

hidden states & cells in RNNs, LSTMs

  • Pro: differentiable,
  • ften converges

faster than sigmoid

  • Con: gradients easily

saturate to zero => vanishing gradients

59

tanh(x) = 2σ(2x) − 1 tanh(x) = ex − e−x ex + e−x

tanh’(x) = 1 − tanh2(x)

slide-60
SLIDE 60

Hard Tanh

hardtanh(t) =         

−1

t < −1 t

−1 ≤ t ≤ 1

1 t > 1

60

  • Pro: computationally

cheaper

  • Con: saturates to

zero easily, doesn’t differentiate at 1, -1

slide-61
SLIDE 61

ReLU

  • Pro: doesn’t saturate for

x > 0, computationally cheaper, induces sparse NNs

  • Con: non-differentiable

at 0

  • Used widely in deep

NN, but not as much in RNNs

  • We informally use

subgradients:

61

ReLU(x) = max(0, x)

d ReLU(x) dx

=

         1 x > 0 x < 0 1 or 0

  • .w
slide-62
SLIDE 62

Vanishing / exploding Gradients

  • Practical solutions w.r.t. numerical operations

– Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid)

  • ReLU or hard-tanh instead

– Batch Normalization: add intermediate input normalization layers

62

slide-63
SLIDE 63

Batch Normalization

63

slide-64
SLIDE 64

Regularization

  • Regularization by objective term

– Modify loss with L1 or L2 norms

  • Less depth, smaller hidden states, early stopping
  • Dr

Dropout

  • pout

– Randomly delete parts of network during training – Each node (and its corresponding incoming and outgoing edges) dropped with a probability p – P is higher for internal nodes, lower for input nodes – The full network is used for testing – Faster training, better results – Vs. Bagging

64

L(θ) =

n

i=1

max{0, 1 ( ˆ yc ˆ yc0)} + λ||θ||2

slide-65
SLIDE 65

Convergence of backprop

  • Without non-linearity or hidden layers, learning is

convex optimization

– Gradient descent reaches global minima global minima

  • Multilayer neural nets (with nonlinearity) are not

not convex convex

– Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years

  • Neural nets are back with a new name

– Deep belief networks – Huge error reduction when trained with lots of data on GPUs

slide-66
SLIDE 66

RECAP RECAP

slide-67
SLIDE 67

Vanishing / exploding Gradients

  • Deep networks are hard to train
  • Gradients go through multiple layers
  • The multiplicative effect tends to lead to

exploding or vanishing gradients

  • Practical solutions w.r.t.

– network architecture – numerical operations

67

slide-68
SLIDE 68

Vanishing / exploding Gradients

  • Practical solutions w.r.t. network

architecture

– Add skip connections to reduce distance

  • Residual networks, highway networks, …

– Add gates (and memory cells) to allow longer term memory

  • LSTMs, GRUs, memory networks, …

68

slide-69
SLIDE 69

seq2seq (aka “encoder-decoder”)

ht = f(xt, ht−1) yt = softmax(V ht)

slide-70
SLIDE 70

Google NMT (Oct 2016)

slide-71
SLIDE 71

ATTENTION! TTENTION!

slide-72
SLIDE 72

Seq-to-Seq with Attention

Diagram from http://distill.pub/2016/augmented-rnns/ 72

slide-73
SLIDE 73

Seq-to-Seq with Attention

Diagram from http://distill.pub/2016/augmented-rnns/ 73

slide-74
SLIDE 74

Trial: Hard Attention

74

  • At each step generating the target word
  • Compute the best alignment to the source word
  • And incorporate the source word to generate the target

word

  • Contextual hard alignment. How?
  • Problem?

st

i

ss

j

wt

i+1 = argmaxwO(w, st i+1, ss j)

zj = tanh([st

i, ss j]W + b)

j = argmaxjzj

slide-75
SLIDE 75

Encoder – Decoder Architecture

Sequence-to-Sequence

the red dog ˆ y1 ˆ y2 ˆ y3 ss

1

ss

2

ss

3

st

1

st

2

st

3

x1 x2 x3 ˆ x1 ˆ x2 ˆ x3 the red dog

<s>

75

Diagram borrowed from Alex Rush

slide-76
SLIDE 76

Attention: Soft Alignments

76

  • At each step generating the target word
  • Compute the attention to the source sequence
  • And incorporate the attention to generate the target

word

  • Contextual attention as soft alignment. How?

– Step-1: compute the attention weights – Step-2: compute the attention vector as interpolation

st

i

wt

i+1 = argmaxwO(w, st i+1, c)

c ss

zj = tanh([st

i, ss j]W + b)

α = softmax(z) c = X

j

αjss

j

slide-77
SLIDE 77

Attention function parameterization

  • Feedforward NNs
  • Dot product
  • Cosine similarity
  • Bi-linear models

77

zj = st

i · ss j

zj = st

i · ss j

||st

i||||ss j||

zj = st

i T Wss j

zj = tanh([st

i; ss j]W + b)

zj = tanh([st

i; ss j; st i ss j]W + b)

slide-78
SLIDE 78

78

slide-79
SLIDE 79

Learned Attention!

79

Diagram borrowed from Alex Rush

slide-80
SLIDE 80

80

  • M. Malinowski

Qualitative results

27

Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word)

slide-81
SLIDE 81

POINTER NETWORKS POINTER NETWORKS

slide-82
SLIDE 82

Convex haul, Delaunay Triangulation, Traveling Salesman

82

Can we model these problems using seq-to-seq?

slide-83
SLIDE 83

Pointer Networks! (Vinyals et al. 2015)

83

  • NNs with attention: content-based attention to input
  • Pointer networks: location-based attention to input
slide-84
SLIDE 84

Pointer Networks

84

(a) Sequence-to-Sequence (b) Ptr-Net

slide-85
SLIDE 85

Pointer Networks

Attention Mechanism vs Pointer Networks

Softmax normalizes the vector eij to be an output distribution over the dictionary of inputs Attention mechanism Ptr-Net Diagram borrowed from Keon Kim 85

slide-86
SLIDE 86

CopyNet (Gu et al. 2016)

  • Conversation

– I: Hello Jack, my name is Chandralekha – R: Nice to meet you, Chandralekha – I: This new guy doesn’t perform exactly as expected. – R: what do you mean by “doesn’t perform exactly as expected?”

  • Translation

86

slide-87
SLIDE 87

CopyNet (Gu et al. 2016)

hello , my name is Tony Jebara .

Attentive Read

hi , Tony Jebara <eos> hi , Tony

h1 h2 h3 h4 h5 s1 s2 s3 s4 h6 h7 h8

“Tony” DNN

Embedding for “Tony” Selective Read for “Tony”

(a) Attention-based Encoder-Decoder (RNNSearch) (c) State Update

s4

Source Vocabulary

Softmax

Prob(“Jebara”) = Prob(“Jebara”, g) + Prob(“Jebara”, c)

… ...

(b) Generate-Mode & Copy-Mode

!

M M

87

slide-88
SLIDE 88

CopyNet (Gu et al. 2016)

88

  • Key idea: interpolation between generation model &

copy model

p(yt|st, yt−1, ct, M) = p(yt, g|st, yt−1, ct, M) + p(yt, c|st, yt−1, ct, M) (4)

Generate-Mode: The same scoring function as in the generic RNN encoder-decoder (Bahdanau et al., 2014) is used, i.e. ψg(yt = vi) = v>

i Wost,

vi ∈ V ∪ UNK (7) where Wo ∈ R(N+1)⇥ds and vi is the one-hot in- dicator vector for vi. Copy-Mode: The score for “copying” the word xj is calculated as ψc(yt = xj) = σ ⇣ h>

j Wc

⌘ st, xj ∈ X (8)

p(yt, g|·)= 8 > > < > > : 1 Z eψg(yt), yt 2 V 0, yt 2 X \ ¯ V 1 Z eψg(UNK) yt 62 V [ X (5) p(yt, c|·)= ( 1 Z P

j:xj=yt eψc(xj),

yt 2 X

  • therwise

(6)

slide-89
SLIDE 89

BiDAF

89

slide-90
SLIDE 90

NEURAL CHECK LIST NEURAL CHECK LIST

slide-91
SLIDE 91

Neural Checklist Models

(Kiddon et al., 2016)

  • What can we do with gating & attention?

91

slide-92
SLIDE 92

Encoder--Decoder Architecture

Chop tomatoes the . Add Chop tomatoes the . <s>

Doesn’t address changing ingredients Want to update ingredient information as ingredients are used

garlic tomato salsa

slide-93
SLIDE 93

Encode title - decode recipe

Cut each sandwich in halves. Sandwiches with sandwiches. Sandwiches, sandwiches, Sandwiches, sandwiches, sandwiches sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, or sandwiches or triangles, a griddle, each sandwich. Top each with a slice of cheese, tomato, and cheese. Top with remaining cheese mixture. Top with remaining cheese. Broil until tops are bubbly and cheese is melted, about 5 minutes.

sausage sandwiches

slide-94
SLIDE 94

Recipe generation vs vs machine translation

by by by by recipe ecipe recipe ecipe token token <S> token token token token decode decode decode decode recipe title ecipe title ingr ingredient 1 edient 1 ingr ingredient 2 edient 2 ingr ingredient 3 edient 3 ingr ingredient 4 edient 4

Two input sources

  • Only ~6-10% words align

between input and output.

  • The rest must be generated

from context (and implicit knowledge about cooking)

  • Contextual switch between

two different input sources

slide-95
SLIDE 95

Chop tomatoes the . Add Chop tomatoes the . <s>

Doesn’t address changing ingredients Want to update ingredient information as ingredients are used

garlic tomato salsa

Encoder--Decoder with Attention

slide-96
SLIDE 96

Neural checklist model

slide-97
SLIDE 97

Let’s make salsa!

Garlic tomato salsa

  • tomatoes
  • nions

garlic salt

slide-98
SLIDE 98

Neural checklist model

LM

Chop <S>

hidden state classifier: non-ingredient new ingredient used ingredient which ingredients are still available new hidden state

tomato salsa garlic

slide-99
SLIDE 99

Neural checklist model

tomatoes tomatoes Chop Chop the the <S>

0.85 0.10 0.04 0.01

.

non- ingredient new ingredient

slide-100
SLIDE 100

Neural checklist model

  • nions
  • nions

Dice Dice the the . .

0.00 0.94 0.03 0.01

✓ ✓ ✓

slide-101
SLIDE 101

Neural checklist model

tomatoes tomatoes Add Add to to . .

0.94 0.04 0.01 0.01

used ingredient

✓ ✓ ✓ ✓

slide-102
SLIDE 102

Checklist is probabilistic

tomatoes tomatoes Add Add to to . .

0.90 0.08 0.01 0.01

used ingredient

0.85 1.00 1.00 0.85 0.85 1.00 0.02 0.02 0.02 0.04 0.04 0.04

= new ingredient prob. distribution

slide-103
SLIDE 103

Hidden state classifier is soft

tomatoes tomatoes Add Add to to . .

0.90 0.08 0.01 0.01

0.85 1.00 1.00 0.85 0.85 1.00 0.02 0.02 0.02 0.04 0.04 0.04

0.00 0.00 0.50 0.50

0.85 1.00 0.02 0.04

0.94 0.05 0.01

slide-104
SLIDE 104

Interpolation

0.94 0.05 0.01

0.90 0.08 0.01 0.01

0.85 1.00 0.02 0.04

0.00 0.00 0.50 0.50

0.85 1.00 0.02 0.04

probability distribution over vocabulary

Attention model

  • ver used

ingredients Attention model

  • ver available

ingredients

slide-105
SLIDE 105

Choose ingredient via attention

Generates a probability distribution over a set of embeddings that corresponds to how close a target embedding is to each

Attention models for other NLP tasks

MT (Balasubramanian et al. 13, Bahdanau et al. 14) Sentence summarization (Rush et

  • al. 15)

Machine reading (Cheng et al. 16) Image captioning (Xu et al. 15) available ingredient embeddings content vector from language model temperature term

available ingredient embeddings

slide-106
SLIDE 106

Attention-generated embeddings

Can generate an embedding from the attention probabilities

ingredient embeddings

slide-107
SLIDE 107

Discussion Points

  • Strength and challenges of deep learning?

… what do NNs think about this?

107

slide-108
SLIDE 108

Hafez: Neural Sonnet Writer

(Ghazvininejad et al. 2016)

108

slide-109
SLIDE 109

Neural Sonnets

Deep Convolution Network

Outrageous channels on the wrong connections, An empty space without an open layer, A closet full of black and blue extensions, Connections by the closure operator.

Theory

Another way to reach the wrong conclusion! A vision from a total transformation, Created by the great magnetic fusion, Lots of people need an explanation.

109

slide-110
SLIDE 110

Discussion Points

  • Strength and challenges of deep learning?
  • Representation learning

– Less efforts on feature engineering (at the cost of more hyperparameter tuning!) – In computer vision: NN learned representation is significantly better than human engineered features – In NLP: often NN induced representation is concatenated with additional human engineered features.

  • Data

– Most success from massive amount of clean (expensive) data – Recent surge of data creation type papers (especially AI challenge type tasks) – Which significantly limits the domains & applications – Need stronger models for unsupervised & distantly supervised approaches

110

slide-111
SLIDE 111

Discussion Points

  • Strength and challenges of deep learning?
  • Architecture

– allows for flexible, expressive, and creative modeling

  • Easier entry to the field

– Recent breakthrough from engineering advancements than theoretic advancements – Several NN platforms, code sharing culture

111

slide-112
SLIDE 112

Neural Recipe Example #1

Cook eggplant in boiling water , covered , for 10 min . Drain and cut in half lengthwise . scoop out insides leaving 1/2 '' shell . Mash insides with cottage cheese ,

  • nion , bay leaf , basil , oregano and tomato sauce .

Preheat oven to 350 ^ stuff eggplant halves , place in casserole dish and bake covered for 15 min . Add a little water to bottom of pan to keep eggplant moist . top with provolone cheese . Bake 5 more min uncovered 1 serving = In a small bowl , combine the cheese , eggplant , basil , oregano , tomato sauce and onion . Mix well . Shape mixture into 6 patties , each about 3/4-inch thick. Place on baking sheet . Bake at 350 degrees for 30 minutes or until lightly browned . Southern living magazine , sometime in 1980 . Typed for you by nancy coleman .

eggplant cheese cottage lowfat chopped onion bay ground leaf basil

  • regano

tomato sauce provolone

title: oven eggplant

slide-113
SLIDE 113

CONVOLUTION NEURAL CONVOLUTION NEURAL NETWORK NETWORK

Next several slides borrowed from Alex Rush

slide-114
SLIDE 114

Models with Sliding Windows

  • Classification/prediction with sliding windows

– E.g., neural language model

  • Feature representations with sliding window

– E.g., sequence tagging with CRFs or structured perceptron

114

[w1 w2 w3 w4 w5] w6 w7 w8

w1 [w2 w3 w4 w5 w6] w7 w8 w1 w2 [w3 w4 w5 w6 w7] w8 . . .

slide-115
SLIDE 115

Sliding Windows w/ Convolution

Let our input be the embeddings of the full sentence, X 2 Rn⇥d0 X = [v(w1), v(w2), v(w3), . . . , v(wn)] Define a window model as NNwindow : R1⇥(dwind0) 7! R1⇥dhid, NNwindow(xwin) = xwinW1 + b1

115

The convolution is defined as NNconv : Rn⇥d0 7! R(ndwin+1)⇥dhid, NNconv(X) = tanh        NNwindow(X1:dwin) NNwindow(X2:dwin+1) . . . NNwindow(Xndwin:n)       

slide-116
SLIDE 116

Pooling Operations

116

I Pooling “over-time” operations f : Rn⇥m 7! R1⇥m

  • 1. fmax(X)1,j = maxi Xi,j
  • 2. fmin(X)1,j = mini Xi,j
  • 3. fmean(X)1,j = ∑i Xi,j/n

f (X) =       

+ +

. . .

+ +

. . . . . .

+ +

. . .       

= [ . . . ]

slide-117
SLIDE 117

Convolution + Pooling

ˆ y = softmax(fmax(NNconv(X))W2 + b2)

I W2 ∈ Rdhid×dout, b2 ∈ R1×dout I Final linear layer W2 uses learned window features

117

slide-118
SLIDE 118

Multiple Convolutions

ˆ y = softmax([f (NN1

conv(X)), f (NN2 conv(X)), . . . , f (NNf conv(X))]W2 + b2)

I Concat several convolutions together. I Each NN1, NN2, etc uses a different dwin I Allows for different window-sizes (similar to multiple n-grams)

118

slide-119
SLIDE 119

Convolution Diagram (kim 2014)

I n = 9, dhid = 4 , dout = 2 I red- dwin = 2, blue- dwin = 3, (ignore back channel)

119

slide-120
SLIDE 120

Text Classification (Kim 2014)

120

slide-121
SLIDE 121

AlexNet (krizhevsky et al., 2012)

121