CSE 481: NLP Capstone Spring 2017 Yejin Choi University of - PowerPoint PPT Presentation

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • GRUs (Gated Recurrent Units): z t = σ ( U ( z ) x t + W ( z ) h t − 1 + b ( z ) ) r t = σ ( U ( r ) x t + W ( r ) h t − 1 + b ( r ) ) ˜ h t = tanh( U ( h ) x t + W ( h ) ( r t � h t − 1 ) + b ( h ) ) Z: Update gate h t = (1 � z t ) � h t − 1 + z t � ˜ h t R: Reset gate Less parameters than LSTMs. Easier to train for comparable performance! ℎ↓ ℎ↓ ℎ↓ ℎ↓ 4 1 2 3 𝑦↓ 3 𝑦↓ 4 𝑦↓ 1 𝑦↓ 2

Gates • Gates contextually control information flow • Open/close with sigmoid • In LSTMs and GRUs, they are used to (contextually) maintain longer term history 28

Bi-directional RNNs • Can incorporate context from both directions • Generally improves over uni-directional RNNs 29

Google NMT (Oct 2016)

Recursive Neural Networks • Sometimes, inference over a tree structure makes more sense than sequential structure • An example of compositionality in ideological bias detection (red → conservative, blue → liberal, gray → neutral) in which modifier phrases and punctuation cause polarity switches at higher levels of the parse tree Example from Iyyer et al., 2014

Recursive Neural Networks • NNs connected as a tree • Tree structure is fixed a priori • Parameters are shared, similarly as RNNs Example from Iyyer et al., 2014

Tree LSTMs • Are tree LSTMs more expressive than sequence LSTMs? • I.e., recursive vs recurrent • When Are Tree Structures Necessary for Deep Learning of Representations? Jiwei Li, Minh-Thang Luong, Dan Jurafsky and Eduard Hovy. EMNLP , 2015. 33

Neural Probabilistic Language Model (Bengio 2003) 34

Neural Probabilistic Language Model (Bengio 2003) • Each word prediction is a separate feed forward neural network • Feedforward NNLM is a Markovian language model • Dashed lines show optional direct connections NN DMLP 1 ( x ) = [ tanh ( xW 1 + b 1 ) , x ] W 2 + b 2 I W 1 ∈ R d in × d hid , b 1 ∈ R 1 × d hid ; first a ffi ne transformation I W 2 ∈ R ( d hid + d in ) × d out , b 2 ∈ R 1 × d out ; second a ffi ne transformation 35

LEARNING: LEARNING: BACKPROP BACKPROPAGA AGATION TION

Next 10 slides on back propagation are adapted from Andrew Rosenberg Error Backpropagation • Model parameters: ✓ = { w (1) ij , w (2) jk , w (3) ~ kl } for brevity: ~ ✓ = { w ij , w jk , w kl } w (1) w (2) ij jk x 0 w (3) kl x 1 f ( x, ~ ✓ ) x 2 x P

Learning: Gradient Descent ij − η ∂ R w t +1 w t = ij w ij jk − η ∂ R w t +1 w t = jk w kl kl − η ∂ R w t +1 w t = kl w kl a j z j z k a k a l z i z l w jk w ij x 0 w kl x 1 f ( x, ~ ✓ ) x 2 x P

Backpropagation Starts with a forward sweep to compute all the intermediate function • z i values δ j ∂ R Through backprop, computes the partial derivatives recursively • ∂ w ij A form of dynamic programming • – Instead of considering exponentially many paths between a weight w_ij and the final loss (risk), store and reuse intermediate results. A type of automatic differentiation. (there are other variants e.g., recursive • differentiation only through forward propagation. Forward Gradient

Backpropagation Primary Interface Language: TensorFlow (https://www.tensorflow.org/) Python • • Torch (http://torch.ch/) Lua • • Theano (http://deeplearning.net/software/theano/) Python • • CNTK (https://github.com/Microsoft/CNTK) C++ • • cnn (https://github.com/clab/cnn) C++ • • Caffe (http://caffe.berkeleyvision.org/) C++ • • Forward Gradient

Cross Entropy Loss (aka log loss, logistic loss) • Cross Entropy X H ( p, q ) = − p ( y ) log q ( y ) Predicted prob y • Related quantities True prob X H ( p ) = p ( y )log p ( y ) – Entropy y – KL divergence (the distance between two distributions p and q) p ( y ) log p ( y ) X D KL ( p || q ) = q ( y ) y H ( p, q ) = E p [ − log q ] = H ( p ) + D KL ( p || q ) • Use Cross Entropy for models that should have more probabilistic flavor (e.g., language models) • Use Mean Squared Error loss for models that focus on correct/ incorrect predictions MSE = 1 2( y − f ( x )) 2

RNN Learning: Backprop Through Time (BPTT) • Similar to backprop with non-recurrent NNs • But unlike feedforward (non-recurrent) NNs, each unit in the computation graph repeats the exact same parameters… • Backprop gradients of the parameters of each unit as if they are different parameters • When updating the parameters using the gradients, use the average gradients throughout the entire chain of units. ℎ↓ 4 ℎ↓ 1 ℎ↓ 2 ℎ↓ 3 𝑦↓ 3 𝑦↓ 4 𝑦↓ 1 𝑦↓ 2

LEARNING: TRAINING DEEP LEARNING: TRAINING DEEP NETWORKS NETWORKS

Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 44

Vanishing / exploding Gradients • Practical solutions w.r.t. network architecture – Add skip connections to reduce distance • Residual networks, highway networks, … – Add gates (and memory cells) to allow longer term memory • LSTMs, GRUs, memory networks, … 45

Gradients of deep networks NN layer ( x ) = ReLU ( xW 1 + b 1 ) h n h n − 1 . . . h 2 h 1 x I Can have similar issues with vanishing gradients. ∂ L ∂ L = ∑ 1 ( h n , j n > 0 ) W j n − 1 , j n ∂ h n − 1, j n − 1 ∂ h n , j n j n 46 Diagram borrowed from Alex Rush

Effects of Skip Connections on Gradients • Thought Experiment: Additive Skip-Connections NN sl 1 ( x ) = 1 2 ReLU ( xW 1 + b 1 ) + 1 2 x h n h n − 1 . . . ∂ L ∂ L h 3 1 ( ∑ = 1 ( h n , j n > 0 ) W j n − 1 , j n ) + 2 ∂ h n − 1, j n − 1 ∂ h n , j n j n h 2 ∂ L 1 ( h n − 1, j n − 1 ) h 1 2 ∂ h n , j n − 1 x 47 Diagram borrowed from Alex Rush

Effects of Skip Connections on Gradients • Thought Experiment: Dynamic Skip-Connections ( 1 − t ) ReLU ( xW 1 + b 1 ) + t x NN sl 2 ( x ) = σ ( xW t + b t ) = t W 1 R d hid × d hid ∈ W t R d hid × 1 ∈ h n h n − 1 . . . h 3 h 2 h 1 48 Diagram borrowed from Alex Rush x

Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 49

Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu • ResNet (He et al. 2015): first very deep (152 layers) network successfully trained for object recognition 50

Residual Networks • Residual net • Plaint net 0 0 weight layer weight layer any two relu b(0) identity relu stacked layers 0 weight layer weight layer relu a 0 = b 0 + 0 a(0) relu • F(x) is a residual mapping with respect to identity • Direct input connection +x leads to a nice property w.r.t. back propagation --- more direct influence from the final loss to any deep layer • In contrast, LSTMs & Highway networks allow for long distance input connection only through “gates”. 51

Residual Networks Revolution of Depth soft max2 Soft maxAct ivat ion FC AveragePool 7x7+ 1(V) AlexNet, 8 layers 11x11 conv, 96, /4, pool/2 VGG, 19 layers 3x3 conv, 64 GoogleNet, 22 layers Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2014) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat 3x3 conv, 384 3x3 conv, 128 Conv Conv Conv Conv soft max1 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Soft maxAct ivation 3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool FC 3x3+ 2(S) Dept hConcat FC 3x3 conv, 256, pool/2 3x3 conv, 256 Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) fc, 4096 3x3 conv, 256 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) fc, 4096 3x3 conv, 256 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat soft max0 fc, 1000 3x3 conv, 256, pool/2 Conv Conv Conv Conv Soft maxAct ivat ion 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool FC 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat FC Conv Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool AveragePool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 5x5+ 3(V) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512, pool/2 MaxPool 3x3+ 2(S) Dept hConcat 3x3 conv, 512 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) 3x3 conv, 512 Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) 3x3 conv, 512 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) 3x3 conv, 512, pool/2 LocalRespNorm Conv 3x3+ 1(S) fc, 4096 Conv 1x1+ 1(V) LocalRespNorm fc, 4096 MaxPool 3x3+ 2(S) Conv fc, 1000 7x7+ 2(S) input Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 52

Residual Networks 7x7 conv, 64, /2, pool/2 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 Revolution of Depth 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 11x11 conv, 96, /4, pool/2 AlexNet, 8 layers 5x5 conv, 256, pool/2 VGG, 19 layers ResNet, 152 layers 1x1 conv, 256 3x3 conv, 64 3x3 conv, 256 3x3 conv, 384 3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024 3x3 conv, 128 1x1 conv, 256 3x3 conv, 256, pool/2 fc, 4096 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 1000 3x3 conv, 256 1x1 conv, 256 3x3 conv, 256 3x3 conv, 256 (ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015) 3x3 conv, 256, pool/2 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512, pool/2 1x1 conv, 256 3x3 conv, 512 3x3 conv, 256 3x3 conv, 512 1x1 conv, 1024 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512, pool/2 3x3 conv, 256 fc, 4096 1x1 conv, 1024 fc, 4096 1x1 conv, 256 fc, 1000 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 53

Residual Networks Revolution of Depth 28.2 25.8 152 layers 16.4 11.7 22 layers 19 layers 7.3 6.7 3.57 8 layers 8 layers shallow ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10 ResNet GoogleNet VGG AlexNet ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. 54

Highway Network (Srivastava et al., 2015) • A plain feedforward neural network: y = H ( x , W H ) . – H is a typical affine transformation followed by a non- linear activation • Highway network: y = H ( x , W H ) · T ( x , W T ) + x · C ( x , W C ) . – T is a “transform gate” – C is a “carry gate” – Often C = 1 – T for simplicity 55

@Schmidhubered 56

Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead 57

Sigmoid • Often used for gates 1 σ ( x ) = • Pro: neuron-like, 1 + e − x differentiable σ 0 ( x ) = σ ( x )(1 − σ ( x )) • Con: gradients saturate to zero almost everywhere except x near zero => vanishing gradients • Batch normalization helps 58

Tanh • Often used for hidden states & cells tanh( x ) = e x − e − x e x + e − x in RNNs, LSTMs • Pro: differentiable, tanh’(x) = 1 − tanh 2 ( x ) often converges faster than sigmoid tanh( x ) = 2 σ (2 x ) − 1 • Con: gradients easily saturate to zero => vanishing gradients 59

Hard Tanh • Pro: computationally cheaper  − 1 t < − 1    • Con: saturates to  hardtanh ( t ) = − 1 ≤ t ≤ 1 t zero easily, doesn’t   differentiate at 1, -1  1 t > 1  60

ReLU • Pro: doesn’t saturate for ReLU( x ) = max(0 , x ) x > 0, computationally cheaper, induces sparse  1 x > 0   NNs d ReLU ( x )   = 0 x < 0 dx • Con: non-differentiable    1 or 0 o . w  at 0 • Used widely in deep NN, but not as much in RNNs • We informally use subgradients: 61

Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead – Batch Normalization: add intermediate input normalization layers 62

Batch Normalization 63

Regularization • Regularization by objective term n y c 0 ) } + λ || θ || 2 ∑ L ( θ ) = max { 0, 1 � ( ˆ y c � ˆ i = 1 – Modify loss with L1 or L2 norms • Less depth, smaller hidden states, early stopping • Dr Dropout opout – Randomly delete parts of network during training – Each node (and its corresponding incoming and outgoing edges) dropped with a probability p – P is higher for internal nodes, lower for input nodes – The full network is used for testing – Faster training, better results – Vs. Bagging 64

Convergence of backprop • Without non-linearity or hidden layers, learning is convex optimization – Gradient descent reaches global minima global minima • Multilayer neural nets (with nonlinearity) are not not convex convex – Gradient descent gets stuck in local minima – Selecting number of hidden units and layers = fuzzy process – NNs have made a HUGE comeback in the last few years • Neural nets are back with a new name – Deep belief networks – Huge error reduction when trained with lots of data on GPUs

RECAP RECAP

Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 67

Vanishing / exploding Gradients • Practical solutions w.r.t. network architecture – Add skip connections to reduce distance • Residual networks, highway networks, … – Add gates (and memory cells) to allow longer term memory • LSTMs, GRUs, memory networks, … 68

seq2seq (aka “encoder-decoder”) h t = f ( x t , h t − 1 ) y t = softmax( V h t )

Google NMT (Oct 2016)

ATTENTION! TTENTION!

Seq-to-Seq with Attention Diagram from http://distill.pub/2016/augmented-rnns/ 72

Seq-to-Seq with Attention Diagram from http://distill.pub/2016/augmented-rnns/ 73

Trial: Hard Attention s t • At each step generating the target word i s s • Compute the best alignment to the source word j • And incorporate the source word to generate the target word w t i +1 = argmax w O ( w, s t i +1 , s s j ) • Contextual hard alignment. How? z j = tanh([ s t i , s s j ] W + b ) j = argmax j z j • Problem? 74

Encoder – Decoder Architecture Sequence-to-Sequence the red dog ˆ ˆ ˆ y 1 y 2 y 3 s s s s s s s t s t s t 1 2 3 1 2 3 ˆ ˆ ˆ x 1 x 2 x 3 x 1 x 2 x 3 the red dog < s > Diagram borrowed from Alex Rush 75

Attention: Soft Alignments s t • At each step generating the target word i • Compute the attention to the source sequence s s c • And incorporate the attention to generate the target word w t i +1 = argmax w O ( w, s t i +1 , c ) • Contextual attention as soft alignment. How? z j = tanh([ s t i , s s j ] W + b ) α = softmax( z ) X α j s s c = j j – Step-1: compute the attention weights 76 – Step-2: compute the attention vector as interpolation

Attention function parameterization z j = tanh([ s t i ; s s j ] W + b ) • Feedforward NNs z j = tanh([ s t i ; s s j ; s t i � s s j ] W + b ) • Dot product z j = s t i · s s j s t i · s s j • Cosine similarity z j = || s t i |||| s s j || T Ws s • Bi-linear models z j = s t i j 77

Learned Attention! Diagram borrowed from Alex Rush 79

Qualitative results Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image. “soft” (top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) Figure 3. Examples of attending to the correct object ( white indicates the attended regions, underlines indicated the corresponding word) 80 27 M. Malinowski

POINTER NETWORKS POINTER NETWORKS

Convex haul, Delaunay Triangulation, Traveling Salesman Can we model these problems using seq-to-seq? 82

Pointer Networks! (Vinyals et al. 2015) • NNs with attention: content-based attention to input • Pointer networks: location-based attention to input 83

Pointer Networks (b) Ptr-Net (a) Sequence-to-Sequence 84

Pointer Networks Attention Mechanism vs Pointer Networks Attention mechanism Ptr-Net Softmax normalizes the vector e ij to be an output distribution over the dictionary of inputs Diagram borrowed from Keon Kim 85

CopyNet (Gu et al. 2016) • Conversation – I: Hello Jack, my name is Chandralekha – R: Nice to meet you, Chandralekha – I: This new guy doesn’t perform exactly as expected. – R: what do you mean by “doesn’t perform exactly as expected?” • Translation 86

CopyNet (Gu et al. 2016) (b) Generate-Mode & Copy-Mode Prob (“ Jebara ”) = Prob( “ Jebara ” , g) + Prob( “ Jebara ” , c) Softmax hi , Tony Jebara … ... Vocabulary Source s 1 s 2 s 3 s 4 M <eos> hi , Tony s 4 Attentive Read Embedding for “Tony” DNN Selective Read for “Tony” “Tony” M h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 hello , my name is Tony Jebara . ! (c) State Update (a) Attention-based Encoder-Decoder (RNNSearch) 87

CopyNet (Gu et al. 2016) • Key idea: interpolation between generation model & copy model p ( y t | s t , y t − 1 , c t , M ) = p ( y t , g | s t , y t − 1 , c t , M ) + p ( y t , c | s t , y t − 1 , c t , M ) (4) 1 8 Generate-Mode: The same scoring function as Z e ψ g ( y t ) , y t 2 V > > in the generic RNN encoder-decoder (Bahdanau et < y t 2 X \ ¯ p ( y t , g |· )= 0 , V (5) al., 2014) is used, i.e. 1 > Z e ψ g ( UNK ) > y t 62 V [ X : ψ g ( y t = v i ) = v > v i ∈ V ∪ UNK i W o s t , (7) ( 1 j : x j = y t e ψ c ( x j ) , P y t 2 X where W o ∈ R ( N +1) ⇥ d s and v i is the one-hot in- p ( y t , c |· )= (6) Z 0 otherwise dicator vector for v i . Copy-Mode: The score for “copying” the word x j is calculated as ⇣ ⌘ h > ψ c ( y t = x j ) = σ x j ∈ X j W c s t , (8) 88

BiDAF 89

NEURAL CHECK LIST NEURAL CHECK LIST

Neural Checklist Models (Kiddon et al., 2016) • What can we do with gating & attention? 91

Encoder--Decoder Architecture Chop the tomatoes . Add <s> Chop the tomatoes . Want to update ingredient Doesn’t information as address ingredients are changing used ingredients garlic tomato salsa

Encode title - decode recipe sausage sandwiches Cut each sandwich in halves. Sandwiches with sandwiches. Sandwiches, sandwiches, Sandwiches, sandwiches, sandwiches sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, sandwiches, or sandwiches or triangles, a griddle, each sandwich. Top each with a slice of cheese, tomato, and cheese. Top with remaining cheese mixture. Top with remaining cheese. Broil until tops are bubbly and cheese is melted, about 5 minutes.

Recipe generation vs vs machine translation decode decode recipe ecipe token token by by token token <S> decode decode recipe ecipe token token by by Only ~6-10% words align • recipe title ecipe title between input and output. The rest must be generated ingr ingredient 1 edient 1 • ingredient 2 ingr edient 2 from context (and implicit ingr ingredient 3 edient 3 knowledge about cooking) ingr ingredient 4 edient 4 Contextual switch between • two different input sources Two input sources

Encoder--Decoder with Attention Chop the tomatoes . Add <s> Chop the tomatoes . Want to update ingredient Doesn’t information as address ingredients are changing used ingredients garlic tomato salsa

Neural checklist model

� Let’s make salsa! Garlic tomato salsa tomatoes � onions � garlic � salt �

Neural checklist model Chop � hidden state classifier: � non-ingredient � new ingredient � new hidden state � used ingredient � LM � which ingredients � are still available � <S> � garlic � tomato � salsa �

Neural checklist model Chop � the � tomatoes � . � 0.85 0.10 0.04 0.01 non- new ingredient ingredient <S> � Chop � the � tomatoes � ✓

Neural checklist model Dice � the � onions � . � 0.00 0.94 0.03 0.01 . � Dice � the � onions � ✓ ✓ ✓

CSE 481: NLP Capstone Spring 2017 Yejin Choi University of - PowerPoint PPT Presentation

CSE 481: NLP Capstone Spring 2017 Yejin Choi University of Washington Office Hour News Hannah: Wed 2 - 3pm @ CSE 220 Maarten: Wed 2 - 3pm @ CSE 220 Yejin: Tue 2pm - 3:30pm Wed 5pm - 5:30pm @ CSE 578 All:

Understan anding & g & Implementing g SB 310 SB 310 HB 481 B 481 Agenda What is

1 Slide 2 us1 Upali Siriwardane, 3/26/2008 Rules for assigning oxidation numbers Rules for

WELCOME TO PHY 481 WELCOME TO PHY 481 ELECTROSTATICS ELECTROSTATICS Prof. Danny Caballero

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Computer Science Capstone Design Assignment: Capstone Presentation Dry Run: 20pts; Capstone

Creating your Capstone Development Plan In this session, students complete a Capstone Development

Industrial Applications Chris Theriot Distributor Development Manager Capstone Turbine

COVID-19 and You HB-481 Relief Late June 2020 Received $2,022,475.68 total Cares Act

481 YONGE STREET ZONING BY-LAW AMENDMENT APPLICATION April 24, 2019 Neighbourhood Meeting

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Announcements CSE 590f seminar Wednesday, 4pm, CSE 403 CSE 477, Winter/Spring 2009 UW

Capstone 1 Panel Presentation Location Information SPRING 2017 (JULY 8-11, 2017) This document

Yes, it is a Curse: politics and the adverse impact of natural-resource riches Professor

Beh Behavioral Economics and Green Growth ior l Economics nd Green Gro th The Role of Insurance

ROLE E AND U D USE SE OF IMPR PROVED VED AGRICU CULTURAL I IMPL PLEM EMEN ENTS Dr. Dr

Agricultural Transformation and Farmers Expectations: Experimental Evidence from Uganda

WHY ZERO TO LANDFILL IS IMPORTANT Saves Resources Reduces Environmental Impact

Administrivia Homework 4 due this Thursday, April 07 Additional OH today (3-4 pm, CS 274)

MyWatershed -Water Balance Exercise What will happen if we build a check-dam and a reservoir?

Joshua 5:136:27 MISSION IMPOSSIBLE JOSHUAS MISSION IMPOSSIBLE OUR PERSONAL MISSION

CSE 481: NLP Capstone Spring 2017 Yejin Choi University of - PowerPoint PPT Presentation

CSE 481: NLP Capstone Spring 2017 Yejin Choi University of Washington Office Hour News Hannah: Wed 2 - 3pm @ CSE 220 Maarten: Wed 2 - 3pm @ CSE 220 Yejin: Tue 2pm - 3:30pm Wed 5pm - 5:30pm @ CSE 578 All:

Understan anding &amp; g &amp; Implementing g SB 310 SB 310 HB 481 B 481 Agenda What is

1 Slide 2 us1 Upali Siriwardane, 3/26/2008 Rules for assigning oxidation numbers Rules for

WELCOME TO PHY 481 WELCOME TO PHY 481 ELECTROSTATICS ELECTROSTATICS Prof. Danny Caballero

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Computer Science Capstone Design Assignment: Capstone Presentation Dry Run: 20pts; Capstone

Creating your Capstone Development Plan In this session, students complete a Capstone Development

Industrial Applications Chris Theriot Distributor Development Manager Capstone Turbine

COVID-19 and You HB-481 Relief Late June 2020 Received $2,022,475.68 total Cares Act

481 YONGE STREET ZONING BY-LAW AMENDMENT APPLICATION April 24, 2019 Neighbourhood Meeting

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Announcements CSE 590f seminar Wednesday, 4pm, CSE 403 CSE 477, Winter/Spring 2009 UW

Capstone 1 Panel Presentation Location Information SPRING 2017 (JULY 8-11, 2017) This document

Yes, it is a Curse: politics and the adverse impact of natural-resource riches Professor

Beh Behavioral Economics and Green Growth ior l Economics nd Green Gro th The Role of Insurance

ROLE E AND U D USE SE OF IMPR PROVED VED AGRICU CULTURAL I IMPL PLEM EMEN ENTS Dr. Dr

Agricultural Transformation and Farmers Expectations: Experimental Evidence from Uganda

WHY ZERO TO LANDFILL IS IMPORTANT Saves Resources Reduces Environmental Impact

Administrivia Homework 4 due this Thursday, April 07 Additional OH today (3-4 pm, CS 274)

MyWatershed -Water Balance Exercise What will happen if we build a check-dam and a reservoir?

Joshua 5:136:27 MISSION IMPOSSIBLE JOSHUAS MISSION IMPOSSIBLE OUR PERSONAL MISSION

Understan anding & g & Implementing g SB 310 SB 310 HB 481 B 481 Agenda What is