cse 490 u deep learning spring 2016
play

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - PowerPoint PPT Presentation

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5


  1. CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer

  2. Human Neurons • Switching time • ~ 0.001 second • Number of neurons – 10 10 • Connections per neuron – 10 4-5 • Scene recognition time – 0.1 seconds • Number of cycles per scene recognition? – 100 à much parallel computation!

  3. Perceptron as a Neural Network g This is one neuron: – Input edges x 1 ... x n , along with basis – The sum is represented graphically – Sum passed through an activation function g

  4. Sigmoid Neuron g Just change g! • Why would we want to do this? • Notice new output range [0,1]. What was it before? • Look familiar?

  5. Optimizing a neuron ∂ ∂ xf ( g ( x )) = f � ( g ( x )) g � ( x ) We train to minimize sum-squared error ⇧ l i )] ⇧ ↵ ↵ ↵ w i x j w i x j = − [ y j − g ( w 0 + g ( w 0 + i ) ⇧ w i ⇧ w i j i i ∂ X X w i x j i ) = x j w i x j i g 0 ( w 0 + g ( w 0 + i ) ∂ w i i i Solution just depends on g’: derivative of activation function!

  6. Sigmoid units: have to differentiate g g � ( x ) = g ( x )(1 − g ( x ))

  7. Perceptron, linear classification, Boolean functions: x i ∈ {0,1} • Can learn x 1 ∨ x 2 ? • -0.5 + x 1 + x 2 • Can learn x 1 ∧ x 2 ? g • -1.5 + x 1 + x 2 • Can learn any conjunction or disjunction? • 0.5 + x 1 + … + x n • (-n+0.5) + x 1 + … + x n • Can learn majority? • (-0.5*n) + x 1 + … + x n • What are we missing? The dreaded XOR!, etc.

  8. Going beyond linear classification Solving the XOR problem y = x 1 XOR x 2 = (x 1 ∧ ¬x 2 ) ∨ (x 2 ∧ ¬x 1 ) v 1 =(x 1 ∧ ¬ x 2 ) = -1.5+2x 1 -x 2 1 1 -0.5 -1.5 v 2 =(x 2 ∧ ¬ x 1 ) = -1.5+2x 2 -x 1 2 1 v 1 x 1 y y = v 1 ∨ v 2 -1 1 -1.5 = -0.5+v 1 +v 2 -1 v 2 x 2 2

  9. Hidden layer • Single unit: • 1-hidden layer: • No longer convex function!

  10. Example data for NN with hidden layer

  11. Learned weights for hidden layer

  12. Why “representation learning”? • MaxEnt (multinomial logistic regression): y = softmax( w · f ( x, y )) You design the feature vector • NNs: y = softmax( w · σ ( Ux )) y = softmax( w · σ ( U ( n ) ( ... σ ( U (2) σ ( U (1) x )))) Feature representations are “learned” through hidden layers

  13. Very deep models in computer vision

  14. RECURRENT NEURAL NETWORKS

  15. Recurrent Neural Networks (RNNs) Each RNN unit computes a new hidden state using the previous state and a • new input h t = f ( x t , h t − 1 ) Each RNN unit (optionally) makes an output using the current hidden state • y t = softmax( V h t ) Hidden states are continuous vectors • h t ∈ R D – Can represent very rich information – Possibly the entire history from the beginning Parameters are shared (tied) across all RNN units (unlike feedforwardNNs) • ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  16. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  17. Many uses of RNNs 1. Classification (seq to one) • Input: a sequence • Output: one label (classification) • Example: sentiment classification h t = f ( x t , h t − 1 ) y = softmax( V h n ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  18. Many uses of RNNs 2. one to seq • Input: one item • Output: a sequence h t = f ( x t , h t − 1 ) • Example: Image captioning y t = softmax( V h t ) Cat sitting on top of …. ℎ " ℎ $ ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 "

  19. Many uses of RNNs 3. sequence tagging • Input: a sequence • Output: a sequence (of the same length) • Example: POS tagging, Named Entity Recognition • How about Language Models? – Yes! RNNs can be used as LMs! h t = f ( x t , h t − 1 ) – RNNs make markov assumption: T/F? y t = softmax( V h t ) ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  20. Many uses of RNNs 4. Language models • Input: a sequence of words h t = f ( x t , h t − 1 ) • Output: one next word y t = softmax( V h t ) • Output: or a sequence of next words • During training, x_t is the actual word in the training sentence. • During testing, x_t is the word predicted from the previous time step. • Does RNN LMs make Markov assumption? – i.e., the next word depends only on the previous N words ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  21. Many uses of RNNs 5. seq2seq (aka “encoder-decoder”) • Input: a sequence • Output: a sequence (of different length) • Examples? h t = f ( x t , h t − 1 ) y t = softmax( V h t ) ℎ ( ℎ % ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ % ℎ ' ℎ ( 𝑦 $ 𝑦 " 𝑦 #

  22. Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) • Conversation and Dialogue • Machine Translation Figure from http://www.wildml.com/category/conversational-agents/

  23. Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) Parsing! - “Grammar as Foreign Language” (Vinyals et al., 2015) ℎ ( ℎ % ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ % ℎ ' ℎ ( 𝑦 $ 𝑦 " 𝑦 # John has a dog

  24. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  25. Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • LSTMs ( L ong S hort- t erm M emory Networks): i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ There are many c t = f t � c t − 1 + i t � ˜ c t known variations to this set of h t = o t � tanh( c t ) equations! 𝑑 % 𝑑 + : cell state 𝑑 # 𝑑 $ 𝑑 " ℎ # ℎ $ ℎ % ℎ + : hidden state ℎ " 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

  26. LSTMS (LONG SHORT-TERM MEMORY NETWORKS 𝑑 +," 𝑑 + ℎ +," ℎ + Figure by Christopher Olah (colah.github.io)

  27. LSTMS (LONG SHORT-TERM MEMORY NETWORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Figure by Christopher Olah (colah.github.io)

  28. LSTMS (LONG SHORT-TERM MEMORY NETWORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ Figure by Christopher Olah (colah.github.io)

  29. LSTMS (LONG SHORT-TERM MEMORY NETWORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

  30. LSTMS (LONG SHORT-TERM MEMORY NETWORKS Forget gate: forget the past or not Output gate: output from the new cell or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Input gate: use the input or not Hidden state: i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) h t = o t � tanh( c t ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

  31. LSTMS (LONG SHORT-TERM MEMORY NETWORKS Forget gate: forget the past or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) Input gate: use the input or not o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Output gate: output from the new cell or not c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) New cell content (temp): ˜ New cell content: c t = f t � c t − 1 + i t � ˜ c t - mix old cell with the new temp cell Hidden state: h t = o t � tanh( c t ) 𝑑 +," 𝑑 + ℎ +," ℎ +

  32. vanishing gradient problem for RNNs. The shading of the nodes in the unfolded network indicates their sensitivity to • the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations of the • hidden layer, and the network ‘forgets’ the first inputs. Example from Graves 2012

  33. Preservation of gradient information by LSTM Output gate Forget gate Input gate For simplicity, all gates are either entirely open (‘O’) or closed (‘—’). • The memory cell ‘remembers’ the first input as long as the forget gate is open and • the input gate is closed. The sensitivity of the output layer can be switched on and off by the output gate • without affecting the cell. Example from Graves 2012

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend