CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - PowerPoint PPT Presentation

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer

Human Neurons • Switching time • ~ 0.001 second • Number of neurons – 10 10 • Connections per neuron – 10 4-5 • Scene recognition time – 0.1 seconds • Number of cycles per scene recognition? – 100 à much parallel computation!

Perceptron as a Neural Network g This is one neuron: – Input edges x 1 ... x n , along with basis – The sum is represented graphically – Sum passed through an activation function g

Sigmoid Neuron g Just change g! • Why would we want to do this? • Notice new output range [0,1]. What was it before? • Look familiar?

Optimizing a neuron ∂ ∂ xf ( g ( x )) = f � ( g ( x )) g � ( x ) We train to minimize sum-squared error ⇧ l i )] ⇧ ↵ ↵ ↵ w i x j w i x j = − [ y j − g ( w 0 + g ( w 0 + i ) ⇧ w i ⇧ w i j i i ∂ X X w i x j i ) = x j w i x j i g 0 ( w 0 + g ( w 0 + i ) ∂ w i i i Solution just depends on g’: derivative of activation function!

Sigmoid units: have to differentiate g g � ( x ) = g ( x )(1 − g ( x ))

Perceptron, linear classification, Boolean functions: x i ∈ {0,1} • Can learn x 1 ∨ x 2 ? • -0.5 + x 1 + x 2 • Can learn x 1 ∧ x 2 ? g • -1.5 + x 1 + x 2 • Can learn any conjunction or disjunction? • 0.5 + x 1 + … + x n • (-n+0.5) + x 1 + … + x n • Can learn majority? • (-0.5*n) + x 1 + … + x n • What are we missing? The dreaded XOR!, etc.

Going beyond linear classification Solving the XOR problem y = x 1 XOR x 2 = (x 1 ∧ ¬x 2 ) ∨ (x 2 ∧ ¬x 1 ) v 1 =(x 1 ∧ ¬ x 2 ) = -1.5+2x 1 -x 2 1 1 -0.5 -1.5 v 2 =(x 2 ∧ ¬ x 1 ) = -1.5+2x 2 -x 1 2 1 v 1 x 1 y y = v 1 ∨ v 2 -1 1 -1.5 = -0.5+v 1 +v 2 -1 v 2 x 2 2

Hidden layer • Single unit: • 1-hidden layer: • No longer convex function!

Example data for NN with hidden layer

Learned weights for hidden layer

Why “representation learning”? • MaxEnt (multinomial logistic regression): y = softmax( w · f ( x, y )) You design the feature vector • NNs: y = softmax( w · σ ( Ux )) y = softmax( w · σ ( U ( n ) ( ... σ ( U (2) σ ( U (1) x )))) Feature representations are “learned” through hidden layers

Very deep models in computer vision

RECURRENT NEURAL NETWORKS

Recurrent Neural Networks (RNNs) Each RNN unit computes a new hidden state using the previous state and a • new input h t = f ( x t , h t − 1 ) Each RNN unit (optionally) makes an output using the current hidden state • y t = softmax( V h t ) Hidden states are continuous vectors • h t ∈ R D – Can represent very rich information – Possibly the entire history from the beginning Parameters are shared (tied) across all RNN units (unlike feedforwardNNs) • ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

Many uses of RNNs 1. Classification (seq to one) • Input: a sequence • Output: one label (classification) • Example: sentiment classification h t = f ( x t , h t − 1 ) y = softmax( V h n ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

Many uses of RNNs 2. one to seq • Input: one item • Output: a sequence h t = f ( x t , h t − 1 ) • Example: Image captioning y t = softmax( V h t ) Cat sitting on top of …. ℎ " ℎ $ ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 "

Many uses of RNNs 3. sequence tagging • Input: a sequence • Output: a sequence (of the same length) • Example: POS tagging, Named Entity Recognition • How about Language Models? – Yes! RNNs can be used as LMs! h t = f ( x t , h t − 1 ) – RNNs make markov assumption: T/F? y t = softmax( V h t ) ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

Many uses of RNNs 4. Language models • Input: a sequence of words h t = f ( x t , h t − 1 ) • Output: one next word y t = softmax( V h t ) • Output: or a sequence of next words • During training, x_t is the actual word in the training sentence. • During testing, x_t is the word predicted from the previous time step. • Does RNN LMs make Markov assumption? – i.e., the next word depends only on the previous N words ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

Many uses of RNNs 5. seq2seq (aka “encoder-decoder”) • Input: a sequence • Output: a sequence (of different length) • Examples? h t = f ( x t , h t − 1 ) y t = softmax( V h t ) ℎ ( ℎ % ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ % ℎ ' ℎ ( 𝑦 $ 𝑦 " 𝑦 #

Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) • Conversation and Dialogue • Machine Translation Figure from http://www.wildml.com/category/conversational-agents/

Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) Parsing! - “Grammar as Foreign Language” (Vinyals et al., 2015) ℎ ( ℎ % ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ % ℎ ' ℎ ( 𝑦 $ 𝑦 " 𝑦 # John has a dog

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • LSTMs ( L ong S hort- t erm M emory Networks): i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ There are many c t = f t � c t − 1 + i t � ˜ c t known variations to this set of h t = o t � tanh( c t ) equations! 𝑑 % 𝑑 + : cell state 𝑑 # 𝑑 $ 𝑑 " ℎ # ℎ $ ℎ % ℎ + : hidden state ℎ " 𝑦 $ 𝑦 % 𝑦 " 𝑦 #

LSTMS (LONG SHORT-TERM MEMORY NETWORKS 𝑑 +," 𝑑 + ℎ +," ℎ + Figure by Christopher Olah (colah.github.io)

LSTMS (LONG SHORT-TERM MEMORY NETWORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Figure by Christopher Olah (colah.github.io)

LSTMS (LONG SHORT-TERM MEMORY NETWORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ Figure by Christopher Olah (colah.github.io)

LSTMS (LONG SHORT-TERM MEMORY NETWORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

LSTMS (LONG SHORT-TERM MEMORY NETWORKS Forget gate: forget the past or not Output gate: output from the new cell or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Input gate: use the input or not Hidden state: i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) h t = o t � tanh( c t ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

LSTMS (LONG SHORT-TERM MEMORY NETWORKS Forget gate: forget the past or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) Input gate: use the input or not o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Output gate: output from the new cell or not c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) New cell content (temp): ˜ New cell content: c t = f t � c t − 1 + i t � ˜ c t - mix old cell with the new temp cell Hidden state: h t = o t � tanh( c t ) 𝑑 +," 𝑑 + ℎ +," ℎ +

vanishing gradient problem for RNNs. The shading of the nodes in the unfolded network indicates their sensitivity to • the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations of the • hidden layer, and the network ‘forgets’ the first inputs. Example from Graves 2012

Preservation of gradient information by LSTM Output gate Forget gate Input gate For simplicity, all gates are either entirely open (‘O’) or closed (‘—’). • The memory cell ‘remembers’ the first input as long as the forget gate is open and • the input gate is closed. The sensitivity of the output layer can be switched on and off by the output gate • without affecting the cell. Example from Graves 2012

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - PowerPoint PPT Presentation

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

CSE 490 Natural Language Processing Spring 2016 Introduction Yejin Choi Slides adapted

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

CSE 490 U Natural Language Processing Spring 2016 Parsing (Trees) Yejin Choi - University of

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 490 U Natural Language Processing Spring 2016 Dependency Parsing And Other Grammar

CSE 490 U Natural Language Processing Spring 2016 Frame Semantics Yejin Choi Some slides

CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University

CSE 802 Spring 2017 Deep Learning Inci M. Baytas Michigan State University February 13-15,

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

NP-hardness of Minimum Circuit Size Problem for OR-AND-MOD Circuits Shuichi Hirahara (The

On Multiparty Garbling of Arithmetic Circuits Aner Ben-Efraim Ariel University & Ben-Gurion

In-depth crypto a9acks It always takes two bugs Karsten

Excitability, dynamical instabilities and interaction of localized structures in a nonlinear

Informatics 1 Computation and Logic computation and logic. We A Traffic-Light Controller

Testability Analysis 1 Why Testability Analysis? Need approximate measure of: Difficulty

Introduction to Data (state) stored in n bits: an element of { 0 ; 1 } n , quantum

Salus Seny Kamara - Microsoft Research Payman Mohassel U. of Calgary Ben Riva Tel Aviv U.

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from - PowerPoint PPT Presentation

CSE 490 U: Deep Learning Spring 2016 Yejin Choi Some slides from Carlos Guestrin, Andrew Rosenberg, Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

CSE 490 Natural Language Processing Spring 2016 Introduction Yejin Choi Slides adapted

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

CSE 490 U Natural Language Processing Spring 2016 Parsing (Trees) Yejin Choi - University of

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 490 U Natural Language Processing Spring 2016 Dependency Parsing And Other Grammar

CSE 490 U Natural Language Processing Spring 2016 Frame Semantics Yejin Choi Some slides

CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University

CSE 802 Spring 2017 Deep Learning Inci M. Baytas Michigan State University February 13-15,

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

NP-hardness of Minimum Circuit Size Problem for OR-AND-MOD Circuits Shuichi Hirahara (The

On Multiparty Garbling of Arithmetic Circuits Aner Ben-Efraim Ariel University &amp; Ben-Gurion

In-depth crypto a9acks It always takes two bugs Karsten

Excitability, dynamical instabilities and interaction of localized structures in a nonlinear

Informatics 1 Computation and Logic computation and logic. We A Traffic-Light Controller

Testability Analysis 1 Why Testability Analysis? Need approximate measure of: Difficulty

Introduction to Data (state) stored in n bits: an element of { 0 ; 1 } n , quantum

Salus Seny Kamara - Microsoft Research Payman Mohassel U. of Calgary Ben Riva Tel Aviv U.

On Multiparty Garbling of Arithmetic Circuits Aner Ben-Efraim Ariel University & Ben-Gurion