Outline Morning program Preliminaries Text matching I Text - PowerPoint PPT Presentation

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 8

Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 9

Preliminaries Multi-layer perceptron a.k.a. feedforward neural network x i,j target: y y 1 y 2 y 3 cost function ^ φ: activation function ^ ^ ^ ^ e.g.: 1/2 (y - y) 2 output/prediction: y y 1 y 2 y 3 e.g.: sigmoid 1 1 + e -o output layer weights hidden layer j weights w 1,4 x i-1, 1 x i-1, 4 x i-1, 2 x i-1, 3 input: x x 1 x 2 x 3 x 4 node j at level i 10

Preliminaries Back propagation until convergence: - do a forward pass - compute the cost/error y 1 y 2 y 3 - adjust weights ← how?? cost function ^ ^ ^ y 1 y 2 y 3 Adjust every weight w i,j by: cost w i,j ∆ w i,j = − α ∂cost ∂w i,j w i,j x 1 x 2 x 3 x 4 α is the learning rate. 12

Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 cost function ^ ^ ^ ∂w i,j y 1 y 2 y 3 cost = − α ∂cost ∂x i,j w i,j ← chain rule ∂x i,j ∂w i,j w i,j x 1 x 2 x 3 x 4 y, y ) = 1 y ) 2 cost (ˆ 2( y − ˆ y j = x i,j = φ ( o i,j ) ˆ 13

Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j x 1 x 3 x 4 w i,j x 2 ∂x i,j ∂w i,j y, y ) = 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ← chain rule ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 x i,j = σ ( o ) = 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 14

Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j x 1 x 3 x 4 w i,j x 2 ∂x i,j ∂w i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 x i,j = σ ( o )= 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 15

Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∂w i,j cost w i,j = − α ∂cost ∂x i,j ∂x i,j ∂w i,j x 1 x 3 x 4 w i,j x 2 = − α ∂cost ∂x i,j ∂o i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ ∂x i,j ∂o i,j ∂w i,j = − α ∂cost ∂x i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ x i − 1 ,j ∂x i,j ∂o i,j 1 x i,j = σ ( o )= 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 16

Preliminaries Back propagation y 1 y 2 y 3 cost function ∆ w i,j = − α ∂cost ^ ^ ^ y 1 y 2 y 3 ∂w i,j cost w i,j = − α ∂cost ∂x i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂w i,j y, y )= 1 y ) 2 = − α ∂cost ∂x i,j ∂o i,j cost (ˆ 2( y − ˆ ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ = − α ∂cost x i,j (1 − x i,j ) x i − 1 ,j 1 ∂x i,j x i,j = σ ( o )= 1 + e − o σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 17

Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂w i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 = − α y j − x i,j x i,j (1 − x i,j ) x i − 1 ,j x i,j = σ ( o )= 1 + e − o σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 18

Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost σ ( o ) σ ′ ( o ) cost w i,j ∂w i,j 1.0 = − α ∂cost ∂x i,j 0.8 w i,j x 1 x 2 x 3 x 4 0.6 ∂x i,j ∂w i,j y, y )= 1 y ) 2 0.4 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j 0.2 0.0 ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 6 4 2 0 2 4 6 1 = − α y j − x i,j x i,j (1 − x i,j ) x i − 1 ,j x i,j = σ ( o )= 1 + e − o = l.rate cost activation input σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 19

Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 ∂w i,j cost function ^ ^ ^ y 1 y 2 y 3 = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j δ 1 δ 2 δ 3 cost = l.rate cost activation input = − α ∂cost ∂x i,j ∂o i,j w i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂o i,j ∂w i,j = − α δ x i − 1 ,j δ output = ( y j − x i,j ) x i,j (1 − x i,j ) ← previous slide � � � δ hidden = δ n w n,j x i,j (1 − x i,j ) n ∈ nodes 20

Preliminaries Network representation y 1 y 2 y 3 y ^ ^ ^ y 1 y 2 y 3 x 3 activation: x 3 = σ(o 2 ) = [ 1 x 3 ] ↔ x 2 * W 2 = o 2 [ 1 × 4 ] [ 4 × 3 ] = [ 1 × 3 ] x 2 activation: x 2 = σ(o 1 ) = [ 1 x 4 ] x 1 * W 1 = o 1 [ 1 × 4 ] [ 4 × 4 ] = [ 1 × 4 ] x 1 x 2 x 3 x 4 x 1 x 1 [1] x 1 [2] x 1 [3] x 1 [4] 21

Preliminaries Distributed representations ◮ Represent units, e.g., words, as vectors ◮ Goal: words that are similar, e.g., in terms of meaning, should get similar embeddings newspaper = <0.08, 0.31, 0.41> Cosine similarity to determine how similar two vectors magazine = <0.09, 0.35, 0.36> are: v ⊤ · � � w cosine ( � v, � w ) = || � v || 2 || � w || 2 biking = <0.59, 0.25, 0.01> � | v | i =1 v i ∗ w i = �� | v | �� | w | i =1 v 2 i =1 w 2 i i 23

Preliminaries Distributed representations How do we get these vectors? ◮ You shall know a word by the company it keeps [Firth, 1957] ◮ The vector of a word should be similar to the vectors of the words surrounding it − → you − need − − → → is − − → all − − → love 24

Preliminaries Embedding methods allanswer amtrak need what zorro you is ... target distribution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... vocabulary size probabitity distribution turn this into a probability distribution ... vocabulary size layer embedding size × vocabulary size weight matrix embedding size hidden layer vocabulary size × embedding size weight matrix vocabulary size inputs ... 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 is answer love you what zorro all amtrak need 25

Preliminaries Probability distributions softmax = normalize the logits y: probability distribution ... cost e logits [ i ] ^ y: probability distribution ... = ? � | logits | logits e logits [ j ] ... j =1 cost = cross entropy loss � = − p ( x ) log ˆ p ( x ) ... x � = − p ground truth ( word = vocabulary [ i ]) log p predictions ( word = vocabulary [ i ]) i � = − y i log ˆ y i i 26

Preliminaries Recurrent neural networks ◮ Lots of information is sequential and ◮ Recurrent neural networks (RNNs) are requires a memory for successful called recurrent because they perform processing same task for every element of sequence, with output dependent on ◮ Sequences as input, sequences as previous computations output ◮ RNNs have memory that captures information about what has been computed so far ◮ RNNs can make use of information in arbitrarily long sequences – in practice they limited to looking back only few steps Image credits: http://karpathy.github.io/assets/rnn/diags.jpeg 28

Preliminaries Recurrent neural networks ◮ RNN being unrolled (or unfolded) into full network ◮ Unrolling: write out network for complete sequence ◮ Formulas governing computation: ◮ x t input at time step t ◮ s t hidden state at time step t – memory of the network, calculated based on previous hidden state and input at the current step: s t = f ( Ux t + Ws t − 1 ) ; f usually nonlinearity, e.g., tanh or ReLU ; s − 1 typically initialized to all zeroes ◮ o t output at step t . E.g.,, if we want to predict next word in sentence, a vector of probabilities across vocabulary: o t = softmax( V s t ) Image credits: Nature 29

Outline Morning program Preliminaries Text matching I Text - PowerPoint PPT Presentation

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 8 Outline Morning program Preliminaries Feedforward neural network Back

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Phonology and speech applications with weighted automata Natural Language Processing LING/CSCI

Alphabet An alphabet is a set of letters . e.g., { a, b, c, . . . , z } e.g., { , , . . . ,

Clustering in the N=4 SYM three point function D. Serban w/ Y. Jiang, S. Komatsu, I. Kostov,

Partially Composite Higgs in Supersymmetry Ryuichiro Kitano (Tohoku U.) based on 1206.4053

H0K03a : Advanced Process Control Model-based Predictive Control 1 : Introduction Bert Pluymers

Status of the Padova DS electronics Completely redesign NIM modules Signal conditioning board

Pattern Recognition Part 10: (Artificial) Neural Networks Gerhard Schmidt

Reactive Probabilistic Programming Semantics with Mixed Nondeterministic/Probabilistic Automata