outline
play

Outline Morning program Preliminaries Text matching I Text - PowerPoint PPT Presentation

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 8 Outline Morning program Preliminaries Feedforward neural network Back


  1. Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 8

  2. Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 9

  3. Preliminaries Multi-layer perceptron a.k.a. feedforward neural network x i,j target: y y 1 y 2 y 3 cost function ^ φ: activation function ^ ^ ^ ^ e.g.: 1/2 (y - y) 2 output/prediction: y y 1 y 2 y 3 e.g.: sigmoid 1 1 + e -o output layer weights hidden layer j weights w 1,4 x i-1, 1 x i-1, 4 x i-1, 2 x i-1, 3 input: x x 1 x 2 x 3 x 4 node j at level i 10

  4. Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 11

  5. Preliminaries Back propagation until convergence: - do a forward pass - compute the cost/error y 1 y 2 y 3 - adjust weights ← how?? cost function ^ ^ ^ y 1 y 2 y 3 Adjust every weight w i,j by: cost w i,j ∆ w i,j = − α ∂cost ∂w i,j w i,j x 1 x 2 x 3 x 4 α is the learning rate. 12

  6. Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 cost function ^ ^ ^ ∂w i,j y 1 y 2 y 3 cost = − α ∂cost ∂x i,j w i,j ← chain rule ∂x i,j ∂w i,j w i,j x 1 x 2 x 3 x 4 y, y ) = 1 y ) 2 cost (ˆ 2( y − ˆ y j = x i,j = φ ( o i,j ) ˆ 13

  7. Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j x 1 x 3 x 4 w i,j x 2 ∂x i,j ∂w i,j y, y ) = 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ← chain rule ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 x i,j = σ ( o ) = 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 14

  8. Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j x 1 x 3 x 4 w i,j x 2 ∂x i,j ∂w i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 x i,j = σ ( o )= 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 15

  9. Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∂w i,j cost w i,j = − α ∂cost ∂x i,j ∂x i,j ∂w i,j x 1 x 3 x 4 w i,j x 2 = − α ∂cost ∂x i,j ∂o i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ ∂x i,j ∂o i,j ∂w i,j = − α ∂cost ∂x i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ x i − 1 ,j ∂x i,j ∂o i,j 1 x i,j = σ ( o )= 1 + e − o K � o i,j = w i,k · x i − 1 ,k k =1 16

  10. Preliminaries Back propagation y 1 y 2 y 3 cost function ∆ w i,j = − α ∂cost ^ ^ ^ y 1 y 2 y 3 ∂w i,j cost w i,j = − α ∂cost ∂x i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂w i,j y, y )= 1 y ) 2 = − α ∂cost ∂x i,j ∂o i,j cost (ˆ 2( y − ˆ ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ = − α ∂cost x i,j (1 − x i,j ) x i − 1 ,j 1 ∂x i,j x i,j = σ ( o )= 1 + e − o σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 17

  11. Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost cost w i,j ∂w i,j = − α ∂cost ∂x i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂w i,j y, y )= 1 y ) 2 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 1 = − α y j − x i,j x i,j (1 − x i,j ) x i − 1 ,j x i,j = σ ( o )= 1 + e − o σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 18

  12. Preliminaries Back propagation y 1 y 2 y 3 cost function ^ ^ ^ y 1 y 2 y 3 ∆ w i,j = − α ∂cost σ ( o ) σ ′ ( o ) cost w i,j ∂w i,j 1.0 = − α ∂cost ∂x i,j 0.8 w i,j x 1 x 2 x 3 x 4 0.6 ∂x i,j ∂w i,j y, y )= 1 y ) 2 0.4 cost (ˆ 2( y − ˆ = − α ∂cost ∂x i,j ∂o i,j 0.2 0.0 ∂x i,j ∂o i,j ∂w i,j y j = x i,j = φ ( o i,j ) , e.g. σ ( o i,j ) ˆ 6 4 2 0 2 4 6 1 = − α y j − x i,j x i,j (1 − x i,j ) x i − 1 ,j x i,j = σ ( o )= 1 + e − o = l.rate cost activation input σ ′ ( o )= σ ( o )(1 − σ ( o )) K � o i,j = w i,k · x i − 1 ,k k =1 19

  13. Preliminaries Back propagation ∆ w i,j = − α ∂cost y 1 y 2 y 3 ∂w i,j cost function ^ ^ ^ y 1 y 2 y 3 = − α ∂cost ∂x i,j ∂o i,j ∂x i,j ∂o i,j ∂w i,j δ 1 δ 2 δ 3 cost = l.rate cost activation input = − α ∂cost ∂x i,j ∂o i,j w i,j w i,j x 1 x 2 x 3 x 4 ∂x i,j ∂o i,j ∂w i,j = − α δ x i − 1 ,j δ output = ( y j − x i,j ) x i,j (1 − x i,j ) ← previous slide � � � δ hidden = δ n w n,j x i,j (1 − x i,j ) n ∈ nodes 20

  14. Preliminaries Network representation y 1 y 2 y 3 y ^ ^ ^ y 1 y 2 y 3 x 3 activation: x 3 = σ(o 2 ) = [ 1 x 3 ] ↔ x 2 * W 2 = o 2 [ 1 × 4 ] [ 4 × 3 ] = [ 1 × 3 ] x 2 activation: x 2 = σ(o 1 ) = [ 1 x 4 ] x 1 * W 1 = o 1 [ 1 × 4 ] [ 4 × 4 ] = [ 1 × 4 ] x 1 x 2 x 3 x 4 x 1 x 1 [1] x 1 [2] x 1 [3] x 1 [4] 21

  15. Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 22

  16. Preliminaries Distributed representations ◮ Represent units, e.g., words, as vectors ◮ Goal: words that are similar, e.g., in terms of meaning, should get similar embeddings newspaper = <0.08, 0.31, 0.41> Cosine similarity to determine how similar two vectors magazine = <0.09, 0.35, 0.36> are: v ⊤ · � � w cosine ( � v, � w ) = || � v || 2 || � w || 2 biking = <0.59, 0.25, 0.01> � | v | i =1 v i ∗ w i = �� | v | �� | w | i =1 v 2 i =1 w 2 i i 23

  17. Preliminaries Distributed representations How do we get these vectors? ◮ You shall know a word by the company it keeps [Firth, 1957] ◮ The vector of a word should be similar to the vectors of the words surrounding it − → you − need − − → → is − − → all − − → love 24

  18. Preliminaries Embedding methods allanswer amtrak need what zorro you is ... target distribution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... vocabulary size probabitity distribution turn this into a probability distribution ... vocabulary size layer embedding size × vocabulary size weight matrix embedding size hidden layer vocabulary size × embedding size weight matrix vocabulary size inputs ... 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 is answer love you what zorro all amtrak need 25

  19. Preliminaries Probability distributions softmax = normalize the logits y: probability distribution ... cost e logits [ i ] ^ y: probability distribution ... = ? � | logits | logits e logits [ j ] ... j =1 cost = cross entropy loss � = − p ( x ) log ˆ p ( x ) ... x � = − p ground truth ( word = vocabulary [ i ]) log p predictions ( word = vocabulary [ i ]) i � = − y i log ˆ y i i 26

  20. Outline Morning program Preliminaries Feedforward neural network Back propagation Distributed representations Recurrent neural networks Sequence-to-sequence models Convolutional neural networks Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 27

  21. Preliminaries Recurrent neural networks ◮ Lots of information is sequential and ◮ Recurrent neural networks (RNNs) are requires a memory for successful called recurrent because they perform processing same task for every element of sequence, with output dependent on ◮ Sequences as input, sequences as previous computations output ◮ RNNs have memory that captures information about what has been computed so far ◮ RNNs can make use of information in arbitrarily long sequences – in practice they limited to looking back only few steps Image credits: http://karpathy.github.io/assets/rnn/diags.jpeg 28

  22. Preliminaries Recurrent neural networks ◮ RNN being unrolled (or unfolded) into full network ◮ Unrolling: write out network for complete sequence ◮ Formulas governing computation: ◮ x t input at time step t ◮ s t hidden state at time step t – memory of the network, calculated based on previous hidden state and input at the current step: s t = f ( Ux t + Ws t − 1 ) ; f usually nonlinearity, e.g., tanh or ReLU ; s − 1 typically initialized to all zeroes ◮ o t output at step t . E.g.,, if we want to predict next word in sentence, a vector of probabilities across vocabulary: o t = softmax( V s t ) Image credits: Nature 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend