maxent models iii neural language models
play

Maxent Models (III), & Neural Language Models CMSC 473/673 - PowerPoint PPT Presentation

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP Recap from last time Maximum Entropy Models a more general language model argmax ) ()


  1. Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP

  2. Recap from last time…

  3. Maximum Entropy Models a more general language model argmax π‘Œ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ) classify in one go argmax π‘Œ π‘ž π‘Œ 𝑍)

  4. Maximum Entropy Models Feature Weights Natural parameters Distribution Parameters Feature function(s) Sufficient statistics β€œStrength” function(s)

  5. What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) Pick a starting value ΞΈ t y 0 derivative Until converged: of F wrt ΞΈ 1. Get value y t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 0 g 1 5. Set t += 1 g 2 ΞΈ ΞΈ 0 ΞΈ 2 ΞΈ 3 ΞΈ 1 ΞΈ *

  6. What if you can’t find the roots? Follow the derivative y 3 y 2 y 1 Set t = 0 F( ΞΈ ) F’(ΞΈ ) Pick a starting value ΞΈ t y 0 derivative Until converged : of F wrt ΞΈ 1. Get value y t = F( ΞΈ t ) 2. Get derivative g t = F’(ΞΈ t ) 3. Get scaling factor ρ t g 0 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t g 1 g 2 ΞΈ 5. Set t += 1 ΞΈ 0 ΞΈ 2 ΞΈ 3 ΞΈ 1 ΞΈ *

  7. Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative NaΓ―ve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

  8. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo

  9. Objective = Full Likelihood?

  10. Objective = Full Likelihood? Differentiating this These values can have very product could be a pain small magnitude  underflow

  11. Logarithms (0, 1]  (- ∞, 0] Products  Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

  12. Log-Likelihood Wide range of (negative) numbers Sums are more stable Products  Sums log(ab) = log(a) + log(b) Differentiating this log(a/b) = log(a) – log(b) becomes nicer (even though Z depends on ΞΈ )

  13. Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦, 𝑧 ) Differentiating this becomes nicer (even though Z depends on ΞΈ )

  14. Log-Likelihood Wide range of (negative) numbers Sums are more stable Inverse of exp log(exp(x)) = x π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦, 𝑧 ) Differentiating this becomes nicer (even though Z depends on ΞΈ )

  15. Log-Likelihood Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ΞΈ )

  16. Expectations number of pieces of candy 1 2 3 4 5 6 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

  17. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  18. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  19. Expectations number of pieces of candy 1 2 3 4 5 6 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

  20. Log-Likelihood Gradient Each component k is the difference between:

  21. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data

  22. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p ΞΈ thinks it computes for feature f k

  23. Log-Likelihood Gradient β€œmoment Each component k is the difference matching” between: the total value of feature f k in the training data and the total value the current model p ΞΈ thinks it computes for feature f k

  24. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 6

  25. Log-Likelihood Gradient Derivation

  26. Log-Likelihood Gradient Derivation depends on ΞΈ

  27. Log-Likelihood Gradient Derivation depends on ΞΈ

  28. Log-Likelihood Gradient Derivation depends on ΞΈ

  29. Log-Likelihood Gradient Derivation

  30. Log-Likelihood Gradient Derivation use the (calculus) chain rule πœ– πœ–π‘• πœ–β„Ž πœ–πœ„ log 𝑕(β„Ž πœ„ ) = πœ–β„Ž(πœ„) πœ–πœ„

  31. Log-Likelihood Gradient Derivation use the (calculus) chain rule scalar p(y’ | x i ) πœ– πœ–π‘• πœ–β„Ž πœ–πœ„ log 𝑕(β„Ž πœ„ ) = πœ–β„Ž(πœ„) πœ–πœ„ vector of functions

  32. Log-Likelihood Gradient Derivation

  33. Log-Likelihood Derivative Derivation πœ–πΊ 𝑙 𝑦 𝑗 , 𝑧 β€² π‘ž 𝑧 β€² 𝑦 𝑗 ) = ෍ 𝑔 𝑙 𝑦 𝑗 , 𝑧 𝑗 βˆ’ ෍ ෍ 𝑔 πœ–πœ„ 𝑙 𝑧 β€² 𝑗 𝑗

  34. Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

  35. Preventing Extreme Values NaΓ―ve Bayes Extreme values are 0 probabilities

  36. Preventing Extreme Values NaΓ―ve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large ΞΈ values

  37. Preventing Extreme Values NaΓ―ve Bayes Log-linear models Extreme values are 0 probabilities Extreme values are large ΞΈ values regularization

  38. (Squared) L2 Regularization

  39. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 8

  40. (More on) Connections to Other Machine Learning Techniques

  41. Classification: Discriminative NaΓ―ve Bayes Label/class Observed features NaΓ―ve Bayes

  42. Classification: Discriminative NaΓ―ve Bayes Label/class Observed features NaΓ―ve Bayes Maxent/ Logistic Regression

  43. Multinomial Logistic Regression

  44. Multinomial Logistic Regression (in one dimension)

  45. Multinomial Logistic Regression

  46. Understanding Conditioning Is this a good language model?

  47. Understanding Conditioning Is this a good language model?

  48. Understanding Conditioning Is this a good language model? (no)

  49. Understanding Conditioning Is this a good posterior classifier? (no)

  50. https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/ https://goo.gl/B23Rxo Lesson 11

  51. Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative NaΓ―ve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

  52. Revisiting the S NAP Function softmax

  53. Revisiting the S NAP Function softmax

  54. N-gram Language Models given some context… w i-3 w i-2 w i-1 w i predict the next word

  55. N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 ,π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 ) w i predict the next word

  56. N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 ) w i predict the next word

  57. Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ β‹… 𝑔(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 )) w i predict the next word

  58. Neural Language Models given some context… w i-3 w i-2 w i-1 can we learn the feature function(s)? compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 , π‘₯ 𝑗 )) w i predict the next word

  59. Neural Language Models given some context… w i-3 w i-2 w i-1 can we learn the feature function(s) for just the context? compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ 𝒙 𝒋 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) can we learn word-specific weights (by type)? w i predict the next word

  60. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use β€œ distributed representations”… e w e i-3 e i-2 e i-1 compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ π‘₯ 𝑗 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) w i predict the next word

  61. Neural Language Models given some context… w i-3 w i-2 w i-1 create/use β€œ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f representations… product compute beliefs about what is likely… π‘ž π‘₯ 𝑗 π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 ) ∝ softmax(πœ„ π‘₯ 𝑗 β‹… π’ˆ(π‘₯ π‘—βˆ’3 , π‘₯ π‘—βˆ’2 , π‘₯ π‘—βˆ’1 )) w i predict the next word

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend