tengyu ma
play

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - PowerPoint PPT Presentation

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 Kernel


  1. Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University

  2. ๐‘ค & โˆˆ โ„ ( ๐‘ฆ โˆˆ ๐’ด Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 ร˜ Kernel methods Linearly separable 0.*3+1, +15.*2 +106 Multi-class linear ร˜ Neural nets classifier

  3. Vocabulary= โ„ 788 { 60k most frequent words } Goal: Embedding captures semantics information (via linear algebraic operations) ร˜ inner products characterize similarity ร˜ similar words have large inner products ร˜ differences characterize relationship ร˜ analogous pairs have similar differences ร˜ more? picture: Chris Olahโ€™s blog

  4. Meaning of a word is determined by words it co-occurs with. ( Distributional hypothesis of meaning , [Harrisโ€™54], [Firthโ€™57] ) ร˜ Pr ๐‘ฆ, ๐‘ง โ‰œ prob. of co-occurrences of ๐‘ฆ, ๐‘ง in a window of size 5 word ๐‘ง โ†“ โ‹ฏ ๐‘ค & ,๐‘ค C - a good measure of ร˜ โ‹ฎ โ‹ฎ similarity of (๐‘ฆ,๐‘ง) [Lund-Burgessโ€™96] word ๐‘ฆ โ†’ ๐‘ค & โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฏ ร˜ ๐‘ค & = row of entry-wise square-root of co-occurrence matrix [Rohde et alโ€™05] Co-occurrence matrix Pr โ‹…,โ‹… L. [&,C] ร˜ ๐‘ค & = row of PMI ๐‘ฆ, ๐‘ง = log L. & L.[C] matrix [Church-Hanksโ€™90]

  5. Algorithm [Levy-Goldberg]: (dimension-reduction version of [Church-Hanksโ€™90]) L. [&,C] ร˜ Compute PMI ๐‘ฆ, ๐‘ง = log L. & L.[C] ร˜ Take rank-300 SVD (best rank-300 approximation) of PMI ร˜ โ‡” Fit PMI ๐‘ฆ,๐‘ง โ‰ˆ โŒฉ๐‘ค & , ๐‘ค C โŒช (with squared loss), where ๐‘ค & โˆˆ โ„ 788 ร˜ โ€œLinear structureโ€ in the found ๐‘ค & โ€™s : ๐‘ค PQRST โˆ’ ๐‘ค RST โ‰ˆ ๐‘ค WXYYT โˆ’ ๐‘ค Z[T\ โ‰ˆ ๐‘ค XT]^Y โˆ’ ๐‘ค SXT_ โ‰ˆ โ‹ฏ king queen uncle man aunt woman

  6. ร˜ Questions: woman: man queen: ? , aunt: ? ร˜ Answers: ๐‘™๐‘—๐‘œ๐‘• = argmin k || ๐‘ค WXYYT โˆ’ ๐‘ค P โˆ’ (๐‘ค PQRST โˆ’๐‘ค RST )|| ๐‘๐‘ฃ๐‘œ๐‘ข = argmin k || ๐‘ค XT]^Y โˆ’ ๐‘ค P โˆ’ (๐‘ค PQRST โˆ’๐‘ค RST )|| king queen uncle man aunt woman

  7. ร˜ recurrent neural network based model [Mikolov et alโ€™12] ร˜ word2vec [Mikolov et alโ€™13] : โˆ expโŒฉ๐‘ค & yz{ ,1 Pr ๐‘ฆ [pq ๐‘ฆ [pr ,โ€ฆ,๐‘ฆ [pt 5 ๐‘ค & yz~ + โ‹ฏ + ๐‘ค & yzโ‚ฌ โŒช ร˜ GloVe [Pennington et alโ€™14] : log Pr [๐‘ฆ,๐‘ง] โ‰ˆ ๐‘ค & ,๐‘ค C + ๐‘ก & + ๐‘ก C + ๐ท ร˜ [Levy-Goldbergโ€™14] (Previous slide) L. [&,C] PMI ๐‘ฆ,๐‘ง = log L. & L.[C] โ‰ˆ ๐‘ค & ,๐‘ค C + ๐ท Logarithm (or exponential) seems to exclude linear algebra!

  8. Why co-occurrence statistics + log ร  linear structure [Levy-Goldbergโ€™13, Pennington et alโ€™14, rephrased] ร˜ For most of the words ๐œ“: Pr[๐œ“ โˆฃ ๐‘™๐‘—๐‘œ๐‘•] Pr[๐œ“ โˆฃ ๐‘›๐‘๐‘œ] Pr[๐œ“ โˆฃ ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ] โ‰ˆ Pr ๐œ“ ๐‘ฅ๐‘๐‘›๐‘๐‘œ] ยง For ๐œ“ unrelated to gender: LHS, RHS โ‰ˆ 1 ยง for ๐œ“ =dress, LHS, RHS โ‰ช 1 ; for ๐œ“ = John, LHS, RHS โ‰ซ 1 ร˜ It suggests โ€ข โ€ข log Pr ๐œ“ ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ โˆ’ log Pr ๐œ“ ๐‘™๐‘—๐‘œ๐‘• ๐‘›๐‘๐‘œ โ‰ˆ 0 Pr ๐œ“ Pr ๐œ“ ๐‘ฅ๐‘๐‘›๐‘๐‘œ] ลฝ โ€ข = โ€ข PMI ๐œ“, ๐‘™๐‘—๐‘œ๐‘• โˆ’ PMI ๐œ“, ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ โˆ’ PMI ๐œ“, ๐‘›๐‘๐‘œ โˆ’ PMI ๐œ“, ๐‘ฅ๐‘๐‘›๐‘๐‘œ โ‰ˆ 0 ลฝ ร˜ Rows of PMI matrix has โ€œlinear structureโ€ ร˜ Empirically one can find ๐‘ค P โ€™s s.t. PMI ๐œ“, ๐‘ฅ โ‰ˆ โŒฉ๐‘ค ลฝ ,๐‘ค P โŒช ร˜ Suggestion: ๐‘ค P โ€™s also have linear structure

  9. M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐‘ฆ, ๐‘ง โ‰ˆ ๐‘ค & , ๐‘ค C (โˆ—) ร˜ NB: PMI matrix is not necessarily PSD. M2: Why low-dim vectors solves analogy when (โˆ—) is only roughly true? โ†‘ empirical fit has 17% error ร˜ NB: solving analogy task requires inner products of 6 pairs of word vectors, and that โ€œkingโ€ survives against all other words โ€“ noise is potentially an issue! ๐‘™๐‘—๐‘œ๐‘• = argmax k || ๐‘ค WXYYT โˆ’ ๐‘ค P โˆ’ (๐‘ค PQRST โˆ’๐‘ค RST ) || โ€ข ร˜ Fact: low-dim word vectors have more accurate linear structure than the rows of PMI (therefore better analogy task performance).

  10. M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐‘ฆ, ๐‘ง โ‰ˆ ๐‘ค & , ๐‘ค C (โˆ—) A1: Under a generative model (named RAND-WALK) , (*) provablyholds M2: Why low-dim vectors solves analogy when (โˆ—) is only roughly true? A2: (*) + isotropy of word vectors โ‡’ low-dim fitting reduces noise (Quite intuitive, though doesnโ€™t follow Occamโ€™s bound for PAC-learning)

  11. ๐‘‘ _ ๐‘‘ _pr ๐‘‘ _pโ€ข ๐‘‘ _p7 ๐‘‘ _pโ€“ ๐‘ฅ _pr ๐‘ฅ _pโ€ข ๐‘ฅ _pโ€“ ๐‘ฅ _ ๐‘ฅ _p7 ร˜ Hidden Markov Model: ยง discourse vector ๐‘‘ _ โˆˆ โ„ ( governs the discourse/theme/context of time ๐‘ข ยง words ๐‘ฅ _ (observable); embedding ๐‘ค P โ€ข โˆˆ โ„ ( (parameters to learn) ยง log-linear observation model Pr[๐‘ฅ _ โˆฃ ๐‘‘ _ ] โˆ expโŒฉ๐‘ค P โ€ข ,๐‘‘ _ โŒช ร˜ Closely related to [Mnih-Hintonโ€™07]

  12. ๐‘‘ _ ๐‘‘ _pr ๐‘‘ _pโ€ข ๐‘‘ _p7 ๐‘‘ _pโ€“ ๐‘ฅ _pr ๐‘ฅ _pโ€ข ๐‘ฅ _pโ€“ ๐‘ฅ _ ๐‘ฅ _p7 ร˜ Ideally, ๐‘‘ _ ,๐‘ค P โˆˆ โ„ ( should contain semantic information in its coordinates ยง E.g. (0.5, -0.3, โ€ฆ) could mean โ€œ0.5 gender, -0.3 age,..โ€ ร˜ But, the whole system is rotational invariant: ๐‘‘ _ ,๐‘ค P = โŒฉ๐‘†๐‘‘ _ ,๐‘†๐‘ค P โŒช ร˜ There should exist a rotation so that the coordinates are meaningful (back to this later)

  13. ๐‘‘ _ ๐‘‘ _pr ๐‘‘ _pโ€ข ๐‘‘ _p7 ๐‘‘ _pโ€“ ๐‘ฅ _ ๐‘ฅ _pr ๐‘ฅ _pโ€ข ๐‘ฅ _p7 ๐‘ฅ _pโ€“ ร˜ Assumptions: ยง { ๐‘ค P } consists of vectors drawn from ๐‘ก โ‹… ๐’ช(0,Id) ; ๐‘ก is bounded scalar r.v. ยง ๐‘‘ _ does a slow random walk (doesnโ€™t change much in a window of 5) ยง log-linear observation model: Pr[๐‘ฅ _ โˆฃ ๐‘‘ _ ] โˆ expโŒฉ๐‘ค P โ€ข ,๐‘‘ _ โŒช ร˜ Main Theorem: ๐‘ค P + ๐‘ค Pโ€บ โ€ข /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— (1) log Pr ๐‘ฅ,๐‘ฅโ€ฒ = ๐‘ค P โ€ข /๐‘’ โˆ’ log ๐‘Ž ยฑ ๐œ— (2) log Pr ๐‘ฅ = Fact: (2) implies that the words have power PMI ๐‘ฅ,๐‘ฅ โ€บ = ๐‘ค P ,๐‘ค P ยข /๐‘’ ยฑ ๐œ— (3) law dist. ร˜ Norm determines frequency; spatial orientation determines โ€œmeaningโ€

  14. ร˜ word2vec [Mikolov et alโ€™13] : โˆ expโŒฉ๐‘ค P yz{ ,1 Pr ๐‘ฅ [pq ๐‘ฅ [pr ,โ€ฆ ,๐‘ฅ [pt 5 ๐‘ค P yz~ + โ‹ฏ + ๐‘ค P yzโ‚ฌ โŒช ร˜ GloVe [Pennington et alโ€™14] : log Pr [๐‘ฅ,๐‘ฅโ€ฒ] โ‰ˆ ๐‘ค P , ๐‘ค P ยข + ๐‘ก P + ๐‘ก Pโ€บ + ๐ท log Pr ๐‘ฅ,๐‘ฅ โ€บ = โ€ข /๐‘’ โˆ’ 2log ๐‘Ž ยฑ ๐œ— Eq. (1) ๐‘ค P + ๐‘ค P ยข ร˜ [Levy-Goldbergโ€™14] PMI ๐‘ฅ,๐‘ฅ โ€บ โ‰ˆ ๐‘ค P ,๐‘ค P ยข + ๐ท Eq. (3) PMI ๐‘ฅ, ๐‘ฅ โ€บ = ๐‘ค P , ๐‘ค P ยข /๐‘’ ยฑ ๐œ—

  15. ร˜ word2vec [Mikolov et alโ€™13] : โˆ expโŒฉ๐‘ค P yz{ ,1 Pr ๐‘ฅ [pq ๐‘ฅ [pr ,โ€ฆ, ๐‘ฅ [pt 5 ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ โŒช โ†‘ max-likelihood estimate of ๐‘‘ [pq ร˜ Under our model, ๐‘‘ [pโ€“ ๐‘‘ [pt ๐‘‘ [pq ยง Random walk is slow: ๐‘‘ [pr โ‰ˆ ๐‘‘ [pโ€ข โ‰ˆ โ‹ฏ โ‰ˆ ๐‘‘ [pq โ‰ˆ ๐‘‘ ยง Best estimate for current discourse ๐‘‘ [pq : ๐‘ฅ [pโ€“ ๐‘ฅ [pt ๐‘ฅ [pq argmax Pr ๐‘‘ ๐‘ฅ [pr ,โ€ฆ,๐‘ฅ t ] = ๐›ฝ ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ ],||]||ยฃr ยง Prob. distribution of next word given the best guess ๐‘‘ : Pr[๐‘ฅ [pq โˆฃ ๐‘‘ [pq = ๐›ฝ ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ ] โˆ expโŒฉ๐‘ค P yz{ ,๐›ฝ ๐‘ค P yz~ + โ‹ฏ+ ๐‘ค P yzโ‚ฌ โŒช

  16. This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘ค P , ๐‘‘โŒช r ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] = ยง ยจ โ‹… expโŒฉ๐‘ค P , ๐‘‘โŒช Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘ค P ยข ,๐‘‘โ€ฒโŒช ๐‘‘โ€ฒ ๐‘‘ ร˜ ๐‘Ž ] = โˆ‘ exp โŒฉ๐‘ค P ,๐‘‘โŒช partition function P Pr[๐‘ฅ,๐‘ฅ โ€บ ] = ยฅ Pr ๐‘ฅ ๐‘‘] Pr ๐‘ฅ โ€บ ๐‘‘โ€ฒ] ๐‘ž ๐‘‘,๐‘‘ โ€บ ๐‘’๐‘‘๐‘’๐‘‘โ€ฒ ๐‘ฅ ๐‘ฅโ€ฒ spherical Gaussian vector ๐‘‘ 1 โ‹… exp ๐‘ค P ,๐‘‘ expโŒฉ๐‘ค P ยข ,๐‘‘ โ€บ โŒช ๐‘ž ๐‘‘, ๐‘‘ โ€บ ๐‘’๐‘‘๐‘’๐‘‘โ€ฒ โ€ข /๐‘’ ร˜ ๐”ฝ exp ๐‘ค,๐‘‘ = exp ๐‘ค = ยฅ ๐‘Ž ] ๐‘Ž ]โ€บ ร˜ Assume ๐‘‘ = ๐‘‘โ€ฒ with probability 1, ?? โ€ข /๐‘’ = ยฅexpโŒฉ๐‘ค P + ๐‘ค P ยข , ๐‘‘โŒช๐‘ž ๐‘‘ ๐‘’๐‘‘ = exp ๐‘ค P + ๐‘ค P ยข Eq. (1) log Pr ๐‘ฅ, ๐‘ฅ โ€บ = โ€ข /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— ๐‘ค P + ๐‘ค P ยข

  17. This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘ค P , ๐‘‘โŒช r ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] = ยง ยจ โ‹… expโŒฉ๐‘ค P , ๐‘‘โŒช Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘ค P ยข ,๐‘‘โ€ฒโŒช ๐‘‘โ€ฒ ๐‘‘ ร˜ ๐‘Ž ] = โˆ‘ exp โŒฉ๐‘ค P ,๐‘‘โŒช partition function P Lemma 1: for almost all c, almost all ๐‘ค P , ๐‘Ž ] = 1 + ๐‘ 1 ๐‘Ž ๐‘ฅ ๐‘ฅโ€ฒ ร˜ Proof (sketch) : ยง for most ๐‘‘ , ๐‘Ž ] concentrates around its mean ยง mean of ๐‘Ž ] is determined by ||๐‘‘|| , which in turn concentrates ยง caveat: expโŒฉ๐‘ค,๐‘‘โŒช for ๐‘ค โˆผ ๐’ช(0,Id) is not subgaussian, nor sub- exponential. ( ๐›ฝ -Orlicz norm is not bounded for any ๐›ฝ > 0 ) Eq. (1) log Pr ๐‘ฅ, ๐‘ฅ โ€บ = โ€ข /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— ๐‘ค P + ๐‘ค P ยข

  18. Lemma 1: for almost all c, almost all ๐‘ค P , ๐‘Ž ] = 1 + ๐‘ 1 ๐‘Ž ร˜ Proof Sketch: ร˜ Fixing ๐‘‘ , to show high probability over choices of ๐‘ค P โ€™s ๐‘Ž ] = โ€ข expโŒฉ๐‘ค P ,๐‘‘โŒช = 1 + ๐‘ 1 ๐”ฝ[๐‘Ž ] ] P ร˜ ๐‘จ P = โŒฉ๐‘ค P ,๐‘‘โŒช scalar Gaussian random variable ร˜ ||๐‘‘|| governs the mean and variance of ๐‘จ P . ร˜ ||๐‘‘|| in turns is concentrated

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend