Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - - PowerPoint PPT Presentation

โ–ถ
tengyu ma
SMART_READER_LITE
LIVE PREVIEW

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - - PowerPoint PPT Presentation

Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 Kernel


slide-1
SLIDE 1

Tengyu Ma

Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University

slide-2
SLIDE 2

๐‘ฆ โˆˆ ๐’ด

๐‘ค& โˆˆ โ„(

complicated space Euclidean space with meaningful inner products ร˜ Kernel methods

)*+,-.*/01, 23/03+4

Linearly separable ร˜ Neural nets

0.*3+1, +15.*2 +106

Multi-class linear classifier

slide-3
SLIDE 3

Vocabulary= { 60k most frequent words } โ„788 Goal: Embedding captures semantics information (via linear algebraic operations)

ร˜ inner products characterize similarity

ร˜ similar words have large inner products

ร˜ differences characterize relationship

ร˜analogous pairs have similar differences

ร˜ more?

picture: Chris Olahโ€™s blog

slide-4
SLIDE 4

Meaning of a word is determined by words it co-occurs with. (Distributional hypothesis of meaning, [Harrisโ€™54], [Firthโ€™57])

โ‹ฏ โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฏ

word ๐‘ฆ โ†’ ๐‘ค& word ๐‘ง โ†“

ร˜ Pr ๐‘ฆ, ๐‘ง โ‰œ prob. of co-occurrences

  • f ๐‘ฆ, ๐‘ง in a window of size 5

ร˜

๐‘ค&,๐‘คC - a good measure of similarity of (๐‘ฆ,๐‘ง) [Lund-Burgessโ€™96]

ร˜ ๐‘ค& = row of entry-wise square-root of

co-occurrence matrix [Rohde et alโ€™05]

ร˜ ๐‘ค& = row of PMI ๐‘ฆ, ๐‘ง = log

L. [&,C]

  • L. & L.[C]

matrix [Church-Hanksโ€™90] Co-occurrence matrix Pr โ‹…,โ‹…

slide-5
SLIDE 5

ร˜ โ€œLinear structureโ€ in the found ๐‘ค&โ€™s :

๐‘คPQRST โˆ’ ๐‘คRST โ‰ˆ ๐‘คWXYYT โˆ’ ๐‘คZ[T\ โ‰ˆ ๐‘คXT]^Y โˆ’ ๐‘คSXT_ โ‰ˆ โ‹ฏ aunt king uncle man woman queen Algorithm [Levy-Goldberg]: (dimension-reduction version of [Church-Hanksโ€™90]) ร˜ Compute PMI ๐‘ฆ, ๐‘ง = log

L. [&,C]

  • L. & L.[C]

ร˜ Take rank-300 SVD (best rank-300 approximation) of PMI ร˜ โ‡” Fit PMI ๐‘ฆ,๐‘ง โ‰ˆ โŒฉ๐‘ค&, ๐‘คCโŒช (with squared loss), where ๐‘ค& โˆˆ โ„788

slide-6
SLIDE 6

ร˜ Questions:

woman: man queen: ?

,aunt: ?

ร˜ Answers:

๐‘™๐‘—๐‘œ๐‘• = argmink|| ๐‘คWXYYT โˆ’ ๐‘คP โˆ’ (๐‘คPQRSTโˆ’๐‘คRST)|| ๐‘๐‘ฃ๐‘œ๐‘ข = argmink|| ๐‘คXT]^Y โˆ’ ๐‘คP โˆ’ (๐‘คPQRSTโˆ’๐‘คRST)|| aunt king uncle man woman queen

slide-7
SLIDE 7

ร˜recurrent neural network based model [Mikolov et alโ€™12] ร˜word2vec [Mikolov et alโ€™13] :

Pr ๐‘ฆ[pq ๐‘ฆ[pr,โ€ฆ,๐‘ฆ[pt โˆ expโŒฉ๐‘ค&yz{,1 5 ๐‘ค&yz~ + โ‹ฏ + ๐‘ค&yzโ‚ฌ โŒช

ร˜GloVe [Pennington et alโ€™14] :

log Pr [๐‘ฆ,๐‘ง] โ‰ˆ ๐‘ค&,๐‘คC + ๐‘ก& + ๐‘กC + ๐ท

ร˜ [Levy-Goldbergโ€™14] (Previous slide)

PMI ๐‘ฆ,๐‘ง = log

L. [&,C]

  • L. & L.[C] โ‰ˆ ๐‘ค&,๐‘คC + ๐ท

Logarithm (or exponential) seems to exclude linear algebra!

slide-8
SLIDE 8

Why co-occurrence statistics + log ร  linear structure

[Levy-Goldbergโ€™13, Pennington et alโ€™14, rephrased]

ร˜ For most of the words ๐œ“:

Pr[๐œ“ โˆฃ ๐‘™๐‘—๐‘œ๐‘•] Pr[๐œ“ โˆฃ ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ] โ‰ˆ Pr[๐œ“ โˆฃ ๐‘›๐‘๐‘œ] Pr ๐œ“ ๐‘ฅ๐‘๐‘›๐‘๐‘œ]

ยง For ๐œ“ unrelated to gender: LHS, RHS โ‰ˆ 1 ยง for ๐œ“ =dress, LHS, RHS โ‰ช 1 ; for ๐œ“ = John, LHS, RHS โ‰ซ 1

ร˜It suggests

= โ€ข PMI ๐œ“, ๐‘™๐‘—๐‘œ๐‘• โˆ’ PMI ๐œ“, ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ โˆ’ PMI ๐œ“, ๐‘›๐‘๐‘œ โˆ’ PMI ๐œ“, ๐‘ฅ๐‘๐‘›๐‘๐‘œ

ลฝ

  • โ‰ˆ 0

ร˜ Rows of PMI matrix has โ€œlinear structureโ€ ร˜ Empirically one can find ๐‘คPโ€™s s.t. PMI ๐œ“, ๐‘ฅ โ‰ˆ โŒฉ๐‘คลฝ,๐‘คPโŒช ร˜ Suggestion: ๐‘คPโ€™s also have linear structure

  • log Pr ๐œ“

๐‘™๐‘—๐‘œ๐‘• Pr ๐œ“ ๐‘Ÿ๐‘ฃ๐‘“๐‘“๐‘œ โˆ’ log Pr ๐œ“ ๐‘›๐‘๐‘œ Pr ๐œ“ ๐‘ฅ๐‘๐‘›๐‘๐‘œ]

ลฝ

  • โ‰ˆ 0
slide-9
SLIDE 9

M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐‘ฆ, ๐‘ง โ‰ˆ ๐‘ค&, ๐‘คC (โˆ—) M2: Why low-dim vectors solves analogy when (โˆ—) is only roughly true?

ร˜ NB: solving analogy task requires inner products of 6 pairs of word vectors, and

that โ€œkingโ€ survives against all other words โ€“ noise is potentially an issue! ๐‘™๐‘—๐‘œ๐‘• = argmaxk|| ๐‘คWXYYT โˆ’ ๐‘คP โˆ’ (๐‘คPQRSTโˆ’๐‘คRST) ||โ€ข

ร˜ Fact: low-dim word vectors have more accurate linear structure than the

rows of PMI (therefore better analogy task performance).

โ†‘ empirical fit has 17% error

ร˜ NB: PMI matrix is not necessarily PSD.

slide-10
SLIDE 10

M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐‘ฆ, ๐‘ง โ‰ˆ ๐‘ค&, ๐‘คC (โˆ—)

A1: Under a generative model (named RAND-WALK) , (*) provablyholds

M2: Why low-dim vectors solves analogy when (โˆ—) is only roughly true?

A2: (*) + isotropy of word vectors โ‡’ low-dim fitting reduces noise

(Quite intuitive, though doesnโ€™t follow Occamโ€™s bound for PAC-learning)

slide-11
SLIDE 11

ร˜ Hidden Markov Model: ยง discourse vector ๐‘‘_ โˆˆ โ„( governs the discourse/theme/context of time ๐‘ข ยง words ๐‘ฅ_ (observable); embedding ๐‘คPโ€ข โˆˆ โ„( (parameters to learn) ยง log-linear observation model

Pr[๐‘ฅ_ โˆฃ ๐‘‘_] โˆ expโŒฉ๐‘คPโ€ข,๐‘‘_โŒช

ร˜ Closely related to [Mnih-Hintonโ€™07]

๐‘‘_ ๐‘‘_pr ๐‘‘_pโ€ข ๐‘‘_p7 ๐‘ฅ_ ๐‘ฅ_pr ๐‘ฅ_pโ€ข ๐‘ฅ_p7 ๐‘ฅ_pโ€“ ๐‘‘_pโ€“

slide-12
SLIDE 12

ร˜ Ideally, ๐‘‘_,๐‘คP โˆˆ โ„( should contain semantic information in its coordinates

ยง E.g. (0.5, -0.3, โ€ฆ) could mean โ€œ0.5 gender, -0.3 age,..โ€

ร˜ But, the whole system is rotational invariant: ๐‘‘_,๐‘คP = โŒฉ๐‘†๐‘‘_,๐‘†๐‘คPโŒช ร˜ There should exist a rotation so that the coordinates are meaningful (back to

this later) ๐‘‘_ ๐‘‘_pr ๐‘‘_pโ€ข ๐‘‘_p7 ๐‘ฅ_ ๐‘ฅ_pr ๐‘ฅ_pโ€ข ๐‘ฅ_p7 ๐‘ฅ_pโ€“ ๐‘‘_pโ€“

slide-13
SLIDE 13

ร˜ Assumptions: ยง {๐‘คP} consists of vectors drawn from ๐‘ก โ‹… ๐’ช(0,Id); ๐‘ก is bounded scalar r.v. ยง ๐‘‘_ does a slow random walk (doesnโ€™t change much in a window of 5) ยง log-linear observation model: Pr[๐‘ฅ_ โˆฃ ๐‘‘_] โˆ expโŒฉ๐‘คPโ€ข,๐‘‘_โŒช ร˜ Main Theorem:

(1) log Pr ๐‘ฅ,๐‘ฅโ€ฒ = ๐‘คP + ๐‘คPโ€บ โ€ข/๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ— (2) log Pr ๐‘ฅ = ๐‘คP โ€ข/๐‘’ โˆ’ log ๐‘Ž ยฑ ๐œ— (3) PMI ๐‘ฅ,๐‘ฅโ€บ = ๐‘คP,๐‘คPยข /๐‘’ ยฑ ๐œ—

ร˜ Norm determines frequency; spatial orientation determines โ€œmeaningโ€

๐‘‘_ ๐‘‘_pr ๐‘‘_pโ€ข ๐‘‘_p7 ๐‘ฅ_ ๐‘ฅ_pr ๐‘ฅ_pโ€ข ๐‘ฅ_p7 ๐‘ฅ_pโ€“ ๐‘‘_pโ€“

Fact: (2) implies that the words have power law dist.

slide-14
SLIDE 14

ร˜word2vec [Mikolov et alโ€™13] :

Pr ๐‘ฅ[pq ๐‘ฅ[pr,โ€ฆ ,๐‘ฅ[pt โˆ expโŒฉ๐‘คPyz{,1 5 ๐‘คPyz~ + โ‹ฏ + ๐‘คPyzโ‚ฌ โŒช

ร˜GloVe [Pennington et alโ€™14] :

log Pr [๐‘ฅ,๐‘ฅโ€ฒ] โ‰ˆ ๐‘คP, ๐‘คPยข + ๐‘กP + ๐‘กPโ€บ + ๐ท

  • Eq. (1)

log Pr ๐‘ฅ,๐‘ฅโ€บ = ๐‘คP + ๐‘คPยข

  • /๐‘’ โˆ’ 2log ๐‘Ž ยฑ ๐œ—

ร˜ [Levy-Goldbergโ€™14]

PMI ๐‘ฅ,๐‘ฅโ€บ โ‰ˆ ๐‘คP,๐‘คPยข + ๐ท

  • Eq. (3) PMI ๐‘ฅ, ๐‘ฅโ€บ = ๐‘คP, ๐‘คPยข /๐‘’ ยฑ ๐œ—
slide-15
SLIDE 15

ร˜word2vec [Mikolov et alโ€™13] :

Pr ๐‘ฅ[pq ๐‘ฅ[pr,โ€ฆ, ๐‘ฅ[pt โˆ expโŒฉ๐‘คPyz{,1 5 ๐‘คPyz~ + โ‹ฏ+ ๐‘คPyzโ‚ฌ โŒช

ร˜ Under our model, ยง Random walk is slow: ๐‘‘[pr โ‰ˆ ๐‘‘[pโ€ข โ‰ˆ โ‹ฏ โ‰ˆ ๐‘‘[pq โ‰ˆ ๐‘‘ ยง Best estimate for current discourse ๐‘‘[pq :

argmax

],||]||ยฃr

Pr ๐‘‘ ๐‘ฅ[pr,โ€ฆ,๐‘ฅt] = ๐›ฝ ๐‘คPyz~ + โ‹ฏ+ ๐‘คPyzโ‚ฌ

ยง Prob. distribution of next word given the best guess ๐‘‘:

Pr[๐‘ฅ[pq โˆฃ ๐‘‘[pq = ๐›ฝ ๐‘คPyz~ + โ‹ฏ+ ๐‘คPyzโ‚ฌ ] โˆ expโŒฉ๐‘คPyz{,๐›ฝ ๐‘คPyz~ + โ‹ฏ+ ๐‘คPyzโ‚ฌ โŒช โ†‘ max-likelihood estimate of ๐‘‘[pq ๐‘‘[pโ€“ ๐‘‘[pt ๐‘ฅ[pโ€“ ๐‘ฅ[pt ๐‘ฅ[pq ๐‘‘[pq

slide-16
SLIDE 16

Pr[๐‘ฅ,๐‘ฅโ€บ] = ยฅ Pr ๐‘ฅ ๐‘‘] Pr ๐‘ฅโ€บ ๐‘‘โ€ฒ] ๐‘ž ๐‘‘,๐‘‘โ€บ ๐‘’๐‘‘๐‘’๐‘‘โ€ฒ = ยฅ 1 ๐‘Ž]๐‘Ž]โ€บ โ‹… exp ๐‘คP,๐‘‘ expโŒฉ๐‘คPยข,๐‘‘โ€บโŒช ๐‘ž ๐‘‘, ๐‘‘โ€บ ๐‘’๐‘‘๐‘’๐‘‘โ€ฒ ร˜ Assume ๐‘‘ = ๐‘‘โ€ฒ with probability 1, = ยฅexpโŒฉ๐‘คP + ๐‘คPยข, ๐‘‘โŒช๐‘ž ๐‘‘ ๐‘’๐‘‘ = exp ๐‘คP + ๐‘คPยข

  • /๐‘’

??

This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘คP, ๐‘‘โŒช ๐‘‘ ๐‘‘โ€ฒ ๐‘ฅ ๐‘ฅโ€ฒ Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘คPยข,๐‘‘โ€ฒโŒช

ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] =

r ยงยจ โ‹… expโŒฉ๐‘คP, ๐‘‘โŒช

ร˜ ๐‘Ž] = โˆ‘ exp

โŒฉ๐‘คP,๐‘‘โŒช

P

partition function

  • Eq. (1) log Pr ๐‘ฅ, ๐‘ฅโ€บ =

๐‘คP + ๐‘คPยข

  • /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ—

spherical Gaussian vector ๐‘‘ ร˜ ๐”ฝ exp ๐‘ค,๐‘‘ = exp ๐‘ค

  • /๐‘’
slide-17
SLIDE 17

This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘คP, ๐‘‘โŒช ๐‘‘ ๐‘‘โ€ฒ ๐‘ฅ ๐‘ฅโ€ฒ Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘คPยข,๐‘‘โ€ฒโŒช

ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] =

r ยงยจ โ‹… expโŒฉ๐‘คP, ๐‘‘โŒช

ร˜ ๐‘Ž] = โˆ‘ exp

โŒฉ๐‘คP,๐‘‘โŒช

P

partition function Lemma 1: for almost all c, almost all ๐‘คP , ๐‘Ž] = 1 + ๐‘ 1 ๐‘Ž

ร˜ Proof (sketch) : ยง for most ๐‘‘, ๐‘Ž] concentrates around its mean ยง mean of ๐‘Ž] is determined by ||๐‘‘||, which in turn concentrates ยง caveat: expโŒฉ๐‘ค,๐‘‘โŒช for ๐‘ค โˆผ ๐’ช(0,Id) is not subgaussian, nor sub-

  • exponential. ( ๐›ฝ-Orlicz norm is not bounded for any ๐›ฝ > 0)
  • Eq. (1) log Pr ๐‘ฅ, ๐‘ฅโ€บ =

๐‘คP + ๐‘คPยข

  • /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ—
slide-18
SLIDE 18

ร˜ Proof Sketch: ร˜ Fixing ๐‘‘, to show high probability over choices of ๐‘คPโ€™s

๐‘Ž] = โ€ข expโŒฉ๐‘คP,๐‘‘โŒช

P

= 1 + ๐‘ 1 ๐”ฝ[๐‘Ž]]

ร˜ ๐‘จP = โŒฉ๐‘คP,๐‘‘โŒช scalar Gaussian random variable ร˜ ||๐‘‘|| governs the mean and variance of ๐‘จP. ร˜ ||๐‘‘|| in turns is concentrated

Lemma 1: for almost all c, almost all ๐‘คP , ๐‘Ž] = 1 + ๐‘ 1 ๐‘Ž

slide-19
SLIDE 19

ร˜ Question: ๐‘จr,โ€ฆ ,๐‘จT โˆผ ๐’ช(0,1) ๐‘Ž = โ€ขexp(๐‘จ[)

T [ยฃr

ร˜ How is ๐‘Ž concentrated ? ร˜ ๐”ฝ ๐‘Ž] = ฮ˜(๐‘œ), and ๐•Ž๐‘๐‘  ๐‘Ž] = O ๐‘œ ร˜ The tail of ๐‘“๐‘ฆ๐‘ž(๐‘จ[) is bad! ร˜ Pr exp๐‘จ[ > ๐‘ข โ‰ˆ ๐‘ขยฒ 2ยณ4 _ ร˜ Claim: Pr[๐‘Ž > ๐”ฝ๐‘Ž + ๐ท ๐‘œ โ‹… log ๐‘œ] โ‰ค exp(โˆ’ logโ€ข ๐‘œ) ร˜ Trick: truncate ๐‘จ[ at log ๐‘œ and deal with the tail by union bound

ร˜ (sub)-Gaussian tail Pr ๐‘Œ > ๐‘ข โ‰ค exp(โˆ’๐‘ขโ€ข/2) ร˜ (sub)-exponential tail Pr ๐‘Œ > ๐‘ข โ‰ค exp(โˆ’๐‘ข/2)

slide-20
SLIDE 20

ร˜ Proof Sketch: ร˜ Fixing ๐‘‘, we have with high probability over choices of ๐‘คPโ€™s

๐‘Ž] = โ€ข expโŒฉ๐‘คP,๐‘‘โŒช

P

= 1 + ๐‘ 1 ๐”ฝ[๐‘Ž]]

ร˜ ๐‘จP = โŒฉ๐‘คP,๐‘‘โŒช scalar Gaussian random variable ร˜ ||๐‘‘|| governs the mean and variance of ๐‘จP. ร˜ ||๐‘‘|| in turns is concentrated

Lemma 1: for almost all c, almost all ๐‘คP , ๐‘Ž] = 1 + ๐‘ 1 ๐‘Ž

slide-21
SLIDE 21

Pr[๐‘ฅ,๐‘ฅโ€บ] = ยฅ 1 ๐‘Ž]๐‘Ž]โ€บ โ‹… exp ๐‘คP + ๐‘คPยข, ๐‘‘ ๐‘ž ๐‘‘ ๐‘’๐‘‘ = 1 ยฑ ๐‘ 1 1 ๐‘Žโ€ข ยฅ exp ๐‘คP + ๐‘คPยข, ๐‘‘ ๐‘ž ๐‘‘ ๐‘’๐‘‘ = 1 ยฑ ๐‘ 1 1 ๐‘Žโ€ข exp(||๐‘คP + ๐‘คPยข||โ€ข/๐‘’) This talk: window of size 2 Pr[๐‘ฅ โˆฃ ๐‘‘] โˆ expโŒฉ๐‘คP, ๐‘‘โŒช ๐‘‘ ๐‘‘โ€ฒ ๐‘ฅ ๐‘ฅโ€ฒ Pr[๐‘ฅโ€ฒ โˆฃ ๐‘‘โ€ฒ] โˆ expโŒฉ๐‘คPยข,๐‘‘โ€ฒโŒช

  • Eq. (1) log Pr ๐‘ฅ, ๐‘ฅโ€บ =

๐‘คP + ๐‘คPยข

  • /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ—

ร˜ Pr[๐‘ฅ โˆฃ ๐‘‘] =

r ยงยจ โ‹… expโŒฉ๐‘คP, ๐‘‘โŒช

ร˜ ๐‘Ž] = โˆ‘ exp

โŒฉ๐‘คP,๐‘‘โŒช

P

partition function Lemma 1: for almost all c, almost all ๐‘คP , ๐‘Ž] = 1 + ๐‘ 1 ๐‘Ž

slide-22
SLIDE 22

ร˜ Our theory predicts

  • Eq. (1) log Pr ๐‘ฅ, ๐‘ฅโ€บ =

๐‘คP + ๐‘คPยข

  • /๐‘’ โˆ’ 2 log ๐‘Ž ยฑ ๐œ—

ร˜ (Approximate) maximum likelihood objective (SN)

min

{ยทยธ},ยบ โ€ข Pr

ยป[๐‘ฅ,๐‘ฅโ€บ](logPr ยป[๐‘ฅ, ๐‘ฅโ€บ] โˆ’ ๐‘คP + ๐‘คPยข

  • P,Pโ€บ

โˆ’ ๐‘)โ€ข

Simplest word embedding method yet (fewest โ€œknobsโ€ to turn) Comparable performance on analogy test

slide-23
SLIDE 23

ร˜ Our theory predicts

  • Eq. (2) log Pr ๐‘ฅ =

๐‘คP โ€ข/๐‘’ โˆ’ log ๐‘Ž ยฑ ๐œ—

slide-24
SLIDE 24

ร˜ Our theory predicts

๐‘Ž] = 1 ยฑ ๐‘ 1 ๐‘Ž

slide-25
SLIDE 25

ร˜Under generative model RANK-WALK

For most of the words ๐œ“: Pr[๐œ“ โˆฃ ๐‘] Pr[๐œ“ โˆฃ ๐‘] โ‰ˆ Pr[๐œ“ โˆฃ ๐‘‘] Pr ๐œ“ ๐‘’]

โŸบ

๐‘คS โˆ’ ๐‘คยฟ โ‰ˆ ๐‘ค] โˆ’ ๐‘ค( โ†‘ semantic def. of analogy โ†‘ algebraic def. of analogy

ร˜ Beyond only solving analogy task? ร˜ Extracting more information from analogy/embeddings?

slide-26
SLIDE 26

Extracting different meanings from word embeddings

(same team: Arora, Li, Liang, M., Risteski)

Some recent work:

slide-27
SLIDE 27

ร˜ โ€œTieโ€ can mean article of clothing, or physical act ร˜Tie represents unrelated words tie1, tie2, etc.

Quick experiment: Take two random/unrelatedwords w1, w2 where w1 is ~100 times more frequent than w2 . Declare these to be a single word and compute its embedding in our model. Result: close to something like 0.8๐‘คP~ + 0.2๐‘คPร‚

slide-28
SLIDE 28

ร˜ Mathematical explanation ร˜ Merge ๐‘ฅr,๐‘ฅโ€ข as ๐‘ฅ. Let ๐‘  =

L.[P~] L.[Pร‚] > 1

ร˜ Then ๐‘คP โ‰ˆ ๐›ฝ๐‘คP~ + ๐›พ๐‘คPร‚, where ยง

๐›ฝ = 1 โˆ’ ๐‘‘r log 1 +

r ร„

โ‰ˆ 1

ยง

๐›พ = 1 โˆ’ ๐‘‘โ€ข log ๐‘ 

ร˜ ๐›พ > .1 even if ๐‘  = 100 ! ร˜ Rare meaning is not swamped, thanks to the log !

slide-29
SLIDE 29

which correspond to different representative โ€œdiscoursesโ€ ๐‘ค_[Y = 0.8๐‘r + 0.2 ๐‘โ€ข + noise

โ†‘ discourse for ๐‘ข๐‘—๐‘“r โ†‘ discourse for ๐‘ข๐‘—๐‘“โ€ข

ร˜ โ€œTieโ€ can mean article of clothing, or physical act ร˜Tie represents unrelated words tie1, tie2, etc. ร˜ Sparse coding for extracting different meanings: ยง Find ๐‘› = 2000 โ€œdiscoursesโ€ ๐‘r,๐‘โ€ข,โ€ฆ โˆˆ โ„( such that each word vector

expressed as weighted sum of at most 5 of them, plus โ€œnoise vector.โ€ ๐‘คP = ๐‘ฆP,r๐‘r + ๐‘ฆP,โ€ข๐‘โ€ข + โ€ฆ+ ๐‘œ๐‘๐‘—๐‘ก๐‘“ ๐‘ฆP has only 5 non-zeros

ร˜ Training objective: min

ร†ยฃ[S~,โ€ฆ,Sร‡] รˆร‰Sร„รˆY &ยธ

ยข รˆ

  • ๐‘คP โˆ’ ๐ต๐‘ฆP
  • P

ร˜ local search algo. [EABโ€™05], provable algo. [SWWโ€™12, AGMโ€™14, AGMMโ€™15..]

slide-30
SLIDE 30

sparse coding, as well as its use to capture different senses of words.

Atom 1978 825 231 616 1638 149 330 drowning instagram stakes membrane slapping

  • rchestra

conferences suicides twitter thoroughbred mitochondria pulling philharmonic meetings

  • verdose

facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor workshops poisoning vimeo filly membranes twisting symphony exhibitions commits linkedin fillies

  • rganelles

bowing

  • rchestras
  • rganizes

stabbing reddit epsom endoplasmic slamming toscanini concerts strangulation myspace racecourse proteins tossing concertgebouw lectures gunshot tweets sired vesicles grabbing solti presentations

Representative subset of 2000 discourses (represented using their nearest words)

โ†‘

closest words to ๐‘โ€ข7r

slide-31
SLIDE 31

5 atoms that express ๐‘ค_[Y

slide-32
SLIDE 32

ร˜ Atoms of discourse found are fairly fine-grained ร˜ Maybe ๐‘ยฟ[Q]ร‹YR[รˆ_ร„C = ๐›ฝ โ‹… ๐‘ยฟ[Q^Q\C + ๐›พ โ‹… ๐‘]ร‹YR[รˆ_ร„C? ร˜ Another layer:

min

รŒ,ยบ รˆร‰Sร„รˆY ||๐ต โˆ’ ๐ถ๐‘||โ€ข

slide-33
SLIDE 33
slide-34
SLIDE 34

ร˜ Part I: new generative model that captures semantics. ร˜ Provable guarantee: ยง log of co-occurrence matrix has low rank structure ยง semantic analogy โ‡” linear algebraic structure for word vectors ร˜ Simplistic assumptions, but good fit to reality ร˜ Part II: automatic way of detect word meanings

ยง Hierarchical basis in the embedding space

ร˜ Other applications of our model/method?

slide-35
SLIDE 35
slide-36
SLIDE 36

ร˜ Each ordinate of ๐‘คP means something:

๐‘ครŽรร† = [โ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆ ] ๐‘ค(Q^^Sร„ = [โ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆ ] ๐‘ครร‹[TS = [โ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆ ] ๐‘คร‘ร’รŒ = [โ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆ ] currency

โ†“

country

โ†“

American

โ†“

Chinese

โ†“

๐‘ครŽรร† โˆ’ ๐‘ค(Q^^Sร„ = [โ€ฆโ€ฆ ,โˆ’1, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0 โ€ฆโ€ฆ โ€ฆ,0,โ€ฆโ€ฆ ] ๐‘ครร‹[TS โˆ’ ๐‘คร‘ร’รŒ = [โ€ฆ โ€ฆ,โˆ’1,โ€ฆโ€ฆ โ€ฆ, 1,โ€ฆโ€ฆ โ€ฆ, 0,โ€ฆโ€ฆ โ€ฆ, 0,โ€ฆ โ€ฆ]

ร˜ On other coordinates, the values are either very small or the supports are non-

  • verlapping

ร˜ Problem: rotational invariance โ€“ rotation of word vectors doesnโ€™t

change the model.

slide-37
SLIDE 37

๐‘ครŽรร† = [โ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆ ] ๐‘ค(Q^^Sร„ = [โ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆ ] ๐‘ครร‹[TS = [โ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆ ] ๐‘คร‘ร’รŒ = [โ€ฆโ€ฆ ,1, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,0, โ€ฆโ€ฆโ€ฆ ,1, โ€ฆโ€ฆ ] currency

โ†“

country

โ†“

American

โ†“

Chinese

โ†“

โ‹… ๐‘†

โ†‘

sparse coefficients

โ†‘

basis vectors

ร˜With sparsity, the model is identifiable; allows overcomplete basis; is tractable

under mild assumptions. [SWWโ€™12] [AGMโ€™13][AAJNTโ€™13][AGMMโ€™14]

slide-38
SLIDE 38

min

ร“ 6ร”*.61, ร‘ ||๐‘Š โˆ’ ๐‘Œ โ‹… ๐‘†||ร–

  • ร˜ ๐‘Š contains word vectors as rows (obtained from any embedding method)

ร˜ Sparsity of rows of X is chosen to be 5 ร˜ ๐‘† contains 2000 basis vectors (as rows), each of which is 300-dim

slide-39
SLIDE 39

Assuming M1 was answered, PMI ๐‘ฅ, ๐‘ฅโ€บ = ๐‘คP, ๐‘คPยข + ๐œŠ (*) with large ๐œŠ M2: Why low-dim vectors solves analogy when (*) is only roughly true? A2: (*) + isotropy of word vectors โ‡’ low-dim fitting reduces noise

(Quite intuitive, though doesnโ€™t follow Occamโ€™s bound for PAC-learning)

slide-40
SLIDE 40

ร˜ Our theory assumes that ๐‘‘_ does a slow random walk ร˜ red dot: the estimate hidden

variable ๐‘‘_ at time ๐‘ข

ร˜ sentence at top: the window

  • f size 10 at time ๐‘ข
slide-41
SLIDE 41

Assuming M1 was answered, PMI ๐‘ฅ, ๐‘ฅโ€บ = ๐‘คP, ๐‘คPยข + ๐œŠ (*) with large ๐œŠ M2: Why low-dim vectors solves analogy when (*) is only roughly true? A2: (*) + isotropy of word vectors โ‡’ low-dim fitting reduces noise

(Quite intuitive, though doesnโ€™t follow Occamโ€™s bound for PAC-learning)