Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - - PowerPoint PPT Presentation
Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, - - PowerPoint PPT Presentation
Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University & ( Euclidean space with complicated space meaningful inner products )*+,-.*/01, 23/03+4 Kernel
๐ฆ โ ๐ด
๐ค& โ โ(
complicated space Euclidean space with meaningful inner products ร Kernel methods
)*+,-.*/01, 23/03+4
Linearly separable ร Neural nets
0.*3+1, +15.*2 +106
Multi-class linear classifier
Vocabulary= { 60k most frequent words } โ788 Goal: Embedding captures semantics information (via linear algebraic operations)
ร inner products characterize similarity
ร similar words have large inner products
ร differences characterize relationship
รanalogous pairs have similar differences
ร more?
picture: Chris Olahโs blog
Meaning of a word is determined by words it co-occurs with. (Distributional hypothesis of meaning, [Harrisโ54], [Firthโ57])
โฏ โฎ โฎ โฑ โฎ โฎ โฏ
word ๐ฆ โ ๐ค& word ๐ง โ
ร Pr ๐ฆ, ๐ง โ prob. of co-occurrences
- f ๐ฆ, ๐ง in a window of size 5
ร
๐ค&,๐คC - a good measure of similarity of (๐ฆ,๐ง) [Lund-Burgessโ96]
ร ๐ค& = row of entry-wise square-root of
co-occurrence matrix [Rohde et alโ05]
ร ๐ค& = row of PMI ๐ฆ, ๐ง = log
L. [&,C]
- L. & L.[C]
matrix [Church-Hanksโ90] Co-occurrence matrix Pr โ ,โ
ร โLinear structureโ in the found ๐ค&โs :
๐คPQRST โ ๐คRST โ ๐คWXYYT โ ๐คZ[T\ โ ๐คXT]^Y โ ๐คSXT_ โ โฏ aunt king uncle man woman queen Algorithm [Levy-Goldberg]: (dimension-reduction version of [Church-Hanksโ90]) ร Compute PMI ๐ฆ, ๐ง = log
L. [&,C]
- L. & L.[C]
ร Take rank-300 SVD (best rank-300 approximation) of PMI ร โ Fit PMI ๐ฆ,๐ง โ โฉ๐ค&, ๐คCโช (with squared loss), where ๐ค& โ โ788
ร Questions:
woman: man queen: ?
,aunt: ?
ร Answers:
๐๐๐๐ = argmink|| ๐คWXYYT โ ๐คP โ (๐คPQRSTโ๐คRST)|| ๐๐ฃ๐๐ข = argmink|| ๐คXT]^Y โ ๐คP โ (๐คPQRSTโ๐คRST)|| aunt king uncle man woman queen
รrecurrent neural network based model [Mikolov et alโ12] รword2vec [Mikolov et alโ13] :
Pr ๐ฆ[pq ๐ฆ[pr,โฆ,๐ฆ[pt โ expโฉ๐ค&yz{,1 5 ๐ค&yz~ + โฏ + ๐ค&yzโฌ โช
รGloVe [Pennington et alโ14] :
log Pr [๐ฆ,๐ง] โ ๐ค&,๐คC + ๐ก& + ๐กC + ๐ท
ร [Levy-Goldbergโ14] (Previous slide)
PMI ๐ฆ,๐ง = log
L. [&,C]
- L. & L.[C] โ ๐ค&,๐คC + ๐ท
Logarithm (or exponential) seems to exclude linear algebra!
Why co-occurrence statistics + log ร linear structure
[Levy-Goldbergโ13, Pennington et alโ14, rephrased]
ร For most of the words ๐:
Pr[๐ โฃ ๐๐๐๐] Pr[๐ โฃ ๐๐ฃ๐๐๐] โ Pr[๐ โฃ ๐๐๐] Pr ๐ ๐ฅ๐๐๐๐]
ยง For ๐ unrelated to gender: LHS, RHS โ 1 ยง for ๐ =dress, LHS, RHS โช 1 ; for ๐ = John, LHS, RHS โซ 1
รIt suggests
= โข PMI ๐, ๐๐๐๐ โ PMI ๐, ๐๐ฃ๐๐๐ โ PMI ๐, ๐๐๐ โ PMI ๐, ๐ฅ๐๐๐๐
ลฝ
- โ 0
ร Rows of PMI matrix has โlinear structureโ ร Empirically one can find ๐คPโs s.t. PMI ๐, ๐ฅ โ โฉ๐คลฝ,๐คPโช ร Suggestion: ๐คPโs also have linear structure
- log Pr ๐
๐๐๐๐ Pr ๐ ๐๐ฃ๐๐๐ โ log Pr ๐ ๐๐๐ Pr ๐ ๐ฅ๐๐๐๐]
ลฝ
- โ 0
M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐ฆ, ๐ง โ ๐ค&, ๐คC (โ) M2: Why low-dim vectors solves analogy when (โ) is only roughly true?
ร NB: solving analogy task requires inner products of 6 pairs of word vectors, and
that โkingโ survives against all other words โ noise is potentially an issue! ๐๐๐๐ = argmaxk|| ๐คWXYYT โ ๐คP โ (๐คPQRSTโ๐คRST) ||โข
ร Fact: low-dim word vectors have more accurate linear structure than the
rows of PMI (therefore better analogy task performance).
โ empirical fit has 17% error
ร NB: PMI matrix is not necessarily PSD.
M1: Why do low-dim vectors capture essence of huge co-occurrence statistics? That is, why is a low-dim fit of PMI matrix even possible? PMI ๐ฆ, ๐ง โ ๐ค&, ๐คC (โ)
A1: Under a generative model (named RAND-WALK) , (*) provablyholds
M2: Why low-dim vectors solves analogy when (โ) is only roughly true?
A2: (*) + isotropy of word vectors โ low-dim fitting reduces noise
(Quite intuitive, though doesnโt follow Occamโs bound for PAC-learning)
ร Hidden Markov Model: ยง discourse vector ๐_ โ โ( governs the discourse/theme/context of time ๐ข ยง words ๐ฅ_ (observable); embedding ๐คPโข โ โ( (parameters to learn) ยง log-linear observation model
Pr[๐ฅ_ โฃ ๐_] โ expโฉ๐คPโข,๐_โช
ร Closely related to [Mnih-Hintonโ07]
๐_ ๐_pr ๐_pโข ๐_p7 ๐ฅ_ ๐ฅ_pr ๐ฅ_pโข ๐ฅ_p7 ๐ฅ_pโ ๐_pโ
ร Ideally, ๐_,๐คP โ โ( should contain semantic information in its coordinates
ยง E.g. (0.5, -0.3, โฆ) could mean โ0.5 gender, -0.3 age,..โ
ร But, the whole system is rotational invariant: ๐_,๐คP = โฉ๐๐_,๐๐คPโช ร There should exist a rotation so that the coordinates are meaningful (back to
this later) ๐_ ๐_pr ๐_pโข ๐_p7 ๐ฅ_ ๐ฅ_pr ๐ฅ_pโข ๐ฅ_p7 ๐ฅ_pโ ๐_pโ
ร Assumptions: ยง {๐คP} consists of vectors drawn from ๐ก โ ๐ช(0,Id); ๐ก is bounded scalar r.v. ยง ๐_ does a slow random walk (doesnโt change much in a window of 5) ยง log-linear observation model: Pr[๐ฅ_ โฃ ๐_] โ expโฉ๐คPโข,๐_โช ร Main Theorem:
(1) log Pr ๐ฅ,๐ฅโฒ = ๐คP + ๐คPโบ โข/๐ โ 2 log ๐ ยฑ ๐ (2) log Pr ๐ฅ = ๐คP โข/๐ โ log ๐ ยฑ ๐ (3) PMI ๐ฅ,๐ฅโบ = ๐คP,๐คPยข /๐ ยฑ ๐
ร Norm determines frequency; spatial orientation determines โmeaningโ
๐_ ๐_pr ๐_pโข ๐_p7 ๐ฅ_ ๐ฅ_pr ๐ฅ_pโข ๐ฅ_p7 ๐ฅ_pโ ๐_pโ
Fact: (2) implies that the words have power law dist.
รword2vec [Mikolov et alโ13] :
Pr ๐ฅ[pq ๐ฅ[pr,โฆ ,๐ฅ[pt โ expโฉ๐คPyz{,1 5 ๐คPyz~ + โฏ + ๐คPyzโฌ โช
รGloVe [Pennington et alโ14] :
log Pr [๐ฅ,๐ฅโฒ] โ ๐คP, ๐คPยข + ๐กP + ๐กPโบ + ๐ท
- Eq. (1)
log Pr ๐ฅ,๐ฅโบ = ๐คP + ๐คPยข
- /๐ โ 2log ๐ ยฑ ๐
ร [Levy-Goldbergโ14]
PMI ๐ฅ,๐ฅโบ โ ๐คP,๐คPยข + ๐ท
- Eq. (3) PMI ๐ฅ, ๐ฅโบ = ๐คP, ๐คPยข /๐ ยฑ ๐
รword2vec [Mikolov et alโ13] :
Pr ๐ฅ[pq ๐ฅ[pr,โฆ, ๐ฅ[pt โ expโฉ๐คPyz{,1 5 ๐คPyz~ + โฏ+ ๐คPyzโฌ โช
ร Under our model, ยง Random walk is slow: ๐[pr โ ๐[pโข โ โฏ โ ๐[pq โ ๐ ยง Best estimate for current discourse ๐[pq :
argmax
],||]||ยฃr
Pr ๐ ๐ฅ[pr,โฆ,๐ฅt] = ๐ฝ ๐คPyz~ + โฏ+ ๐คPyzโฌ
ยง Prob. distribution of next word given the best guess ๐:
Pr[๐ฅ[pq โฃ ๐[pq = ๐ฝ ๐คPyz~ + โฏ+ ๐คPyzโฌ ] โ expโฉ๐คPyz{,๐ฝ ๐คPyz~ + โฏ+ ๐คPyzโฌ โช โ max-likelihood estimate of ๐[pq ๐[pโ ๐[pt ๐ฅ[pโ ๐ฅ[pt ๐ฅ[pq ๐[pq
Pr[๐ฅ,๐ฅโบ] = ยฅ Pr ๐ฅ ๐] Pr ๐ฅโบ ๐โฒ] ๐ ๐,๐โบ ๐๐๐๐โฒ = ยฅ 1 ๐]๐]โบ โ exp ๐คP,๐ expโฉ๐คPยข,๐โบโช ๐ ๐, ๐โบ ๐๐๐๐โฒ ร Assume ๐ = ๐โฒ with probability 1, = ยฅexpโฉ๐คP + ๐คPยข, ๐โช๐ ๐ ๐๐ = exp ๐คP + ๐คPยข
- /๐
??
This talk: window of size 2 Pr[๐ฅ โฃ ๐] โ expโฉ๐คP, ๐โช ๐ ๐โฒ ๐ฅ ๐ฅโฒ Pr[๐ฅโฒ โฃ ๐โฒ] โ expโฉ๐คPยข,๐โฒโช
ร Pr[๐ฅ โฃ ๐] =
r ยงยจ โ expโฉ๐คP, ๐โช
ร ๐] = โ exp
โฉ๐คP,๐โช
P
partition function
- Eq. (1) log Pr ๐ฅ, ๐ฅโบ =
๐คP + ๐คPยข
- /๐ โ 2 log ๐ ยฑ ๐
spherical Gaussian vector ๐ ร ๐ฝ exp ๐ค,๐ = exp ๐ค
- /๐
This talk: window of size 2 Pr[๐ฅ โฃ ๐] โ expโฉ๐คP, ๐โช ๐ ๐โฒ ๐ฅ ๐ฅโฒ Pr[๐ฅโฒ โฃ ๐โฒ] โ expโฉ๐คPยข,๐โฒโช
ร Pr[๐ฅ โฃ ๐] =
r ยงยจ โ expโฉ๐คP, ๐โช
ร ๐] = โ exp
โฉ๐คP,๐โช
P
partition function Lemma 1: for almost all c, almost all ๐คP , ๐] = 1 + ๐ 1 ๐
ร Proof (sketch) : ยง for most ๐, ๐] concentrates around its mean ยง mean of ๐] is determined by ||๐||, which in turn concentrates ยง caveat: expโฉ๐ค,๐โช for ๐ค โผ ๐ช(0,Id) is not subgaussian, nor sub-
- exponential. ( ๐ฝ-Orlicz norm is not bounded for any ๐ฝ > 0)
- Eq. (1) log Pr ๐ฅ, ๐ฅโบ =
๐คP + ๐คPยข
- /๐ โ 2 log ๐ ยฑ ๐
ร Proof Sketch: ร Fixing ๐, to show high probability over choices of ๐คPโs
๐] = โข expโฉ๐คP,๐โช
P
= 1 + ๐ 1 ๐ฝ[๐]]
ร ๐จP = โฉ๐คP,๐โช scalar Gaussian random variable ร ||๐|| governs the mean and variance of ๐จP. ร ||๐|| in turns is concentrated
Lemma 1: for almost all c, almost all ๐คP , ๐] = 1 + ๐ 1 ๐
ร Question: ๐จr,โฆ ,๐จT โผ ๐ช(0,1) ๐ = โขexp(๐จ[)
T [ยฃr
ร How is ๐ concentrated ? ร ๐ฝ ๐] = ฮ(๐), and ๐๐๐ ๐] = O ๐ ร The tail of ๐๐ฆ๐(๐จ[) is bad! ร Pr exp๐จ[ > ๐ข โ ๐ขยฒ 2ยณ4 _ ร Claim: Pr[๐ > ๐ฝ๐ + ๐ท ๐ โ log ๐] โค exp(โ logโข ๐) ร Trick: truncate ๐จ[ at log ๐ and deal with the tail by union bound
ร (sub)-Gaussian tail Pr ๐ > ๐ข โค exp(โ๐ขโข/2) ร (sub)-exponential tail Pr ๐ > ๐ข โค exp(โ๐ข/2)
ร Proof Sketch: ร Fixing ๐, we have with high probability over choices of ๐คPโs
๐] = โข expโฉ๐คP,๐โช
P
= 1 + ๐ 1 ๐ฝ[๐]]
ร ๐จP = โฉ๐คP,๐โช scalar Gaussian random variable ร ||๐|| governs the mean and variance of ๐จP. ร ||๐|| in turns is concentrated
Lemma 1: for almost all c, almost all ๐คP , ๐] = 1 + ๐ 1 ๐
Pr[๐ฅ,๐ฅโบ] = ยฅ 1 ๐]๐]โบ โ exp ๐คP + ๐คPยข, ๐ ๐ ๐ ๐๐ = 1 ยฑ ๐ 1 1 ๐โข ยฅ exp ๐คP + ๐คPยข, ๐ ๐ ๐ ๐๐ = 1 ยฑ ๐ 1 1 ๐โข exp(||๐คP + ๐คPยข||โข/๐) This talk: window of size 2 Pr[๐ฅ โฃ ๐] โ expโฉ๐คP, ๐โช ๐ ๐โฒ ๐ฅ ๐ฅโฒ Pr[๐ฅโฒ โฃ ๐โฒ] โ expโฉ๐คPยข,๐โฒโช
- Eq. (1) log Pr ๐ฅ, ๐ฅโบ =
๐คP + ๐คPยข
- /๐ โ 2 log ๐ ยฑ ๐
ร Pr[๐ฅ โฃ ๐] =
r ยงยจ โ expโฉ๐คP, ๐โช
ร ๐] = โ exp
โฉ๐คP,๐โช
P
partition function Lemma 1: for almost all c, almost all ๐คP , ๐] = 1 + ๐ 1 ๐
ร Our theory predicts
- Eq. (1) log Pr ๐ฅ, ๐ฅโบ =
๐คP + ๐คPยข
- /๐ โ 2 log ๐ ยฑ ๐
ร (Approximate) maximum likelihood objective (SN)
min
{ยทยธ},ยบ โข Pr
ยป[๐ฅ,๐ฅโบ](logPr ยป[๐ฅ, ๐ฅโบ] โ ๐คP + ๐คPยข
- P,Pโบ
โ ๐)โข
Simplest word embedding method yet (fewest โknobsโ to turn) Comparable performance on analogy test
ร Our theory predicts
- Eq. (2) log Pr ๐ฅ =
๐คP โข/๐ โ log ๐ ยฑ ๐
ร Our theory predicts
๐] = 1 ยฑ ๐ 1 ๐
รUnder generative model RANK-WALK
For most of the words ๐: Pr[๐ โฃ ๐] Pr[๐ โฃ ๐] โ Pr[๐ โฃ ๐] Pr ๐ ๐]
โบ
๐คS โ ๐คยฟ โ ๐ค] โ ๐ค( โ semantic def. of analogy โ algebraic def. of analogy
ร Beyond only solving analogy task? ร Extracting more information from analogy/embeddings?
Extracting different meanings from word embeddings
(same team: Arora, Li, Liang, M., Risteski)
Some recent work:
ร โTieโ can mean article of clothing, or physical act รTie represents unrelated words tie1, tie2, etc.
Quick experiment: Take two random/unrelatedwords w1, w2 where w1 is ~100 times more frequent than w2 . Declare these to be a single word and compute its embedding in our model. Result: close to something like 0.8๐คP~ + 0.2๐คPร
ร Mathematical explanation ร Merge ๐ฅr,๐ฅโข as ๐ฅ. Let ๐ =
L.[P~] L.[Pร] > 1
ร Then ๐คP โ ๐ฝ๐คP~ + ๐พ๐คPร, where ยง
๐ฝ = 1 โ ๐r log 1 +
r ร
โ 1
ยง
๐พ = 1 โ ๐โข log ๐
ร ๐พ > .1 even if ๐ = 100 ! ร Rare meaning is not swamped, thanks to the log !
which correspond to different representative โdiscoursesโ ๐ค_[Y = 0.8๐r + 0.2 ๐โข + noise
โ discourse for ๐ข๐๐r โ discourse for ๐ข๐๐โข
ร โTieโ can mean article of clothing, or physical act รTie represents unrelated words tie1, tie2, etc. ร Sparse coding for extracting different meanings: ยง Find ๐ = 2000 โdiscoursesโ ๐r,๐โข,โฆ โ โ( such that each word vector
expressed as weighted sum of at most 5 of them, plus โnoise vector.โ ๐คP = ๐ฆP,r๐r + ๐ฆP,โข๐โข + โฆ+ ๐๐๐๐ก๐ ๐ฆP has only 5 non-zeros
ร Training objective: min
รยฃ[S~,โฆ,Sร] รรSรรY &ยธ
ยข ร
- ๐คP โ ๐ต๐ฆP
- P
ร local search algo. [EABโ05], provable algo. [SWWโ12, AGMโ14, AGMMโ15..]
sparse coding, as well as its use to capture different senses of words.
Atom 1978 825 231 616 1638 149 330 drowning instagram stakes membrane slapping
- rchestra
conferences suicides twitter thoroughbred mitochondria pulling philharmonic meetings
- verdose
facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor workshops poisoning vimeo filly membranes twisting symphony exhibitions commits linkedin fillies
- rganelles
bowing
- rchestras
- rganizes
stabbing reddit epsom endoplasmic slamming toscanini concerts strangulation myspace racecourse proteins tossing concertgebouw lectures gunshot tweets sired vesicles grabbing solti presentations
Representative subset of 2000 discourses (represented using their nearest words)
โ
closest words to ๐โข7r
5 atoms that express ๐ค_[Y
ร Atoms of discourse found are fairly fine-grained ร Maybe ๐ยฟ[Q]รYR[ร_รC = ๐ฝ โ ๐ยฟ[Q^Q\C + ๐พ โ ๐]รYR[ร_รC? ร Another layer:
min
ร,ยบ รรSรรY ||๐ต โ ๐ถ๐||โข
ร Part I: new generative model that captures semantics. ร Provable guarantee: ยง log of co-occurrence matrix has low rank structure ยง semantic analogy โ linear algebraic structure for word vectors ร Simplistic assumptions, but good fit to reality ร Part II: automatic way of detect word meanings
ยง Hierarchical basis in the embedding space
ร Other applications of our model/method?
ร Each ordinate of ๐คP means something:
๐ครรร = [โฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆ ] ๐ค(Q^^Sร = [โฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆ ] ๐ครร[TS = [โฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆ ] ๐ครรร = [โฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆ ] currency
โ
country
โ
American
โ
Chinese
โ
๐ครรร โ ๐ค(Q^^Sร = [โฆโฆ ,โ1, โฆโฆโฆ ,1, โฆโฆโฆ ,0 โฆโฆ โฆ,0,โฆโฆ ] ๐ครร[TS โ ๐ครรร = [โฆ โฆ,โ1,โฆโฆ โฆ, 1,โฆโฆ โฆ, 0,โฆโฆ โฆ, 0,โฆ โฆ]
ร On other coordinates, the values are either very small or the supports are non-
- verlapping
ร Problem: rotational invariance โ rotation of word vectors doesnโt
change the model.
๐ครรร = [โฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆ ] ๐ค(Q^^Sร = [โฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆ ] ๐ครร[TS = [โฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆ ] ๐ครรร = [โฆโฆ ,1, โฆโฆโฆ ,0, โฆโฆโฆ ,0, โฆโฆโฆ ,1, โฆโฆ ] currency
โ
country
โ
American
โ
Chinese
โ
โ ๐
โ
sparse coefficients
โ
basis vectors
รWith sparsity, the model is identifiable; allows overcomplete basis; is tractable
under mild assumptions. [SWWโ12] [AGMโ13][AAJNTโ13][AGMMโ14]
min
ร 6ร*.61, ร ||๐ โ ๐ โ ๐||ร
- ร ๐ contains word vectors as rows (obtained from any embedding method)
ร Sparsity of rows of X is chosen to be 5 ร ๐ contains 2000 basis vectors (as rows), each of which is 300-dim
Assuming M1 was answered, PMI ๐ฅ, ๐ฅโบ = ๐คP, ๐คPยข + ๐ (*) with large ๐ M2: Why low-dim vectors solves analogy when (*) is only roughly true? A2: (*) + isotropy of word vectors โ low-dim fitting reduces noise
(Quite intuitive, though doesnโt follow Occamโs bound for PAC-learning)
ร Our theory assumes that ๐_ does a slow random walk ร red dot: the estimate hidden
variable ๐_ at time ๐ข
ร sentence at top: the window
- f size 10 at time ๐ข