Inside Out: Two Jointly Predictive Models for Word Representations - - PowerPoint PPT Presentation

inside out
SMART_READER_LITE
LIVE PREVIEW

Inside Out: Two Jointly Predictive Models for Word Representations - - PowerPoint PPT Presentation

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng ofey.sunfei@gmail.com CAS Key Lab of Network Data Science and Technology Institute of


slide-1
SLIDE 1

Inside Out:

Two Jointly Predictive Models for Word Representations and Phrase Representations

Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng

  • fey.sunfei@gmail.com

CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences

K e y L a b
  • r
a t
  • r
y
  • f
N e t w
  • r
k D a t a S c i e n c e & T e c h n
  • l
  • g
, C A S 网络数据科学与技术 重点实验室

February 14, 2016

slide-2
SLIDE 2

Word Representation

Word Representation

POS Taging

[Collobert et al., 2011] Word-Sense Disambiguation [Collobert et al., 2011]

Parsing

[Socher et al., 2011]

Language Modeling

[Bengio et al., 2003]

Machine Translation

[Kalchbrenner and Blunsom, 2013]

Sentiment Analysis

[Maas et al., 2011]

slide-3
SLIDE 3

Word Representation Models

softmax tanh . . . . . . . . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . . C C wt−1 wt−2 C(wt−2) C(wt−1) C(wt−n+1) wt−n+1 i-th output = P(wt = i | context) Input Window Lookup Table Linear HardTanh Linear

Text cat sat

  • n the mat

Feature 1 w1

1

w1

2

. . . w1

N

. . . Feature K wK

1

wK

2

. . . wK

N

LTW 1

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

. . . LTW K

x x x x x x x x x x x x x x x x x x x x

M 1 × · M 2 × ·

word of interest d

concat

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n1 hu xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx n2 hu = #tags Figure 2 presents a schematic of the singular value de- composition for a t x d matrix of terms by documents. In general, for X = T,S, D,’ the matrices To, D,, and SO must all be of full rank. The beauty of an SVD, however, is that it allows a simple strategy for optimal approximate fit using smaller matrices. If the singular values in S, are
  • rdered by size, the first k largest may be kept and the re-
maining smaller ones set to zero. The product of the result- ing matrices is a matrix X which is only approximately equal to X, and is of rank k. It can be shown that the new matrix X is the matrix of rank k, which is closest in the least squares sense to X. Since zeros were introduced into S,, the representation can be simplified by deleting the zero rows and columns of S, to obtain a new diagonal ma- trix S, and then deleting the corresponding columns of T,, and D, to obtain T and D respectively. The result is a reduced model: X=i= TSD’ which is the rank-k model with the best possible least- squares-fit to X. It is this reduced model, presented in Figure 3, that we use to approximate our data. The amount of dimension reduction, i.e., the choice of k, is critical to our work. Ideally, we want a value of k that is large enough to fit all the real structure in the data, but small enough so that we do not also fit the sampling error
  • r unimportant
  • details. The proper way to make such
choices is an open issue in the factor analytic literature. In practice, we currently use an operational criterion- a value of k which yields good retrieval performance. Geometric Interpretation
  • f the SVD Model.
For purposes of intuition and discussion it is useful to interpret documents terms X L txd X the SVD geometrically. The rows of the reduced matrices
  • f singular vectors are taken as coordinates of points repre-
senting the documents and terms in a k dimensional space. With appropriate resealing of the axes, by quantities related to the associated diagonal values of S, dot products between points in the space can be used to compare the correspond- ing objects. The next section details these comparisons. Computing Fundamental Comparison Quantities from the SVD Model. There are basically three sorts of comparisons of interest: those comparing two terms (“How similar are terms i and j?“), those comparing two docu- ments (“How similar are documents i and j?“), and those comparing a term and a document (“How associated are term i and document j?“). In standard information retrieval approaches, these amount respectively, to comparing two rows, comparing two columns, or examining individual cells of the original matrix of term by document data, X. Here we make similar comparisons, but use the matrix X, since it is presumed to represent the importtnt and reliable patterns underlying the data in X. Since X = TSD ‘, the relevant quantities can be computed just using the smaller matrices, T, D, and S. Comparing Two Terms. The dot product between two row vectors of X reflects the extent to which two terms have a similar pattemAtf occurrence across the set of docu-
  • ments. The matrix XX’ is the square symmetric matrix
containing all these term-to-term dot products. Since S is diagonal and D is orthonormal. It is easy to verify that: Note that this means that the i,j cell of XX’ can be ob- tained by taking the dot product between the i and j rows TO mxm mxd txm TO SO DO Singular value decomposition
  • f the term x document
matrix, X. Where: To has orthogonal, unit-length columns (To’ To = I) Do has orthogonal, unit-length columns (D,’ Do = I) So is the diagonal matrix of singular values t is the number of rows of X d is the number of columns
  • f X
m is the rank of X (< min(t,d))
  • FIG. 2. Schematic
  • f the Singular Value Decomposition (SVD) of a rectangular
term by document
  • matrix. The original term by document
matrix is decomposed into three matrices each with linearly independent components. 398 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1990 w(t-2) w(t+1) w(t-1) w(t+2) w(t) SUM INPUT PROJECTION OUTPUT w(t) INPUT PROJECTION OUTPUT w(t-2) w(t-1) w(t+1) w(t+2) CBOW Skip-gram
slide-4
SLIDE 4

The Distributional Hypothesis

[Harris, 1954, Firth, 1957]

“You shall know a word by the company it keeps.” —J.R. Firth

slide-5
SLIDE 5

The Distributional Hypothesis

We found a cute little wampimuk sleeping in a tree.

?

slide-6
SLIDE 6

The Distributional Hypothesis

We found a cute little wampimuk sleeping in a tree.

?

slide-7
SLIDE 7

What if there were no context for a word?

slide-8
SLIDE 8

What if there were no context for a word? Which word is closer to buy, buys or sells?

slide-9
SLIDE 9

Morphology

How words are built from morphemes

slide-10
SLIDE 10

Morphology

How words are built from morphemes breakable: break, able buys: buy, s

slide-11
SLIDE 11

Limitation of Morphology

Words do not share morpheme dog ? husky

slide-12
SLIDE 12

Motivation

Example: “. . . glass is breakable, take care . . .” breakable is take care glass

EXTERNAL CONTEXTS

break able

INTERNAL MORPHEMES

slide-13
SLIDE 13

Model

slide-14
SLIDE 14

BEING

. . . . . .

# ‰ Pc

i

wi ci−1 ci+1 ci+2 ci−2 Context

L =

N

i=1

log p(wi|Pc

i )

slide-15
SLIDE 15

BEING

. . . . . .

# ‰ Pc

i

wi ci−1 ci+1 ci+2 ci−2 Context

. . .

# ‰ Pm

i

m(1)

i

m(2)

i

Morphology

L =

N

i=1

( log p(wi|Pc

i ) + log p(wi|Pm i )

)

slide-16
SLIDE 16

BEING

. . . . . .

# ‰ Pc

i

wi ci−1 ci+1 ci+2 ci−2 Context

. . .

# ‰ Pm

i

m(1)

i

m(2)

i

Morphology

L =

N

i=1

( log p(wi|Pc

i ) + log p(wi|Pm i )

) L =

N

i=1

( log σ(#‰ wi · # ‰ Pc

i ) + k·E˜ w∼P ˜

W log σ(−#

‰ ˜ w · # ‰ Pc

i )

+log σ(#‰ wi · # ‰ Pm

i ) + k·E˜ w∼P ˜

W log σ(−#

‰ ˜ w · # ‰ Pm

i )

) σ(x) = 1 1 + exp(−x)

Negative Sampling

slide-17
SLIDE 17

SEING

. . . . . .

wi ci+2 ci+1 ci−1 ci−2 Context

L =

N

i=1

( i+l ∑

j=i−l j̸=i

log p(cj|wi) )

slide-18
SLIDE 18

SEING

. . . . . .

wi ci+2 ci+1 ci−1 ci−2 Context

. . .

m(1)

i

m(2)

i

Morphology

L =

N

i=1

( i+l ∑

j=i−l j̸=i

log p(cj|wi) +

s(wi )

z=1

log p(m(z)

i

|wi) )

slide-19
SLIDE 19

SEING

. . . . . .

wi ci+2 ci+1 ci−1 ci−2 Context

. . .

m(1)

i

m(2)

i

Morphology

L =

N

i=1

( i+l ∑

j=i−l j̸=i

log p(cj|wi) +

s(wi )

z=1

log p(m(z)

i

|wi) ) L=

N

i=1

( i+l ∑

j=i−l j̸=i

( log σ(# ‰ cj ·#‰ wi) + k·E˜

c∼P˜

C log σ(−#

‰ ˜ c ·#‰ wi) ) +

s(wi )

z=1

( log σ( # ‰ m(z)

i

·#‰ wi)+k·E ˜

m∼P ˜

M log σ(

−# ‰ ˜ m·#‰ wi) ) ) σ(x) = 1 1 + exp(−x)

Negative Sampling

slide-20
SLIDE 20

Related Work I

  • Context-sensitive morphological RNN

[Luong et al., 2013] CLBL++ [Botha and Blunsom, 2014]

  • Neural language model with morphological information
  • Simple and straightforward models can acquire better word

representations

slide-21
SLIDE 21

Related Work II

{𝑥𝑗} {𝑛𝑗𝑘} 𝑢𝑏𝑠𝑕𝑓𝑢 ∅𝑥, 𝑁 ∅𝑛, 𝑁 𝑤𝐽 𝑒𝑓𝑑𝑝𝑛𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 𝑥𝑝𝑠𝑒𝑡 𝑥1 𝑠𝑓𝑝𝑠𝑕𝑏𝑜𝑗𝑨𝑓 𝑥−𝑡 𝑥−1 𝑥𝑡

…… …… 𝑁′ 𝑥𝑡 𝑛𝑡,1 ⋯ , 𝑛𝑡,𝑢𝑡 𝑥0 𝑛0,1 ⋯ , 𝑛0,𝑢0

Projection Matrix Embedding Matrix Embedding Space (𝐸-dimension) Vocabulary Space (𝑊-dimension) 1-of-𝑊 representation Word + Morphemes Vocabulary Space (𝑊-dimension) Bag of Morphemes Bag of Words

Morpheme powered CBOW Models [Qiu et al., 2014]

  • Do not capture the interaction between the words and their

morphemes

slide-22
SLIDE 22

Experiments

slide-23
SLIDE 23

Experimental Settings

Corpus:

model corpus size csmRNN [Luong et al., 2013] Wikipedia 2010 1B MorphemeCBOW [Qiu et al., 2014] Wikipedia 2010 1B GloVe, CBOW, SG Wikipedia 2010 1B BEING, SEING

Parameters Setting:

window negative iteration learning rate noise distribution 10 10 20 0.0251 0.052 ∝ #(w)0.75

Morpheme : Morfessor

1SG, SEING 2CBOW, BEING

slide-24
SLIDE 24

Word Analogy

  • Test Set
  • Google [Mikolov et al., 2013a]
  • Semantic: “Beijing is to China as Paris is to

  • Syntactic: “big is to bigger as deep is to

  • Solution:
  • arg

max

x∈W,x̸=a x̸=b, x̸=c

(⃗ b + ⃗ c − ⃗ a) · ⃗ x

  • Metric:
  • percentage of questions answered correctly
slide-25
SLIDE 25

Word Analogy

HSMN+csmRNN C&W+csmRNN GloVe CBOW BEING SG SEING

50 300 20 40 60 80 Semantic Precision 50 300 20 40 60 80 Syntactic Precision 50 300 20 40 60 80 Total Precision

  • CBOW and SG are very strong baselines.
  • BEING and SEING outperform CBOW and SG respectively.
  • Bigger improvements on syntactic analogy.
slide-26
SLIDE 26

Syntactic Analogy

Syntactic Subtask CBOW BEING SG SEING adjective-to-adverb 31.85 26.51 38.10 37.20

  • pposite

34.73 45.07 30.79 39.16 comparative 88.14 91.82 79.58 83.93 superlative 61.14 71.30 48.31 61.94 present-participle 67.23 67.42 62.59 66.67 nationality-adjective 90.18 91.56 90.24 90.68 past-tense 66.86 69.17 61.28 64.94 plural 81.91 86.86 82.21 84.53 plural-verbs 65.86 85.86 67.47 81.26

  • adjectives-to-adverbs: words wrongly segmented by

Morfessor.

  • “luckily” is segmented to“lucki” +“ly”
slide-27
SLIDE 27

Word Similarity

  • Test Set
  • Rare Word (RW) [Luong et al., 2013]
  • WordSim-353 (WS-353) [Finkelstein et al., 2002]
  • SimLex-999 (SL-999) [Hill et al., 2014]
  • Detail:
  • Word pair with similarity score assigned by human
  • (tiger

cat 7.35)

  • Evaluation Metric:
  • spearman rank correlation
slide-28
SLIDE 28

Word Similarity

HSMN+csmRNN C&W+csmRNN CLBL++ MorphemeCBOW GloVe CBOW BEING SG SEING

50 300 20 30 40 50 60 RW ρ × 100 50 300 30 40 50 60 70 80 WordSim-353 ρ × 100 50 300 10 20 30 40 50 SimLex-999 ρ × 100

  • BEING and SEING outperform CBOW and SG

respectively.

slide-29
SLIDE 29

Phrase Representations

slide-30
SLIDE 30

Phrase Representation

. . . . . .

# ‰ Pc

i

pi ci−1 ci+1 ci+2 ci−2

. . .

# ‰ Pw

i

w(1)

i

w(2)

i

. . . . . .

pi ci+2 ci+1 ci−1 ci−2

. . .

w(1)

i

w(2)

i

  • Distributed Morphology
  • Constituting words in a phrase as its morphemes
slide-31
SLIDE 31

Phrase Analogy

  • Test Set
  • Google [Mikolov et al., 2013b]
  • “Boston is to Boston Bruins as Los Angeles is to

  • Solution:
  • arg

max

x∈W,x̸=a x̸=b, x̸=c

(⃗ b + ⃗ c − ⃗ a) · ⃗ x

  • Metric:
  • percentage of questions answered correctly
slide-32
SLIDE 32

Phrase Analogy

50 100 200 300 400 20 40 60 80

7.77 18.46 32.2 37.95 39.08 13.81 28.24 48.35 56.04 58.75 23.04 39.56 52.23 61.1 62.86 23.74 44.25 63.08 67.69 69.41 43.74 63.96 75.16 76.23 77.25

Dimensionality Precision

GloVe CBOW SG BEING SEING

slide-33
SLIDE 33

Discussion

Word Analogy

50 300 20 40 60 80 Dimensionality Precision

GloVe CBOW BEING SG SEING

Phrase Analogy

50 100 200 300 400 20 40 60 80

7.77 18.46 32.2 37.95 39.08 13.81 28.24 48.35 56.04 58.75 23.04 39.56 52.23 61.1 62.86 23.74 44.25 63.08 67.69 69.41 43.74 63.96 75.16 76.23 77.25

Dimensionality Precision

GloVe CBOW SG BEING SEING

Word Analogy Phrase Analogy Performance BEING, CBOW SEING, SG Improvement BEING SEING

slide-34
SLIDE 34

Summary

  • Two novel models modeling external contexts and internal

morphemes simultaneously.

  • State-of-the-art results.
  • Two novel models for phrase representations.
slide-35
SLIDE 35

Thanks! Q & A

More Infomration: http://ofey.me/projects/InsideOut/

slide-36
SLIDE 36

Reference I

Botha, J. A. and Blunsom, P . (2014). Compositional morphology for word representations and language modelling. In Proceedings of ICML, pages 1899–1907. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., andGadi Wolfman, Z. S., and Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Trans. Inf. Syst., 20(1):116–131. Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis (special volume of the Philological Society), 1952-59:1–32. Harris, Z. (1954). Distributional structure. Word, 10(23):146–162. Hill, F., Reichart, R., and Korhonen, A. (2014). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. CoRR, abs/1408.3456. Luong, M.-T., Socher, R., and Manning, C. D. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of CoNLL, pages 104–113. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. In Proceedings of Workshop of ICLR.

slide-37
SLIDE 37

Reference II

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pages 3111–3119. Curran Associates, Inc. Qiu, S., Cui, Q., Bian, J., Gao, B., and Liu, T. (2014). Co-learning of word representations and morpheme representations. In Proceedings of COLING, pages 141–150.