Inside Out: Two Jointly Predictive Models for Word Representations - PowerPoint PPT Presentation

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng ofey.sunfei@gmail.com CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences K S e y A L C a b , g o o r a o l t 网络数据科学与技术 n o h r c y 重点实验室 e o T f N & e e t w n c o r c e i k D a t a S February 14, 2016

Word Representation Sentiment Machine Analysis Translation [Kalchbrenner and [Maas et al., Blunsom, 2013] 2011] Language POS Word Taging Modeling Representation [Collobert et al., [Bengio et al., 2003] 2011] Word-Sense Parsing Disambiguation [Socher et al., [Collobert et al., 2011] 2011]

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx Input Window word of interest Text cat sat on the mat i -th output = P ( w t = i | context ) w 1 w 1 w 1 Feature 1 . . . 1 2 N . . . softmax w K . . . . . . Feature K w K w K . . . 1 2 N Figure 2 presents a schematic of the singular value de- the SVD geometrically. The rows of the reduced matrices most computation here composition for a t x d matrix of terms by documents. Lookup Table of singular vectors are taken as coordinates of points repre- In general, for X = T,S, D,’ the matrices To, D,, and SO senting the documents and terms in a k dimensional space. LT W 1 must all be of full rank. The beauty of an SVD, however, With appropriate resealing of the axes, by quantities related is that it allows a simple strategy for optimal approximate . to the associated diagonal values of S, dot products between . d . tanh fit using smaller matrices. If the singular values in S, are points in the space can be used to compare the correspond- . . . . . . ordered by size, the first k largest may be kept and the re- LT W K ing objects. The next section details these comparisons. maining smaller ones set to zero. The product of the result- Computing Fundamental Comparison Quantities concat ing matrices is a matrix X which is only approximately from the SVD Model. There are basically three sorts of Linear equal to X, and is of rank k. It can be shown that the new comparisons of interest: those comparing two terms (“How similar are terms i and j?“), those comparing two docu- matrix X is the matrix of rank k, which is closest in the M 1 × · ments (“How similar are documents i and j?“), and those C ( w t − n + 1 ) C ( w t − 2 ) C ( w t − 1 ) least squares sense to X. Since zeros were introduced into S,, the representation can be simplified by deleting the comparing a term and a document (“How associated are n 1 . . . . . . . . . . . . hu zero rows and columns of S, to obtain a new diagonal ma- term i and document j?“). In standard information retrieval trix S, and then deleting the corresponding HardTanh columns of approaches, these amount respectively, to comparing two Table Matrix C T,, and D, to obtain T and D respectively. The result is a rows, comparing two columns, or examining individual look−up shared parameters reduced model: cells of the original matrix of term by document data, X. in C across words Here we make similar comparisons, but use the matrix X, X=i= Linear TSD’ since it is presumed to represent the importtnt and reliable w t − n + 1 w t − 2 index for index for index for w t − 1 which is the rank-k model with the best possible least- patterns underlying the data in X. Since X = TSD ‘, the M 2 × · Word squares-fit to X. It is this reduced model, presented in relevant quantities can be computed just using the smaller Figure 3, that we use to approximate our data. n 2 hu = #tags matrices, T, D, and S. The amount of dimension reduction, i.e., the choice of Comparing Two Terms. The dot product between two k, is critical to our work. Ideally, we want a value of k that row vectors of X reflects the extent to which two terms is large enough to fit all the real structure in the data, but have a similar pattemAtf occurrence across the set of docu- Representation small enough so that we do not also fit the sampling error ments. The matrix XX’ is the square symmetric matrix or unimportant details. The proper way to make such containing all these term-to-term dot products. Since S is choices is an open issue in the factor analytic literature. diagonal and D is orthonormal. It is easy to verify that: In practice, we currently use an operational criterion- a Models value of k which yields good retrieval performance. Geometric Interpretation of the SVD Model. For Note that this means that the i,j cell of XX’ can be ob- purposes of intuition and discussion it is useful to interpret tained by taking the dot product between the i and j rows documents terms X TO INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT mxm mxd L w(t-2) w(t-2) txd txm X TO SO DO w(t-1) w(t-1) SUM w(t) w(t) Singular value decomposition of the term x document matrix, X. Where: To has orthogonal, unit-length columns (To’ To = I) Do has orthogonal, unit-length columns (D,’ Do = I) w(t+1) w(t+1) So is the diagonal matrix of singular values t is the number of rows of X d is the number of columns of X w(t+2) w(t+2) m is the rank of X (< min(t,d)) FIG. 2. Schematic of the Singular Value Decomposition (SVD) of a rectangular term by document matrix. The original term by document matrix is decomposed into three matrices each with linearly independent components. CBOW Skip-gram 398 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1990

The Distributional Hypothesis [Harris, 1954, Firth, 1957] “ You shall know a word by the company it keeps .” —J.R. Firth

The Distributional Hypothesis We found a cute little ? wampimuk sleeping in a tree.

What if there were no context for a word?

What if there were no context for a word? Which word is closer to buy, buys or sells?

Morphology How words are built from morphemes

Morphology How words are built from morphemes breakable: break, able buys: buy, s

Limitation of Morphology Words do not share morpheme dog ? husky

Motivation Example: “ . . . glass is breakable, take care . . . ” EXTERNAL CONTEXTS INTERNAL MORPHEMES glass is break breakable take able care

BEING . . . N c i − 2 ∑ log p ( w i |P c L = i ) Context # ‰ i = 1 P c c i − 1 i c i + 1 c i + 2 w i . . .

BEING . . . N c i − 2 ( ) ∑ log p ( w i |P c i ) + log p ( w i |P m L = i ) Context # ‰ i = 1 P c c i − 1 i c i + 1 c i + 2 w i . . . # ‰ P m m ( 1 ) i i m ( 2 ) Morphology i . . .

BEING . . . N c i − 2 ( ) ∑ log p ( w i |P c i ) + log p ( w i |P m L = i ) Context # ‰ i = 1 P c c i − 1 i Negative Sampling N ( w i · # ‰ W log σ ( − # w · # ‰ ‰ ∑ c i + 1 log σ ( #‰ P c P c L = i ) + k · E ˜ ˜ i ) w ∼ P ˜ i = 1 w i · # ‰ w · # ‰ ) c i + 2 W log σ ( − # ‰ + log σ ( #‰ P m P m ˜ w i i ) + k · E ˜ i ) w ∼ P ˜ . . . # ‰ P m 1 m ( 1 ) σ ( x ) = i 1 + exp ( − x ) i m ( 2 ) Morphology i . . .

SEING . . . ( i + l c i − 2 N ) ∑ ∑ L = log p ( c j | w i ) i = 1 j = i − l c i − 1 j ̸ = i Context c i + 1 w i c i + 2 . . .

SEING . . . ( i + l s ( w i ) c i − 2 N ) ∑ ∑ ∑ log p ( m ( z ) L = log p ( c j | w i ) + | w i ) i i = 1 j = i − l z = 1 c i − 1 j ̸ = i Context c i + 1 w i c i + 2 . . . m ( 1 ) i Morphology m ( 2 ) i . . .

SEING . . . ( i + l s ( w i ) c i − 2 N ) ∑ ∑ ∑ log p ( m ( z ) L = log p ( c j | w i ) + | w i ) i i = 1 j = i − l z = 1 c i − 1 j ̸ = i Negative Sampling Context c i + 1 ( i + l N ( C log σ ( − # ‰ ) ∑ ∑ log σ ( # c j · #‰ ‰ ˜ c · #‰ L = w i ) + k · E ˜ w i ) c ∼ P ˜ w i c i + 2 i = 1 j = i − l j ̸ = i . . . s ( w i ) # ‰ ) ( m ( z ) − # ‰ ) ∑ · #‰ m · #‰ ˜ m ( 1 ) + log σ ( w i )+ k · E ˜ M log σ ( w i ) m ∼ P ˜ i i z = 1 1 Morphology m ( 2 ) σ ( x ) = 1 + exp ( − x ) i . . .

Related Work I � �� CLBL++ Context-sensitive morphological RNN [Botha and Blunsom, 2014] [Luong et al., 2013] • Neural language model with morphological information • Simple and straightforward models can acquire better word representations

Inside Out: Two Jointly Predictive Models for Word Representations - PowerPoint PPT Presentation

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng ofey.sunfei@gmail.com CAS Key Lab of Network Data Science and Technology Institute of

IMC Presentation Recommendation Adopt Inside Out and Back Again by Thanhha Lai Adopt One

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

Long-term Research Issues in SSD NVRAMOS 2011 Research Issues:

Inside Vaucanson The Vaucanson group LRDE / EPITA - LIAFA / Paris 7 - LTCI / ENST June 27, 2005

Think Thinking ab about out t the he box: ins inside and and out out Andrew Mayes

Ar Art t Ins nside/Out ide/Out Art Inside/Out is an initiative for acquiring, exhibiting, &

[7] Gaussian Elimination Starting to peek inside the black box So far sol ve( A, b) is a black

God Reaches Out Week 1: God Reaches Out To Meet Us Where We Are Week 2: God Reaches Out In

JDBC JDBC Perf erfor ormance mance fr from the Inside om the Inside Ju July 2017 1

REMIT Inside Information Platform Key Principles: Market Participants are responsible

THE QUEST FOR STUDENT SUCCESS AT COMMUNITY COLLEGES An Inside Higher Ed Webinar March 25 at 2

Linux in a Light Bulb Linux How far are we on tinifjcation? inside Pieter Smith Philips

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

Hacking Russia: Inside an Hacking Russia: Inside an unprecedented prosecution of organized

Payload Already Inside: Payload Already Inside: Data re-use for ROP Exploits Data re-use for ROP

Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive

The Life & Times of Isaac GENESIS 26:1-33 Opening Thoughts This is the only chapter devoted

PARTS OF A PHILOSOPHY PAPER OR PRESENTATION [Includes excerpts from About Philosophy, 5 th ed.,

Baumgartner, POLI 203 Spring 2016 Reading: Peffley and Hurwitz March 21, 2016 If you are

BBN-ANG-243 Advanced Phonology: Phonological Analysis 1. Introduction Kiss Zoltn / Starcevic

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on

Identifying Negation in the DGS Corpus Graz, 2019-05-03 Marc Schulder, Thomas Hanke

Assembly Assembly Computational Challenge: assemble individual short fragments (reads) into a

Inside Out: Two Jointly Predictive Models for Word Representations - PowerPoint PPT Presentation

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng ofey.sunfei@gmail.com CAS Key Lab of Network Data Science and Technology Institute of

IMC Presentation Recommendation Adopt Inside Out and Back Again by Thanhha Lai Adopt One

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

Long-term Research Issues in SSD NVRAMOS 2011 Research Issues:

Inside Vaucanson The Vaucanson group LRDE / EPITA - LIAFA / Paris 7 - LTCI / ENST June 27, 2005

Think Thinking ab about out t the he box: ins inside and and out out Andrew Mayes

Ar Art t Ins nside/Out ide/Out Art Inside/Out is an initiative for acquiring, exhibiting, &amp;

[7] Gaussian Elimination Starting to peek inside the black box So far sol ve( A, b) is a black

God Reaches Out Week 1: God Reaches Out To Meet Us Where We Are Week 2: God Reaches Out In

JDBC JDBC Perf erfor ormance mance fr from the Inside om the Inside Ju July 2017 1

REMIT Inside Information Platform Key Principles: Market Participants are responsible

THE QUEST FOR STUDENT SUCCESS AT COMMUNITY COLLEGES An Inside Higher Ed Webinar March 25 at 2

Linux in a Light Bulb Linux How far are we on tinifjcation? inside Pieter Smith Philips

Inside The RT Patch Talk: Steven Rostedt (Red Hat) Benchmarks : Darren V Hart (IBM) Inside

Hacking Russia: Inside an Hacking Russia: Inside an unprecedented prosecution of organized

Payload Already Inside: Payload Already Inside: Data re-use for ROP Exploits Data re-use for ROP

Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive

The Life &amp; Times of Isaac GENESIS 26:1-33 Opening Thoughts This is the only chapter devoted

PARTS OF A PHILOSOPHY PAPER OR PRESENTATION [Includes excerpts from About Philosophy, 5 th ed.,

Baumgartner, POLI 203 Spring 2016 Reading: Peffley and Hurwitz March 21, 2016 If you are

BBN-ANG-243 Advanced Phonology: Phonological Analysis 1. Introduction Kiss Zoltn / Starcevic

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on

Identifying Negation in the DGS Corpus Graz, 2019-05-03 Marc Schulder, Thomas Hanke

Assembly Assembly Computational Challenge: assemble individual short fragments (reads) into a

Ar Art t Ins nside/Out ide/Out Art Inside/Out is an initiative for acquiring, exhibiting, &

The Life & Times of Isaac GENESIS 26:1-33 Opening Thoughts This is the only chapter devoted