More Distributional Semantics: New Models & Applications CMSC - PowerPoint PPT Presentation

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

Last week… • Q: what is understanding meaning? • A: meaning is knowing when words are similar or not • Topics – Word similarity – Thesaurus-based methods – Distributional word representations – Dimensionality reduction

T oday New models for learning word • representations From “count” -based models (e.g., LSA) • to “prediction” -based models (e.g., word2vec) • … and back • Beyond semantic similarity • Learning semantic relations between words •

DI DISTR TRIBU IBUTIO TIONAL NAL MO MODE DELS OF OF WO WORD ME D MEAN ANING NG

Distributional Approaches: Intuition “You shall know a word by the company it keeps!” (Firth, 1957) “ Differences of meaning correlates with differences of distribution” (Harris, 1970)

Context Features • Word co-occurrence within a window: • Grammatical relations:

Association Metric • Commonly-used metric: Pointwise Mutual Information P ( w , f )  associatio n ( w , f ) log PMI 2 P ( w ) P ( f ) • Can be used as a feature value or by itself

Computing Similarity • Semantic similarity boils down to computing some measure on context vectors • Cosine distance: borrowed from information retrieval    N   v w   v w    i i i 1 sim ( v , w )   cosine   v w N N 2 2 v w   i i 1 1 i i

Dimensionality Reduction with Latent Semantic Analysis

NE NEW DI DIRECT CTIONS IONS: PR PREDIC DICT T VS. COU OUNT NT MOD MODELS

Word vectors as a byproduct of language modeling A neural probabilistic Language Model. Bengio et al. JMLR 2003

Using neural word representations in NLP • word representations from neural LMs – aka distributed word representations – aka word embeddings • How would you use these word vectors? • Turian et al. [2010] – word representations as features consistently improve performance of • Named-Entity Recognition • Text chunking tasks

Word2vec [Mikolov et al. 2013] introduces simpler models https://code.google.com/p/word2vec

Word2vec claims Useful representations for NLP applications Can discover relations between words using vector arithmetic king – male + female = queen Paper+tool received lots of attention even outside the NLP research community try it out at “word2vec playground”: http://deeplearner.fz-qqq.net/

Demystifying the skip-gram model [Levy & Goldberg, 2014] Word context word embeddings Learn word vector parameters so as to maximize the probability of training set D Expensive!! http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

T oward the training objective for skip-gram Problem: trivial solution when Vc=Vw and Vc.Vw = K for all Vc,Vw, with a large K http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

Final training objective Word context pairs not observed in data D’ (negative sampling) Word context pairs (artificially generated) observed in data D http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

Skip-gram model [Mikolov et al. 2013] Predict context words given current word (ie 2(n-1) classifiers for context window of size n) Use negative samples at each position

Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict vectors. As seasoned distributional semanticists with thorough experience in developing and using count vectors, we set out to conduct this study because we were annoyed by the triumphalist overtones surrounding predict models, despite the almost complete lack of a proper comparison to count vectors.”

Don’t count, predict! [Baroni et al. 2014] “Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. […] Instead, we found that the predict models are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.”

Why does word2vec produce good word representations? Levy & Goldberg, Apr 2014: “Good question. We don’t really know. The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity v_w.v_c for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other […]. This is, however, very hand-wavy .”

Learning skip-gram is almost equivalent to matrix factorization [Levy & Goldberg 2014] http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf

New directions: Summary • There are alternative ways to learn distributional representations for word meaning • Understanding >> Magic

BEYOND SIMILARITY PR PREDIC DICTING TING SEMAN MANTIC TIC RELATIO TIONS NS BE BETWE WEEN EN WO WORDS DS Slides credit: Peter Turney

Recognizing T extual Entailment • Sample problem – Text iTunes software has seen strong sales in Europe – Hypothesis Strong sales for iTunes in Europe – Task: Does Text entails Hypothesis? Yes or No?

Recognizing T extual Entailment • Sample problem – Task: Does Text entails Hypothesis? Yes or No? • Has emerged as a core task for semantic analysis in NLP – subsumes many tasks: Paraphrase Detection, Question Answering, etc. – fully text based: does not require committing to a specific semantic representation [Dagan et al. 2013]

Recognizing lexical entailment • To recognize entailment between sentences, we must first recognize entailment between words • Sample problem – Text George was bitten by a dog – Hypothesis George was attacked by an animal

Lexical entailment & semantic relations • Synonymy: synonyms entail each other firm entails company • is-a relations: hyponyms entail hypernyms automaker entails company • part-whole relations: it depends government entails minister division does not entail company • entailment also covers other relations ocean entails water murder entails death

• We know how to build word vectors that represent word meaning • How can we predict entailment using these vectors?

Approach 1: context inclusion hypothesis • Hypothesis: – if a word a tends to occur in subset of the contexts in which a word b occur (b contextually includes a) – then a (the narrower term) tends to entail b (the broader term) • Inspired by formal logic • In practice – Design an asymmetric real-valued metric to compare word vectors [Kotlerman, Dagan, et al. 2010]

Approach 1: the BalAPinc Metric Complex hand- crafted metric!

Approach 2: context combination hypothesis • Hypothesis: – The tendence of word a to entail word b is correlated with some learnable function of the contexts in which a occurs, and the contexts in which b occurs – Some combination of contexts tend to block entailment, others tend to allow entailment • In practice – Binary prediction task – Supervised learning from labeled word pairs [Baroni, Bernardini, Do and Shan, 2012]

Approach 3: similarity differences hypothesis • Hypothesis – The tendency of a to entail b is correlated with some learnable function of the differences in their similarities, sim(a,r) – sim(b,r), to a set of reference words r in R – Some differences tend to block entailment, and others tend to allow entailment • In practice – Binary prediction task – Supervised learning from labeled word pairs + reference words [Turney & Mohammad 2015]

Approach 3: similarity differences hypothesis

Evaluation: test set 1/3 (KDSZ)

Evaluation: test set 2/3 (JMTH)

Evaluation: test set 3/3 (BBDS)

Evaluation [Turney & Mohammad 2015]

Lessons from lexical entailment task • Distributional hypothesis can be refined and put to use in various ways • to detect relations between words beyond • concept of similarity • Combination of unsupervised similarity+ supervised learning is powerful

RECAP AP

Today A glimpse into recent research • New models for learning word • representations From “count” -based models (e.g., LSA) • to “prediction” -based models (e.g., word2vec) • … and back • Beyond semantic similarity • Learning lexical entailment • Next topics multiword expressions & predicate argument • structure

References Don’t count, predict! [ Baroni et al. 2014] http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal- countpredict-acl2014.pdf Word2vec explained [Goldberg & Levy 2014] http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf Neural Word Embeddings as Implicit Matrix Factorization [Levy & Goldberg 2014] http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf Experiments with Three Approaches to Recognizing Lexical Entailment [Turney & Mohammad 2015] http://arxiv.org/abs/1401.8269

More Distributional Semantics: New Models & Applications CMSC - PowerPoint PPT Presentation

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last week Q: what is understanding meaning? A: meaning is knowing when words are similar or not

Distributional Semantics The unsupervised modeling of meaning on a large scale Tim Van de Cruys

Distributional Compositionality Intro to Distributional Semantics Raffaella Bernardi University

Linear mixed models with improper priors and flexible distributional assumptions for longitudinal

Statistics and Samples in Distributional Reinforcement Learning Mark Rowland, Robert Dadashi,

Statistics and Samples in Distributional Reinforcement Learning Rowland, Dadashi, Kumar, Munos,

Compositional Distributional Semantic Models for Semantic Relatedness and Entailment Sidharth

Automatic construction of distributional thesaurus (for multiple languages) Zheng ZHANG 1 st

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Distributional Implications of Proposed US Greenhouse Gas Control Measures Sebastian Rausch,

Will It Hurt? Who Will it Hurt? Will It Hurt? Who Will it Hurt? Macroeconomic and Distributional

Efficiency and Distributional Trade-Offs in Recycling Carbon Cap-and-Trade Revenues Ian Parry

Distributional Semantics Joo Sedoc IntroHLT class November 4, 2019 Intuition of

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

Modelling constructional change with distributional semantics Florent Perek Overview o Applying

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Index Introduction Previous approaches Our proposal Evaluation Conclusions and future work J.

Sambuz

Useful Links

Newsletter

Mail Us

More Distributional Semantics: New Models & Applications CMSC - PowerPoint PPT Presentation

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last week Q: what is understanding meaning? A: meaning is knowing when words are similar or not

Distributional Semantics The unsupervised modeling of meaning on a large scale Tim Van de Cruys

Distributional Compositionality Intro to Distributional Semantics Raffaella Bernardi University

Linear mixed models with improper priors and flexible distributional assumptions for longitudinal

Statistics and Samples in Distributional Reinforcement Learning Mark Rowland, Robert Dadashi,

Statistics and Samples in Distributional Reinforcement Learning Rowland, Dadashi, Kumar, Munos,

Compositional Distributional Semantic Models for Semantic Relatedness and Entailment Sidharth

Automatic construction of distributional thesaurus (for multiple languages) Zheng ZHANG 1 st

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

Distributional Implications of Proposed US Greenhouse Gas Control Measures Sebastian Rausch,

Will It Hurt? Who Will it Hurt? Will It Hurt? Who Will it Hurt? Macroeconomic and Distributional

Efficiency and Distributional Trade-Offs in Recycling Carbon Cap-and-Trade Revenues Ian Parry

Distributional Semantics Joo Sedoc IntroHLT class November 4, 2019 Intuition of

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

Modelling constructional change with distributional semantics Florent Perek Overview o Applying

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Index Introduction Previous approaches Our proposal Evaluation Conclusions and future work J.

Sambuz

Useful Links

Newsletter

Mail Us

Why Transformers Work. More info blablabla More info blablabla More info blablabla More