More Distributional Semantics: New Models & Applications
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
More Distributional Semantics: New Models & Applications CMSC - - PowerPoint PPT Presentation
More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last week Q: what is understanding meaning? A: meaning is knowing when words are similar or not
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
– Word similarity – Thesaurus-based methods – Distributional word representations – Dimensionality reduction
“You shall know a word by the company it keeps!” (Firth, 1957) “Differences of meaning correlates with differences
) ( ) ( ) , ( log ) , ( n associatio
2 PMI
f P w P f w P f w
N i i N i i N i i i
w v w v w v w v w v
1 2 1 2 1 cosine
) , ( sim
A neural probabilistic Language Model. Bengio et al. JMLR 2003
– aka distributed word representations – aka word embeddings
– word representations as features consistently improve performance of
Useful representations for NLP applications Can discover relations between words using vector arithmetic king – male + female = queen Paper+tool received lots of attention even outside the NLP research community try it out at “word2vec playground”:
http://deeplearner.fz-qqq.net/
context word Word embeddings Learn word vector parameters so as to maximize the probability of training set D Expensive!! http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Problem: trivial solution when Vc=Vw and Vc.Vw = K for all Vc,Vw, with a large K http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Word context pairs
Word context pairs not
(artificially generated) http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Predict context words given current word (ie 2(n-1) classifiers for context window of size n)
Use negative samples at each position
“This paper has presented the first systematic comparative evaluation of count and predict vectors. As seasoned distributional semanticists with thorough experience in developing and using count vectors, we set out to conduct this study because we were annoyed by the triumphalist
almost complete lack of a proper comparison to count vectors.”
Levy & Goldberg, Apr 2014: “Good question. We don’t really know. The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity v_w.v_c for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each
http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf
BEYOND SIMILARITY
Slides credit: Peter Turney
– Text iTunes software has seen strong sales in Europe – Hypothesis Strong sales for iTunes in Europe – Task: Does Text entails Hypothesis? Yes or No?
– Task: Does Text entails Hypothesis? Yes or No?
– subsumes many tasks: Paraphrase Detection, Question Answering, etc. – fully text based: does not require committing to a specific semantic representation [Dagan et al. 2013]
– Text George was bitten by a dog – Hypothesis George was attacked by an animal
firm entails company
automaker entails company
government entails minister division does not entail company
murder entails death
– if a word a tends to occur in subset of the contexts in which a word b occur (b contextually includes a) – then a (the narrower term) tends to entail b (the broader term)
– Design an asymmetric real-valued metric to compare word vectors
[Kotlerman, Dagan, et al. 2010]
– The tendence of word a to entail word b is correlated with some learnable function of the contexts in which a occurs, and the contexts in which b occurs – Some combination of contexts tend to block entailment, others tend to allow entailment
– Binary prediction task – Supervised learning from labeled word pairs [Baroni, Bernardini, Do and Shan, 2012]
– The tendency of a to entail b is correlated with some learnable function of the differences in their similarities, sim(a,r) – sim(b,r), to a set of reference words r in R – Some differences tend to block entailment, and others tend to allow entailment
– Binary prediction task – Supervised learning from labeled word pairs + reference words [Turney & Mohammad 2015]
[Turney & Mohammad 2015]
concept of similarity
structure
Don’t count, predict! [Baroni et al. 2014] http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal- countpredict-acl2014.pdf Word2vec explained [Goldberg & Levy 2014] http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf Neural Word Embeddings as Implicit Matrix Factorization [Levy & Goldberg 2014] http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf Experiments with Three Approaches to Recognizing Lexical Entailment [Turney & Mohammad 2015] http://arxiv.org/abs/1401.8269