More Distributional Semantics: New Models & Applications CMSC - - PowerPoint PPT Presentation

more distributional
SMART_READER_LITE
LIVE PREVIEW

More Distributional Semantics: New Models & Applications CMSC - - PowerPoint PPT Presentation

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last week Q: what is understanding meaning? A: meaning is knowing when words are similar or not


slide-1
SLIDE 1

More Distributional Semantics: New Models & Applications

CMSC 723 / LING 723 / INST 725 MARINE CARPUAT

marine@cs.umd.edu

slide-2
SLIDE 2

Last week…

  • Q: what is understanding meaning?
  • A: meaning is knowing when words are

similar or not

  • Topics

– Word similarity – Thesaurus-based methods – Distributional word representations – Dimensionality reduction

slide-3
SLIDE 3

T

  • day
  • New models for learning word

representations

  • From “count”-based models (e.g., LSA)
  • to “prediction”-based models (e.g., word2vec)
  • … and back
  • Beyond semantic similarity
  • Learning semantic relations between words
slide-4
SLIDE 4

DI DISTR TRIBU IBUTIO TIONAL NAL MO MODE DELS OF OF WO WORD ME D MEAN ANING NG

slide-5
SLIDE 5

Distributional Approaches: Intuition

“You shall know a word by the company it keeps!” (Firth, 1957) “Differences of meaning correlates with differences

  • f distribution” (Harris, 1970)
slide-6
SLIDE 6

Context Features

  • Word co-occurrence within a window:
  • Grammatical relations:
slide-7
SLIDE 7

Association Metric

  • Commonly-used metric: Pointwise Mutual

Information

  • Can be used as a feature value or by itself

) ( ) ( ) , ( log ) , ( n associatio

2 PMI

f P w P f w P f w 

slide-8
SLIDE 8

Computing Similarity

  • Semantic similarity boils down to

computing some measure on context vectors

  • Cosine distance: borrowed from

information retrieval

  

  

   

N i i N i i N i i i

w v w v w v w v w v

1 2 1 2 1 cosine

) , ( sim      

slide-9
SLIDE 9

Dimensionality Reduction with Latent Semantic Analysis

slide-10
SLIDE 10

NE NEW DI DIRECT CTIONS IONS: PR PREDIC DICT T VS. COU OUNT NT MOD MODELS

slide-11
SLIDE 11

Word vectors as a byproduct of language modeling

A neural probabilistic Language Model. Bengio et al. JMLR 2003

slide-12
SLIDE 12
slide-13
SLIDE 13

Using neural word representations in NLP

  • word representations from neural LMs

– aka distributed word representations – aka word embeddings

  • How would you use these word vectors?
  • Turian et al. [2010]

– word representations as features consistently improve performance of

  • Named-Entity Recognition
  • Text chunking tasks
slide-14
SLIDE 14

Word2vec [Mikolov et al. 2013] introduces simpler models

https://code.google.com/p/word2vec

slide-15
SLIDE 15

Word2vec claims

Useful representations for NLP applications Can discover relations between words using vector arithmetic king – male + female = queen Paper+tool received lots of attention even outside the NLP research community try it out at “word2vec playground”:

http://deeplearner.fz-qqq.net/

slide-16
SLIDE 16

Demystifying the skip-gram model [Levy & Goldberg, 2014]

context word Word embeddings Learn word vector parameters so as to maximize the probability of training set D Expensive!! http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

slide-17
SLIDE 17

T

  • ward the training objective

for skip-gram

Problem: trivial solution when Vc=Vw and Vc.Vw = K for all Vc,Vw, with a large K http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

slide-18
SLIDE 18

Final training objective (negative sampling)

Word context pairs

  • bserved in data D

Word context pairs not

  • bserved in data D’

(artificially generated) http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

slide-19
SLIDE 19

Skip-gram model [Mikolov et al. 2013]

Predict context words given current word (ie 2(n-1) classifiers for context window of size n)

Use negative samples at each position

slide-20
SLIDE 20
slide-21
SLIDE 21

Don’t count, predict!

[Baroni et al. 2014]

“This paper has presented the first systematic comparative evaluation of count and predict vectors. As seasoned distributional semanticists with thorough experience in developing and using count vectors, we set out to conduct this study because we were annoyed by the triumphalist

  • vertones surrounding predict models, despite the

almost complete lack of a proper comparison to count vectors.”

slide-22
SLIDE 22

Don’t count, predict!

[Baroni et al. 2014]

“Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. […] Instead, we found that the predict models are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.”

slide-23
SLIDE 23

Why does word2vec produce good word representations?

Levy & Goldberg, Apr 2014: “Good question. We don’t really know. The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity v_w.v_c for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each

  • ther […]. This is, however, very hand-wavy.”
slide-24
SLIDE 24

Learning skip-gram is almost equivalent to matrix factorization [Levy & Goldberg 2014]

http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf

slide-25
SLIDE 25

New directions: Summary

  • There are alternative ways to learn

distributional representations for word meaning

  • Understanding >> Magic
slide-26
SLIDE 26

PR PREDIC DICTING TING SEMAN MANTIC TIC RELATIO TIONS NS BE BETWE WEEN EN WO WORDS DS

BEYOND SIMILARITY

Slides credit: Peter Turney

slide-27
SLIDE 27

Recognizing T extual Entailment

  • Sample problem

– Text iTunes software has seen strong sales in Europe – Hypothesis Strong sales for iTunes in Europe – Task: Does Text entails Hypothesis? Yes or No?

slide-28
SLIDE 28

Recognizing T extual Entailment

  • Sample problem

– Task: Does Text entails Hypothesis? Yes or No?

  • Has emerged as a core task for semantic

analysis in NLP

– subsumes many tasks: Paraphrase Detection, Question Answering, etc. – fully text based: does not require committing to a specific semantic representation [Dagan et al. 2013]

slide-29
SLIDE 29

Recognizing lexical entailment

  • To recognize entailment between

sentences, we must first recognize entailment between words

  • Sample problem

– Text George was bitten by a dog – Hypothesis George was attacked by an animal

slide-30
SLIDE 30

Lexical entailment & semantic relations

  • Synonymy: synonyms entail each other

firm entails company

  • is-a relations: hyponyms entail hypernyms

automaker entails company

  • part-whole relations: it depends

government entails minister division does not entail company

  • entailment also covers other relations
  • cean entails water

murder entails death

slide-31
SLIDE 31
  • We know how to build word vectors that

represent word meaning

  • How can we predict entailment using

these vectors?

slide-32
SLIDE 32

Approach 1: context inclusion hypothesis

  • Hypothesis:

– if a word a tends to occur in subset of the contexts in which a word b occur (b contextually includes a) – then a (the narrower term) tends to entail b (the broader term)

  • Inspired by formal logic
  • In practice

– Design an asymmetric real-valued metric to compare word vectors

[Kotlerman, Dagan, et al. 2010]

slide-33
SLIDE 33

Approach 1: the BalAPinc Metric Complex hand- crafted metric!

slide-34
SLIDE 34

Approach 2: context combination hypothesis

  • Hypothesis:

– The tendence of word a to entail word b is correlated with some learnable function of the contexts in which a occurs, and the contexts in which b occurs – Some combination of contexts tend to block entailment, others tend to allow entailment

  • In practice

– Binary prediction task – Supervised learning from labeled word pairs [Baroni, Bernardini, Do and Shan, 2012]

slide-35
SLIDE 35

Approach 3: similarity differences hypothesis

  • Hypothesis

– The tendency of a to entail b is correlated with some learnable function of the differences in their similarities, sim(a,r) – sim(b,r), to a set of reference words r in R – Some differences tend to block entailment, and others tend to allow entailment

  • In practice

– Binary prediction task – Supervised learning from labeled word pairs + reference words [Turney & Mohammad 2015]

slide-36
SLIDE 36

Approach 3: similarity differences hypothesis

slide-37
SLIDE 37

Evaluation: test set 1/3 (KDSZ)

slide-38
SLIDE 38

Evaluation: test set 2/3 (JMTH)

slide-39
SLIDE 39

Evaluation: test set 3/3 (BBDS)

slide-40
SLIDE 40

Evaluation

[Turney & Mohammad 2015]

slide-41
SLIDE 41

Lessons from lexical entailment task

  • Distributional hypothesis
  • can be refined and put to use in various ways
  • to detect relations between words beyond

concept of similarity

  • Combination of unsupervised similarity+

supervised learning is powerful

slide-42
SLIDE 42

RECAP AP

slide-43
SLIDE 43

Today

  • A glimpse into recent research
  • New models for learning word

representations

  • From “count”-based models (e.g., LSA)
  • to “prediction”-based models (e.g., word2vec)
  • … and back
  • Beyond semantic similarity
  • Learning lexical entailment

Next topics

  • multiword expressions & predicate argument

structure

slide-44
SLIDE 44

References

Don’t count, predict! [Baroni et al. 2014] http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal- countpredict-acl2014.pdf Word2vec explained [Goldberg & Levy 2014] http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf Neural Word Embeddings as Implicit Matrix Factorization [Levy & Goldberg 2014] http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf Experiments with Three Approaches to Recognizing Lexical Entailment [Turney & Mohammad 2015] http://arxiv.org/abs/1401.8269