context2vec: Learning Generic Context Embedding with Bidirectional - - PowerPoint PPT Presentation

context2vec learning generic context embedding with
SMART_READER_LITE
LIVE PREVIEW

context2vec: Learning Generic Context Embedding with Bidirectional - - PowerPoint PPT Presentation

Oren Melamud, Jacob Goldberger, Ido Dagan CoNLL, 2016 context2vec: Learning Generic Context Embedding with Bidirectional LSTM Target: bank 2 What context is They robbed the _bank_ last night. Sentential context: They robbed the last


slide-1
SLIDE 1

context2vec: Learning Generic Context Embedding with Bidirectional LSTM

Oren Melamud, Jacob Goldberger, Ido Dagan

CoNLL, 2016

slide-2
SLIDE 2

What context is

They robbed the _bank_ last night.

  • Target: bank
  • Sentential context: They robbed the

last night.

2

slide-3
SLIDE 3

What context representations are used for

  • Sentence completion

IBM this company for 100 million dollars.

  • Word sense disambiguation

They robbed the _bank_ last night.

  • Named entity recognition

I can’t find _April_.

  • More: supersense tagging, coreference resolution, ...

3

slide-4
SLIDE 4

What context representations are used for

  • Sentence completion

IBM this company for 100 million dollars.

  • Word sense disambiguation

They robbed the _bank_ last night.

  • Named entity recognition

I can’t find _April_.

  • More: supersense tagging, coreference resolution, ...

3

slide-5
SLIDE 5

What context representations are used for

  • Sentence completion

IBM this company for 100 million dollars.

  • Word sense disambiguation

They robbed the _bank_ last night.

  • Named entity recognition

I can’t find _April_.

  • More: supersense tagging, coreference resolution, ...

3

slide-6
SLIDE 6

What context representations are used for

  • Sentence completion

IBM this company for 100 million dollars.

  • Word sense disambiguation

They robbed the _bank_ last night.

  • Named entity recognition

I can’t find _April_.

  • More: supersense tagging, coreference resolution, ...

3

slide-7
SLIDE 7

What context representations are used for

  • Sentence completion

IBM this company for 100 million dollars.

  • Word sense disambiguation

They robbed the _bank_ last night.

  • Named entity recognition

I can’t find _April_.

  • More: supersense tagging, coreference resolution, ...

3

slide-8
SLIDE 8

What we want from context representations

  • Information on the target slot/word
  • Contextual information

sum of context words

Similar context words, different contextual information

  • IBM

this company for 100 million dollars.

  • IBM bought this company for

million dollars.

Different context words, similar contextual information

  • IBM

this company for 100 million dollars.

  • I

this necklace for my wife’s birthday.

  • Context representation

sentence representation

4

slide-9
SLIDE 9

What we want from context representations

  • Information on the target slot/word
  • Contextual information ̸= sum of context words

Similar context words, different contextual information

  • IBM

this company for 100 million dollars.

  • IBM bought this company for

million dollars.

Different context words, similar contextual information

  • IBM

this company for 100 million dollars.

  • I

this necklace for my wife’s birthday.

  • Context representation

sentence representation

4

slide-10
SLIDE 10

What we want from context representations

  • Information on the target slot/word
  • Contextual information ̸= sum of context words

Similar context words, different contextual information

  • IBM

this company for 100 million dollars.

  • IBM bought this company for

million dollars.

Different context words, similar contextual information

  • IBM

this company for 100 million dollars.

  • I

this necklace for my wife’s birthday.

  • Context representation

sentence representation

4

slide-11
SLIDE 11

What we want from context representations

  • Information on the target slot/word
  • Contextual information ̸= sum of context words

Similar context words, different contextual information

  • IBM

this company for 100 million dollars.

  • IBM bought this company for

million dollars.

Different context words, similar contextual information

  • IBM

this company for 100 million dollars.

  • I

this necklace for my wife’s birthday.

  • Context representation

sentence representation

4

slide-12
SLIDE 12

What we want from context representations

  • Information on the target slot/word
  • Contextual information ̸= sum of context words

Similar context words, different contextual information

  • IBM

this company for 100 million dollars.

  • IBM bought this company for

million dollars.

Different context words, similar contextual information

  • IBM

this company for 100 million dollars.

  • I

this necklace for my wife’s birthday.

  • Context representation ̸= sentence representation

4

slide-13
SLIDE 13

Our work

  • Our goal
  • Sentential context representations
  • More value than sum of words
  • Unsupervised generic learning setting
  • Our model
  • context2vec = word2vec - CBOW + biLSTM
  • We show
  • context2vec

average of word embeddings

  • context2vec

state-of-the-art (more complex models)

  • Toolkit available for your NLP application

5

slide-14
SLIDE 14

Our work

  • Our goal
  • Sentential context representations
  • More value than sum of words
  • Unsupervised generic learning setting
  • Our model
  • context2vec = word2vec - CBOW + biLSTM
  • We show
  • context2vec

average of word embeddings

  • context2vec

state-of-the-art (more complex models)

  • Toolkit available for your NLP application

5

slide-15
SLIDE 15

Our work

  • Our goal
  • Sentential context representations
  • More value than sum of words
  • Unsupervised generic learning setting
  • Our model
  • context2vec = word2vec - CBOW + biLSTM
  • We show
  • context2vec >> average of word embeddings
  • context2vec ∼ state-of-the-art (more complex models)
  • Toolkit available for your NLP application

5

slide-16
SLIDE 16

Our work

  • Our goal
  • Sentential context representations
  • More value than sum of words
  • Unsupervised generic learning setting
  • Our model
  • context2vec = word2vec - CBOW + biLSTM
  • We show
  • context2vec >> average of word embeddings
  • context2vec ∼ state-of-the-art (more complex models)
  • Toolkit available for your NLP application

5

slide-17
SLIDE 17

Background

slide-18
SLIDE 18

Popular recent context representations

loses word order Limited scope Variable-size

7

slide-19
SLIDE 19

Popular recent context representations

loses word order Limited scope Variable-size

7

slide-20
SLIDE 20

Popular recent context representations

loses word order Limited scope Variable-size

7

slide-21
SLIDE 21

Supervised biLSTM with pre-trained word embeddings

  • Word order captured with

biLSTM

  • Task-specific training
  • Supervision is limited in size
  • Pre-trained word embeddings

carry valuable information from large corpora

  • Can we bring even more

information? NER (Lample, 2016)

8

slide-22
SLIDE 22

Supervised biLSTM with pre-trained word embeddings

  • Word order captured with

biLSTM

  • Task-specific training
  • Supervision is limited in size
  • Pre-trained word embeddings

carry valuable information from large corpora

  • Can we bring even more

information? NER (Lample, 2016)

8

slide-23
SLIDE 23

Supervised biLSTM with pre-trained word embeddings

  • Word order captured with

biLSTM

  • Task-specific training
  • Supervision is limited in size
  • Pre-trained word embeddings

carry valuable information from large corpora

  • Can we bring even more

information? NER (Lample, 2016)

8

slide-24
SLIDE 24

Model

slide-25
SLIDE 25

Baseline architecture: word2vec with CBOW

had [ submitted ] a paper submitted

  • bjective function

target word embeddings averaged context embeddings context window context word embeddings John

S = ∑

(t,c)∈PAIRS

( log σ(⃗ cavg ·⃗ t) + ∑

t′∈NEGS(t,c) log σ(−⃗

cavg · ⃗ t′) )

10

slide-26
SLIDE 26

context2vec = word2vec - CBOW + biLSTM

word2vec CBOW

had [ submitted ] a paper submitted

  • bjective function

target word embeddings averaged context embeddings context window context word embeddings John

context2vec

LSTM had [ submitted ] a paper submitted

  • bjective function

LSTM LSTM LSTM LSTM LSTM LSTM LSTM MLP target word embeddings sentential context embeddings John LSTM LSTM

11

slide-27
SLIDE 27

Learning architecture: context2vec

LSTM

had [ submitted ] a paper submitted

  • bjective function

LSTM LSTM LSTM LSTM LSTM LSTM LSTM

MLP target word embeddings sentential context embeddings John

LSTM LSTM

S = ∑

(t,c)∈PAIRS

( log σ(⃗ cc2v ·⃗ t) + ∑

t′∈NEGS(t,c) log σ(−⃗

cc2v · ⃗ t′) )

12

slide-28
SLIDE 28

The context2vec embedding space

bought IBM [ ] this company I [ ] this necklace for my wife’s birthday IBM bought this [ ] acquired company t2c target word sentential context technology

13

slide-29
SLIDE 29

The context2vec embedding space

bought technology IBM [ ] this company I [ ] this necklace for my wife’s birthday IBM bought this [ ] acquired company target word sentential context c2c t2t

14

slide-30
SLIDE 30

Evaluation & Results

slide-31
SLIDE 31

Evaluation goals

  • Standalone evaluation of context2vec
  • Using simple cosine similarity measures

16

slide-32
SLIDE 32

Tasks: Sentence completion

I have seen it on him, and could to it.

write migrate climb swear contribute

  • Implementation: Shortest target-context cosine distance
  • Benchmark: Microsoft sentence completion challenge (Zweig and Burges, 2011)

17

slide-33
SLIDE 33

Tasks: Lexical substitution

Charlie is a bright boy.

skilled luminous vivid hopeful smart

  • Implementation: Rank by target-context cosine distance
  • Benchmarks:
  • Lexical sample (McCarthy and Navigli and Burges, 2007)
  • All-words (Kremer et al., 2014)

18

slide-34
SLIDE 34

Tasks: Supervised word sense disambiguation

TEST This adds a wider perspective. TRAIN

  • They add (s2) a touch of humor.
  • The minister added (s4) : the

process remains fragile.

  • Implementation: Shortest context-context cosine distance (kNN)
  • Benchmark: Senseval-3 English lexical sample (Mihalcea et al. , 2004)

19

slide-35
SLIDE 35

Tasks: Supervised word sense disambiguation

TEST This adds a wider perspective. TRAIN

  • They add (s2) a touch of humor.
  • The minister added (s4) : the

process remains fragile.

  • Implementation: Shortest context-context cosine distance (kNN)
  • Benchmark: Senseval-3 English lexical sample (Mihalcea et al. , 2004)

19

slide-36
SLIDE 36

Tasks: Supervised word sense disambiguation

TEST This adds a wider perspective. TRAIN

  • They add (s2) a touch of humor.
  • The minister added (s4) : the

process remains fragile.

  • Implementation: Shortest context-context cosine distance (kNN)
  • Benchmark: Senseval-3 English lexical sample (Mihalcea et al. , 2004)

19

slide-37
SLIDE 37

Results

c2v Avg* Sentence completion 65.1 49.7 LexSub (sample) 56.0 42.5 LexSub (all-words) 47.9 38.9 WSD 72.8 61.4

* Avg baseline:

  • Based on standard Skip-gram word embeddings
  • Hyperparameters optimized: context size, word weights.

20

slide-38
SLIDE 38

Results

c2v SOTA Sentence completion 65.1 58.9* LexSub (sample) 56.0 55.1 LexSub (all-words) 47.9 50.2 WSD 72.8 74.1

  • Better recent result 69.2 (Tran, 2016)

Note: SOTA models are generally more complex than c2v

21

slide-39
SLIDE 39

Conclusions

slide-40
SLIDE 40

Summary

  • context2vec = word2vec - CBOW + biLSTM
  • context2vec

average-of-word-embeddings

  • context2vec

SOTA Appealing alternative for generic context representation

23

slide-41
SLIDE 41

Summary

  • context2vec = word2vec - CBOW + biLSTM
  • context2vec >> average-of-word-embeddings
  • context2vec

SOTA Appealing alternative for generic context representation

23

slide-42
SLIDE 42

Summary

  • context2vec = word2vec - CBOW + biLSTM
  • context2vec >> average-of-word-embeddings
  • context2vec ∼ SOTA

Appealing alternative for generic context representation

23

slide-43
SLIDE 43

Summary

  • context2vec = word2vec - CBOW + biLSTM
  • context2vec >> average-of-word-embeddings
  • context2vec ∼ SOTA

Appealing alternative for generic context representation

23

slide-44
SLIDE 44

Try context2vec yourself

  • The context2vec python toolkit is available at:

https://github.com/orenmel/context2vec

  • Integrate into your NLP application
  • With our pre-trained models
  • Learn your own (choose corpus, dimensionality, etc.)
  • Potentially more effective or complementary to using

pre-trained word embeddings

THANK YOU!

24

slide-45
SLIDE 45

Try context2vec yourself

  • The context2vec python toolkit is available at:

https://github.com/orenmel/context2vec

  • Integrate into your NLP application
  • With our pre-trained models
  • Learn your own (choose corpus, dimensionality, etc.)
  • Potentially more effective or complementary to using

pre-trained word embeddings

THANK YOU!

24

slide-46
SLIDE 46

Try context2vec yourself

  • The context2vec python toolkit is available at:

https://github.com/orenmel/context2vec

  • Integrate into your NLP application
  • With our pre-trained models
  • Learn your own (choose corpus, dimensionality, etc.)
  • Potentially more effective or complementary to using

pre-trained word embeddings

THANK YOU!

24

slide-47
SLIDE 47

Try context2vec yourself

  • The context2vec python toolkit is available at:

https://github.com/orenmel/context2vec

  • Integrate into your NLP application
  • With our pre-trained models
  • Learn your own (choose corpus, dimensionality, etc.)
  • Potentially more effective or complementary to using

pre-trained word embeddings

THANK YOU!

24

slide-48
SLIDE 48

Backup slides

slide-49
SLIDE 49

Qualitative example

Sentential Context Closest target words This [ ] is due item, fact-sheet, offer, pack, card This [ ] is due not just to mere luck

  • ffer, suggestion, announcement,

item, prize This [ ] is due not just to mere luck, award, prize, turnabout, offer, gift but to outstanding work and dedication [ ] is due not just to mere luck, it, success, this, victory, prize-money but to outstanding work and dedication

26

slide-50
SLIDE 50

Target-context example: bias towards rare words

α John was [ ] last year 0.25 born, late, married, out, back 0.50 born, back, married, released, elected 0.75 born, interviewed, re-elected 1.00 starstruck, goal-less, unwed

  • α - negative sampling hyperparameter
  • Larger α values yield bias towards more rare target words

27

slide-51
SLIDE 51

Context-context example

Query Furthermore our work in Uganda and Romania [ adds ] a wider perspective. Themes in art have a fascination, context2vec since they [ add ] a subject interest ... closest Richard is joining us every month to pass on tips, and [ add ] a touch of humour too. The foreign ministers said reforms in Poland and Hungary Average had made considerable progress but [ added ] : ... closest Germany had announced the solution [ adding ] that it hoped Bonn in future ...

  • Sentences from Senseval-3 (shortened for readability).

28