Moving Down the Long Tail of Word Sense Disambiguation with Gloss - - PowerPoint PPT Presentation

moving down the long tail of word sense disambiguation
SMART_READER_LITE
LIVE PREVIEW

Moving Down the Long Tail of Word Sense Disambiguation with Gloss - - PowerPoint PPT Presentation

Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders Terra Blevins and Luke Zettlemoyer The plant sprouted a new leaf. (n) (botany) a (n) buildings for (v) to put or set living organism... carrying on (a


slide-1
SLIDE 1

Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders

Terra Blevins and Luke Zettlemoyer

slide-2
SLIDE 2

The plant sprouted a new leaf. (n) (botany) a living organism... (n) buildings for carrying on industrial labor (v) to put or set (a seed or plant) into the ground

slide-3
SLIDE 3

The plant sprouted a new leaf. (n) (botany) a living organism... (n) buildings for carrying on industrial labor (v) to put or set (a seed or plant) into the ground

slide-4
SLIDE 4

The plant sprouted a new leaf. (n) (botany) a living organism... (n) buildings for carrying on industrial labor Context Target Word Candidate Senses (v) to put or set (a seed or plant) into the ground

slide-5
SLIDE 5

Data Sparsity in WSD

  • Senses have Zipfian distribution in

natural language

Kilgarriff (2004), How dominant is the commonest sense of a word?. Kumar et al. (2019), Zero-shot Word Sense Disambiguation using Sense Definition Embeddings.

slide-6
SLIDE 6

Data Sparsity in WSD

  • Senses have Zipfian distribution in

natural language

  • Data imbalance leads to worse

performance on uncommon senses

Kilgarriff (2004), How dominant is the commonest sense of a word?. Kumar et al. (2019), Zero-shot Word Sense Disambiguation using Sense Definition Embeddings.

EWISE

slide-7
SLIDE 7

Data Sparsity in WSD

  • Senses have Zipfian distribution in

natural language

  • Data imbalance leads to worse

performance on uncommon senses

Kilgarriff (2004), How dominant is the commonest sense of a word?. Kumar et al. (2019), Zero-shot Word Sense Disambiguation using Sense Definition Embeddings.

EWISE

62.3 F1 point gap

slide-8
SLIDE 8

Data Sparsity in WSD

  • Senses have Zipfian distribution in

natural language

  • Data imbalance leads to worse

performance on uncommon senses

  • We propose an approach to improve

performance on rare senses with pretrained models and glosses

Kilgarriff (2004), How dominant is the commonest sense of a word?. Kumar et al. (2019), Zero-shot Word Sense Disambiguation using Sense Definition Embeddings.

EWISE

62.3 F1 point gap

slide-9
SLIDE 9

Incorporating Glosses into WSD Models

  • Lexical overlap between context and

gloss is a successful knowledge

  • based approach (Lesk, 1986)
slide-10
SLIDE 10

Incorporating Glosses into WSD Models

  • Lexical overlap between context and

gloss is a successful knowledge

  • based approach (Lesk, 1986)
  • Neural models integrate glosses by:

○ Adding glosses as additional inputs into the WSD model (Luo et al., 2018a,b)

slide-11
SLIDE 11

Incorporating Glosses into WSD Models

  • Lexical overlap between context and

gloss is a successful knowledge

  • based approach (Lesk, 1986)
  • Neural models integrate glosses by:

○ Adding glosses as additional inputs into the WSD model (Luo et al., 2018a,b)

Mapping encoded gloss representations onto graph embeddings to be used as labels for a WSD model (Kumar et al., 2019)

slide-12
SLIDE 12

Pretrained Models for WSD

  • Simple probing classifiers on frozen pretrained representations found to

perform better than models without pretraining

Hadiwinoto et al. (2019), Improved word sense disambiguation using pretrained contextualized representations. Huang et al. (2019), GlossBERT: Bert for word sense disambiguation with gloss knowledge.

slide-13
SLIDE 13

Pretrained Models for WSD

  • Simple probing classifiers on frozen pretrained representations found to

perform better than models without pretraining

  • GlossBERT finetunes BERT on WSD with glosses by setting it up as a

sentence-pair classification task

Hadiwinoto et al. (2019), Improved word sense disambiguation using pretrained contextualized representations. Huang et al. (2019), GlossBERT: Bert for word sense disambiguation with gloss knowledge.

slide-14
SLIDE 14

Our Approach: Gloss Informed Bi-encoder

  • Two encoders that independently encode the context and gloss, aligning

the target word embedding to the correct sense embedding

slide-15
SLIDE 15

Our Approach: Gloss Informed Bi-encoder

  • Two encoders that independently encode the context and gloss, aligning

the target word embedding to the correct sense embedding

  • Encoders initialized with BERT and trained end-to-end, without external

knowledge

slide-16
SLIDE 16

Our Approach: Gloss Informed Bi-encoder

  • Two encoders that independently encode the context and gloss, aligning

the target word embedding to the correct sense embedding

  • Encoders initialized with BERT and trained end-to-end, without external

knowledge

  • The bi-encoder is more computationally efficient than a cross-encoder
slide-17
SLIDE 17

Our Approach: Gloss Informed Bi-encoder

slide-18
SLIDE 18

Our Approach: Gloss Informed Bi-encoder

slide-19
SLIDE 19

Our Approach: Gloss Informed Bi-encoder

slide-20
SLIDE 20

Our Approach: Gloss Informed Bi-encoder

slide-21
SLIDE 21

Baselines and Prior Work

Model Glosses? Pretraining? Source

HCAN ✓ Luo et al., 2018a EWISE ✓ Kumar et al., 2019 BERT Probe ✓ Ours GLU ✓ Hadiwinoto et al., 2019 LMMS ✓ ✓ Loureiro and Jorge, 2019 SVC ✓ Vial et al., 2019 GlossBERT ✓ ✓ Huang et al., 2019 Bi-encoder Model (BEM) ✓ ✓ Ours

slide-22
SLIDE 22

Baselines and Prior Work

Model Glosses? Pretraining? Source

HCAN ✓ Luo et al., 2018a EWISE ✓ Kumar et al., 2019 BERT Probe ✓ Ours GLU ✓ Hadiwinoto et al., 2019 LMMS ✓ ✓ Loureiro and Jorge, 2019 SVC ✓ Vial et al., 2019 GlossBERT ✓ ✓ Huang et al., 2019 Bi-encoder Model (BEM) ✓ ✓ Ours

slide-23
SLIDE 23

Baselines and Prior Work

Model Glosses? Pretraining? Source

HCAN ✓ Luo et al., 2018a EWISE ✓ Kumar et al., 2019 BERT Probe ✓ Ours GLU ✓ Hadiwinoto et al., 2019 LMMS ✓ ✓ Loureiro and Jorge, 2019 SVC ✓ Vial et al., 2019 GlossBERT ✓ ✓ Huang et al., 2019 Bi-encoder Model (BEM) ✓ ✓ Ours

slide-24
SLIDE 24

Baselines and Prior Work

Model Glosses? Pretraining? Source

HCAN ✓ Luo et al., 2018a EWISE ✓ Kumar et al., 2019 BERT Probe ✓ Ours GLU ✓ Hadiwinoto et al., 2019 LMMS ✓ ✓ Loureiro and Jorge, 2019 SVC ✓ Vial et al., 2019 GlossBERT ✓ ✓ Huang et al., 2019 Bi-encoder Model (BEM) ✓ ✓ Ours

slide-25
SLIDE 25

Overall WSD Performance

MFS baseline (65.5)

71.1 71.8

slide-26
SLIDE 26

Overall WSD Performance

71.1 71.8 73.7

slide-27
SLIDE 27

Overall WSD Performance

71.1 71.8 74.1 75.4 75.6 77.0 73.7

slide-28
SLIDE 28

Overall WSD Performance

71.1 71.8 73.7 74.1 75.4 75.6 77.0 79.0

slide-29
SLIDE 29

Performance by Sense Frequency

slide-30
SLIDE 30

Performance by Sense Frequency

MFS Performance 94.9 93.5 94.1

slide-31
SLIDE 31

Performance by Sense Frequency

MFS Performance LFS Performance 94.9 93.5 94.1 37.0 31.2 52.6

slide-32
SLIDE 32

Performance by Sense Frequency

MFS Performance LFS Performance 94.9 93.5 94.1 37.0 31.2 52.6

BEM gains come almost entirely from LFS

slide-33
SLIDE 33

Zero-shot Evaluation

84.9 91.2

  • BEM can represent new, unseen senses with gloss encoder and encode

unseen words with the context encoder

  • Probe baseline relies on WordNet back-off, predicting the most common

sense of unseen words as indicated in WordNet

slide-34
SLIDE 34

Zero-shot Evaluation

Zero-shot Words 84.9 91.0 91.2

slide-35
SLIDE 35

Zero-shot Evaluation

Zero-shot Words Zero-shot Senses 84.9 91.0 91.2 53.6 68.9

slide-36
SLIDE 36

Few-shot Learning of WSD

Train BEM (and frozen probe baseline) on subset of SemCor, with (up to) k examples of each sense in the training data

slide-37
SLIDE 37

Few-shot Learning of WSD

Train BEM (and frozen probe baseline) on subset of SemCor, with (up to) k examples of each sense in the training data

slide-38
SLIDE 38

Few-shot Learning of WSD

Train BEM (and frozen probe baseline) on subset of SemCor, with (up to) k examples of each sense in the training data

BEM at k=5 gets similar performance to full baseline

slide-39
SLIDE 39

Takeaways

  • The BEM improves over the BERT probe baseline and prior approaches to

using (1) sense definitions and (2) pretrained models for WSD

slide-40
SLIDE 40

Takeaways

  • The BEM improves over the BERT probe baseline and prior approaches for

using (1) sense definitions and (2) pretrained models for WSD

  • Gains stem from better performance on less common and unseen senses
slide-41
SLIDE 41

Takeaways

  • The BEM improves over the BERT probe baseline and prior approaches to

using (1) sense definitions and (2) pretrained models for WSD

  • Gains stem from better performance on less common and unseen senses

Questions?

https://github.com/facebookresearch/wsd-biencoders blvns@cs.washington.edu