Retrofitting Contextualized Word Embeddings with Paraphrases Weijia - - PowerPoint PPT Presentation

▶

Mar 12, 2024 615 likes •773 views

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei Zhou 2 , Kai-Wei Chang 1 1 University of California, Los Angeles 2 University of Southern California Contextualized Word Embeddings Representations

SLIDE 1

Retrofitting Contextualized Word Embeddings with Paraphrases

Weijia Shi1, Muhao Chen1, Pei Zhou2, Kai-Wei Chang1

1University of California, Los Angeles 2University of Southern California

SLIDE 2

Contextualized Word Embeddings

Representations that considers the difference of lexical semantics under different linguistic contexts Such representations have become the backbone of many StoA NLU systems for

Sentence classification, textual inference, QA, EDL, NMT, SRL, …

SLIDE 3

Contextualized Word Embeddings

Aggregating context information in a word vector with a pre-trained deep neural language model. Key benefits:

More refined semantic representations of lexemes
Automatically capturing polysemy
Apples have been grown for thousands of years in Asia and Europe.
With that market capacity, Apple is worth over 1% of the world's GDP.

Embedding space Apple Apple

SLIDE 4

The Paraphrased Context Problem

The pre-trained language models are not aware of the semantic relatedness of contexts The same word can be represented more differently than opposite words in unrelated contexts

Contexts L2 distance by ELMo How can I make bigger my arms? How do I make my arms bigger?

6.42

Some people believe earth is flat, why? Why do people still believe in flat earth?

7.59

It is a very small window. I have a large suitcase.

5.44

Paraphrases

SLIDE 5

The Paraphrased Context Problem

Consider ELMo distances of the same words (excluding stop words) in paraphrased sentence pairs from MRPC:

28.30% 41.50%

0% 10% 20% 30% 40% 50%

>d(good, bad) >d(big, small)

ELMo encoding of shared words in MRPC paraphrases

>d(good, bad) >d(big, small)

Contextualization:

Can be oversensitive to

paraphrasing,

and further impair sentence

representations.

SLIDE 6

Outline

Background
Paraphrase-aware retrofitting
Evaluation
Future Work

SLIDE 7

Paraphrase-aware Retrofitting (PAR)

Method

An orthogonal transformation M to retrofit the input space
Minimizing the variance of word representations on paraphrased contexts
Without compromising the varying representations on unrelated contexts

Orthogonal constraint:

Keeping the relative distance of raw embeddings before contextualization

SLIDE 8

Paraphrase-aware Retrofitting (PAR)

Learning objective

Loss Function: Input: Paraphrase 1: What is prison life like? Paraphrase 2: How is life in prison? Negative sample: I have life insurance. 𝑀 = ෍

(𝑇1,𝑇2)∈𝑄

෍

𝑥∈𝑇1∩𝑇2

𝑒𝑡1,𝑡2 𝐍𝐱 + 𝛿 − 𝑒෢

𝑇1,෢ 𝑇2 𝐍𝐱 + + 𝜇𝑀𝑃

Intuition: the shared words in paraphrases should be embedded closer than those in non-paraphrases.

Orthogonal constraint

SLIDE 9

Experiment Settings

Paraphrase pair datasets

The positive training cases of MRPC (2,753 pairs)
Sampled Quora (20,000 pairs) and PAN (5,000 pairs)

Tasks

Sentence classification: MPQA, MR, CR, SST-2
Textual inference: MRPC, SICK-E
Sentence relatedness scoring: SICK-R, STS-15, STS-16, STS-

Benchmark

Adversarial SQuAD

* The first three categories of tasks follow the settings in SentEval [Conneau et al, 2018].

SLIDE 10

Text Classification/Inference/Relatedness Tasks

PAR leads to performance improvement of ELMo by

2.59-4.21% in accuracy on sentence classification tasks
2.60-3.30% in accuracy on textual inference tasks
3-5% in Pearson correlation in text similarity tasks

PAR improves ELMo on sentence representation tasks.

0.6 0.8 1 SST-2 (acc) SST-Benchmark (ρ) SICK-E (acc)

Comparison of ELMo w/o and W/ on Three SentEval Tasks

ELMo ELMo-PAR

SLIDE 11

Adversarial SQuAD

PAR improves the robustness of a downstream QA model against adversarial examples.

Bi-Directional Attention Flow (BiDAF) [Seo et al. 2017] on two challenge settings

AddOneSent: add one human-paraphrased sentence
AddvSent: add one adversarial example sentence that is semantically similar to

the question

53.7 41.7 57.9 47.1 40 45 50 55 60

AddOneSent AddvSent

F1 scores (%) w/ and w/o PAR

ELMo-BiDAF ELMo-PAR-BiDAF

SLIDE 12

Word Representations

1.25 2.5 3.75 5 Paraphrase Non-paraphrase

Average distances of shared words in MRPC test set sentence pairs before and after applying PAR

ELMo (all layers) ELMo-PAR

PAR minimizes the differences of a word’s representations in paraphrased contexts and preserves the differences in non-paraphrased contexts.

SLIDE 13

Future Work

Applying PAR on other contextualized embedding models To modify contextualized word embeddings linguistic knowledge

Context simplicity aware embeddings
Incorporating lexical definitions in the word contextualization

process

SLIDE 14

Retrofitting Contextualized Word Embeddings with Paraphrases

Weijia Shi1*, Muhao Chen1*, Pei Zhou2, Kai-Wei Chang1

Contextualized Word Embeddings

Representations that considers the difference of lexical semantics under different linguistic contexts Such representations have become the backbone of many StoA NLU systems for

Contextualized Word Embeddings

Aggregating context information in a word vector with a pre-trained deep neural language model. Key benefits:

The Paraphrased Context Problem

The pre-trained language models are not aware of the semantic relatedness of contexts The same word can be represented more differently than opposite words in unrelated contexts

Contexts L2 distance by ELMo How can I make bigger my arms? How do I make my arms bigger?

6.42

Some people believe earth is flat, why? Why do people still believe in flat earth?

7.59

It is a very small window. I have a large suitcase.

5.44

Paraphrases

The Paraphrased Context Problem

Consider ELMo distances of the same words (excluding stop words) in paraphrased sentence pairs from MRPC:

28.30% 41.50%

>d(good, bad) >d(big, small)

ELMo encoding of shared words in MRPC paraphrases

Contextualization:

paraphrasing,

representations.

Outline

Paraphrase-aware Retrofitting (PAR)

Method

Orthogonal constraint:

Keeping the relative distance of raw embeddings before contextualization

Paraphrase-aware Retrofitting (PAR)

Learning objective

Intuition: the shared words in paraphrases should be embedded closer than those in non-paraphrases.

Orthogonal constraint

Experiment Settings

Paraphrase pair datasets

Tasks

Benchmark

* The first three categories of tasks follow the settings in SentEval [Conneau et al, 2018].

Text Classification/Inference/Relatedness Tasks

PAR leads to performance improvement of ELMo by

Comparison of ELMo w/o and W/ on Three SentEval Tasks

Adversarial SQuAD

Bi-Directional Attention Flow (BiDAF) [Seo et al. 2017] on two challenge settings

the question

F1 scores (%) w/ and w/o PAR

Word Representations

Average distances of shared words in MRPC test set sentence pairs before and after applying PAR

PAR minimizes the differences of a word’s representations in paraphrased contexts and preserves the differences in non-paraphrased contexts.

Future Work

Applying PAR on other contextualized embedding models To modify contextualized word embeddings linguistic knowledge

process

Thank You

Weijia Shi1, Muhao Chen1, Pei Zhou2, Kai-Wei Chang1