[PPT] - On Evaluation of Adversarial Perturbations for Sequence-to-Sequence PowerPoint Presentation

SLIDE 1

On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models

Paul Michel, Xian Li, Graham Neubig, Juan Pino

SLIDE 2

Adversarial Attacks/Perturbations

Apply a small (indistinguishable) perturbation to the input that elicit

large changes in the output

SLIDE 3

Adversarial Attacks/Perturbations

Apply a small (indistinguishable) perturbation to the input that elicit

large changes in the output

Figure from Goodfellow et al. (2014)

SLIDE 4

Adversarial Attacks/Perturbations

Apply a small (indistinguishable) perturbation to the input that elicit

large changes in the output

Figure from Goodfellow et al. (2014)

SLIDE 5

Adversarial Attacks/Perturbations

Apply a small (indistinguishable) perturbation to the input that elicit

large changes in the output

Figure from Goodfellow et al. (2014)

SLIDE 6

Adversarial Attacks/Perturbations

Apply a small (indistinguishable) perturbation to the input that elicit

large changes in the output

Figure from Goodfellow et al. (2014)

SLIDE 7

Small perturbations are well defined in vision

○ Small l2 ~= indistinguishable to the human eye

Indistinguishable Perturbations

...

l2 distance

SLIDE 8

Small perturbations are well defined in vision

○ Small l2 ~= indistinguishable to the human eye

What about text?

Indistinguishable Perturbations

...

l2 distance

SLIDE 9

Not all Text Perturbations are Equal

He’s very friendly

SLIDE 10

Not all Text Perturbations are Equal

He’s very friendly He’s pretty friendly

[Similar meaning]

✔

SLIDE 11

Not all Text Perturbations are Equal

He’s very friendly He’s pretty friendly He’s very annoying

[Difgerent meaning] [Similar meaning]

✔ ❌

SLIDE 12

Not all Text Perturbations are Equal

He’s very friendly He’s pretty friendly He’s very annoying He’s She friendly

[Difgerent meaning] [Similar meaning] [Nonsensical]

✔ ❌ ❌

SLIDE 13

Not all Text Perturbations are Equal

He’s very friendly He’s pretty friendly He’s very annoying He’s She friendly

[Difgerent meaning] [Similar meaning] [Nonsensical]

✔ ❌ ❌

He’s very freindly

[Typo]

✔

SLIDE 14

Not all Text Perturbations are Equal

⇒Can’t expect the model to output the same output!

He’s very friendly He’s pretty friendly He’s very annoying He’s She friendly

[Difgerent meaning] [Similar meaning] [Nonsensical]

✔ ❌ ❌

He’s very freindly

[Typo]

✔

SLIDE 15

Not all Text Perturbations are Equal

⇒Can’t expect the model to output the same output!

He’s very friendly He’s pretty friendly He’s very annoying He’s She friendly

[Difgerent meaning] [Similar meaning] [Nonsensical]

✔ ❌ ❌

He’s very freindly

[Typo]

✔ This paper: Why and How you should evaluate adversarial perturbations

SLIDE 16

A Framework for Evaluating Adversarial Attacks

SLIDE 17

Problem Definition

Ils le réinvestissent directement en engageant plus de procès.

Original

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 18

Problem Definition

Ils le réinvestissent directement en engageant plus de procès.

Original

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 19

Problem Definition

Ils le réinvestissent directement en engageant plus de procès.

Original

They direct it directly by engaging more cases.

Base output

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 20

Problem Definition

Evaluate

Ils le réinvestissent directement en engageant plus de procès.

Original

They direct it directly by engaging more cases.

Base output

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 21

Problem Definition

Evaluate Attack

Ils le réinvestissent directement en engageant plus de procès.

Original

Ilss le réinvestissent dierctement en engagaent plus de procès.

Adv. src

They direct it directly by engaging more cases.

Base output

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 22

Problem Definition

Evaluate Attack

Ils le réinvestissent directement en engageant plus de procès.

Original

Ilss le réinvestissent dierctement en engagaent plus de procès.

Adv. src

They direct it directly by engaging more cases.

Base output

.. de plus.

Adv. output

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 23

Problem Definition

Evaluate Attack Evaluate too!

Ils le réinvestissent directement en engageant plus de procès.

Original

Ilss le réinvestissent dierctement en engagaent plus de procès.

Adv. src

They direct it directly by engaging more cases.

Base output

.. de plus.

Adv. output

They plow it right back into filing more troll lawsuits.

Reference

SLIDE 24

Source Side Evaluation

Evaluate meaning preservation on the source side
Where is a similarity metric such that

He’s pretty friendly He’s very friendly

>

He’s very annoying He’s very friendly He’s pretty friendly He’s very friendly

>

He’s She friendly He’s very friendly

[...]

SLIDE 25

Given , a similarity metric on the target side

Target Side Evaluation

SLIDE 26

Given , a similarity metric on the target side

Target Side Evaluation

Evaluate relative meaning destruction on the target side

SLIDE 27

Given , a similarity metric on the target side

Target Side Evaluation

Evaluate relative meaning destruction on the target side

SLIDE 28

Given , a similarity metric on the target side

Target Side Evaluation

Evaluate relative meaning destruction on the target side

SLIDE 29

Given , a similarity metric on the target side

Target Side Evaluation

Evaluate relative meaning destruction on the target side

SLIDE 30

Given , a similarity metric on the target side

Target Side Evaluation

Evaluate relative meaning destruction on the target side

SLIDE 31

Given , a similarity metric on the target side

Target Side Evaluation

Evaluate relative meaning destruction on the target side

SLIDE 32

Successful Adversarial Attacks

Ensure that:

SLIDE 33

Successful Adversarial Attacks

Ensure that:

Source meaning destruction

SLIDE 34

Successful Adversarial Attacks

Ensure that:

Target meaning destruction Source meaning destruction

SLIDE 35

Successful Adversarial Attacks

Ensure that:

Target meaning destruction Source meaning destruction

Destroy the meaning on the target side more than on the source side

SLIDE 36

Which similarity metric to use?

Human evaluation

○ 6 point scale, details in paper

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

SLIDE 37

Which similarity metric to use?

Human evaluation

○ 6 point scale, details in paper

BLEU [Papineni et al., 2002]

○ Geometric mean of n-gram precision + length penalty

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

SLIDE 38

Which similarity metric to use?

Human evaluation

○ 6 point scale, details in paper

BLEU [Papineni et al., 2002]

○ Geometric mean of n-gram precision + length penalty

METEOR [Banerjee and Lavie, 2005]

○ Word matching taking into account stemming, synonyms, paraphrases...

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

SLIDE 39

Which similarity metric to use?

Human evaluation

○ 6 point scale, details in paper

BLEU [Papineni et al., 2002]

○ Geometric mean of n-gram precision + length penalty

METEOR [Banerjee and Lavie, 2005]

○ Word matching taking into account stemming, synonyms, paraphrases...

chrF [Popović, 2015]

○ Character n-gram F-score

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

SLIDE 40

Experimental Setting

SLIDE 41

Data and Models

Data

○ IWSLT 2016 dataset ○ {Czech, German, French} → English

Models

○ LSTM based model ○ Transformer based model ○ Both word and sub-word based models

SLIDE 42

Gradient Based Adversarial Attacks on Text

Idea: Back propagate through the model to score possible substitutions

Le gros chien .

SLIDE 43

Gradient Based Adversarial Attacks on Text

Idea: Back propagate through the model to score possible substitutions

Le Encoder gros chien .

SLIDE 44

Gradient Based Adversarial Attacks on Text

Idea: Back propagate through the model to score possible substitutions

Decoder The big dog . The big dog . <eos> Le Encoder gros chien .

SLIDE 45

Gradient Based Adversarial Attacks on Text

Idea: Back propagate through the model to score possible substitutions

Decoder The big dog . The big dog . <eos> Le Encoder gros chien .

Adversarial loss

SLIDE 46

Gradient Based Adversarial Attacks on Text

Idea: Back propagate through the model to score possible substitutions

Decoder The big dog . The big dog . <eos> Le Encoder gros chien .

Adversarial loss

SLIDE 47

Gradient Based Adversarial Attacks on Text

Idea: Back propagate through the model to score possible substitutions

Decoder The big dog . The big dog . <eos> Le Encoder gros chien .

Adversarial loss

, ... chat ... petit ... un ...

SLIDE 48

Constrained Adversarial Attacks

SLIDE 49

Constrained Adversarial Attacks: kNN

Only replace words with 10 nearest neighbors in embedding space

Example from our fr→en Transformer source embeddings

○ grand (tall SING+MASC) ■ grands (tall PL+MASC) ■ grande (tall SING+FEM) ■ grandes (tall PL+FEM) ■ gros (fat SING+MASC) ■ grosse (fat SING+FEM) ○ math (math) ■ maths (maths) ■ mathématique (mathematic) ■ mathématiques (mathematics) ■ objective (objective [ADJ] SING+FEM)

SLIDE 50

Constrained Adversarial Attacks: CharSwap

Only swap word internal characters to get OOVs

○ grand → grnad ○ adversarial → advresarial ○ [...]

If that’s impossible, repeat the last character

○ he → heeeeeee

⇒ Realistic typos

⤻ ⤺ ⤻ ⤺

SLIDE 51

Constrained Adversarial Attacks

SLIDE 52

Choosing an Similarity Metric

Human vs automatic (pearson r):

○ Humans score original/adversarial input ○ Humans score original/adversarial output ○ Compare scores to automatic metric with

Pearson correlation

SLIDE 53

Choosing an Similarity Metric

Human vs automatic (pearson r):

○ Humans score original/adversarial input ○ Humans score original/adversarial output ○ Compare scores to automatic metric with

Pearson correlation

SLIDE 54

chrF better

⇒ = := chrF ⇒ := RDchrF (Relative Decrease in chrF)

Choosing an Similarity Metric

Human vs automatic (pearson r):

○ Humans score original/adversarial input ○ Humans score original/adversarial output ○ Compare scores to automatic metric with

Pearson correlation

SLIDE 55

Efgect of Constraints on Evaluation

Better target destruction Better source preservation

SLIDE 56

Efgect of Constraints on Adversarial Training

SLIDE 57

Efgect of Constraints on Adversarial Training

Adversarial training ≈ training with adversarial examples

○ 𝛽 = 0: Standard training ○ 𝛽 = 1 : Training only on adversarial examples

Standard input Adversarial input

SLIDE 58

Efgect of Constraints on Adversarial Training

Adversarial training ≈ training with adversarial examples

○ 𝛽 = 0: Standard training ○ 𝛽 = 1 : Training only on adversarial examples

Training with Unconstrained attacks vs CharSwap attacks
Evaluate on

○ robustness to CharSwap attacks ○ Accuracy on non-adversarial data

Standard input Adversarial input

SLIDE 59

Efgect of Constraints on Adversarial Training: Adversarial Robustness

Robustness to CharSwap attacks on the validation set

lower is better

SLIDE 60

Efgect of Constraints on Adversarial Training: Adversarial Robustness

Robustness to CharSwap attacks on the validation set

lower is better

SLIDE 61

Efgect of Constraints on Adversarial Training: Adversarial Robustness

Robustness to CharSwap attacks on the validation set

lower is better

SLIDE 62

Efgect of Constraints on Adversarial Training: Adversarial Robustness

Robustness to CharSwap attacks on the validation set

lower is better

Adversarial training ⇒ better robustness

SLIDE 63

Target chrF on the original test set

Efgect of Constraints on Adversarial Training: Accuracy on Non-Adversarial Input

Higher is better

SLIDE 64

Target chrF on the original test set

Efgect of Constraints on Adversarial Training: Accuracy on Non-Adversarial Input

Higher is better

SLIDE 65

Target chrF on the original test set

Efgect of Constraints on Adversarial Training: Accuracy on Non-Adversarial Input

Higher is better

SLIDE 66

Target chrF on the original test set

Efgect of Constraints on Adversarial Training: Accuracy on Non-Adversarial Input

Higher is better

Unconstrained attacks ⇒ hurts accuracy

SLIDE 67

Takeway

When doing adversarial attacks

○ Evaluate meaning preservation on the source side

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

SLIDE 68

Takeway

When doing adversarial attacks

○ Evaluate meaning preservation on the source side

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

When doing adversarial training

○ Consider adding constraints to your attacks

SLIDE 69

Not only true for seq2seq!

○ Easily transposed to classification, etc.. ○ Just adapt and accordingly

Takeway

When doing adversarial attacks

○ Evaluate meaning preservation on the source side

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

When doing adversarial training

○ Consider adding constraints to your attacks

SLIDE 70

TEAPOT

Tool implementing our evaluation

framework

pip install teapot-nlp
github.com/pmichel31415/teapot-nlp

SLIDE 71

Questions

SLIDE 72

Gradient Based Adversarial Attacks on Text

Idea: Word substitution ⟺ Adding word vector difgerence
Use the 1st order approximation to maximize the loss

SLIDE 73

Human Evaluation: the Gold Standard

“How would you rate the similarity between the meaning of these two sentences?” 0. The meaning is completely difgerent or one of the sentences is meaningless 1. The topic is the same but the meaning is difgerent 2. Some key information is difgerent 3. The key information is the same but the details difger 4. Meaning is essentially the same but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed [Language]

Check for semantic similarity and fluency

SLIDE 74

Example of a Successful Attack

(source chrF = 80.89, target RDchrF = 84.06)

Original

Ils le réinvestissent directement en engageant plus de procès.

Adv. src.

Ilss le réinvestissent dierctement en engagaent plus de procès.

Ref.

They plow it right back into filing more troll lawsuits.

Base output

They direct it directly by engaging more cases.

Adv. output

.. de plus.

SLIDE 75

Example of an Unsuccessful Attack

(source chrF = 54.46, target RDchrF = 0.00)

Original

C’était en Juillet 1969.

Adv. src.

C’étiat en Jiullet 1969.

Ref.

This is from July, 1969.

Base output

This was in July 1969.

Adv. output

This is. in 1969.