[PPT] - Counterfactual Data Augmentation for Mitigating Gender Stereotypes PowerPoint Presentation

SLIDE 1

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology

ACL 2019 Ran Zmigrod, Sebastian J. Mielke, Hanna Wallach, Ryan Cotterell

University of Cambridge // Johns Hopkins University // Microsoft Research

rz279@cam.ac.uk sjmielke@jhu.edu wallach@microsoft.com rdc42@cam.ac.uk

Twitter: @RanZmigrod – paper and thread pinned! // @sjmielke 1

SLIDE 2

Gender bias in NLP systems

Coreference resolution systems are biased: Even though the doctor reassured the nurse, she was worried.

2

SLIDE 3

Gender bias in NLP systems

Coreference resolution systems are biased: Even though the doctor reassured the nurse, she was worried.

2

SLIDE 4

Gender bias in NLP systems

Coreference resolution systems are biased: Even though the doctor reassured the nurse, she was worried.

2

SLIDE 5

Gender bias in NLP systems

Coreference resolution systems are biased: Even though the doctor reassured the nurse, she was worried. Both are possible...

2

SLIDE 6

Gender bias in NLP systems

Coreference resolution systems are biased: Even though the doctor reassured the nurse, she was worried. Both are possible... but systems prefer nurse!

(Rudinger et al., 2018; Zhao et al., 2018)

2

SLIDE 7

Gender bias in NLP systems

Coreference resolution systems are biased: Even though the doctor reassured the nurse, she was worried. Both are possible... but systems prefer nurse!

(Rudinger et al., 2018; Zhao et al., 2018)

Word embeddings carry biases:

2

SLIDE 8

This shouldn’t come as a surprise: our data is biased

Google n-grams frequency counts

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

he is a doctor she is a doctor

3

SLIDE 9

Our focus: stereotypes in language modeling (Lu et al., 2018)

Training data counts are visible as likelihoods under a language model: stereotype m f pronoun m He is a good doctor. He is a good nurse. f She is a good doctor. She is a good nurse.

4

SLIDE 10

Our focus: stereotypes in language modeling (Lu et al., 2018)

Training data counts are visible as likelihoods under a language model: stereotype m f pronoun m He is a good doctor. He is a good nurse. f She is a good doctor. She is a good nurse. The solution: Counterfactual Data Augmentation

(Lu et al., 2018) 4

SLIDE 11

Our focus: stereotypes in language modeling (Lu et al., 2018)

Training data counts are visible as likelihoods under a language model: stereotype m f pronoun m He is a good doctor. He is a good nurse. f She is a good doctor. She is a good nurse. The solution: Counterfactual Data Augmentation

(Lu et al., 2018)

For every sentence with she/he: e.g., “She is a nurse.”

4

SLIDE 12

Our focus: stereotypes in language modeling (Lu et al., 2018)

Training data counts are visible as likelihoods under a language model: stereotype m f pronoun m He is a good doctor. He is a good nurse. f She is a good doctor. She is a good nurse. The solution: Counterfactual Data Augmentation

(Lu et al., 2018)

For every sentence with she/he: e.g., “She is a nurse.” add that sentence with he/she for training: e.g., “He is a nurse.”

4

SLIDE 13

Our focus: stereotypes in language modeling (Lu et al., 2018)

Training data counts are visible as likelihoods under a language model: stereotype m f pronoun m He is a good doctor. He is a good nurse. f She is a good doctor. She is a good nurse. The solution: Counterfactual Data Augmentation

(Lu et al., 2018)

For every sentence with she/he: e.g., “She is a nurse.” add that sentence with he/she for training: e.g., “He is a nurse.” Now they should yield a balanced model!

4

SLIDE 14

SLIDE 15

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin.

6

SLIDE 16

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin.

6

SLIDE 17

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin.

6

SLIDE 18

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin.

6

SLIDE 19

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin. So, uh, can we just... change all words’ grammatical gender?

6

SLIDE 20

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin. So, uh, can we just... change all words’ grammatical gender? Example: Der Arzt sitzt auf einem Stuhl (The male doctor sits on a chair)

6

SLIDE 21

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin. So, uh, can we just... change all words’ grammatical gender? Example: Der Arzt sitzt auf einem Stuhl (The male doctor sits on a chair) Swap all: Die Ärztin sitzt auf einer Stuhl

6

SLIDE 22

“Agreement” or “what if: German”

stereotype m f pronoun m Er ist ein guter Arzt. Er ist ein guter Krankenpfleger. f Sie ist eine gute Ärztin. Sie ist eine gute Krankenpflegerin. So, uh, can we just... change all words’ grammatical gender? Example: Der Arzt sitzt auf einem Stuhl (The male doctor sits on a chair) Swap all: Die Ärztin sitzt auf einer Stuhl (The female doctor sits on a... what?) No, what we need is...

6

SLIDE 23

Syntax to the rescue: use dependency parses

Der gute Arzt sitzt auf einem Stuhl

7

SLIDE 24

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Der gute Arzt sitzt auf einem Stuhl

7

SLIDE 25

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

M ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

7

SLIDE 26

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

M ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

7

SLIDE 27

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

M ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

m a n u a l d a m p e n i n g not learned, boosts tags that stay

what they were before intervention

7

SLIDE 28

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

m a n u a l d a m p e n i n g not learned, boosts tags that stay

what they were before intervention

7

SLIDE 29

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! F ; SG;

NOM

F ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

m a n u a l d a m p e n i n g not learned, boosts tags that stay

what they were before intervention

7

SLIDE 30

Recap: what is a Markov Random Field (Koller and Friedman, 2009)?

x y z Model p(x, y,z) by decomposing into factors ( )!

8

SLIDE 31

Recap: what is a Markov Random Field (Koller and Friedman, 2009)?

x y z Model p(x, y,z) by decomposing into factors ( )!

8

SLIDE 32

Recap: what is a Markov Random Field (Koller and Friedman, 2009)?

x y z Model p(x, y,z) by decomposing into factors ( )! Every factor gives a score to certain assignments: (x = 2, y = 1) = 0.42 (y = 1) = 1.3 (z = 1) = −1

8

SLIDE 33

Recap: what is a Markov Random Field (Koller and Friedman, 2009)?

x y z Model p(x, y,z) by decomposing into factors ( )! Every factor gives a score to certain assignments: (x = 2, y = 1) = 0.42 (y = 1) = 1.3 (z = 1) = −1 Add up all factors to obtain global score: score(x = 2, y = 1,z = 4) = (x = 2, y = 1) + (y = 1) + (z = 4)

8

SLIDE 34

Recap: what is a Markov Random Field (Koller and Friedman, 2009)?

x y z Model p(x, y,z) by decomposing into factors ( )! Every factor gives a score to certain assignments: (x = 2, y = 1) = 0.42 (y = 1) = 1.3 (z = 1) = −1 Add up all factors to obtain global score: score(x = 2, y = 1,z = 4) = (x = 2, y = 1) + (y = 1) + (z = 4) Get p by global normalization (easy in trees): p(x = 2, y = 1,z = 4) ∝ expscore(x = 2, y = 1,z = 4)

8

SLIDE 35

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

M ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

9

SLIDE 36

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

M ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

9

SLIDE 37

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

M ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

m a n u a l d a m p e n i n g not learned, boosts tags that stay

what they were before intervention

9

SLIDE 38

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! M ; SG;

NOM

M ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

m a n u a l d a m p e n i n g not learned, boosts tags that stay

what they were before intervention

9

SLIDE 39

Syntax to the rescue: use dependency parses

Only words “connected” in the dependency parse should change! Build a MRF over morphological tags along the dependency parse! F ; SG;

NOM

F ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

a g r e e m e n t / c

n

c

r

d learned from data,

n e u r a l f a c t

r

s

m a n u a l d a m p e n i n g not learned, boosts tags that stay

what they were before intervention

9

SLIDE 40

Reinflect tokens to obtain the CDA sentence

Get the new sentence by performing morphological reinflection where tags changes:

(this is a reasonably well-working procedure, established in three shared tasks at SIGMORPHON and CoNLL)

Der gute Arzt sitzt auf einem Stuhl F ; SG;

NOM

F ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

10

SLIDE 41

Reinflect tokens to obtain the CDA sentence

Get the new sentence by performing morphological reinflection where tags changes:

(this is a reasonably well-working procedure, established in three shared tasks at SIGMORPHON and CoNLL)

Der gute Arzt sitzt auf einem Stuhl F ; SG;

NOM

F ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

p(· | ·) p(· | ·) p(· | ·)

10

SLIDE 42

Reinflect tokens to obtain the CDA sentence

Get the new sentence by performing morphological reinflection where tags changes:

(this is a reasonably well-working procedure, established in three shared tasks at SIGMORPHON and CoNLL)

Der gute Arzt sitzt auf einem Stuhl F ; SG;

NOM

F ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

p(· | ·) p(· | ·) p(· | ·) Die gute Ärztin

10

SLIDE 43

Reinflect tokens to obtain the CDA sentence

Get the new sentence by performing morphological reinflection where tags changes:

(this is a reasonably well-working procedure, established in three shared tasks at SIGMORPHON and CoNLL)

Der gute Arzt sitzt auf einem Stuhl F ; SG;

NOM

F ; SG;

NOM

F ; SG;

NOM 3P; SG; PRS

M ; SG;

DAT

M ; SG;

DAT

p(· | ·) p(· | ·) p(· | ·) Die gute Ärztin sitzt auf einem Stuhl

10

SLIDE 44

Intrinsic evaluation: how good are we at gender-swapping (Hebrew, Spanish)?

We manually annotated over 100 sentences for each language and checked performance: Tag Form P R F1 Acc Acc

11

SLIDE 45

Intrinsic evaluation: how good are we at gender-swapping (Hebrew, Spanish)?

We manually annotated over 100 sentences for each language and checked performance: Tag Form P R F1 Acc Acc Hebrew: hardcoded factors 89.04 40.12 55.32 86.88 83.63

11

SLIDE 46

Intrinsic evaluation: how good are we at gender-swapping (Hebrew, Spanish)?

We manually annotated over 100 sentences for each language and checked performance: Tag Form P R F1 Acc Acc Hebrew: hardcoded factors 89.04 40.12 55.32 86.88 83.63 Hebrew: linear factors 87.07 62.35 72.66 90.5 86.75

11

SLIDE 47

Intrinsic evaluation: how good are we at gender-swapping (Hebrew, Spanish)?

We manually annotated over 100 sentences for each language and checked performance: Tag Form P R F1 Acc Acc Hebrew: hardcoded factors 89.04 40.12 55.32 86.88 83.63 Hebrew: linear factors 87.07 62.35 72.66 90.5 86.75 Hebrew: neural factors 87.18 62.96 73.12 90.62 86.25

11

SLIDE 48

Intrinsic evaluation: how good are we at gender-swapping (Hebrew, Spanish)?

We manually annotated over 100 sentences for each language and checked performance: Tag Form P R F1 Acc Acc Hebrew: hardcoded factors 89.04 40.12 55.32 86.88 83.63 Hebrew: linear factors 87.07 62.35 72.66 90.5 86.75 Hebrew: neural factors 87.18 62.96 73.12 90.62 86.25 Spanish: hardcoded factors 96.97 51.45 67.23 90.21 86.32 Spanish: linear factors 92.74 73.95 82.29 93.79 89.52 Spanish: neural factors 95.34 72.35 82.27 93.91 89.65

11

SLIDE 49

Extrinsic evaluation: train language models on CDA-balanced data, then evaluate:

Bias log

x∈Σ∗ p(Der gute Arzt x)
x∈Σ∗ p(Die gute Ärztin x)

m f

12

SLIDE 50

Extrinsic evaluation: train language models on CDA-balanced data, then evaluate:

Bias log

x∈Σ∗ p(Der gute Arzt x)
x∈Σ∗ p(Die gute Ärztin x)

m f Esp Fra Heb Ita 2 4 6 Gender Bias Original Swap MRF

12

SLIDE 51

Extrinsic evaluation: train language models on CDA-balanced data, then evaluate:

Bias log

x∈Σ∗ p(Der gute Arzt x)
x∈Σ∗ p(Die gute Ärztin x)

m f Esp Fra Heb Ita 2 4 6 Gender Bias Original Swap MRF Grammaticality log

x∈Σ∗ p(Die gute Ärztin x)
x∈Σ∗ p(Der gute Ärztin x)
k

bad

12

SLIDE 52

Extrinsic evaluation: train language models on CDA-balanced data, then evaluate:

Bias log

x∈Σ∗ p(Der gute Arzt x)
x∈Σ∗ p(Die gute Ärztin x)

m f Esp Fra Heb Ita 2 4 6 Gender Bias Original Swap MRF Grammaticality log

x∈Σ∗ p(Die gute Ärztin x)
x∈Σ∗ p(Der gute Ärztin x)
k

bad Esp Fra Heb Ita 1 2 3 Grammaticality Original Swap MRF

12

SLIDE 53

Extrinsic evaluation: train language models on CDA-balanced data, then evaluate:

Bias log

x∈Σ∗ p(Der gute Arzt x)
x∈Σ∗ p(Die gute Ärztin x)

m f Esp Fra Heb Ita 2 4 6 Gender Bias Original Swap MRF Grammaticality log

x∈Σ∗ p(Die gute Ärztin x)
x∈Σ∗ p(Der gute Ärztin x)
k

bad Esp Fra Heb Ita 1 2 3 Grammaticality Original Swap MRF

12

SLIDE 54

Conclusion

13

SLIDE 55

Conclusion

1. As so often, things that are easy in English...

...become surprisingly hard in other languages.

13

SLIDE 56

Conclusion

1. As so often, things that are easy in English...

...become surprisingly hard in other languages.

2. Old-school probabilistic models often work well enoughTM

13

SLIDE 57

Conclusion

1. As so often, things that are easy in English...

...become surprisingly hard in other languages.

2. Old-school probabilistic models often work well enoughTM
3. And, always, careful with your training data, Eugene!

13

SLIDE 58