DIG Seminar: Civil Rephrases Of Toxic Texts With Self-Supervised - - PowerPoint PPT Presentation

dig seminar civil rephrases of toxic texts with self
SMART_READER_LITE
LIVE PREVIEW

DIG Seminar: Civil Rephrases Of Toxic Texts With Self-Supervised - - PowerPoint PPT Presentation

DIG Seminar: Civil Rephrases Of Toxic Texts With Self-Supervised Transformers Thomas Bonald 1 1 Tlcom Paris, Institut Polytechnique de Paris 2 Athens University of Economics & Business 3 Stockholm University 4 Google October 15, 2020


slide-1
SLIDE 1

DIG Seminar: Civil Rephrases Of Toxic Texts With Self-Supervised Transformers

Léo Laugier1, John Pavlopoulos2, 3, Jefgrey Sorensen 4, Lucas Dixon4, Thomas Bonald1

1Télécom Paris, Institut Polytechnique de Paris 2Athens University of Economics & Business 3Stockholm University 4Google

October 15, 2020

Laugier, L. (IP Paris) Presentation October 15, 2020 1 / 42

slide-2
SLIDE 2

Contents

1

Introduction: Can we nudge healthier conversations from an unpaired corpus?

2

Method: We fjne-tuned a Denoising Auto-Encoder bi-conditional Language Model

3

Evaluation: How to evaluate with automatic metrics?

4

Results on sentiment transfer and detoxicfjcation

5

Conclusion

Laugier, L. (IP Paris) Presentation October 15, 2020 2 / 42

slide-3
SLIDE 3

Contents

1

Introduction: Can we nudge healthier conversations from an unpaired corpus?

2

Method: We fjne-tuned a Denoising Auto-Encoder bi-conditional Language Model

3

Evaluation: How to evaluate with automatic metrics?

4

Results on sentiment transfer and detoxicfjcation

5

Conclusion

Laugier, L. (IP Paris) Presentation October 15, 2020 3 / 42

slide-4
SLIDE 4

Introduction (1/5): Nudging healthier conversations online

Laugier, L. (IP Paris) Presentation October 15, 2020 4 / 42

slide-5
SLIDE 5

Introduction (1/5): Nudging healthier conversations online

Laugier, L. (IP Paris) Presentation October 15, 2020 4 / 42

slide-6
SLIDE 6

Introduction (2/5): Machine learning systems classify toxic comments online

Figure: from Pavlopoulos et al. [1]

Laugier, L. (IP Paris) Presentation October 15, 2020 5 / 42

slide-7
SLIDE 7

Introduction (3/5): Deep learning is effjcient when applied to generative transfer tasks

Figure: Left: CycleGAN [2] Right: Neural Machine Translation (NMT) (from https://jalammar.github.io/)

Laugier, L. (IP Paris) Presentation October 15, 2020 6 / 42

slide-8
SLIDE 8

Introduction (3/5): Deep learning is effjcient when applied to generative transfer tasks

Figure: Left: CycleGAN [2] Right: Neural Machine Translation (NMT) (from https://jalammar.github.io/)

Laugier, L. (IP Paris) Presentation October 15, 2020 6 / 42

slide-9
SLIDE 9

Introduction (4/5): Golden annotated pairs are more expensive and diffjcult to get than monolingual corpora annotated in attribute

Figure: Left: Parallel (paired) corpus for supervised NMT Right: Non-parallel (Unpaired) corpora for self-supervised NMT

Laugier, L. (IP Paris) Presentation October 15, 2020 7 / 42

slide-10
SLIDE 10

Introduction (4/5): Golden annotated pairs are more expensive and diffjcult to get than monolingual corpora annotated in attribute

Figure: Left: Parallel (paired) corpus for supervised NMT Right: Non-parallel (Unpaired) corpora for self-supervised NMT

Laugier, L. (IP Paris) Presentation October 15, 2020 7 / 42

slide-11
SLIDE 11

Introduction (5/5): Therefore we opted for a self-supervised setting

Figure: Left: Polarised Civil Comments dataset [3] Right: Yelp Review dataset [4] (for initial experiments and fair comparison purpose)

Laugier, L. (IP Paris) Presentation October 15, 2020 8 / 42

slide-12
SLIDE 12

Introduction (5/5): Therefore we opted for a self-supervised setting

Figure: Left: Polarised Civil Comments dataset [3] Right: Yelp Review dataset [4] (for initial experiments and fair comparison purpose)

Laugier, L. (IP Paris) Presentation October 15, 2020 8 / 42

slide-13
SLIDE 13

Contents

1

Introduction: Can we nudge healthier conversations from an unpaired corpus?

2

Method: We fjne-tuned a Denoising Auto-Encoder bi-conditional Language Model

3

Evaluation: How to evaluate with automatic metrics?

4

Results on sentiment transfer and detoxicfjcation

5

Conclusion

Laugier, L. (IP Paris) Presentation October 15, 2020 9 / 42

slide-14
SLIDE 14

Method (1/14): Formalizing the problem

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2 Laugier, L. (IP Paris) Presentation October 15, 2020 10 / 42

slide-15
SLIDE 15

Method (1/14): Formalizing the problem

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2 Laugier, L. (IP Paris) Presentation October 15, 2020 10 / 42

slide-16
SLIDE 16

Method (1/14): Formalizing the problem

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2 Laugier, L. (IP Paris) Presentation October 15, 2020 10 / 42

slide-17
SLIDE 17

Method (2/14): Encoder-Decoder for supervised seq2seq

Laugier, L. (IP Paris) Presentation October 15, 2020 11 / 42

¯ yj =

  • yj if training

PYj Y1

y1 Yj 1 yj 1 X x

P Yj w1 Y1 y1 Yj

1

yj

1 X

x P Yj w2 Y1 y1 Yj

1

yj

1 X

x . . . P Yj w V Y1 y1 Yj

1

yj

1 X

x

Auto-Regressive (AR) generation at inference yj yj if training yj if inference

slide-18
SLIDE 18

Method (2/14): Encoder-Decoder for supervised seq2seq

Laugier, L. (IP Paris) Presentation October 15, 2020 11 / 42

¯ yj =

  • yj if training

ˆ Yj | ˆ Y1= ¯ y1,··· , ˆ Yj−1=¯ yj−1,X=x =

  

Pθ( ˆ Yj = w1| ˆ Y1 = ¯ y1, · · · , ˆ Yj−1 = ¯ yj−1, X = x) Pθ( ˆ Yj = w2| ˆ Y1 = ¯ y1, · · · , ˆ Yj−1 = ¯ yj−1, X = x) . . . Pθ( ˆ Yj = w|V || ˆ Y1 = ¯ y1, · · · , ˆ Yj−1 = ¯ yj−1, X = x)

  

Auto-Regressive (AR) generation at inference yj yj if training yj if inference

slide-19
SLIDE 19

Method (2/14): Encoder-Decoder for supervised seq2seq

Laugier, L. (IP Paris) Presentation October 15, 2020 11 / 42

yj yj if training

ˆ Yj | ˆ Y1= ¯ y1,··· , ˆ Yj−1=¯ yj−1,X=x =

  

Pθ( ˆ Yj = w1| ˆ Y1 = ¯ y1, · · · , ˆ Yj−1 = ¯ yj−1, X = x) Pθ( ˆ Yj = w2| ˆ Y1 = ¯ y1, · · · , ˆ Yj−1 = ¯ yj−1, X = x) . . . Pθ( ˆ Yj = w|V || ˆ Y1 = ¯ y1, · · · , ˆ Yj−1 = ¯ yj−1, X = x)

  

Auto-Regressive (AR) generation at inference ¯ yj =

  • yj if training

ˆ yj if inference

slide-20
SLIDE 20

Method (3/14): Encoding and decoding is modeled via attention mechanism (see https://jalammar.github.io/)

Figure: Cross-attention heat map for NMT, from Bahdanau et al. [8] (2015)

Second Law of Robotics

A robot must obey the orders given it by human beings except where such

  • rders would confmict with the First Law.

Laugier, L. (IP Paris) Presentation October 15, 2020 12 / 42

slide-21
SLIDE 21

Method (3/14): Encoding and decoding is modeled via attention mechanism (see https://jalammar.github.io/)

Second Law of Robotics

A robot must obey the orders given it by human beings except where such

  • rders would confmict with the First Law.

Laugier, L. (IP Paris) Presentation October 15, 2020 12 / 42

slide-22
SLIDE 22

Method (3/14): Encoding and decoding is modeled via attention mechanism (see https://jalammar.github.io/)

Second Law of Robotics

A robot must obey the orders given it by human beings except where such

  • rders would confmict with the First Law.

Laugier, L. (IP Paris) Presentation October 15, 2020 12 / 42

slide-23
SLIDE 23

Method (4/14): Bi-transformers [9] encode the input and decode the hidden states (see https://jalammar.github.io/)

Laugier, L. (IP Paris) Presentation October 15, 2020 13 / 42

slide-24
SLIDE 24

Method (5/14): Inference time - where the Natural Language Generation happens (see https://jalammar.github.io/)

Laugier, L. (IP Paris) Presentation October 15, 2020 14 / 42

slide-25
SLIDE 25

Method (5/14): Inference time - where the Natural Language Generation happens (see https://jalammar.github.io/)

Laugier, L. (IP Paris) Presentation October 15, 2020 14 / 42

slide-26
SLIDE 26

Method (6/14): Transformers learn relevant features

Figure: As we encode the word “it”, one attention head is focusing most on “the animal”, while another is focusing on ”tired”.

Laugier, L. (IP Paris) Presentation October 15, 2020 15 / 42

slide-27
SLIDE 27

Method (7/14): Transformers benefjt from scaling their size (hidden size and depth) and pre-training on massive corpus: T5[5]

Figure: Transfer learning: Text-to-Text Transfer Transformer (T5) is pre-trained with a self-supervised objective to learn semantic representations, before being fjne-tuned on downstream supervised tasks (NMT, sentiment analysis, etc.)

Pre-training dataset: “Colossal Clean Crawled Corpus” (C4) ∼34 Billion tokens (∼750 GB) of clean English text scraped from the web. T5 sizes: Small, Base, Large (24 layers; 770 Million parameters), 3B, 11B.

Laugier, L. (IP Paris) Presentation October 15, 2020 16 / 42

slide-28
SLIDE 28

Method (8/14): Encoder-Decoder transformers had rarely been trained in self-supervised setting but decoders had

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2 Laugier, L. (IP Paris) Presentation October 15, 2020 17 / 42

slide-29
SLIDE 29

Method (8/14): Encoder-Decoder transformers had rarely been trained in self-supervised setting but decoders had

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2 Laugier, L. (IP Paris) Presentation October 15, 2020 17 / 42

slide-30
SLIDE 30

Method (9/14): Introduction to Language Models (LM)

What is a Language Model?

A statistical Language Model is a probability distribution over sequences of words. Predicting the next word: p(wt|w<t) If w<t = [“the”,“best”,“place”,“to”,“visit”,“in”,“France”,“is”] then p(“Paris”|w<t) = 0.6 p(“Mont”|w<t) = 0.3 p(“Saclay”|w<t) = ǫ p(“have”|w<t) = 0 Deep learning provides parametric architectures able to learn in a self-supervised setting to approximate LMs: p wt w

t

. They are trained with maximum likelihood on massive corpora like C4. Generating w

t from prompt w t: p w t w t n i t p wi w i

Laugier, L. (IP Paris) Presentation October 15, 2020 18 / 42

slide-31
SLIDE 31

Method (9/14): Introduction to Language Models (LM)

What is a Language Model?

A statistical Language Model is a probability distribution over sequences of words. Predicting the next word: p(wt|w<t) If w<t = [“the”,“best”,“place”,“to”,“visit”,“in”,“France”,“is”] then p(“Paris”|w<t) = 0.6 p(“Mont”|w<t) = 0.3 p(“Saclay”|w<t) = ǫ p(“have”|w<t) = 0 Deep learning provides parametric architectures able to learn in a self-supervised setting to approximate LMs: p(wt|w<t; θ). They are trained with maximum likelihood on massive corpora like C4. Generating w≥t from prompt w<t: p(w≥t|w<t; θ) = n

i=t p(wi|w<i; θ)

Laugier, L. (IP Paris) Presentation October 15, 2020 18 / 42

slide-32
SLIDE 32

Method (10/14): Class-Conditional LMs (CC-LMs)

CTRL: A Conditional Transformer Language Model for Controllable Generation [7]

Generating a sentence sa = w1:n of length n in class a: p(sa; θ) =

n

  • i=1

p(wi|w<i, a; θ) If the “prompt” w<t = [“Paris”,“is”] and a ∈ { ; } then arg max

wt:t+4

p(wt:t+4|w<t, a = ; θ) = [“such”,“a”,“beautiful”,“city”] arg max

wt:t+4

p(wt:t+4|w<t, a = ; θ) = [“a”,“very”,“boring”,“town”]

Laugier, L. (IP Paris) Presentation October 15, 2020 19 / 42

slide-33
SLIDE 33

Method (11/14): Our approach combines both ideas

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (e.g. NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

CAE-T5:

We fjne-tuned a pre-trained T5 bi-transformer

2 with a Conditional 1

Auto-Encoder objective

3 . Laugier, L. (IP Paris) Presentation October 15, 2020 20 / 42

slide-34
SLIDE 34

Method (11/14): Our approach combines both ideas

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

There exist two related approaches

Encoder-decoder architectures work well for supervised sequence-to-sequence (seq2seq) tasks (e.g. NMT): T5[5]

1 2 3

Language Models (LMs) are effjcient for self-supervised “free” generation: GPT-2[6]

2 and CTRL[7] 1 2

CAE-T5:

We fjne-tuned a pre-trained T5 bi-transformer

2 with a Conditional 1

Auto-Encoder objective

3 . Laugier, L. (IP Paris) Presentation October 15, 2020 20 / 42

slide-35
SLIDE 35

Method (12/14): Training CAE-T5 is fjne-tuning T5 with a Conditional denoising Auto-Encoder objective

training example (alternate batches of and )

x = [“this”, “is”, “a”, “great”, “article” ] of attribute a = α(x) = The noise function η masks and replace tokens randomly: η(x) = [“this”, “MASK”, “a”, “the”, “article”]

2 3

γ(a, x) prepends to x the control code corresponding to attribute a: γ(α(x), x) = [“civil:”, “this”, “is”, “a”, “great”, “article” ]

1 Laugier, L. (IP Paris) Presentation October 15, 2020 21 / 42

LDAE = Ex∼X [− log p(x|η(x), α(x); θ)]

slide-36
SLIDE 36

Method (13/14): Attribute transfer at prediction time with trained CAE-T5

→ test example

x = [“you”, “write”, “stupid”, “comments” ] of attribute α(x) = Destination attribute a = ¯ α(x) = γ(a, ˆ y<0) = [“civil:”] AR generation: ˆ y0 =“your”; ˆ y1 =“comments”; ˆ y2 =“are”; ˆ y3 =“great”

Laugier, L. (IP Paris) Presentation October 15, 2020 22 / 42

slide-37
SLIDE 37

Method (14/14): During training, we add a Cycle-Consistency objective to enforce

3

Final loss function

L = λDAELDAE + λCCLCC Weighted sum of 2 negative log-likelihood (equiv. Cross-Entropy)

Optimization

ˆ θ = arg min

θ

L(θ) Optimized with Stochastic Gradient Descent on TPUs (∼90,000 steps).

Laugier, L. (IP Paris) Presentation October 15, 2020 23 / 42

LCC = Ex∼X

− log p(x|f˜

θ(x, ¯

α(x)), α(x); θ)

slide-38
SLIDE 38

Contents

1

Introduction: Can we nudge healthier conversations from an unpaired corpus?

2

Method: We fjne-tuned a Denoising Auto-Encoder bi-conditional Language Model

3

Evaluation: How to evaluate with automatic metrics?

4

Results on sentiment transfer and detoxicfjcation

5

Conclusion

Laugier, L. (IP Paris) Presentation October 15, 2020 24 / 42

slide-39
SLIDE 39

Evaluation (1/2): How to evaluate with automatic metrics?

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

Automatic evaluation systems

1 Accuracy (ACC): pre-trained attribute classifjer (BERT [10]) 2 Perplexity (PPL): pre-trained language model (GPT-2 [6]) 3 Sentence similarity (self-SIM): pre-trained encoder (USE [11]). Laugier, L. (IP Paris) Presentation October 15, 2020 25 / 42

slide-40
SLIDE 40

Evaluation (1/2): How to evaluate with automatic metrics?

Goal

Let XT and XC be the “toxic” and “civil” non-parallel copora. Let X = XT ∪ XC. We aim at learning in a self-supervised setting, a mapping fθ s. t. ∀(x, a) ∈ X×{“civil”, “toxic”}, y = fθ(x, a) is a text:

1 Satisfying a, 2 Fluent in English, 3 Preserving the meaning of x “as much as possible”.

Automatic evaluation systems

1 Accuracy (ACC): pre-trained attribute classifjer (BERT [10]) 2 Perplexity (PPL): pre-trained language model (GPT-2 [6]) 3 Sentence similarity (self-SIM): pre-trained encoder (USE [11]). Laugier, L. (IP Paris) Presentation October 15, 2020 25 / 42

slide-41
SLIDE 41

Evaluation (2/2): Human evaluation through crowdworking

Figure: Guidelines provided to annotators on Appen

Laugier, L. (IP Paris) Presentation October 15, 2020 26 / 42

slide-42
SLIDE 42

Contents

1

Introduction: Can we nudge healthier conversations from an unpaired corpus?

2

Method: We fjne-tuned a Denoising Auto-Encoder bi-conditional Language Model

3

Evaluation: How to evaluate with automatic metrics?

4

Results on sentiment transfer and detoxicfjcation

5

Conclusion

Laugier, L. (IP Paris) Presentation October 15, 2020 27 / 42

slide-43
SLIDE 43

Results (1/4): Yelp ↔ quantitative automatic evaluation

Laugier, L. (IP Paris) Presentation October 15, 2020 28 / 42

slide-44
SLIDE 44

Results (2/4): Yelp ↔ qualitative evaluation

Laugier, L. (IP Paris) Presentation October 15, 2020 29 / 42

slide-45
SLIDE 45

Results (3/4): → quantitative evaluations

Figure: Automatic evaluation of CAE-T5 applied to Civil Comments Figure: Human evaluation of CAE-T5 applied to Civil Comments

Laugier, L. (IP Paris) Presentation October 15, 2020 30 / 42

slide-46
SLIDE 46

Results (4/4): → qualitative evaluation

Laugier, L. (IP Paris) Presentation October 15, 2020 31 / 42

slide-47
SLIDE 47

Contents

1

Introduction: Can we nudge healthier conversations from an unpaired corpus?

2

Method: We fjne-tuned a Denoising Auto-Encoder bi-conditional Language Model

3

Evaluation: How to evaluate with automatic metrics?

4

Results on sentiment transfer and detoxicfjcation

5

Conclusion

Laugier, L. (IP Paris) Presentation October 15, 2020 32 / 42

slide-48
SLIDE 48

Conclusion (1/2)

CAE-T5 works well on the Yelp sentiment transfer task. Results are still preliminary for the Civil Comments dataset, probably due to the diffjculty of the task in a self-supervised setting but it is

  • nly the second time it is addressed.

Human and automatic evaluations are open research topics. CAE-T5 can be applied to other attribute transfer tasks provided that

  • ne has access to two (or more) corpora annotated in attributes.

Currently under review at EACL 2021. Code (TF): https://github.com/LeoLaugier/ conditional-auto-encoder-text-to-text-transfer-transformer

Laugier, L. (IP Paris) Presentation October 15, 2020 33 / 42

slide-49
SLIDE 49

Conclusion (2/2): CAE-T5 learnt to transfer →

Laugier, L. (IP Paris) Presentation October 15, 2020 34 / 42

slide-50
SLIDE 50

References I

John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. Deeper attention to abusive user content moderation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1125–1135, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.

Laugier, L. (IP Paris) Presentation October 15, 2020 35 / 42

slide-51
SLIDE 51

References II

Daniel Borkan, Lucas Dixon, Jefgrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classifjcation. CoRR, abs/1903.04561, 2019. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pages 6830–6841, 2017. Colin Rafgel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unifjed text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

Laugier, L. (IP Paris) Presentation October 15, 2020 36 / 42

slide-52
SLIDE 52

References III

Alec Radford, Jefg Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong, and Richard Socher. CTRL - A Conditional Transformer Language Model for Controllable Generation. arXiv preprint arXiv:1909.05858, 2019. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Laugier, L. (IP Paris) Presentation October 15, 2020 37 / 42

slide-53
SLIDE 53

References IV

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.

Laugier, L. (IP Paris) Presentation October 15, 2020 38 / 42

slide-54
SLIDE 54

References V

Yulia Tsvetkov. Towards personalized adaptive nlp: Modeling output spaces in continuous-output language generation. 2019. Yoon Kim. Convolutional neural networks for sentence classifjcation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014.

Laugier, L. (IP Paris) Presentation October 15, 2020 39 / 42

slide-55
SLIDE 55

References VI

Jefgrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In In EMNLP, 2014. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fjxed-length context. CoRR, abs/1901.02860, 2019. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237, 2019.

Laugier, L. (IP Paris) Presentation October 15, 2020 40 / 42

slide-56
SLIDE 56

References VII

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.

Laugier, L. (IP Paris) Presentation October 15, 2020 41 / 42

slide-57
SLIDE 57

DIG Seminar: Civil Rephrases Of Toxic Texts With Self-Supervised Transformers

Léo Laugier1, John Pavlopoulos2, 3, Jefgrey Sorensen 4, Lucas Dixon4, Thomas Bonald1

1Télécom Paris, Institut Polytechnique de Paris 2Athens University of Economics & Business 3Stockholm University 4Google

October 15, 2020

Laugier, L. (IP Paris) Presentation October 15, 2020 42 / 42