Exploring the Limits of Transfer Learning with a unified - - PowerPoint PPT Presentation

exploring the limits of transfer
SMART_READER_LITE
LIVE PREVIEW

Exploring the Limits of Transfer Learning with a unified - - PowerPoint PPT Presentation

Exploring the Limits of Transfer Learning with a unified Text-to-Text Transformer Presented by - Vipul Rathore Elements and images borrowed from Raffel et al., 2019 Transfer Learning: Background Pre-train a model on a data-rich task


slide-1
SLIDE 1

Exploring the Limits of Transfer Learning with a unified Text-to-Text Transformer

Presented by - Vipul Rathore Elements and images borrowed from Raffel et al., 2019

slide-2
SLIDE 2

Transfer Learning: Background

  • Pre-train a model on a data-rich task (Unsupervised)

e.g. Word2vec, Glove (Mikolov et al., 2013a,b)

  • Fine tune on a downstream task (Supervised)
  • Pre-training gives a model “general-purpose abilities” that

can be “transferred” to downstream tasks

slide-3
SLIDE 3

via towardsdatascience.com

slide-4
SLIDE 4

Multi-task learning: Classical Paradigm

Shared Layers Task-specific Layers

LA LB LC

Task A

Input

Task B Task C

Task-specific Loss fn

slide-5
SLIDE 5

Multi-task learning: Classical Paradigm

  • Task-specific loss function
  • Task specific architectural layers
slide-6
SLIDE 6

T5 (Text-to-Text Transfer Transformer): Idea

  • Pre-train a Transformer Encoder-Decoder model on a

large unlabeled web crawl text

  • Pose every NLP task as text to text (McCann et al., 2018;

Radford et al., 2019)

  • Fine-tune separately for each downstream task (done in

parallel)

slide-7
SLIDE 7

Multi-task learning: T5 Paradigm

Pre-training Same hyperparameters

LA LB LC

Task A

Input

Task B Task C

Same loss function

slide-8
SLIDE 8

Multi-task learning: T5 Paradigm

  • Cross Entropy/Max. likelihood loss for all pre-training and

fine-tuning tasks

  • Same hyperparameters for each task
  • “Unified” vocabulary
slide-9
SLIDE 9

Unified Text-to-Text view

slide-10
SLIDE 10

Pre-training Dataset: Colossal Clean Crawled Corpus

  • Goal: analyze the effect of the quality, characteristics and

size of unlabeled data

  • Source: https://commoncrawl.org/ (20 TB/month, noisy data)
  • Data cleaning using heuristics

○ Only retain lines ending in a terminal punctuation mark (“.”, “!”, “?” etc.) ○ Remove obscene words ○ Removing pages containing Javascript code ○ Remove duplicate sentences ○ Retain only English webpages

  • 750 GB
slide-11
SLIDE 11

Fine-tuning (Downstream) tasks

  • Text classification: GLUE and SuperGLUE
  • Abstractive summarization: CNN/Daily Mail
  • QA: SQuAD
  • Translation: WMT English to German, French, and

Romanian

slide-12
SLIDE 12

Input & Output

  • “text-to-text” format
  • consistent training objective: maximum likelihood
  • task-specific (text) prefix
  • Mismatch label Issue

○ e.g. given a premise and hypothesis, classify into one of 3 categories - ‘entailment’, ‘contradiction’ and ‘neutral’ ○ Potentially possible for decoder to output ‘hamburger’ ○ This issue never observed with their trained models

slide-13
SLIDE 13

Input & Output

  • Regression task

○ Predict a score between 1 to 5

○ Convert to 21-class classification i.e. round target floating point score to nearest integer multiple of 0.2 and convert into string ○ At inference, convert the string back into floating point number

slide-14
SLIDE 14

Input & Output

  • Winograd Task (ambiguation)

○ Input - Highlighted ambiguous pronoun. e.g. “The city councilmen refused the demonstrators a permit because *they* feared violence .” ○ Output - the target noun. E.g. “The city councilmen”

slide-15
SLIDE 15

Empirical Survey

Methodology “coordinate descent”

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-16
SLIDE 16

Baseline

  • Encoder-Decoder architecture as in original Transformer

paper (Vaswani et al., 2017)

  • Relative Positional self-attention (Shaw et al., 2018)

○ RelativeAttention = Softmax ○ Srel is shared across layers for a given attention head, different for different attention heads within a layer

slide-17
SLIDE 17

Baseline

  • Pre-training objective: Denoising(drop 15 % tokens randomly)
  • BERT-base Size Encoder and Decoder (L=12, H=768, A=12)
  • Multilingual Vocabulary: SentencePiece (32k word pieces)
slide-18
SLIDE 18

Baseline (Pre-training Details)

  • Max Sequence length: 512 tokens
  • Batch size: 128 sequences = 128 ⨯ 512 = 216 tokens
  • Training size = 219 steps = 219 ⨯ 216 = 235 tokens ≈ 34 B

tokens << BERT (137B) << RoBERTa (2.2T)

  • inverse square root learning rate schedule, where k = 104

(warm-up steps)

  • AdaFactor
  • Dropout: 0.1
slide-19
SLIDE 19

Baseline (Fine-tuning Details)

  • Batch Size: 128
  • Length: 512
  • Training size = 218 steps = 218 ⨯ 216 = 234 tokens
  • constant learning rate: 0.001
  • 5,000 steps/checkpoint
slide-20
SLIDE 20

Baseline Performance

slide-21
SLIDE 21

Empirical Survey

Methodology “coordinate descent”

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-22
SLIDE 22

Types of Self-attention

slide-23
SLIDE 23

Architectural Variants

slide-24
SLIDE 24
  • Encoder-Decoder

○ Baseline

  • Language model

○ Used in transfer learning as pre-training model with language modeling objective (Radford et al., 2018)

  • Prefix LM

○ Suited for classification tasks. e.g. Input - “ mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity. target:”, Output - “entailment”

slide-25
SLIDE 25

Prefix LM

x

1

x

4

x

3

x

2

x

2

x

3

x

4

y

Similar to CLS token in BERT !!!

slide-26
SLIDE 26

Model Architectures: Results

  • Surprisingly, Enc-dec shared performs nearly as well as

baseline and better than prefix LM. (ALBERT, XLNet)

  • Explicit encoder-decoder structure can be useful
  • Denoising objective > LM objective
slide-27
SLIDE 27

Empirical Survey

Methodology “coordinate descent”

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-28
SLIDE 28

Pre-training: Bert vs Non-Bert style

slide-29
SLIDE 29

Variants of Masked LM

Objective Input Output BERT-style 15 % corruption → (90 % MASK, 10 % random tokens) Original full text MASS-style 15 % corruption → (100 % MASK) Original full text Replace corrupted spans Thank you <X> me to your party <Y> week . <X> for inviting <Y> last <Z> Drop corrupted tokens Thank you me to your party week . for inviting last

slide-30
SLIDE 30

Results

slide-31
SLIDE 31

Results

  • Corruption rate:

○ Not sensitive

slide-32
SLIDE 32

Results

  • token-level vs span-level corruption

○ Slight improvement with span length 3

slide-33
SLIDE 33
  • Small modification to the masked language model
  • bjective may not lead to significant improvement.
  • Try something different !!!

Message

slide-34
SLIDE 34

Empirical Survey

Methodology “coordinate descent”

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-35
SLIDE 35

Pre-training Datasets

  • C4: Common Crawl with heuristic filterin
  • Unfiltered C4: Common Crawl only use use langdetect to

extract English text

  • RealNews-like: omitted any non-news content in C4
  • WebText-like (GPT2-like): high Reddit score webpages in C4
  • Wikipedia
  • Wikipedia + Toronto Books Corpus (BERT)
slide-36
SLIDE 36

Pre-training Datasets

  • Pre-training on in-domain unlabeled data can improve

performance on downstream tasks.

slide-37
SLIDE 37

Varying No. of epochs

  • Keeping total number of

Training steps = constant

slide-38
SLIDE 38

Empirical Survey

Methodology “coordinate descent”

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-39
SLIDE 39

Fine-tuning

  • Adapter Layers (Houlsby et al., 2019):

○ Only adapter layers are updated

d dimensional dff dimensional

slide-40
SLIDE 40
  • Gradual Unfreezing (ULMFiT):

○ First unfreeze the last layer (which contains least general knowledge) → the next lower layer ○ Scope for better unfreezing scheduling

  • Data hungry tasks => higher value of d
slide-41
SLIDE 41

Multi-task learning

  • Mixing datasets for all fine-tuning tasks

○ Equal mixing: rm ∝ 1 ○ Examples-proportional mixing: rm ∝ min(sm, K) ○ Temperature scaled mixing (Multilingual BERT): rm ∝ (min(sm,K))1/T

slide-42
SLIDE 42

Combining multi-task learning with fine-tuning

slide-43
SLIDE 43

Empirical Survey

Methodology “coordinate descent”

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-44
SLIDE 44
  • Allowed compute power = 4x

○ increasing both the training time as well as model size can be complementary

  • Scaling model size: main idea to increase dff substantially

○ TPUs efficient for dense tensor multiplications

slide-45
SLIDE 45

State-of-the-Art

Baseline → Architecture → Objective →Dataset → Transfer Approach → Scaling

slide-46
SLIDE 46

Model

  • Objective: span-corruption (SpanBERT) with span length 3
  • Longer training: 1M steps with batch size 2048 → 1T tokens

○ 8x BERT, 2x XLNet, 1⁄2 x RoBERTa

  • Model sizes:

○ Small: 60M Base: 220M Large: 770M XLarge: 3B XXLarge: 11B

  • Multi-task pre-training (MT-DNN):

○ Monitor downstream task performance while pre-training

  • Finetune on GLUE and SuperGLUE: 8 batch size
slide-47
SLIDE 47
slide-48
SLIDE 48

Takeaways

  • Text-to-text framework comparable to task-specific

architectures

  • Original Encoder-Decoder ≈ shared Encoder-Decoder
  • Denoising objectives > LM objective
  • Pre-training on in-domain unlabeled data useful for a few

downstream tasks

  • Scaling could be most useful when both model size and

training steps are increased

  • Pushing limits (11 B parameters) on transformer-like

architectures can help achieve SOTA

slide-49
SLIDE 49

Cons

  • Not language-agnostic (Atishya, Sankalan, Pratyush,

Soumya, Jigyasa)

  • Large carbon footprints (Keshav, Rajas, Saransh)
  • Saturation point of size still not known (Jigyasa)
  • Not much different from BERT (Siddhant, Rajas)
  • Better data cleaning heuristics (Pratyush, Keshav)
slide-50
SLIDE 50

Possible extensions

  • Extending to Graphs (KBs) [Keshav, Atishya]

○ Leverage OpenIE to construct graphs with clustering of related paragraphs ○ Pre-training task: Predict a sentence from the graph given its neighbouring ones ○ Leverage Graph transformers (Yun et al., 2019) for fine-tuning

  • Alternatives to Gradual unfreezing [Rajas, Saransh]

○ RL based approach

  • Balance scalability vs Performance trade-off in practical

settings [Shubham, Lovish]

slide-51
SLIDE 51

Possible extensions

  • Multi-lingual learning [Pratyush, Sankalan]

○ Lakew et al., 2018

slide-52
SLIDE 52

Thank You !!!