intermediate task transfer CS685 Fall 2020 Advanced Natural Language - - PowerPoint PPT Presentation

intermediate task transfer
SMART_READER_LITE
LIVE PREVIEW

intermediate task transfer CS685 Fall 2020 Advanced Natural Language - - PowerPoint PPT Presentation

intermediate task transfer CS685 Fall 2020 Advanced Natural Language Processing Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Tu Vu Stu ff from last time Too many readings!


slide-1
SLIDE 1

intermediate task transfer

CS685 Fall 2020

Advanced Natural Language Processing

Mohit Iyyer

College of Information and Computer Sciences University of Massachusetts Amherst many slides from Tu Vu

slide-2
SLIDE 2

Stuff from last time

  • Too many readings!
  • The mythical HW1
  • Extra credit!
slide-3
SLIDE 3

What is a task?

  • a description
  • a (sample) dataset


slide-4
SLIDE 4

Tasks can help each other!

  • classification: supplementing language model (LM)-

style pretraining with further training on intermediate tasks leads to improvements and reduced variance (Phang et al., 2019; arXiv)

  • sequence labeling: pretraining on a closely related

task yields better performance than LM pretraining when the pretraining dataset is fixed (Liu et al., 2019; NAACL)

  • machine comprehension: pretraining on multiple

related datasets leads to robust generalization and transfer (Talmor and Berant, 2019; ACL)

slide-5
SLIDE 5
  • Discover the space of language tasks
  • properties of individual tasks
  • task similarities and beneficial relations among tasks
  • Practical application
  • reduce the need for supervision among related tasks
  • multi-task learning: solve many tasks in one system
  • transfer learning: select source tasks for a given task
slide-6
SLIDE 6

A real-world scenario

Task bank


Company’s Cloud Service

Submits a new task

Task description Sample data

Returns a structure among tasks

end user

end 
 user’s 
 task related tasks

efficient supervision policies

slide-7
SLIDE 7

There are tons of NLP tasks!

  • ~ 100 tasks/datasets from various classes of problems

Single Sentence
 Classification Sentence
 Pair
 Classification Machine
 Comprehension Sequence
 Labeling Unsupervised
 Learning Probing
 Tasks CoLA MRPC SQuAD CCG LM SentLen SST-2 STS-B NewsQA POS autoencoding WC 20 Newsgroups QQP SearchQA Chunk next sentence TreeDepth TREC-6 MNLI TriviaQA NER real/fake TopConst IMDB QNLI HotpotQA ST discourse relations BShift Yelp-2 RTE CQ GED … Tense Yelp-full WNLI CWQ PS SubjNum AG BoolQ ComQA EF ObjNum DBPedia CB WikiHOP Parent SOMO Sogou News WiC DROP Conj CoordInv … … … … …

slide-8
SLIDE 8

Taskonomy for vision tasks

  • Zamir et al. (2018); CVPR: A library of 26 tasks covering

common themes in computer vision (2D, 3D, semantics, etc.)

slide-9
SLIDE 9

A research question

  • What criteria can be used to predict

which combinations of source/ intermediate and target tasks should work well?

slide-10
SLIDE 10

Create task embeddings

  • fixed-length dense vector representations
  • f tasks
  • The vector space can tell us how closely

related two tasks are (i.e., via cosine distance)

slide-11
SLIDE 11

MNLI

SNLI

QQP

SST-2

RTE

MRPC

QNLI

WNLI

STS-B

CoLA

slide-12
SLIDE 12

Previous work on exploring the relations between NLP tasks

  • Talmor and Berant (2019);

ACL: 10 main reading comprehension tasks

  • Bingel and Søgaard (2017);

EACL: 10 main sequence labeling tasks, 90 task pairs for multi-task learning

slide-13
SLIDE 13

A simple approach

base network

Tok 1 Tok 2 Tok N …

task description task embedding

  • use the task description
  • nly (i.e., a paragraph

describing the task)

  • Limitation: requires a clear

description for each task in the library

slide-14
SLIDE 14

base network

Gradient-based methods

Tok 1 Tok 2 Tok N …

input text task-specific classifier layer

  • use a single base network
  • add a task-specific layer for a

given task

  • pass the entire dataset forward

through the network only once

  • during backpropagation:

either use training labels or sample from the model’s predictive distribution to compute gradients w.r.t. the model’s parameters (weights) or

  • utputs (activations)
slide-15
SLIDE 15

What is the base network?

  • a pre-trained

model, e.g., BERT, XLNet, RoBERTa

slide-16
SLIDE 16
  • use training labels
  • original gradients
  • use the empirical Fisher
  • sample from the model’s predictive distribution
  • original gradients
  • use the theoretical Fisher



 
 


How to get gradient information?

slide-17
SLIDE 17

activations weights

word embedding position embedding segment embedding LayerNorm

queries values keys

dense LayerNorm

MH 1 MH 2 MH N

… dense dense LayerNorm

L 1 L 2 L N

… dense

P

Multi-head Attention Output Layer Output Pooled Output Embedding Layer Encoder Layer N x Pooler Layer Multi-head Attention Feed Forward

Various gradient types

slide-18
SLIDE 18
  • 3. fine-tune BERT on

selected source task

  • 4. fine-tune the

resulting model

  • n target task
  • 1. given a target task of interest,

compute a task embedding from BERT’s layer-wise gradients

  • 2. identify the most

similar source task embedding from a precomputed library

SQuAD SST2 DROP MNLI QNLI POS-PTB CCG WikiHop

WikiHop Target task

slide-19
SLIDE 19
  • 3. fine-tune BERT on

selected source task

  • 4. fine-tune the

resulting model

  • n target task
  • 1. given a target task of interest,

compute a task embedding from BERT’s layer-wise gradients

  • 2. identify the most

similar source task embedding from a precomputed library

SQuAD SST2 DROP MNLI QNLI POS-PTB CCG WikiHop

WikiHop Target task

slide-20
SLIDE 20
  • 3. fine-tune BERT on

selected source task

  • 4. fine-tune the

resulting model

  • n target task
  • 1. given a target task of interest,

compute a task embedding from BERT’s layer-wise gradients

  • 2. identify the most

similar source task embedding from a precomputed library

SQuAD SST2 DROP MNLI QNLI POS-PTB CCG WikiHop

WikiHop Target task

slide-21
SLIDE 21
  • 3. fine-tune BERT on

selected source task

  • 4. fine-tune the

resulting model

  • n target task
  • 1. given a target task of interest,

compute a task embedding from BERT’s layer-wise gradients

  • 2. identify the most

similar source task embedding from a precomputed library

SQuAD SST2 DROP MNLI QNLI POS-PTB CCG WikiHop

WikiHop Target task

slide-22
SLIDE 22

CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTail SQuAD-1.1 SQuAD-2.0 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQA CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED Conj

slide-23
SLIDE 23 Chunk SNLI SQuAD-1 SNLI CQ SNLI SciTail STS-B CQ SNLI ComQA QNLI ComQA MRPC STS-B MNLI NewsQA SQuAD-2 WikiHop SNLI ComQA QNLI CCG HotpotQA HotpotQA HotpotQA NewsQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA NewsQA HotpotQA HotpotQA HotpotQA POS-PTB SQuAD-1 NewsQA SQuAD-1 NER HotpotQA NewsQA HotpotQA DuoRC-p GParent GGParent GParent GParent GParent GGParent GGParent HotpotQA POS-PTB Chunk GParent POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB ST ST Parent POS-PTB

WNLI CoLA RTE MRPC MNLI STS-B QQP SNLI QNLI SST-2 SciTail DROP CQ DuoRC-p WikiHop ComQA DuoRC-s NewsQA BoolQ HotpotQA SQuAD-2 SQuAD-1 GED Conj GGParent GParent NER Parent CCG ST POS-EWT POS-PTB Chunk 20 40 60 80

Target task performance

CR tasks QA tasks SL tasks baseline (no transfer) task chosen by TaskEmb

LIMITED → LIMITED

SQuAD-2 is no longer the best source task for any QA targets in this regime QA tasks are good sources for CR targets

T a r g e t t a s k

B e s t s

  • u

r c e T a s k E m b c h

  • i

c e

slide-24
SLIDE 24

Chunk SNLI SQuAD-1 SNLI CQ SNLI SciTail STS-B CQ SNLI ComQA QNLI ComQA MRPC STS-B MNLI NewsQA SQuAD-2 WikiHop SNLI ComQA QNLI CCG HotpotQA HotpotQA HotpotQA NewsQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA NewsQA HotpotQA HotpotQA HotpotQA POS-PTB SQuAD-1 NewsQA SQuAD-

WNLI CoLA RTE MRPC MNLI STS-B QQP SNLI QNLI SST-2 SciTail DROP CQ DuoRC-p WikiHop ComQA DuoRC-s NewsQA BoolQ Hotpot 20 40 60 80

Target task performance

LIMITED → LIMITED

SQuAD-2 is no longer the best source task for any QA targets in this regime QA tasks are good sources for CR targets

T a r g e t t a s k

B e s t s

  • u

r c e T a s k E m b c h

  • i

c e

slide-25
SLIDE 25

A QNLI CCG HotpotQA HotpotQA HotpotQA NewsQA HotpotQA HotpotQA HotpotQA HotpotQA HotpotQA NewsQA HotpotQA HotpotQA HotpotQA POS-PTB SQuAD-1 NewsQA SQuAD-1 NER HotpotQA NewsQA HotpotQA DuoRC-p GParent GGParent GParent GParent GParent GGParent GGParent HotpotQA POS-PTB Chunk GParent POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB ST ST Parent POS-PTB

iTail DROP CQ DuoRC-p WikiHop ComQA DuoRC-s NewsQA BoolQ HotpotQA SQuAD-2 SQuAD-1 GED Conj GGParent GParent NER Parent CCG ST POS-EWT POS-PTB Chunk

CR tasks QA tasks SL tasks baseline (no transfer) task chosen by TaskEmb

LIMITED → LIMITED

SQuAD-2 is no longer the best source task for any QA targets in this regime

slide-26
SLIDE 26
  • tQA

DuoRC-p GParent GGParent GParent GParent GParent GGParent GGParent HotpotQA POS-PTB Chunk GParent POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB POS-PTB ST ST Parent POS-PTB

SQuAD-1 GED Conj GGParent GParent NER Parent CCG ST POS-EWT POS-PTB Chunk

CR tasks QA tasks SL tasks baseline (no transfer) task chosen by TaskEmb