Transfer learning with neural language models CS 685, Spring 2020 - - PowerPoint PPT Presentation

transfer learning with neural language models
SMART_READER_LITE
LIVE PREVIEW

Transfer learning with neural language models CS 685, Spring 2020 - - PowerPoint PPT Presentation

Transfer learning with neural language models CS 685, Spring 2020 Advanced Natural Language Processing Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Jacob Devlin & Matt Peters


slide-1
SLIDE 1

Transfer learning with neural language models

CS 685, Spring 2020

Advanced Natural Language Processing

Mohit Iyyer

College of Information and Computer Sciences University of Massachusetts Amherst

many slides from Jacob Devlin & Matt Peters

slide-2
SLIDE 2

Stuff from last time…

  • Project proposals due 9/21, please use

Overleaf template

  • Still working on making the next homework

computationally feasible on Colab, look out for it next week

  • Please ask other questions (about logistics /

material / etc) in the chatbox!

2

slide-3
SLIDE 3

Do NNs really need millions

  • f labeled examples?

3

  • Can we leverage unlabeled data to cut down
  • n the number of labeled examples we

need?

slide-4
SLIDE 4

What is transfer learning?

  • In our context: take a network trained on a

task for which it is easy to generate labels, and adapt it to a different task for which it is harder.

  • In computer vision: train a CNN on

ImageNet, transfer its representations to every other CV task

  • In NLP: train a really big language model
  • n billions of words, transfer to every NLP

task!

4

slide-5
SLIDE 5

5

A ton of unlabeled text

A huge self- supervised model

step 1: unsupervised pretraining

Sentiment- specialized model

Labeled reviews from IMDB

step 2: supervised fine-tuning

slide-6
SLIDE 6

language models for transfer learning

Deep contextualized word representations. Peters et al., NAACL 2018

slide-7
SLIDE 7

Previous methods (e.g., word2vec) represent each word type with a single vector

play =[0.2, -0.1, 0.5, ...] bank =[-0.3, 1.4, 0.7, ...] run =[-0.5, -0.3, -0.1, ...]

NNs are then used to compose those vectors over longer sequences

slide-8
SLIDE 8

The new-look play area is due to be completed by early spring 2010 .

Single vector per word

slide-9
SLIDE 9

Gerrymandered congressional districts favor representatives who play to the party base .

Single vector per word

slide-10
SLIDE 10

The freshman then completed the three-point play for a 66-63 lead .

Single vector per word

slide-11
SLIDE 11

Nearest neighbors

play =[0.2, -0.1, 0.5, ...]

Nearest Neighbors

playing plays game player games Play played football players multiplayer

slide-12
SLIDE 12

Multiple senses entangled

play =[0.2, -0.1, 0.5, ...]

Nearest Neighbors

playing plays game player games Play played football players multiplayer

VERB

slide-13
SLIDE 13

Multiple senses entangled

play =[0.2, -0.1, 0.5, ...]

Nearest Neighbors

playing plays game player games Play played football players multiplayer

VERB NOUN

slide-14
SLIDE 14

Multiple senses entangled

play =[0.2, -0.1, 0.5, ...]

Nearest Neighbors

playing plays game player games Play played football players multiplayer

VERB NOUN ADJ

slide-15
SLIDE 15

15

slide-16
SLIDE 16

16

Examples on iPad

slide-17
SLIDE 17

17

slide-18
SLIDE 18

Deep bidirectional language model … download new games or play ??

slide-19
SLIDE 19

… download new games or play ?? Deep bidirectional language model

slide-20
SLIDE 20

LSTM

Deep bidirectional language model … download new games or play ??

slide-21
SLIDE 21

LSTM LSTM LSTM

Deep bidirectional language model … download new games or play ??

slide-22
SLIDE 22

LSTM LSTM LSTM LSTM LSTM

Deep bidirectional language model … download new games or play ??

slide-23
SLIDE 23

LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Deep bidirectional language model … download new games or play ??

slide-24
SLIDE 24

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Deep bidirectional language model … download new games or play ??

slide-25
SLIDE 25

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

?? Deep bidirectional language model … download new games or play ??

slide-26
SLIDE 26

biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM

… games or play online via … Use all layers of language model

0.25 0.6

embeddings from language models

0.15

ELMo

slide-27
SLIDE 27

Learned task-specific combination of layers

biLSTM biLSTM biLSTM biLSTM biLSTM biLSTM

… games or play online via …

s3 s2

embeddings from language models

s1

ELMo

slide-28
SLIDE 28

Contextual representations ELMo representations are contextual – they depend on the entire sentence in which a word is used.

how many different embeddings does ELMo compute for a given word?

slide-29
SLIDE 29

ELMo improves NLP tasks

slide-30
SLIDE 30

Large-scale recurrent neural language models learn contextual representations that capture basic elements of semantics and syntax Adding ELMo to existing state-of-the-art models provides significant performance improvement on all NLP tasks.

slide-31
SLIDE 31

TO FROM

slide-32
SLIDE 32

32

slide-33
SLIDE 33

33

  • Why not?
slide-34
SLIDE 34

34

slide-35
SLIDE 35

35

slide-36
SLIDE 36

36

  • What are the pros and

cons of increasing k?

slide-37
SLIDE 37

37

slide-38
SLIDE 38

38

  • This has since been shown to

be unimportant (and can be removed e.g., in RoBERTa)

slide-39
SLIDE 39

39

slide-40
SLIDE 40

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

43

  • More details

next week!

slide-44
SLIDE 44

44

slide-45
SLIDE 45

45

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

slide-49
SLIDE 49

49