Machine Learning for NLP Learning from small data: reading Aurlie - - PowerPoint PPT Presentation

machine learning for nlp
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for NLP Learning from small data: reading Aurlie - - PowerPoint PPT Presentation

Machine Learning for NLP Learning from small data: reading Aurlie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 High-risk learning Today, reading: High-risk learning: acquiring word vectors from tiny data Herbelot


slide-1
SLIDE 1

Machine Learning for NLP

Learning from small data: reading

Aurélie Herbelot 2018

Centre for Mind/Brain Sciences University of Trento 1

slide-2
SLIDE 2

High-risk learning

Today, reading: High-risk learning: acquiring word vectors from tiny data Herbelot & Baroni (2017)

2

slide-3
SLIDE 3

Introduction

3

slide-4
SLIDE 4

Learning Italian (for lazy people)

Il Nottetempo viaggiava nell’oscurità, mettendo in fuga parchimetri e cespugli, alberi e cabine del telefono. The Knight Bus drove in complete darkness, scaring away parking meters and ___, trees and phone boxes.

4

slide-5
SLIDE 5

Cespugli...

5

slide-6
SLIDE 6

A high-risk strategy...

“Si, c’è una bruciatura sul tavolo” disse Ron indicando la macchia. “Yes, there’s a ___ on the table”, said Ron pointing at the stain. wine? liqueur? inkpot? ... due pozioni contro le bruciature... ... two potions against inkpots...

6

slide-7
SLIDE 7

A high-risk strategy...

“Si, c’è una bruciatura sul tavolo” disse Ron indicando la macchia. “Yes, there’s a ___ on the table”, said Ron pointing at the stain. wine? liqueur? inkpot? ... due pozioni contro le bruciature... ... two potions against burns...

6

slide-8
SLIDE 8

Fast mapping in your language

  • Fast mapping: the process whereby a new concept is learn

via one single exposure.

  • Examples:
  • Language acquisition [not today!]
  • Dictionary definitions:

Tetraspores are red algae spores...

  • New words in naturally-occurring text:

The team needs a seeker for the next quidditch game.

7

slide-9
SLIDE 9

The research question

  • Can we simulate fast mapping? Can we learn good word

representations from tiny data?

  • Test in two conditions:
  • Definitions. Maximally informative (we hope!)
  • Natural occurrences of a nonce. Unclear whether the

context is sufficient to learn a good representation.

  • Do it with distributional semantics.

8

slide-10
SLIDE 10

Semantic spaces and Harry Potter

9

slide-11
SLIDE 11

Vectors vs human meaning

Machine exposed to: 100M words (BNC) 2.6B words (UKWac) 100B words (GoogleNews) 3-year old child exposed to: 25M words (US) 20M words (Dutch) 5M words (Mayan) (Cristia et al 2017) Humans learn much faster than machines. Owning data is not intelligence. We’ll never do fast-mapping like that!

10

slide-12
SLIDE 12

Some fast mapping tasks

11

slide-13
SLIDE 13

The general task: learning a meaning

Putting a new point in the semantic space, in the right place!

12

slide-14
SLIDE 14

The definitional dataset

  • Record all Wikipedia titles containing one word only (e.g.

Albedo, Insulin).

  • Extract and tokenise the first sentence of the Wikipedia

page corresponding to each target title. insulin is a peptide hormone produced by beta cells in the pancreas .

  • Replace target with slot.

___ is a peptide hormone produced by beta cells in the pancreas .

  • 1000, manually checked, split into 700/300 train/test sets.

All target words have frequency 200 in UKWaC.

13

slide-15
SLIDE 15

The definitional dataset: examples

pride ___ is an inwardly directed emotion that carries two common meanings waxing ___ is a form of semi permanent hair removal which removes the hair from the root beech ___ fagus is a genus of deciduous trees in the family fagaceae native to temperate europe asia and north america glasgow ___

  • r scots glesca scottish gaelic glaschu

is the largest city in scotland and the fourth largest in the united kingdom 14

slide-16
SLIDE 16

The definitional dataset: evaluation

Evaluation: how far is the learned vector from one that would be learned from 2.6 billion words (UKWaC)? (Reciprocal Rank)

15

slide-17
SLIDE 17

The chimera dataset (Lazaridou et al, 2016)

  • Simulate a nonce situation: a speaker encounters a word

for the first time in naturally-occurring sentences.

  • Each data point is associated with 2-6 sentences, showing

the word in context.

  • The nonce is created as a ‘chimera’, i.e. a mixture of two

existing and somewhat related concepts (e.g., a buffalo crossed with an elephant).

  • The sentences associated with the nonce are utterances

containing one of the components of the chimera.

  • Data annotated by humans in terms of the similarity of the

nonce to other, randomly selected concepts.

16

slide-18
SLIDE 18

The chimera dataset (Lazaridou et al, 2016)

Sentences: STIMARANS and tomatoes as well as peppers are grown in greenhouses with much higher yields. @@ Add any liquid left from the STIMARAN together with all the other ingredients except the breadcrumbs and cheese. Probes: rhubarb, onion, pear, strawberry, limousine, cushion Human responses: 2.86, 3, 3.29, 2.29, 1.14, 1.29

Figure 1: An example chimera (STIMARAN), made of cucumber and celery

17

slide-19
SLIDE 19

The chimera dataset: evaluation

  • Try and simulate human answers
  • n the similarity task.
  • Calculate Spearman ranked

correlation between human and machine.

  • Average Spearman ρ over all

instances.

?

Evaluation: can the machine reproduce human judgements?

18

slide-20
SLIDE 20

Learning concepts, the trendy way

19

slide-21
SLIDE 21

Word2Vec (Mikolov et al, 2013)

  • Super-trendy: 3137 + 2835 citations.
  • Unreadable code. Muddy parameters.

(147 + 267 + 207 + 152 citations gained explaining Word2Vec.)

  • It works!
  • Excellent correlation with human similarity judgements.
  • Computes analogies of the type king - man = queen -

woman (also for morphological derivations).

  • Performs as well as any student in the TOEFL test.

20

slide-22
SLIDE 22

The intuition behind Word2Vec

  • Word2Vec (Mikolov et al 2013) is a neural network,

predictive model. It has two possible architectures:

  • given some context words, predict the target (CBOW)
  • given a target word, predict the contexts (Skip-gram)
  • In the process of doing the prediction task, the model

learns word vectors.

21

slide-23
SLIDE 23

Word2Vec: the model

The word vectors are given by the weights of the input matrix. Random initialisation.

22

slide-24
SLIDE 24

The Word2Vec vocabulary

  • Word2Vec looks incremental: it reads through a corpus,
  • ne line after the other, and tries to predict terms in each

encountered word window.

  • In fact, it requires a first pass through the corpus to build a

vocabulary of all words in the corpus, together with their frequencies.

  • This table will be used in the sampling steps of the

algorithm.

23

slide-25
SLIDE 25

Subsampling

  • Instead of considering all words in the sentence, transform

it by randomly removing words from it: considering all sentence transform randomly words

  • The subsampling function makes it more likely to remove a

frequent word.

  • Word2Vec uses aggressive subsampling.

24

slide-26
SLIDE 26

The learning rate

  • Word2Vec tries to maximise the probability of a correct

prediction.

  • This means modifying the weights of the network ‘in the

right direction’.

  • By taking too big a step, we run the risk to overshoot the

maximum.

  • Word2Vec is conservative. Default α = 0.025.

25

slide-27
SLIDE 27

The word window

  • How much context are we taking into account?
  • Smaller windows emphasise structural similarity:

cat dog pet kitty ferret

  • Larger windows emphasise relatedness:

cat mouse whisker stroke

  • Best of both worlds with random resizing of the window.

26

slide-28
SLIDE 28

Experimental setup

  • We assume that we have a background vocabulary, that is,

a semantic space with high-quality vectors, trained on a large corpus.

  • We then expose the model to the sentence(s) containing

the nonce.

  • Standard Word2Vec parameters:
  • Learning rate: 0.025
  • Window size: 5
  • Negative samples: 5
  • Epochs: 5
  • Subsampling: 0.001

27

slide-29
SLIDE 29

Results on definitions

MRR Mean rank W2V 0.00007 111012 Sum N2V Evaluation: rank of ‘true’ vector (learnt from big data) amongst the 259,376 neighbours of learnt vector.

28

slide-30
SLIDE 30

What does 0.00007 mean?

Figure: Binned ranks in the definitional task

29

slide-31
SLIDE 31

Results on chimeras

L2 ρ L4 ρ L6 ρ W2V 0.1459 0.2457 0.2498 Sum N2V Evaluation: correlation with human similarity judgements over probes.

30

slide-32
SLIDE 32

Verdict

  • Word2Vec can learn from big data, but not from tiny data.
  • I.e. it learns really slowly.
  • No wonder. α = 0.025.

31

slide-33
SLIDE 33

Slow learner!

32

slide-34
SLIDE 34

Learning concepts, the hacky way

33

slide-35
SLIDE 35

Hack it (Lazaridou et al 2016)

  • Sum the vectors of the words in the nonce’s context.
  • Given a nonce N in a sentence S = w1...N...wk...wp:
  • N =
  • 1...k...p=n
  • wk

34

slide-36
SLIDE 36

Results on chimeras

MRR Mean rank W2V 0.00007 111012 Sum 0.03686 861 N2V Evaluation: rank of ‘true’ vector (learnt from big data) amongst the 259,376 neighbours of learnt vector.

35

slide-37
SLIDE 37

What does 0.03147 mean?

Figure: Binned ranks in the definitional task

36

slide-38
SLIDE 38

What does 0.03147 mean?

blackmail ___ is an act

  • ften a crime

involving unjustified threats to make a gain or cause loss to another unless a demand is met Neighbours [’cause’, ’trespasser’, ’victimless’, ’deprives’, ’threats’, ’injunctive’, ’promisor’, ’exonerate’, ’hypokalemia’, ’abuser’] Rank 2182 37

slide-39
SLIDE 39

Results on chimeras

L2 ρ L4 ρ L6 ρ W2V 0.1459 0.2457 0.2498 Sum 0.3376 0.3624 0.4080 N2V Evaluation: correlation with human similarity judgements over probes.

38

slide-40
SLIDE 40

Theoretical issues in hacking

  • Addition is a special nonce process, activated when a new

word is encountered.

  • But for how long is a new word new? 2, 4, 6 sentences?

More? When shall we come back to standard Word2Vec?

  • Standard problem in having multiple processes for

modelling one phenomena: you need a meta-theory. (When to apply process X or process Y.)

  • Wouldn’t it be nice to have just one algorithm for all cases?

39

slide-41
SLIDE 41

Practical issues in hacking

  • Addition is an upper bound. It can’t be made better.
  • Addition is very sensitive to the nature of the context. The

more context coherence, the better the representation.

40

slide-42
SLIDE 42

Practical issues in hacking

  • Addition is an upper bound. It can’t be made better.
  • Addition is very sensitive to the nature of the context. The

more context coherence, the better the representation.

40

slide-43
SLIDE 43

Low topic coherence is not incoherence

“Bring out your cat’s inner DJ!” http://www.suck.uk.com/products/catscratch/

41

slide-44
SLIDE 44

Learning concepts, the risky way

42

slide-45
SLIDE 45

Learning from small data – the risky way

Make a wild guess – move fast – take everything that’s given to you – but don’t lose yourself in unstable beliefs!

43

slide-46
SLIDE 46

The alternative: high-risk learning

  • Keep Word2Vec (nearly) as it is.
  • Take risks. Trust the sentence and its informativeness:

Increase the learning rate. 40-fold.

  • Be greedy. Grab all you can.

Increase the word window size. 3-fold. Suppress window resizing and most subsampling.

44

slide-47
SLIDE 47

Insurance policy

  • Initialise vectors to sum.
  • Selective training: only train the nonce. Don’t change prior

beliefs.

  • The standard w2v training process involves bringing the

words in a sentence closer to each other.

  • With a high learning rate, this means a drastic move for all

words in that sentence.

  • The words we already know well shouldn’t move.

45

slide-48
SLIDE 48

Beyond the nonce

  • How long should we keep the learning rate that high?
  • The increase in learning rate α drastically moves a

randomly-generated vector to what the system assumes is the right area of the semantic space.

  • Once initial positioning has taken place the system should

refine its guess rather than moving wildly in the space.

46

slide-49
SLIDE 49

On the importance of decay

  • We tune decay on learning rate, window size and

subsampling.

  • Learning rate: every time t that we train a pair containing

the target word, we set α to α0e−λt, where α0 is our initial learning rate.

  • Window size: we slowly decrease the window size to get

back to ‘normal’ levels.

  • Subsampling: we slowly increase subsampling to get back

to ‘normal’ levels.

47

slide-50
SLIDE 50

Experimental setup

  • Search for best parameters on definitional training set, use
  • n definitional test set and chimera dataset.
  • Range of parameters:
  • Learning rate: [0.5, 0.8, 1, 2, 5, 10, 20]
  • Window size: [5, 10, 15, 20]
  • Negative samples: [3, 5, 10]
  • Number of epochs: [1, 5, 10]
  • Subsampling rate: [500, 1000, 10000]

48

slide-51
SLIDE 51

Results on definitions

MRR Mean rank W2V 0.00007 111012 Sum 0.03686 861 N2V 0.04907 623 Evaluation: rank of ‘true’ vector (learnt from big data) amongst the 259,376 neighbours of learnt vector.

49

slide-52
SLIDE 52

Best parameters

W2V N2V Learning rate 0.025 1 Window size 5 15 Negative samples 5 3 Number of epochs 5 1 Subsampling rate 0.001 10000

50

slide-53
SLIDE 53

Results on chimeras

L2 ρ L4 ρ L6 ρ W2V 0.1459 0.2457 0.2498 Sum 0.3376 0.3624 0.4080 N2V 0.3320 0.3668 0.3890 Table 1: Evaluation: correlation with human similarity judgements

  • ver probes.

N2V does not improve on sum. Explanation: the system can’t tell really informative sentences from noise – it heightens its learning rate

  • n the wrong data. The risk does

not pay off.

51

slide-54
SLIDE 54

Conclusion

52

slide-55
SLIDE 55

Learning concepts, from any amount of data

  • Given a fairly extensive prior vocabulary, it is possible to

learn new concepts from any amount of (minimally informative) data using a dynamic, incremental architecture.

  • Risks must be mitigated: know what you believe, and use

your beliefs:

  • don’t revise your beliefs in the light of a new, unknown

concept;

  • don’t learn a new concept on the back of uncertain beliefs.
  • On natural data, the system must know how to increase its

learning rate on the right data. This means the ability to measure the informativeness of a sentence.

53

slide-56
SLIDE 56

Acknowledgements and things

Collaborator: Marco Baroni Github:

https://github.com/minimalparts/nonce2vec/

Funding: Horizon 2020, Marie Skłodowska-Curie grant No 751250. Images: Brain by Gaetan Lee. From Flickr as Chimp Brain in a jar, CC BY 2.0. Emotions by Sourav Biswas, Flickr, CC BY-NC-ND 2.0. Robot by Michael Dain, Flickr, CC BY-NC-ND 2.0. Finnish Parliament by Tiina Tuukkanen - Tekijän arkisto, CC BY-SA 4.0. Chocolate fudge cake by Tracy Hunter, uploaded by Ekabhishek, CC BY 2.0. Sad robot by Ergoneon, Pixabay, CC0.

54