NLU lecture 5: Word representations and morphology Adam Lopez - - PowerPoint PPT Presentation

nlu lecture 5 word representations and morphology
SMART_READER_LITE
LIVE PREVIEW

NLU lecture 5: Word representations and morphology Adam Lopez - - PowerPoint PPT Presentation

NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk Essential epistemology Word representations and word2vec Word representations and compositional morphology Reading: Mikolov et al. 2013, Luong et al.


slide-1
SLIDE 1

NLU lecture 5: Word representations and morphology

Adam Lopez alopez@inf.ed.ac.uk

slide-2
SLIDE 2
  • Essential epistemology
  • Word representations and word2vec
  • Word representations and compositional

morphology Reading: Mikolov et al. 2013, Luong et al. 2013

slide-3
SLIDE 3

Essential epistemology

Exact sciences Empirical sciences Engineering Deals with Axioms & theorems Facts & theories Artifacts Truth is Forever Temporary It works Examples Mathematics C.S. theory F.L. theory Physics Biology Linguistics Many, including applied C.S. e.g. NLP

slide-4
SLIDE 4

Essential epistemology

Exact sciences Empirical sciences Engineering Deals with Axioms & theorems Facts & theories Artifacts Truth is Forever Temporary It works Examples Mathematics C.S. theory Physics Biology Linguistics Many, including applied C.S. e.g. MT

slide-5
SLIDE 5

Essential epistemology

Exact sciences Empirical sciences Engineering Deals with Axioms & theorems Facts & theories Artifacts Truth is Forever Temporary It works Examples Mathematics C.S. theory Physics Biology Linguistics Many, including applied C.S. e.g. MT

morphological properties of words (facts)

slide-6
SLIDE 6

Essential epistemology

Exact sciences Empirical sciences Engineering Deals with Axioms & theorems Facts & theories Artifacts Truth is Forever Temporary It works Examples Mathematics C.S. theory Physics Biology Linguistics Many, including applied C.S. e.g. MT

morphological properties of words (facts) Optimality theory

slide-7
SLIDE 7

Essential epistemology

Exact sciences Empirical sciences Engineering Deals with Axioms & theorems Facts & theories Artifacts Truth is Forever Temporary It works Examples Mathematics C.S. theory Physics Biology Linguistics Many, including applied C.S. e.g. MT

morphological properties of words (facts) Optimality theory Optimality theory is finite-state

slide-8
SLIDE 8

Essential epistemology

Exact sciences Empirical sciences Engineering Deals with Axioms & theorems Facts & theories Artifacts Truth is Forever Temporary It works Examples Mathematics C.S. theory Physics Biology Linguistics Many, including applied C.S. e.g. MT

morphological properties of words (facts) Optimality theory Optimality theory is finite-state We can represent morphological properties of words with finite-state automata

slide-9
SLIDE 9
slide-10
SLIDE 10

Remember the bandwagon

slide-11
SLIDE 11

Word representations

slide-12
SLIDE 12

Feedforward model

p(e) =

|e|

Y

i=1

p(ei | ei−n+1, . . . , ei−1) p(ei | ei−n+1, . . . , ei−1) = ei ei−1 ei−2 ei−3 C C C W V

tanh softmax x=x

slide-13
SLIDE 13

p(e) =

|e|

Y

i=1

p(ei | ei−n+1, . . . , ei−1) p(ei | ei−n+1, . . . , ei−1) = ei ei−1 ei−2 ei−3 C C C W V

tanh softmax x=x

Every word is a vector (a one-hot vector) The concatenation

  • f these vectors is

an n-gram

Feedforward model

slide-14
SLIDE 14

p(e) =

|e|

Y

i=1

p(ei | ei−n+1, . . . , ei−1) p(ei | ei−n+1, . . . , ei−1) = ei ei−1 ei−2 ei−3 C C C W V

tanh softmax x=x

Word embeddings are vectors: continuous representations of each word.

Feedforward model

slide-15
SLIDE 15

p(e) =

|e|

Y

i=1

p(ei | ei−n+1, . . . , ei−1) p(ei | ei−n+1, . . . , ei−1) = ei ei−1 ei−2 ei−3 C C C W V

tanh softmax x=x

n-grams are vectors: continuous representations of n-grams (or, via recursion, larger structures)

Feedforward model

slide-16
SLIDE 16

p(e) =

|e|

Y

i=1

p(ei | ei−n+1, . . . , ei−1) p(ei | ei−n+1, . . . , ei−1) = ei ei−1 ei−2 ei−3 C C C W V

tanh softmax x=x

a discrete probability distribution over V

  • utcomes is a vector:

V non-negative reals summing to 1.

Feedforward model

slide-17
SLIDE 17

p(e) =

|e|

Y

i=1

p(ei | ei−n+1, . . . , ei−1) p(ei | ei−n+1, . . . , ei−1) = ei ei−1 ei−2 ei−3 C C C W V

tanh softmax x=x

No matter what we do in NLP, we’ll (almost) always have words… Can we reuse these vectors?

Feedforward model

slide-18
SLIDE 18

Design a POS tagger using an RRNLM

slide-19
SLIDE 19

Design a POS tagger using an RRNLM

What are some difficulties with this? What limitation do you have in learning a POS tagger that you don’t have when learning a LM?

slide-20
SLIDE 20

Design a POS tagger using an RRNLM

What are some difficulties with this? What limitation do you have in learning a POS tagger that you don’t have when learning a LM? One big problem: LIMITED DATA

slide-21
SLIDE 21

–John Rupert Firth (1957)

“You shall know a word by the company it keeps”

slide-22
SLIDE 22

Learning word representations using language modeling

  • Idea: we’ll learn word representations using a

language model, then reuse them in our POS tagger (or any other thing we predict from words).

  • Problem: Bengio language model is slow. Imagine

computing a softmax over 10,000 words!

slide-23
SLIDE 23

Continuous bag-of-words (CBOW)

slide-24
SLIDE 24

Skip-gram

slide-25
SLIDE 25

Skip-gram

slide-26
SLIDE 26

Learning skip-gram

slide-27
SLIDE 27

Learning skip-gram

slide-28
SLIDE 28

Word representations capture some world knowledge

slide-29
SLIDE 29

Continuous Word Representations

man woman king queen Semantics walk walks Syntactic read reads

slide-30
SLIDE 30

Will it learn this?

slide-31
SLIDE 31

(Additional) limitations of word2vec

  • Closed vocabulary assumption
  • Cannot exploit functional relationships in learning

?

slide-32
SLIDE 32

Is this language?

A Lorillard spokeswoman said, “This is an old story.” A UNK UNK said, “This is an old story.” What our data contains: What word2vec thinks our data contains:

slide-33
SLIDE 33

Is it ok to ignore words?

slide-34
SLIDE 34

Is it ok to ignore words?

slide-35
SLIDE 35

What we know about linguistic structure

Morpheme: the smallest meaningful unit of language “loves”

root/stem: love affix: -s

  • morph. analysis: 3rd.SG.PRES

love +s

slide-36
SLIDE 36

What if we embed morphemes rather than words?

Basic idea: compute representation recursively from children Vectors in green are morpheme embeddings (parameters) Vectors in grey are computed as above (functions) f is an activation function (e.g. tanh)

slide-37
SLIDE 37

Train compositional morpheme model by minimizing distance to reference vector

Target output: reference vector pr contructed vector is pc Minimize:

slide-38
SLIDE 38

Or, train in context using backpropagation

Vectors in green are morpheme embeddings (parameters) Vectors in grey are computed as above (functions) Vectors in blue are word or n-gram embeddings (parameters) (Basically a feedforward LM)

slide-39
SLIDE 39

Where do we get morphemes?

  • Use an unsupervised morphological analyzer (we’ll

talk about unsupervised learning later on).

  • How many morphemes are there?
slide-40
SLIDE 40

New stems are invented every day!

fleeking, fleeked, and fleeker are all attested…

slide-41
SLIDE 41

Representations learned by compositional morphology model

slide-42
SLIDE 42

Summary

  • Deep learning is not magic and will not solve all of your

problems, but representation learning is a very powerful idea.

  • Word representations can be transferred between models.
  • Word2vec trains word representations using an objective based
  • n language modeling—so it can be trained on unlabeled data.
  • Sometimes called unsupervised, but objective is supervised!
  • Vocabulary is not finite.
  • Compositional representations based on morphemes make our

models closer to open vocabulary.