Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - - PowerPoint PPT Presentation

β–Ά
lecture 6 representing words
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - - PowerPoint PPT Presentation

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text


slide-1
SLIDE 1

Lecture 6: Representing Words

Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/

1 ML in NLP

slide-2
SLIDE 2

Bag-of-Words with N-grams

v N-grams: a contiguous sequence of n tokens from a given piece of text

CS 6501: Natural Language Processing 2

http://recognize-speech.com/language-model/n-gram-model/comparison

slide-3
SLIDE 3

Language model

v Probability distributions over sentences (i.e., word sequences ) P(W) = P(π‘₯"π‘₯#π‘₯$π‘₯% … π‘₯') v Can use them to generate strings P(π‘₯' ∣ π‘₯#π‘₯$π‘₯% … π‘₯')") v Rank possible sentences

v P(β€œToday is Tuesday”) > P(β€œTuesday Today is”) v P(β€œToday is Tuesday”) > P(β€œToday is Los Angeles”)

CS 6501: Natural Language Processing 3

slide-4
SLIDE 4

N-Gram Models

v Unigram model: 𝑄 π‘₯" 𝑄 π‘₯# 𝑄 π‘₯$ … 𝑄(π‘₯,) v Bigram model: 𝑄 π‘₯" 𝑄 π‘₯#|π‘₯" 𝑄 π‘₯$|π‘₯# … 𝑄(π‘₯,|π‘₯,)") v Trigram model: 𝑄 π‘₯" 𝑄 π‘₯#|π‘₯" 𝑄 π‘₯$|π‘₯#, π‘₯" … 𝑄(π‘₯,|π‘₯,)"π‘₯,)#) v N-gram model: 𝑄 π‘₯" 𝑄 π‘₯#|π‘₯" … 𝑄(π‘₯,|π‘₯,)"π‘₯,)# … π‘₯,)0)

CS 6501: Natural Language Processing 4

slide-5
SLIDE 5

Random language via n-gram

v http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf v https://research.googleblog.com/2006/08/al l-our-n-gram-are-belong-to-you.html

CS 6501: Natural Language Processing 5

Collection of n-gram

slide-6
SLIDE 6

N-Gram Viewer

ML in NLP 6

https://books.google.com/ngrams

slide-7
SLIDE 7

How to represent words?

v N-gram -- cannot capture word similarity v Word clusters

v Brown Clustering v Part-of-speech tagging

v Continuous space representation

v Word embedding

ML in NLP 7

slide-8
SLIDE 8

Brown Clustering

v Similar to language model But, basic unit is β€œword clusters”

v Intuition: similar words appear in similar context

v Recap: Bigram Language Models

v 𝑄 π‘₯1, π‘₯", π‘₯#, … , π‘₯, = 𝑄 π‘₯" π‘₯1 𝑄 π‘₯# π‘₯" … 𝑄 π‘₯, π‘₯,)" = Ξ 56"

7

P(w: ∣ π‘₯:)")

8 6501 Natural Language Processing

π‘₯1 is a dummy word representing ”begin of a sentence”

slide-9
SLIDE 9

Motivation example

v ”a dog is chasing a cat”

v 𝑄 π‘₯1, β€œπ‘β€, ”𝑒𝑝𝑕”, … , β€œπ‘‘π‘π‘’β€ = 𝑄 ”𝑏” π‘₯1 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑒” ”𝑏”

v Assume Every word belongs to a cluster

9 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 64 chasing following biting…

slide-10
SLIDE 10

Motivation example

v Assume every word belongs to a cluster v β€œa dog is chasing a cat”

10 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46

slide-11
SLIDE 11

Motivation example

v Assume every word belongs to a cluster v β€œa dog is chasing a cat”

11 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

slide-12
SLIDE 12

Motivation example

v Assume every word belongs to a cluster v β€œthe boy is following a rabbit”

12 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 the boy is following a rabbit

slide-13
SLIDE 13

Motivation example

v Assume every word belongs to a cluster v β€œa fox was chasing a bird”

13 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a fox was chasing a bird

slide-14
SLIDE 14

Brown Clustering

v Let 𝐷 π‘₯ denote the cluster that π‘₯ belongs to v β€œa dog is chasing a cat”

14 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

P(C(dog)|C(a)) P(cat|C(cat))

slide-15
SLIDE 15

Brown clustering model

v P(β€œa dog is chasing a cat”)

= P(C(β€œa”)|𝐷1) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))...

15 6501 Natural Language Processing

Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

P(C(dog)|C(a)) P(cat|C(cat))

slide-16
SLIDE 16

Brown clustering model

v P(β€œa dog is chasing a cat”)

= P(C(β€œa”)|𝐷1) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... v In general

𝑄 π‘₯1, π‘₯", π‘₯#, … , π‘₯, = 𝑄 𝐷(π‘₯") 𝐷 π‘₯1 𝑄 𝐷(π‘₯#) 𝐷(π‘₯") … 𝑄 𝐷 π‘₯, 𝐷 π‘₯,)" 𝑄(π‘₯"|𝐷 π‘₯" 𝑄 π‘₯# 𝐷 π‘₯# … 𝑄(π‘₯,|𝐷 π‘₯, ) = Ξ 56"

7

P 𝐷 w: 𝐷 π‘₯:)" 𝑄(π‘₯: ∣ 𝐷 π‘₯: )

16 6501 Natural Language Processing

slide-17
SLIDE 17

Model parameters

𝑄 π‘₯1, π‘₯", π‘₯#, … , π‘₯, = Ξ 56"

7

P 𝐷 w: 𝐷 π‘₯:)" 𝑄(π‘₯: ∣ 𝐷 π‘₯: )

17 6501 Natural Language Processing Cluster 3 a the Cluster 46 dog cat fox rabbit bird boy Cluster 64 is was Cluster 8 chasing following biting… C3 C46 C64 C8 C3 C46 a dog is chasing a cat

Parameter set 1: 𝑄(𝐷(π‘₯:)|𝐷 π‘₯:)" ) Parameter set 2: 𝑄(π‘₯:|𝐷 π‘₯: ) Parameter set 3: 𝐷 π‘₯:

slide-18
SLIDE 18

Model parameters

𝑄 π‘₯1, π‘₯", π‘₯#, … , π‘₯, = Ξ 56"

7

P 𝐷 w: 𝐷 π‘₯:)" 𝑄(π‘₯: ∣ 𝐷 π‘₯: ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 β†’ {1, 2, 3, … 𝑙 }

v A partition of vocabulary into k classes

v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑N ∈ 1, … , 𝑙 v Conditional probability 𝑄(π‘₯ ∣ 𝑑) for 𝑑, 𝑑N ∈ 1, … , 𝑙 , π‘₯ ∈ 𝑑

18 6501 Natural Language Processing

πœ„ represents the set of conditional probability parameters C represents the clustering

slide-19
SLIDE 19

Log likelihood

LL(πœ„, 𝐷 ) = log 𝑄 π‘₯1, π‘₯", π‘₯#, … , π‘₯, πœ„, 𝐷 = log Ξ 56"

7

P 𝐷 w: 𝐷 π‘₯:)" 𝑄(π‘₯: ∣ 𝐷 π‘₯: ) = βˆ‘56"

7

[log P 𝐷 w: 𝐷 π‘₯:)" + log 𝑄(π‘₯: ∣ 𝐷 π‘₯: ) ]

v Maximizing LL(πœ„, 𝐷) can be done by

alternatively update πœ„ and 𝐷

  • 1. max

\∈] 𝑀𝑀(πœ„, 𝐷)

  • 2. max

_ 𝑀𝑀(πœ„, 𝐷)

19 6501 Natural Language Processing

slide-20
SLIDE 20

max

\∈] 𝑀𝑀(πœ„, 𝐷)

LL(πœ„, 𝐷 ) = log 𝑄 π‘₯1, π‘₯", π‘₯#, … , π‘₯, πœ„, 𝐷 = log Ξ 56"

7

P 𝐷 w: 𝐷 π‘₯:)" 𝑄(π‘₯: ∣ 𝐷 π‘₯: ) = βˆ‘56"

7

[log P 𝐷 w: 𝐷 π‘₯:)" + log 𝑄(π‘₯: ∣ 𝐷 π‘₯: ) ]

v 𝑄(𝑑′ ∣ 𝑑) = #(ab,a)

#a

v 𝑄(π‘₯ ∣ 𝑑) = #(c,a)

#a

6501 Natural Language Processing 20

This part is the same as training a POS tagging model

See section 9.2: http://ciml.info/dl/v0_99/ciml-v0_99-ch09.pdf

slide-21
SLIDE 21

max

_ 𝑀𝑀(πœ„, 𝐷)

max

_ βˆ‘56" 7

[log P 𝐷 w: 𝐷 π‘₯:)" + log 𝑄(π‘₯: ∣ 𝐷 π‘₯: ) ] = n βˆ‘ βˆ‘ π‘ž 𝑑, 𝑑N log

e a,ab e a e(ab) + 𝐻 ' aN6" ' a6"

where G is a constant

v Here,

π‘ž 𝑑, 𝑑N =

# a,ab βˆ‘ #(a,ab)

h,hb

, π‘ž 𝑑 =

# a βˆ‘ #(a)

h

v 𝑑: cluster of w:, 𝑑′:cluster of w:)"

v

e a,ab e a e(ab) = e 𝑑 𝑑N e a

(mutual information)

21 6501 Natural Language Processing

See classnote here: http://web.cs.ucla.edu/~kwchang/teaching /NLP16/slides/classnote.pdf

slide-22
SLIDE 22

Algorithm 1

v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps:

v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(πœ„, 𝐷)

v Cost? (can be improved to 𝑃( π‘Š $)) O(|V|-k) 𝑃( π‘Š #) 𝑃 ( π‘Š #) = 𝑃( π‘Š k) #Iters #pairs compute LL

6501 Natural Language Processing 22

slide-23
SLIDE 23

Algorithm 2

v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑", 𝑑#, 𝑑$, … 𝑑l v For 𝑗 = 𝑛 + 1 … |π‘Š|

v Create a new cluster 𝑑lo" (we have m+1 clusters) v Choose two cluster from m+1 clusters based on

𝑀𝑀 πœ„, 𝐷 and merge β‡’ back to m clusters

v Carry out (m-1) final merges β‡’ full hierarchy v Running time O π‘Š 𝑛# + π‘œ , n=#words in corpus

6501 Natural Language Processing 23

slide-24
SLIDE 24

Example clusters (Brown+1992)

6501 Natural Language Processing 24

slide-25
SLIDE 25

Example Hierarchy(Miller+2004)

6501 Natural Language Processing 25