Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1
This lecture v Brown Clustering 6501 Natural Language Processing 2
Brown Clustering v Similar to language model But, basic unit is βword clustersβ v Intuition: again, similar words appear in similar context v Recap: Bigram Language Models v π π₯ # , π₯ % , π₯ & , β¦ , π₯ ( = π π₯ % π₯ # π π₯ & π₯ % β¦ π π₯ ( π₯ (+% / = Ξ -.% P(w 3 β£ π₯ 3+% ) π₯ # is a dummy word representing βbegin of a sentenceβ 6501 Natural Language Processing 3
Motivation example v βa dog is chasing a catβ v π π₯ # , βπβ, βπππβ, β¦ , βπππ’β = π βπβ π₯ # π βπππβ βπβ β¦ π βπππ’β βπβ v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 4
Motivation example v Assume every word belongs to a cluster v βa dog is chasing a catβ C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 5
Motivation example v Assume every word belongs to a cluster v βa dog is chasing a catβ C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 6
Motivation example v Assume every word belongs to a cluster v βthe boy is following a rabbitβ C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 7
Motivation example v Assume every word belongs to a cluster v βa fox was chasing a birdβ C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 8
Brown Clustering v Let π· π₯ denote the cluster that π₯ belongs to v βa dog is chasing a catβ C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 9
Brown clustering model v P(βa dog is chasing a catβ) = P(C(βaβ)| π· # ) P(C(βdogβ)|C(βaβ)) P(C(βdogβ)|C(βaβ))β¦ P(βaβ|C(βaβ))P(βdogβ|C(βdogβ))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 10
Brown clustering model v P(βa dog is chasing a catβ) = P(C(βaβ)| π· # ) P(C(βdogβ)|C(βaβ)) P(C(βdogβ)|C(βaβ))β¦ P(βaβ|C(βaβ))P(βdogβ|C(βdogβ))... v In general π π₯ # , π₯ % , π₯ & , β¦ , π₯ ( = π π·(π₯ % ) π· π₯ # π π·(π₯ & ) π·(π₯ % ) β¦ π π· π₯ ( π· π₯ (+% π(π₯ % |π· π₯ % π π₯ & π· π₯ & β¦ π(π₯ ( |π· π₯ ( ) / = Ξ -.% P π· w 3 π· π₯ 3+% π(π₯ 3 β£ π· π₯ 3 ) 6501 Natural Language Processing 11
Model parameters π π₯ # , π₯ % , π₯ & , β¦ , π₯ ( / = Ξ -.% P π· w 3 π· π₯ 3+% π(π₯ 3 β£ π· π₯ 3 ) Parameter set 2: Parameter set 1: π(π₯ 3 |π· π₯ 3 ) π(π·(π₯ 3 )|π· π₯ 3+% ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 π· π₯ 3 dog cat chasing a is fox rabbit following the was bird boy bitingβ¦ 6501 Natural Language Processing 12
Model parameters π π₯ # , π₯ % , π₯ & , β¦ , π₯ ( / = Ξ -.% P π· w 3 π· π₯ 3+% π(π₯ 3 β£ π· π₯ 3 ) v A vocabulary set π v A function π·: π β {1, 2, 3, β¦ π } v A partition of vocabulary into k classes v Conditional probability π(πβ² β£ π) for π, π J β 1,β¦ , π v Conditional probability π(π₯ β£ π) for π, π J β 1,β¦ , π ,π₯ β π π represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 13
Log likelihood LL( π, π· ) = log π π₯ # , π₯ % , π₯ & ,β¦ , π₯ ( π, π· / = log Ξ -.% P π· w 3 π· π₯ 3+% π(π₯ 3 β£ π· π₯ 3 ) / = β -.% [log P π· w 3 π· π₯ 3+% + logπ(π₯ 3 β£ π· π₯ 3 ) ] v Maximizing LL( π, π· ) can be done by alternatively update π and π· 1. max XβY ππ(π,π·) 2. max [ ππ(π,π·) 6501 Natural Language Processing 14
max XβY ππ(π, π·) LL( π, π· ) = log π π₯ # , π₯ % , π₯ & , β¦ , π₯ ( π, π· / = log Ξ -.% P π· w 3 π· π₯ 3+% π(π₯ 3 β£ π· π₯ 3 ) / = β -.% [log P π· w 3 π· π₯ 3+% + logπ(π₯ 3 β£ π· π₯ 3 ) ] #(] ^ ,]) v π(πβ² β£ π) = #] #(_,]) v π(π₯ β£ π) = #] 6501 Natural Language Processing 15
max [ ππ(π, π·) / max [ β -.% [log P π· w 3 π· π₯ 3+% + logπ(π₯ 3 β£ π· π₯ 3 ) ] π π, π J log a ],] ^ c c = n β β a ] a(] ^ ) + π» ].% ]J.% where G is a constant v Here, # ],] ^ π π, π J = # ] , π π = #(],] ^ ) β β #(]) d d,d^ a π π J a ],] ^ a ] a(] ^ ) = v (mutual information) a ] 6501 Natural Language Processing 16
max [ ππ(π, π·) / max [ β -.% [log P π· w 3 π· π₯ 3+% + logπ(π₯ 3 β£ π· π₯ 3 ) ] π π, π J log a ],] ^ c c = n β β a ] a(] ^ ) + π» ].% ]J.% 6501 Natural Language Processing 17
Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing ππ(π, π·) v Cost? (can be improved to π( π g ) ) O(|V|-k) π( π & ) π ( π & ) = π( π h ) #Iters #pairs compute LL 6501 Natural Language Processing 18
Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster π % , π & ,π g ,β¦ π i v For π = π + 1 β¦ |π| v Create a new cluster π il% (we have m+1 clusters) v Choose two cluster from m+1 clusters based on ππ π, π· and merge β back to m clusters v Carry out (m-1) final merges β full hierarchy v Running time O π π & + π , n=#words in corpus 6501 Natural Language Processing 19
Example clusters (Brown+1992) 6501 Natural Language Processing 20
Example Hierarchy (Miller+2004) 6501 Natural Language Processing 21
Quiz 1 v 30 min (9/20 Tue. 12:30pm-1:00pm) v Fill-in-the-blank, True/False v Short answer v Closed book, Closed notes, Closed laptop v Sample questions: v Add one smoothing v.s. Add-Lambda Smoothing v π = 1,3,5 , π = 2,3,6 what is the cosine similarity between a and π ? 6501 Natural Language Processing 22
6501 Natural Language Processing 23
Recommend
More recommend