lecture 8 word clustering
play

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Brown Clustering 6501 Natural Language Processing 2


  1. Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

  2. This lecture v Brown Clustering 6501 Natural Language Processing 2

  3. Brown Clustering v Similar to language model But, basic unit is β€œword clusters” v Intuition: again, similar words appear in similar context v Recap: Bigram Language Models v 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( = 𝑄 π‘₯ % π‘₯ # 𝑄 π‘₯ & π‘₯ % … 𝑄 π‘₯ ( π‘₯ (+% / = Ξ  -.% P(w 3 ∣ π‘₯ 3+% ) π‘₯ # is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 3

  4. Motivation example v ”a dog is chasing a cat” v 𝑄 π‘₯ # , β€œπ‘β€, ”𝑒𝑝𝑕”, … , β€œπ‘‘π‘π‘’β€ = 𝑄 ”𝑏” π‘₯ # 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑒” ”𝑏” v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 4

  5. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 5

  6. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 6

  7. Motivation example v Assume every word belongs to a cluster v β€œthe boy is following a rabbit” C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 7

  8. Motivation example v Assume every word belongs to a cluster v β€œa fox was chasing a bird” C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 8

  9. Brown Clustering v Let 𝐷 π‘₯ denote the cluster that π‘₯ belongs to v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 9

  10. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 # ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 10

  11. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 # ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... v In general 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( = 𝑄 𝐷(π‘₯ % ) 𝐷 π‘₯ # 𝑄 𝐷(π‘₯ & ) 𝐷(π‘₯ % ) … 𝑄 𝐷 π‘₯ ( 𝐷 π‘₯ (+% 𝑄(π‘₯ % |𝐷 π‘₯ % 𝑄 π‘₯ & 𝐷 π‘₯ & … 𝑄(π‘₯ ( |𝐷 π‘₯ ( ) / = Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) 6501 Natural Language Processing 11

  12. Model parameters 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( / = Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) Parameter set 2: Parameter set 1: 𝑄(π‘₯ 3 |𝐷 π‘₯ 3 ) 𝑄(𝐷(π‘₯ 3 )|𝐷 π‘₯ 3+% ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 𝐷 π‘₯ 3 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 12

  13. Model parameters 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( / = Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 β†’ {1, 2, 3, … 𝑙 } v A partition of vocabulary into k classes v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑 J ∈ 1,… , 𝑙 v Conditional probability 𝑄(π‘₯ ∣ 𝑑) for 𝑑, 𝑑 J ∈ 1,… , 𝑙 ,π‘₯ ∈ 𝑑 πœ„ represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 13

  14. Log likelihood LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ # , π‘₯ % , π‘₯ & ,… , π‘₯ ( πœ„, 𝐷 / = log Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) / = βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] v Maximizing LL( πœ„, 𝐷 ) can be done by alternatively update πœ„ and 𝐷 1. max X∈Y 𝑀𝑀(πœ„,𝐷) 2. max [ 𝑀𝑀(πœ„,𝐷) 6501 Natural Language Processing 14

  15. max X∈Y 𝑀𝑀(πœ„, 𝐷) LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( πœ„, 𝐷 / = log Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) / = βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] #(] ^ ,]) v 𝑄(𝑑′ ∣ 𝑑) = #] #(_,]) v 𝑄(π‘₯ ∣ 𝑑) = #] 6501 Natural Language Processing 15

  16. max [ 𝑀𝑀(πœ„, 𝐷) / max [ βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] π‘ž 𝑑, 𝑑 J log a ],] ^ c c = n βˆ‘ βˆ‘ a ] a(] ^ ) + 𝐻 ].% ]J.% where G is a constant v Here, # ],] ^ π‘ž 𝑑, 𝑑 J = # ] , π‘ž 𝑑 = #(],] ^ ) βˆ‘ βˆ‘ #(]) d d,d^ a 𝑑 𝑑 J a ],] ^ a ] a(] ^ ) = v (mutual information) a ] 6501 Natural Language Processing 16

  17. max [ 𝑀𝑀(πœ„, 𝐷) / max [ βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] π‘ž 𝑑, 𝑑 J log a ],] ^ c c = n βˆ‘ βˆ‘ a ] a(] ^ ) + 𝐻 ].% ]J.% 6501 Natural Language Processing 17

  18. Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(πœ„, 𝐷) v Cost? (can be improved to 𝑃( π‘Š g ) ) O(|V|-k) 𝑃( π‘Š & ) 𝑃 ( π‘Š & ) = 𝑃( π‘Š h ) #Iters #pairs compute LL 6501 Natural Language Processing 18

  19. Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑 % , 𝑑 & ,𝑑 g ,… 𝑑 i v For 𝑗 = 𝑛 + 1 … |π‘Š| v Create a new cluster 𝑑 il% (we have m+1 clusters) v Choose two cluster from m+1 clusters based on 𝑀𝑀 πœ„, 𝐷 and merge β‡’ back to m clusters v Carry out (m-1) final merges β‡’ full hierarchy v Running time O π‘Š 𝑛 & + π‘œ , n=#words in corpus 6501 Natural Language Processing 19

  20. Example clusters (Brown+1992) 6501 Natural Language Processing 20

  21. Example Hierarchy (Miller+2004) 6501 Natural Language Processing 21

  22. Quiz 1 v 30 min (9/20 Tue. 12:30pm-1:00pm) v Fill-in-the-blank, True/False v Short answer v Closed book, Closed notes, Closed laptop v Sample questions: v Add one smoothing v.s. Add-Lambda Smoothing v 𝑏 = 1,3,5 , 𝑐 = 2,3,6 what is the cosine similarity between a and 𝑐 ? 6501 Natural Language Processing 22

  23. 6501 Natural Language Processing 23

Recommend


More recommend