lecture 8 word clustering
play

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Brown Clustering 6501 Natural Language Processing 2


  1. Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

  2. This lecture v Brown Clustering 6501 Natural Language Processing 2

  3. Brown Clustering v Similar to language model But, basic unit is β€œword clusters” v Intuition: again, similar words appear in similar context v Recap: Bigram Language Models v 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( = 𝑄 π‘₯ % π‘₯ # 𝑄 π‘₯ & π‘₯ % … 𝑄 π‘₯ ( π‘₯ (+% / = Ξ  -.% P(w 3 ∣ π‘₯ 3+% ) π‘₯ # is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 3

  4. Motivation example v ”a dog is chasing a cat” v 𝑄 π‘₯ # , β€œπ‘β€, ”𝑒𝑝𝑕”, … , β€œπ‘‘π‘π‘’β€ = 𝑄 ”𝑏” π‘₯ # 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑒” ”𝑏” v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 4

  5. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 5

  6. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 6

  7. Motivation example v Assume every word belongs to a cluster v β€œthe boy is following a rabbit” C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 7

  8. Motivation example v Assume every word belongs to a cluster v β€œa fox was chasing a bird” C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 8

  9. Brown Clustering v Let 𝐷 π‘₯ denote the cluster that π‘₯ belongs to v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 9

  10. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 # ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 10

  11. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 # ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... v In general 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( = 𝑄 𝐷(π‘₯ % ) 𝐷 π‘₯ # 𝑄 𝐷(π‘₯ & ) 𝐷(π‘₯ % ) … 𝑄 𝐷 π‘₯ ( 𝐷 π‘₯ (+% 𝑄(π‘₯ % |𝐷 π‘₯ % 𝑄 π‘₯ & 𝐷 π‘₯ & … 𝑄(π‘₯ ( |𝐷 π‘₯ ( ) / = Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) 6501 Natural Language Processing 11

  12. Model parameters 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( / = Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) Parameter set 2: Parameter set 1: 𝑄(π‘₯ 3 |𝐷 π‘₯ 3 ) 𝑄(𝐷(π‘₯ 3 )|𝐷 π‘₯ 3+% ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 𝐷 π‘₯ 3 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 12

  13. Model parameters 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( / = Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 β†’ {1, 2, 3, … 𝑙 } v A partition of vocabulary into k classes v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑 J ∈ 1,… , 𝑙 v Conditional probability 𝑄(π‘₯ ∣ 𝑑) for 𝑑, 𝑑 J ∈ 1,… , 𝑙 ,π‘₯ ∈ 𝑑 πœ„ represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 13

  14. Log likelihood LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ # , π‘₯ % , π‘₯ & ,… , π‘₯ ( πœ„, 𝐷 / = log Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) / = βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] v Maximizing LL( πœ„, 𝐷 ) can be done by alternatively update πœ„ and 𝐷 1. max X∈Y 𝑀𝑀(πœ„,𝐷) 2. max [ 𝑀𝑀(πœ„,𝐷) 6501 Natural Language Processing 14

  15. max X∈Y 𝑀𝑀(πœ„, 𝐷) LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ # , π‘₯ % , π‘₯ & , … , π‘₯ ( πœ„, 𝐷 / = log Ξ  -.% P 𝐷 w 3 𝐷 π‘₯ 3+% 𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) / = βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] #(] ^ ,]) v 𝑄(𝑑′ ∣ 𝑑) = #] #(_,]) v 𝑄(π‘₯ ∣ 𝑑) = #] 6501 Natural Language Processing 15

  16. max [ 𝑀𝑀(πœ„, 𝐷) / max [ βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] π‘ž 𝑑, 𝑑 J log a ],] ^ c c = n βˆ‘ βˆ‘ a ] a(] ^ ) + 𝐻 ].% ]J.% where G is a constant v Here, # ],] ^ π‘ž 𝑑, 𝑑 J = # ] , π‘ž 𝑑 = #(],] ^ ) βˆ‘ βˆ‘ #(]) d d,d^ a 𝑑 𝑑 J a ],] ^ a ] a(] ^ ) = v (mutual information) a ] 6501 Natural Language Processing 16

  17. max [ 𝑀𝑀(πœ„, 𝐷) / max [ βˆ‘ -.% [log P 𝐷 w 3 𝐷 π‘₯ 3+% + log𝑄(π‘₯ 3 ∣ 𝐷 π‘₯ 3 ) ] π‘ž 𝑑, 𝑑 J log a ],] ^ c c = n βˆ‘ βˆ‘ a ] a(] ^ ) + 𝐻 ].% ]J.% 6501 Natural Language Processing 17

  18. Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(πœ„, 𝐷) v Cost? (can be improved to 𝑃( π‘Š g ) ) O(|V|-k) 𝑃( π‘Š & ) 𝑃 ( π‘Š & ) = 𝑃( π‘Š h ) #Iters #pairs compute LL 6501 Natural Language Processing 18

  19. Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑 % , 𝑑 & ,𝑑 g ,… 𝑑 i v For 𝑗 = 𝑛 + 1 … |π‘Š| v Create a new cluster 𝑑 il% (we have m+1 clusters) v Choose two cluster from m+1 clusters based on 𝑀𝑀 πœ„, 𝐷 and merge β‡’ back to m clusters v Carry out (m-1) final merges β‡’ full hierarchy v Running time O π‘Š 𝑛 & + π‘œ , n=#words in corpus 6501 Natural Language Processing 19

  20. Example clusters (Brown+1992) 6501 Natural Language Processing 20

  21. Example Hierarchy (Miller+2004) 6501 Natural Language Processing 21

  22. Quiz 1 v 30 min (9/20 Tue. 12:30pm-1:00pm) v Fill-in-the-blank, True/False v Short answer v Closed book, Closed notes, Closed laptop v Sample questions: v Add one smoothing v.s. Add-Lambda Smoothing v 𝑏 = 1,3,5 , 𝑐 = 2,3,6 what is the cosine similarity between a and 𝑐 ? 6501 Natural Language Processing 22

  23. 6501 Natural Language Processing 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend