lecture 6 representing words
play

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text


  1. Lecture 6: Representing Words Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1

  2. Bag-of-Words with N-grams v N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 2

  3. Language model v Probability distributions over sentences (i.e., word sequences ) P(W) = P( π‘₯ " π‘₯ # π‘₯ $ π‘₯ % … π‘₯ ' ) v Can use them to generate strings P( π‘₯ ' ∣ π‘₯ # π‘₯ $ π‘₯ % … π‘₯ ')" ) v Rank possible sentences v P(β€œToday is Tuesday”) > P(β€œTuesday Today is”) v P(β€œToday is Tuesday”) > P(β€œToday is Los Angeles”) CS 6501: Natural Language Processing 3

  4. N-Gram Models v Unigram model: 𝑄 π‘₯ " 𝑄 π‘₯ # 𝑄 π‘₯ $ … 𝑄(π‘₯ , ) v Bigram model: 𝑄 π‘₯ " 𝑄 π‘₯ # |π‘₯ " 𝑄 π‘₯ $ |π‘₯ # … 𝑄(π‘₯ , |π‘₯ ,)" ) v Trigram model: 𝑄 π‘₯ " 𝑄 π‘₯ # |π‘₯ " 𝑄 π‘₯ $ |π‘₯ # , π‘₯ " … 𝑄(π‘₯ , |π‘₯ ,)" π‘₯ ,)# ) v N-gram model: 𝑄 π‘₯ " 𝑄 π‘₯ # |π‘₯ " … 𝑄(π‘₯ , |π‘₯ ,)" π‘₯ ,)# … π‘₯ ,)0 ) CS 6501: Natural Language Processing 4

  5. Random language via n-gram v http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf Collection of n-gram v https://research.googleblog.com/2006/08/al l-our-n-gram-are-belong-to-you.html CS 6501: Natural Language Processing 5

  6. N-Gram Viewer https://books.google.com/ngrams ML in NLP 6

  7. How to represent words? v N-gram -- cannot capture word similarity v Word clusters v Brown Clustering v Part-of-speech tagging v Continuous space representation v Word embedding ML in NLP 7

  8. Brown Clustering v Similar to language model But, basic unit is β€œword clusters” v Intuition: similar words appear in similar context v Recap: Bigram Language Models v 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , = 𝑄 π‘₯ " π‘₯ 1 𝑄 π‘₯ # π‘₯ " … 𝑄 π‘₯ , π‘₯ ,)" 7 = Ξ  56" P(w : ∣ π‘₯ :)" ) π‘₯ 1 is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 8

  9. Motivation example v ”a dog is chasing a cat” v 𝑄 π‘₯ 1 , β€œπ‘β€, ”𝑒𝑝𝑕”, … , β€œπ‘‘π‘π‘’β€ = 𝑄 ”𝑏” π‘₯ 1 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑒” ”𝑏” v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 9

  10. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 10

  11. Motivation example v Assume every word belongs to a cluster v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 11

  12. Motivation example v Assume every word belongs to a cluster v β€œthe boy is following a rabbit” C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 12

  13. Motivation example v Assume every word belongs to a cluster v β€œa fox was chasing a bird” C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 13

  14. Brown Clustering v Let 𝐷 π‘₯ denote the cluster that π‘₯ belongs to v β€œa dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(C(dog)|C(a)) P(cat|C(cat)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 14

  15. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 1 ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 15

  16. Brown clustering model v P(β€œa dog is chasing a cat”) = P(C(β€œa”)| 𝐷 1 ) P(C(β€œdog”)|C(β€œa”)) P(C(β€œdog”)|C(β€œa”))… P(β€œa”|C(β€œa”))P(β€œdog”|C(β€œdog”))... v In general 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , = 𝑄 𝐷(π‘₯ " ) 𝐷 π‘₯ 1 𝑄 𝐷(π‘₯ # ) 𝐷(π‘₯ " ) … 𝑄 𝐷 π‘₯ , 𝐷 π‘₯ ,)" 𝑄(π‘₯ " |𝐷 π‘₯ " 𝑄 π‘₯ # 𝐷 π‘₯ # … 𝑄(π‘₯ , |𝐷 π‘₯ , ) 7 = Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) 6501 Natural Language Processing 16

  17. Model parameters 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , 7 = Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) Parameter set 2: Parameter set 1: 𝑄(π‘₯ : |𝐷 π‘₯ : ) 𝑄(𝐷(π‘₯ : )|𝐷 π‘₯ :)" ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 𝐷 π‘₯ : dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 17

  18. Model parameters 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , 7 = Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 β†’ {1, 2, 3, … 𝑙 } v A partition of vocabulary into k classes v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑 N ∈ 1, … , 𝑙 v Conditional probability 𝑄(π‘₯ ∣ 𝑑) for 𝑑, 𝑑 N ∈ 1, … , 𝑙 , π‘₯ ∈ 𝑑 πœ„ represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 18

  19. Log likelihood LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , πœ„, 𝐷 7 = log Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) 7 = βˆ‘ 56" [log P 𝐷 w : 𝐷 π‘₯ :)" + log 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) ] v Maximizing LL( πœ„, 𝐷 ) can be done by alternatively update πœ„ and 𝐷 1. max \∈] 𝑀𝑀(πœ„, 𝐷) 2. max _ 𝑀𝑀(πœ„, 𝐷) 6501 Natural Language Processing 19

  20. max \∈] 𝑀𝑀(πœ„, 𝐷) LL( πœ„, 𝐷 ) = log 𝑄 π‘₯ 1 , π‘₯ " , π‘₯ # , … , π‘₯ , πœ„, 𝐷 7 = log Ξ  56" P 𝐷 w : 𝐷 π‘₯ :)" 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) 7 = βˆ‘ 56" [log P 𝐷 w : 𝐷 π‘₯ :)" + log 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) ] v 𝑄(𝑑′ ∣ 𝑑) = #(a b ,a) #a This part is the same as training a POS tagging model v 𝑄(π‘₯ ∣ 𝑑) = #(c,a) #a See section 9.2: http://ciml.info/dl/v0_99/ciml-v0_99-ch09.pdf 6501 Natural Language Processing 20

  21. max _ 𝑀𝑀(πœ„, 𝐷) 7 max _ βˆ‘ 56" [log P 𝐷 w : 𝐷 π‘₯ :)" + log 𝑄(π‘₯ : ∣ 𝐷 π‘₯ : ) ] e a,a b π‘ž 𝑑, 𝑑 N log ' ' = n βˆ‘ βˆ‘ e a e(a b ) + 𝐻 a6" aN6" See classnote here: where G is a constant http://web.cs.ucla.edu/~kwchang/teaching /NLP16/slides/classnote.pdf v Here, # a,a b π‘ž 𝑑, 𝑑 N = # a , π‘ž 𝑑 = #(a,a b ) βˆ‘ βˆ‘ #(a) οΏ½h οΏ½h,hb v 𝑑 : cluster of w : , 𝑑′ :cluster of w :)" e 𝑑 𝑑 N e a,a b e a e(a b ) = v (mutual information) e a 6501 Natural Language Processing 21

  22. Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(πœ„, 𝐷) v Cost? (can be improved to 𝑃( π‘Š $ ) ) O(|V|-k) 𝑃( π‘Š # ) 𝑃 ( π‘Š # ) = 𝑃( π‘Š k ) #Iters #pairs compute LL 6501 Natural Language Processing 22

  23. Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑 " , 𝑑 # , 𝑑 $ , … 𝑑 l v For 𝑗 = 𝑛 + 1 … |π‘Š| v Create a new cluster 𝑑 lo" (we have m+1 clusters) v Choose two cluster from m+1 clusters based on 𝑀𝑀 πœ„, 𝐷 and merge β‡’ back to m clusters v Carry out (m-1) final merges β‡’ full hierarchy v Running time O π‘Š 𝑛 # + π‘œ , n=#words in corpus 6501 Natural Language Processing 23

  24. Example clusters (Brown+1992) 6501 Natural Language Processing 24

  25. Example Hierarchy (Miller+2004) 6501 Natural Language Processing 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend