na ve bayes
play

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP - PowerPoint PPT Presentation

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words Nave assumption Training & performance NB as a language model Outline Terminology: bag-of-words Nave assumption Training


  1. Naïve Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP

  2. Outline Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

  3. Outline Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

  4. Bag-of-words Based on some tokenization, turn an input document into an array (or dictionary or set) of its unique vocab items Position/ordering in the document is not important Bag-of-words go hand-in-hand with one-hot representations but they can be extended to handle dense representations Short-hand: BOW

  5. The Bag of Words Representation

  6. The Bag of Words Representation

  7. The Bag of Words Representation 7

  8. The Bag of Words Representation Count-based BOW 8

  9. Bag-of-words as a Function Based on some tokenization, turn an input document into an array (or dictionary or set) of its unique vocab items Think of getting a BOW rep. as a function f input: Document output: Container of size E , indexable by each vocab type v

  10. Some Bag-of-words Functions Type of 𝒈 𝒘 Kind Interpretation Did v appear in the Binary 0, 1 document? How often did v occur in Count-based Natural number (int >= 0) the document? How often did v occur in Averaged Real number (>=0, <= 1) the document, normalized by doc length? TF-IDF (term How frequent is a word, frequency, tempered by how inverse Real number (>= 0) prevalent it is across the document corpus (to be covered frequency) later!) …

  11. Some Bag-of-words Functions Type of 𝒈 𝒘 Kind Interpretation Did v appear in the Binary 0, 1 document? How often did v occur in Count-based Natural number (int >= 0) the document? How often did v occur in Averaged Real number (>=0, <= 1) the document, normalized by doc length? TF-IDF (term How frequent is a word, frequency, tempered by how inverse Real number (>= 0) prevalent it is across the document corpus (to be covered frequency) later!) … Q: Is this a reasonable Q: What are some tradeoffs representation? (benefits vs. costs)?

  12. Outline Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

  13. Bag of Words Classifier seen 2 classifier sweet 1 γ ( )=c whimsical 1 recommend 1 happy 1 classifier ... ...

  14. Naïve Bayes (NB) Classifier argmax 𝑍 𝑞 𝑌 𝑍) ∗ 𝑞(𝑍) label text Start with Bayes Rule Q : Are we doing discriminative training or generative training?

  15. Naïve Bayes (NB) Classifier argmax 𝑍 𝑞 𝑌 𝑍) ∗ 𝑞(𝑍) label text Start with Bayes Rule Q : Are we doing A : generative training discriminative training or generative training?

  16. Naïve Bayes (NB) Classifier label each word argmax 𝑍 ෑ 𝑞(𝑌 𝑢 |𝑍) ∗ 𝑞(𝑍) 𝑢 Iterate through types Adopt naïve bag of words representation X t

  17. Naïve Bayes (NB) Classifier label each word argmax 𝑍 ෑ 𝑞(𝑌 𝑢 |𝑍) ∗ 𝑞(𝑍) 𝑢 Iterate through types Adopt naïve bag of words representation X t Assume position doesn’t matter

  18. Outline Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

  19. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora)

  20. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters (values/weights) must be learned?

  21. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned?

  22. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many parameters must be learned?

  23. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many A: 𝑀𝐿 + 𝑀 parameters must be learned?

  24. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many A: 𝑀𝐿 + 𝑀 parameters must be learned? Q: What distributions need to sum to 1?

  25. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many A: 𝑀𝐿 + 𝑀 parameters must be learned? Q: What distributions A: Each 𝑞 ⋅ 𝑣 𝑚 , and need to sum to 1? the prior

  26. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!)

  27. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) Q: Should OOV and UNK be included?

  28. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) Q: Should OOV and Q: Should EOS be UNK be included? included?

  29. Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) A binary/count-based NB classifier is also called a Multinomial Naïve Bayes classifier

  30. Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary

  31. Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P ( c j ) terms For each c j in C do docs j = all docs with class = c j 𝑘 = |𝑒𝑝𝑑𝑡 𝑘 | 𝑞 𝑑 # 𝑒𝑝𝑑𝑡

  32. Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P ( c j ) terms Calculate P ( w k | c j ) terms For each c j in C do Text j = single doc containing all docs j Foreach word w k in Vocabulary docs j = all docs with class = c j n k = # of occurrences of w k in Text j 𝑘 = |𝑒𝑝𝑑𝑡 𝑘 | 𝑞 𝑑 𝑞 𝑥 𝑙 | 𝑑 𝑘 = class unigram LM # 𝑒𝑝𝑑𝑡

  33. Brill and Banko (2001) With enough data, the classifier may not matter

  34. Outline Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

  35. Sec.13.2.1 Naïve Bayes as a Language Model Positive Model Negative Model 0.2 I 0.1 I 0.001 love 0.1 love 0.01 this 0.01 this 0.005 fun 0.05 fun 0.1 film 0.1 film

  36. Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s ? Positive Model Negative Model I love this fun film 0.2 I 0.1 I 0.001 love 0.1 love 0.01 this 0.01 this 0.005 fun 0.05 fun 0.1 film 0.1 film

  37. Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s ? Positive Model Negative Model I love this fun film 0.2 I 0.1 I 0.001 love 0.1 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.005 fun 0.05 fun 0.1 film 0.1 film

  38. Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s ? Positive Model Negative Model I love this fun film 0.2 I 0.1 I 0.001 love 0.1 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.005 fun 0.05 fun 0.1 film 0.1 film 5e-7 ≈ P( s|pos) > P(s|neg ) ≈ 1e-9

  39. Naïve Bayes To Try http://csee.umbc.edu/courses/undergraduate/473/f19/nb • Toy problem: classify whether a tweet will be retweeted • Toy problem: OOV and EOS are not included • Laplace smoothing is used for p(word | label)

  40. Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend