Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP - PowerPoint PPT Presentation

Naïve Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP

Outline Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

Bag-of-words Based on some tokenization, turn an input document into an array (or dictionary or set) of its unique vocab items Position/ordering in the document is not important Bag-of-words go hand-in-hand with one-hot representations but they can be extended to handle dense representations Short-hand: BOW

The Bag of Words Representation

The Bag of Words Representation 7

The Bag of Words Representation Count-based BOW 8

Bag-of-words as a Function Based on some tokenization, turn an input document into an array (or dictionary or set) of its unique vocab items Think of getting a BOW rep. as a function f input: Document output: Container of size E , indexable by each vocab type v

Some Bag-of-words Functions Type of 𝒈 𝒘 Kind Interpretation Did v appear in the Binary 0, 1 document? How often did v occur in Count-based Natural number (int >= 0) the document? How often did v occur in Averaged Real number (>=0, <= 1) the document, normalized by doc length? TF-IDF (term How frequent is a word, frequency, tempered by how inverse Real number (>= 0) prevalent it is across the document corpus (to be covered frequency) later!) …

Some Bag-of-words Functions Type of 𝒈 𝒘 Kind Interpretation Did v appear in the Binary 0, 1 document? How often did v occur in Count-based Natural number (int >= 0) the document? How often did v occur in Averaged Real number (>=0, <= 1) the document, normalized by doc length? TF-IDF (term How frequent is a word, frequency, tempered by how inverse Real number (>= 0) prevalent it is across the document corpus (to be covered frequency) later!) … Q: Is this a reasonable Q: What are some tradeoffs representation? (benefits vs. costs)?

Bag of Words Classifier seen 2 classifier sweet 1 γ ( )=c whimsical 1 recommend 1 happy 1 classifier ... ...

Naïve Bayes (NB) Classifier argmax 𝑍 𝑞 𝑌 𝑍) ∗ 𝑞(𝑍) label text Start with Bayes Rule Q : Are we doing discriminative training or generative training?

Naïve Bayes (NB) Classifier argmax 𝑍 𝑞 𝑌 𝑍) ∗ 𝑞(𝑍) label text Start with Bayes Rule Q : Are we doing A : generative training discriminative training or generative training?

Naïve Bayes (NB) Classifier label each word argmax 𝑍 ෑ 𝑞(𝑌 𝑢 |𝑍) ∗ 𝑞(𝑍) 𝑢 Iterate through types Adopt naïve bag of words representation X t

Naïve Bayes (NB) Classifier label each word argmax 𝑍 ෑ 𝑞(𝑌 𝑢 |𝑍) ∗ 𝑞(𝑍) 𝑢 Iterate through types Adopt naïve bag of words representation X t Assume position doesn’t matter

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora)

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters (values/weights) must be learned?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many parameters must be learned?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many A: 𝑀𝐿 + 𝑀 parameters must be learned?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many A: 𝑀𝐿 + 𝑀 parameters must be learned? Q: What distributions need to sum to 1?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) Q: What parameters A: 𝑞 𝑥 𝑤 𝑣 𝑚 , 𝑞(𝑣 𝑚 ) (values/weights) must be learned? Q: How many A: 𝑀𝐿 + 𝑀 parameters must be learned? Q: What distributions A: Each 𝑞 ⋅ 𝑣 𝑚 , and need to sum to 1? the prior

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!)

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) Q: Should OOV and UNK be included?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) Q: Should OOV and Q: Should EOS be UNK be included? included?

Learning for a Naïve Bayes Classifier Assuming V vocab types 𝑥 1 , … , 𝑥 𝑊 and L classes 𝑣 1 , … , 𝑣 𝑀 (and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣 𝑚 , all class specific language models 𝑞 ⋅ 𝑣 𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) A binary/count-based NB classifier is also called a Multinomial Naïve Bayes classifier

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P ( c j ) terms For each c j in C do docs j = all docs with class = c j 𝑘 = |𝑒𝑝𝑑𝑡 𝑘 | 𝑞 𝑑 # 𝑒𝑝𝑑𝑡

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P ( c j ) terms Calculate P ( w k | c j ) terms For each c j in C do Text j = single doc containing all docs j Foreach word w k in Vocabulary docs j = all docs with class = c j n k = # of occurrences of w k in Text j 𝑘 = |𝑒𝑝𝑑𝑡 𝑘 | 𝑞 𝑑 𝑞 𝑥 𝑙 | 𝑑 𝑘 = class unigram LM # 𝑒𝑝𝑑𝑡

Brill and Banko (2001) With enough data, the classifier may not matter

Sec.13.2.1 Naïve Bayes as a Language Model Positive Model Negative Model 0.2 I 0.1 I 0.001 love 0.1 love 0.01 this 0.01 this 0.005 fun 0.05 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s ? Positive Model Negative Model I love this fun film 0.2 I 0.1 I 0.001 love 0.1 love 0.01 this 0.01 this 0.005 fun 0.05 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s ? Positive Model Negative Model I love this fun film 0.2 I 0.1 I 0.001 love 0.1 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.005 fun 0.05 fun 0.1 film 0.1 film

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s ? Positive Model Negative Model I love this fun film 0.2 I 0.1 I 0.001 love 0.1 love 0.1 0.1 0.01 0.05 0.1 0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1 0.005 fun 0.05 fun 0.1 film 0.1 film 5e-7 ≈ P( s|pos) > P(s|neg ) ≈ 1e-9

Naïve Bayes To Try http://csee.umbc.edu/courses/undergraduate/473/f19/nb • Toy problem: classify whether a tweet will be retweeted • Toy problem: OOV and EOS are not included • Laplace smoothing is used for p(word | label)

Summary: Naïve Bayes is Not So Naïve Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP - PowerPoint PPT Presentation

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words Nave assumption Training & performance NB as a language model Outline Terminology: bag-of-words Nave assumption Training

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Probabilistic Diagnosis Albert R Meyer, May 3, 2013 Albert R Meyer, May 3, 2013 bayes.1

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes

Bayes meets Dijkstra Exact Inference by Program Verification Joost-Pieter Katoen Dagstuhl

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

Categorial Type Logics and Italian Corpora Raffaella Bernardi Free University of Bolzano-Bozen

Applications of Dynamical Horizons in Numerical Relativity E. Schnetter 2 B. Krishnan 1 F. Beyer 1

Large-Scale Click- stream and transaction log mining in practice Uwe Mayer, Nish Parikh, Gyanit

2018 CCIM President Carole Brill, CCIM 2018 Commercial Real Estate Forecasts Presented by

Open Cavity Resonators The Orpheus Experiment Gray Rybka, University of Washington Workshop on

Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP - PowerPoint PPT Presentation

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words Nave assumption Training & performance NB as a language model Outline Terminology: bag-of-words Nave assumption Training

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Nave Bayes Classification Nickolai Riabov, Kenneth Tiong Brown University Fall 2013 Nickolai

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Probabilistic Diagnosis Albert R Meyer, May 3, 2013 Albert R Meyer, May 3, 2013 bayes.1

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes

Bayes meets Dijkstra Exact Inference by Program Verification Joost-Pieter Katoen Dagstuhl

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning &amp; H.

Categorial Type Logics and Italian Corpora Raffaella Bernardi Free University of Bolzano-Bozen

Applications of Dynamical Horizons in Numerical Relativity E. Schnetter 2 B. Krishnan 1 F. Beyer 1

Large-Scale Click- stream and transaction log mining in practice Uwe Mayer, Nish Parikh, Gyanit

2018 CCIM President Carole Brill, CCIM 2018 Commercial Real Estate Forecasts Presented by

Open Cavity Resonators The Orpheus Experiment Gray Rybka, University of Washington Workshop on

Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.