T ext Classification & Naïve Bayes
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein
T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST - - PowerPoint PPT Presentation
T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein T oday Text classification problems and their evaluation
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein
– and their evaluation
– Features & Weights – Bag of words – Naïve Bayes
Machine Learning, Probability Linguistics
TE TEXT T CLAS ASSIFIC SIFICATIO TION
From: "Fabian Starr“ <Patrick_Freeman@pamietaniepeerelu.pl> Subject: Hey! Sofware for the funny prices! Get the great discounts on popular software today for PC and Macintosh http://iiled.org/Cj4Lmx 70-90% Discounts from retail price!!! All sofware is instantly available to download - No Need Wait!
New York to ratify U.S Constitution: Jay, Madison, Hamilton.
using Bayesian methods
James Madison Alexander Hamilton
satire, and some great plot twists
ever filmed
was the boxing scenes.
Inhibitors
MeSH Subject Category Hierarchy
?
MEDLINE Article
T ext Classification
– a document w – a fixed set of classes Y = {y1, y2,…, yJ}
Classification Methods: Hand-coded rules
– spam: black-list-address OR (“dollars” AND “have been selected”)
– If rules carefully refined by expert
expensive
Classification Methods: Supervised Machine Learning
– a document w – a fixed set of classes Y = {y1, y2,…, yJ} – A training set of m hand-labeled documents (w1,y1),....,(wm,ym)
– a learned classifier w y
– By experts or non-experts (crowdsourcing) – Found data
– Accuracy on held out data
– Compare classifier predictions with human annotation – On held out test examples – Evaluation metrics: accuracy, precision, recall
correct not correct selected tp fp not selected fn tn
correct Recall: % of correct items that are selected
correct not correct selected tp fp not selected fn tn
tradeoff is F measure (weighted harmonic mean):
– i.e., with = 1 (that is, = ½): F = 2PR/(P+R)
R P PR R P F + + =
=
2 2
) 1 ( 1 ) 1 ( 1 1 b b a a
LINE NEAR AR CLAS ASSIFIERS SIFIERS
Feature function representation Weights
– Today: Naïve Bayes
– e.g., perceptron, support vector machines
training examples are generated
– Relative frequency estimator – Formally: this is the maximum likelihood estimate
– and their evaluation
– Features & Weights – Bag of words – Naïve Bayes
Machine Learning, Probability Linguistics