T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST - - PowerPoint PPT Presentation

t ext classification na ve bayes
SMART_READER_LITE
LIVE PREVIEW

T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST - - PowerPoint PPT Presentation

T ext Classification & Nave Bayes CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein T oday Text classification problems and their evaluation


slide-1
SLIDE 1

T ext Classification & Naïve Bayes

CMSC 723 / LING 723 / INST 725 MARINE CARPUAT

marine@cs.umd.edu

Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein

slide-2
SLIDE 2

T

  • day
  • Text classification problems

– and their evaluation

  • Linear classifiers

– Features & Weights – Bag of words – Naïve Bayes

Machine Learning, Probability Linguistics

slide-3
SLIDE 3

TE TEXT T CLAS ASSIFIC SIFICATIO TION

slide-4
SLIDE 4

Is this spam?

From: "Fabian Starr“ <Patrick_Freeman@pamietaniepeerelu.pl> Subject: Hey! Sofware for the funny prices! Get the great discounts on popular software today for PC and Macintosh http://iiled.org/Cj4Lmx 70-90% Discounts from retail price!!! All sofware is instantly available to download - No Need Wait!

slide-5
SLIDE 5

Who wrote which Federalist papers?

  • 1787-8: anonymous essays try to convince

New York to ratify U.S Constitution: Jay, Madison, Hamilton.

  • Authorship of 12 of the letters in dispute
  • 1963: solved by Mosteller and Wallace

using Bayesian methods

James Madison Alexander Hamilton

slide-6
SLIDE 6

Positive or negative movie review?

  • unbelievably disappointing
  • Full of zany characters and richly applied

satire, and some great plot twists

  • this is the greatest screwball comedy

ever filmed

  • It was pathetic. The worst part about it

was the boxing scenes.

slide-7
SLIDE 7

What is the subject of this article?

  • Antogonists and

Inhibitors

  • Blood Supply
  • Chemistry
  • Drug Therapy
  • Embryology
  • Epidemiology

MeSH Subject Category Hierarchy

?

MEDLINE Article

slide-8
SLIDE 8

T ext Classification

  • Assigning subject categories, topics, or genres
  • Spam detection
  • Authorship identification
  • Age/gender identification
  • Language Identification
  • Sentiment analysis
slide-9
SLIDE 9

T ext Classification: definition

  • Input:

– a document w – a fixed set of classes Y = {y1, y2,…, yJ}

  • Output: a predicted class y  Y
slide-10
SLIDE 10

Classification Methods: Hand-coded rules

  • Rules based on combinations of words or
  • ther features

– spam: black-list-address OR (“dollars” AND “have been selected”)

  • Accuracy can be high

– If rules carefully refined by expert

  • But building and maintaining these rules is

expensive

slide-11
SLIDE 11

Classification Methods: Supervised Machine Learning

  • Input

– a document w – a fixed set of classes Y = {y1, y2,…, yJ} – A training set of m hand-labeled documents (w1,y1),....,(wm,ym)

  • Output

– a learned classifier w  y

slide-12
SLIDE 12

Aside: getting examples for supervised learning

  • Human annotation

– By experts or non-experts (crowdsourcing) – Found data

  • Truth vs. gold standard
  • How do we know how good a classifier is?

– Accuracy on held out data

slide-13
SLIDE 13

Aside: evaluating classifiers

  • How do we know how good a classifier is?

– Compare classifier predictions with human annotation – On held out test examples – Evaluation metrics: accuracy, precision, recall

slide-14
SLIDE 14

The 2-by-2 contingency table

correct not correct selected tp fp not selected fn tn

slide-15
SLIDE 15

Precision and recall

  • Precision: % of selected items that are

correct Recall: % of correct items that are selected

correct not correct selected tp fp not selected fn tn

slide-16
SLIDE 16

A combined measure: F

  • A combined measure that assesses the P/R

tradeoff is F measure (weighted harmonic mean):

  • People usually use balanced F1 measure

– i.e., with  = 1 (that is,  = ½): F = 2PR/(P+R)

R P PR R P F + + =

  • +

=

2 2

) 1 ( 1 ) 1 ( 1 1 b b a a

slide-17
SLIDE 17

LINE NEAR AR CLAS ASSIFIERS SIFIERS

slide-18
SLIDE 18

Bag of words

slide-19
SLIDE 19

Defining features

slide-20
SLIDE 20

Linear classification

slide-21
SLIDE 21

Linear Models for Classification

Feature function representation Weights

slide-22
SLIDE 22

How can we learn weights?

  • By hand
  • Probability

– Today: Naïve Bayes

  • Discriminative training

– e.g., perceptron, support vector machines

slide-23
SLIDE 23

Generative Story for Multinomial Naïve Bayes

  • A hypothetical stochastic process describing how

training examples are generated

slide-24
SLIDE 24

Prediction with Naïve Bayes

slide-25
SLIDE 25

Parameter Estimation

  • “count and normalize”
  • Parameters of a multinomial distribution

– Relative frequency estimator – Formally: this is the maximum likelihood estimate

  • See CIML for derivation
slide-26
SLIDE 26

Smoothing

slide-27
SLIDE 27

Naïve Bayes recap

slide-28
SLIDE 28

T

  • day
  • Text classification problems

– and their evaluation

  • Linear classifiers

– Features & Weights – Bag of words – Naïve Bayes

Machine Learning, Probability Linguistics