SI485i : NLP Set 5 Using Nave Bayes Motivation We want to predict - - PowerPoint PPT Presentation

▶

Nov 07, 2022 382 likes •560 views

SI485i : NLP Set 5 Using Nave Bayes Motivation We want to predict something . We have some text related to this something. something = target label Y text = text features X Given X, what is the most probable Y? Motivation:

SLIDE 1

SI485i : NLP

Set 5 Using Naïve Bayes

SLIDE 2

Motivation

We want to predict something.
We have some text related to this something.
something = target label Y
text = text features X

Given X, what is the most probable Y?

SLIDE 3

Motivation: Author Detection

Alas the day! take heed of him; he stabbed me in mine own house, and that most beastly: in good faith, he cares not what mischief he does. If his weapon be out: he will foin like any devil; he will spare neither man, woman, nor child.

X = Y =

{ Charles Dickens, William Shakespeare, Herman Melville, Jane Austin, Homer, Leo Tolstoy }

) | ( ) ( max arg

k k y

y Y X P y Y P Y

   

SLIDE 4

More Motivation

P(Y=spam | X=email) P(Y=worthy | X=review sentence)

SLIDE 5

The Naïve Bayes Classifier

Recall Bayes rule:
Which is short for:
We can re-write this as:

) ( ) | ( ) ( ) | (

j i j i j i

X P Y X P Y P X Y P  ) ( ) | ( ) ( ) | (

j i j i j i

x X P y Y x X P y Y P x X y Y P       



        

k k k j i j i j i

y Y P y Y x X P y Y x X P y Y P x X y Y P ) ( ) | ( ) | ( ) ( ) | (

Remaining slides adapted from Tom Mitchell.

SLIDE 6

Deriving Naïve Bayes

Idea: use the training data to directly estimate:
We can use these values to estimate using Bayes rule.

and

) | ( Y X P ) (Y P ) | (

new

X Y P ) | , , , ( ) | (

2 1

Y X X X P Y X P

 

Recall that representing the full joint probability

is not practical.

SLIDE 7

Deriving Naïve Bayes

However, if we make the assumption that the attributes

are independent, estimation is easy!

In other words, we assume all attributes are conditionally

independent given Y.

Often this assumption is violated in practice, but more on that

later…





i i n

Y X P Y X X P ) | ( ) | , , (

1 

SLIDE 8

Deriving Naïve Bayes

Let and label Y be discrete.
Then, we can estimate

and directly from the training data by counting!

Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(Sky = sunny | Play = yes) = ? P(Humid = high | Play = yes) = ?

X X X , ,

1 

 ) | (

i i Y

X P ) ( i Y P

SLIDE 9

The Naïve Bayes Classifier

Now we have:
To classify a new point Xnew:

  

     

k i k i k j i i j n j

y Y X P y Y P y Y X P y Y P X X y Y P ) | ( ) ( ) | ( ) ( ) , , | (

1 



   

i k i k y new

y Y X P y Y P Y

) | ( ) ( max arg

SLIDE 10

The Naïve Bayes Algorithm

For each value yk
Estimate P(Y = yk) from the data.
For each value xij of each attribute Xi
Estimate P(Xi=xij | Y = yk)
Classify a new point via:
In practice, the independence assumption doesn’t often

hold true, but Naïve Bayes performs very well despite it.



   

i k i k y new

y Y X P y Y P Y

) | ( ) ( max arg

SLIDE 11

An alternate view of NB as LMs

SLIDE 12

Naïve Bayes Applications

Text classification
Which e-mails are spam?
Which e-mails are meeting notices?
Which author wrote a document?
Which webpages are about current events?
Which blog contains angry writing?
What sentence in a document talks about company X?
etc.

SLIDE 13

Text and Features

What is Xi?
Could be unigrams, hopefully bigrams too.
It can be anything that is computed from the text X.
Yes, I really mean anything. Creativity and intuition into

language is where the real gains come from in NLP.

Non n-gram examples:
X10 = “the number of sentences that begin with conjunctions”
X356 = “existence of a semi-colon in the paragraph”





i i n

Y X P Y X X P ) | ( ) | , , (

1 

SLIDE 14

Features

In machine learning, “features” are the attributes to

which you assign weights (probabilities in Naïve Bayes) that help in the final classification.

Up until now, your features have been n-grams. You

now want to consider other types of features.

You count features just like n-grams. How many did you see?
X = set of features
P(Y|X) = probability of a Y given a set of features

SLIDE 15

How do you count features?

Feature idea: “a semicolon exists in this sentence”
Count them:
Count(“FEAT-SEMICOLON”, 1)
Make up a unique name for the feature, then count!
Compute probability:
P(“FEAT-SEMICOLON” | author=“dickens”) =

Count(“FEAT-SEMICOLON”) / (# dickens sentences)

SLIDE 16

Authorship Lab

1. Figure out how to use your Language Models from

Lab 2. They can be your initial features.

Can you train() a model on one author’s text?
2. P(dickens | text) = P(dickens) * PBigramModel(text)
3. New code for new features. Call your language

SI485i : NLP

Set 5 Using Naïve Bayes

Motivation

Given X, what is the most probable Y?

Motivation: Author Detection

X = Y =

) | ( ) ( max arg

y Y X P y Y P Y

   

More Motivation

P(Y=spam | X=email) P(Y=worthy | X=review sentence)

The Naïve Bayes Classifier

) ( ) | ( ) ( ) | (

X P Y X P Y P X Y P  ) ( ) | ( ) ( ) | (

x X P y Y x X P y Y P x X y Y P       



        

y Y P y Y x X P y Y x X P y Y P x X y Y P ) ( ) | ( ) | ( ) ( ) | (

Deriving Naïve Bayes

and

) | ( Y X P ) (Y P ) | (

X Y P ) | , , , ( ) | (

Y X X X P Y X P

 

is not practical.

Deriving Naïve Bayes

are independent, estimation is easy!

independent given Y.





Y X P Y X X P ) | ( ) | , , (

Deriving Naïve Bayes

and directly from the training data by counting!

X X X , ,

 ) | (

X P ) ( i Y P

The Naïve Bayes Classifier

  

     

y Y X P y Y P y Y X P y Y P X X y Y P ) | ( ) ( ) | ( ) ( ) , , | (



   

y Y X P y Y P Y

) | ( ) ( max arg

The Naïve Bayes Algorithm

hold true, but Naïve Bayes performs very well despite it.



   

y Y X P y Y P Y

) | ( ) ( max arg

An alternate view of NB as LMs

Naïve Bayes Applications

Text and Features





Y X P Y X X P ) | ( ) | , , (

Features

which you assign weights (probabilities in Naïve Bayes) that help in the final classification.

now want to consider other types of features.

How do you count features?

Authorship Lab

Lab 2. They can be your initial features.

models, get a probability, and then multiply new feature probabilities.