Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

machine learning chenhao tan
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 4 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 33 Logistics Piazza:


slide-1
SLIDE 1

Machine Learning: Chenhao Tan

University of Colorado Boulder

LECTURE 4 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 33

slide-2
SLIDE 2

Logistics

  • Piazza: https://piazza.com/colorado/fall2017/csci5622/
  • Office hour
  • HW1 due
  • Final projects
  • Feedback

Machine Learning: Chenhao Tan | Boulder | 2 of 33

slide-3
SLIDE 3

Recap

  • Supervised learning
  • K-nearest neighbor
  • Training/validation/test, overfitting/underfitting

Machine Learning: Chenhao Tan | Boulder | 3 of 33

slide-4
SLIDE 4

Overview

Generative vs. Discriminative models Naïve Bayes Classifier Motivating Naïve Bayes Example Naïve Bayes Definition Estimating Probability Distributions Logistic regression Logistic Regression Example

Machine Learning: Chenhao Tan | Boulder | 4 of 33

slide-5
SLIDE 5

Generative vs. Discriminative models

Outline

Generative vs. Discriminative models Naïve Bayes Classifier Motivating Naïve Bayes Example Naïve Bayes Definition Estimating Probability Distributions Logistic regression Logistic Regression Example

Machine Learning: Chenhao Tan | Boulder | 5 of 33

slide-6
SLIDE 6

Generative vs. Discriminative models

Probabilistic Models

  • hypothesis function h : X → Y.

Machine Learning: Chenhao Tan | Boulder | 6 of 33

slide-7
SLIDE 7

Generative vs. Discriminative models

Probabilistic Models

  • hypothesis function h : X → Y.

In this special case, we define h based on estimating a probabilistic model P(X, Y).

Machine Learning: Chenhao Tan | Boulder | 6 of 33

slide-8
SLIDE 8

Generative vs. Discriminative models

Probabilistic Classification

Input: Strain = {(xi, yi)}N

i=1 training examples

yi ∈ {c1, c2, . . . , cJ} Goal: h : X → Y For each class cj, estimate P(y = cj | x, Strain) Assign to x the class with the highest probability ˆ y = h(x) = arg max

c

P(y = c | x, Strain)

Machine Learning: Chenhao Tan | Boulder | 7 of 33

slide-9
SLIDE 9

Generative vs. Discriminative models

Generative vs. Discriminative Models

Generative

Model joint probability p(x, y) including the data x. Naïve Bayes

  • Uses Bayes rule to reverse

conditioning p(x|y) → p(y|x)

  • Naïve because it ignores joint

probabilities within the data distribution

Discriminative

Model only conditional probability p(y|x), excluding the data x. Logistic regression

  • Logistic: A special mathematical

function it uses

  • Regression: Combines a weight

vector with observations to create an answer

  • General cookbook for building

conditional probability distributions

Machine Learning: Chenhao Tan | Boulder | 8 of 33

slide-10
SLIDE 10

Naïve Bayes Classifier

Outline

Generative vs. Discriminative models Naïve Bayes Classifier Motivating Naïve Bayes Example Naïve Bayes Definition Estimating Probability Distributions Logistic regression Logistic Regression Example

Machine Learning: Chenhao Tan | Boulder | 9 of 33

slide-11
SLIDE 11

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

  • Suppose that I have two coins, C1 and C2
  • Now suppose I pull a coin out of my pocket, flip it a bunch of times, record the

coin and outcomes, and repeat many times: C1: 0 1 1 1 1 C1: 1 1 0 C2: 1 0 0 0 0 0 0 1 C1: 0 1 C1: 1 1 0 1 1 1 C2: 0 0 1 1 0 1 C2: 1 0 0 0

Machine Learning: Chenhao Tan | Boulder | 10 of 33

slide-12
SLIDE 12

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

  • Suppose that I have two coins, C1 and C2
  • Now suppose I pull a coin out of my pocket, flip it a bunch of times, record the

coin and outcomes, and repeat many times: C1: 0 1 1 1 1 C1: 1 1 0 C2: 1 0 0 0 0 0 0 1 C1: 0 1 C1: 1 1 0 1 1 1 C2: 0 0 1 1 0 1 C2: 1 0 0 0

  • Now suppose I am given a new sequence, 0 0 1; which coin is it from?

Machine Learning: Chenhao Tan | Boulder | 10 of 33

slide-13
SLIDE 13

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

  • different numbers of covariates for each observation
  • number of covariates can be large

However, there is some structure:

  • Easy to get P(C1), P(C2)
  • Also easy to get P(Xi = 1 | C1) and P(Xi = 1 | C2)
  • By conditional independence,

P(X = 0 0 1 | C1) = P(X1 = 0 | C1)P(X2 = 0 | C1)P(X2 = 1 | C1)

  • Can we use these to get P(C1 | X = 0 0 1 )?

Machine Learning: Chenhao Tan | Boulder | 11 of 33

slide-14
SLIDE 14

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

  • different numbers of covariates for each observation
  • number of covariates can be large

However, there is some structure:

  • Easy to get P(C1)= 4/7, P(C2)= 3/7
  • Also easy to get P(Xi = 1 | C1) and P(Xi = 1 | C2)
  • By conditional independence,

P(X = 0 0 1 | C1) = P(X1 = 0 | C1)P(X2 = 0 | C1)P(X2 = 1 | C1)

  • Can we use these to get P(C1 | X = 0 0 1 )?

Machine Learning: Chenhao Tan | Boulder | 11 of 33

slide-15
SLIDE 15

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

  • different numbers of covariates for each observation
  • number of covariates can be large

However, there is some structure:

  • Easy to get P(C1)= 4/7, P(C2)= 3/7
  • Also easy to get P(Xi = 1 | C1)= 12/16 and P(Xi = 1 | C2)= 6/18
  • By conditional independence,

P(X = 0 0 1 | C1) = P(X1 = 0 | C1)P(X2 = 0 | C1)P(X2 = 1 | C1)

  • Can we use these to get P(C1 | X = 0 0 1 )?

Machine Learning: Chenhao Tan | Boulder | 11 of 33

slide-16
SLIDE 16

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

Summary: have P(data | class), want P(class | data) Solution: Bayes’ rule! P(class | data) = P(data | class)P(class) P(data) = P(data | class)P(class) C

class=1 P(data | class)P(class)

To compute, we need to estimate P(data | class), P(class) for all classes

Machine Learning: Chenhao Tan | Boulder | 12 of 33

slide-17
SLIDE 17

Naïve Bayes Classifier | Motivating Naïve Bayes Example

A Classification Problem

However, there is some structure:

  • Easy to get P(C1)= 4/7, P(C2)= 3/7
  • Also easy to get P(Xi = 1 | C1)= 12/16 and P(Xi = 1 | C2)== 6/18
  • By conditional independence,

P(X = 0 0 1 | C1) = P(X1 = 0 | C1)P(X2 = 0 | C1)P(X2 = 1 | C1) P(C1 | X = 0 0 1 ) = 4/7 × 4/16 × 4/16 × 12/16 4/7 × 4/16 × 4/16 × 12/16 + 3/7 × 12/18 × 12/18 × 6/18

Machine Learning: Chenhao Tan | Boulder | 13 of 33

slide-18
SLIDE 18

Naïve Bayes Classifier | Naïve Bayes Definition

The Naïve Bayes classifier

  • The Naïve Bayes classifier is a probabilistic classifier.
  • We compute the probability of a document d being in a class c as follows:

P(c|d) ∝ P(c, d) = P(c)

  • 1≤i≤nd

P(wi|c)

Machine Learning: Chenhao Tan | Boulder | 14 of 33

slide-19
SLIDE 19

Naïve Bayes Classifier | Naïve Bayes Definition

The Naïve Bayes classifier

  • The Naïve Bayes classifier is a probabilistic classifier.
  • We compute the probability of a document d being in a class c as follows:

P(c|d) ∝ P(c, d) = P(c)

  • 1≤i≤nd

P(wi|c)

Machine Learning: Chenhao Tan | Boulder | 14 of 33

slide-20
SLIDE 20

Naïve Bayes Classifier | Naïve Bayes Definition

The Naïve Bayes classifier

  • The Naïve Bayes classifier is a probabilistic classifier.
  • We compute the probability of a document d being in a class c as follows:

P(c|d) ∝ P(c, d) = P(c)

  • 1≤i≤nd

P(wi|c)

  • nd is the length of the document. (number of tokens)
  • P(wi|c) is the conditional probability of term wi occurring in a document of class

c

  • P(wi|c) as a measure of how much evidence wi contributes that c is the correct

class.

  • P(c) is the prior probability of c.
  • If a document’s terms do not provide clear evidence for one class vs. another,

we choose the c with higher P(c).

Machine Learning: Chenhao Tan | Boulder | 14 of 33

slide-21
SLIDE 21

Naïve Bayes Classifier | Naïve Bayes Definition

Maximum a posteriori class

  • Our goal is to find the “best” class.
  • The best class in Naïve Bayes classification is the most likely or maximum a

posteriori (MAP) class cMAP: cMAP = arg max

cj∈C

ˆ P(cj|d) = arg max

cj∈C

ˆ P(cj)

  • 1≤i≤nd

ˆ P(wi|cj)

  • We write ˆ

P for P since these values are estimates from the training set.

Machine Learning: Chenhao Tan | Boulder | 15 of 33

slide-22
SLIDE 22

Naïve Bayes Classifier | Naïve Bayes Definition

Naive Bayes Classifier: More examples

This works because the coin flips are independent given the coin parameter. What about this case:

  • want to identify the type of fruit given a set of features: color, shape and size
  • color: red, green, yellow or orange (discrete)
  • shape: round, oval or long+skinny (discrete)
  • size: diameter in inches (continuous)

Machine Learning: Chenhao Tan | Boulder | 16 of 33

slide-23
SLIDE 23

Naïve Bayes Classifier | Naïve Bayes Definition

Naive Bayes Classifier: More examples

Conditioned on type of fruit, these features are not necessarily independent: Given category “apple,” the color “green” has a higher probability given “size < 2”: P(green | size < 2, apple) > P(green | apple)

Machine Learning: Chenhao Tan | Boulder | 17 of 33

slide-24
SLIDE 24

Naïve Bayes Classifier | Naïve Bayes Definition

Naive Bayes Classifier: More examples

Using chain rule, P(apple | green, round, size = 2) = P(green, round, size = 2 | apple)P(apple)

  • fruits P(green, round, size = 2 | fruit j)P(fruit j)

∝ P(green | round, size = 2, apple)P(round | size = 2, apple) × P(size = 2 | apple)P(apple) But computing conditional probabilities is hard! There are many combinations of (color, shape, size) for each fruit.

Machine Learning: Chenhao Tan | Boulder | 18 of 33

slide-25
SLIDE 25

Naïve Bayes Classifier | Naïve Bayes Definition

Naive Bayes Classifier: More examples

Idea: assume conditional independence for all features given class, P(green | round, size = 2, apple) = P(green | apple) P(round | green, size = 2, apple) = P(round | apple) P(size = 2 | green, round, apple) = P(size = 2 | apple)

Machine Learning: Chenhao Tan | Boulder | 19 of 33

slide-26
SLIDE 26

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • Suppose we want to estimate P(wn = “buy”|y = SPAM).

Machine Learning: Chenhao Tan | Boulder | 20 of 33

slide-27
SLIDE 27

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • Suppose we want to estimate P(wn = “buy”|y = SPAM).

buy buy nigeria

  • pportunity

viagra nigeria

  • pportunity

viagra fly money fly buy nigeria fly buy money buy fly nigeria viagra

Machine Learning: Chenhao Tan | Boulder | 20 of 33

slide-28
SLIDE 28

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • Suppose we want to estimate P(wn = “buy”|y = SPAM).

buy buy nigeria

  • pportunity

viagra nigeria

  • pportunity

viagra fly money fly buy nigeria fly buy money buy fly nigeria viagra

  • Maximum likelihood (ML) estimate of the probability is:

ˆ P(wi|SPAM) = ni

  • k nk

(1)

Machine Learning: Chenhao Tan | Boulder | 20 of 33

slide-29
SLIDE 29

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • Suppose we want to estimate P(wn = “buy”|y = SPAM).

buy buy nigeria

  • pportunity

viagra nigeria

  • pportunity

viagra fly money fly buy nigeria fly buy money buy fly nigeria viagra

  • Maximum likelihood (ML) estimate of the probability is:

ˆ P(wi|SPAM) = ni

  • k nk

(1)

  • Is this reasonable?

Machine Learning: Chenhao Tan | Boulder | 20 of 33

slide-30
SLIDE 30

Naïve Bayes Classifier | Estimating Probability Distributions

The problem with maximum likelihood estimates: Zeros (cont)

  • If there were no occurrences of “bagel” in documents in class SPAM, we’d get a

zero estimate: ˆ P(“bagel”| SPAM) = T SPAM,“bagel”

  • w′∈V T SPAM,w′ = 0,

where Tc,w is the count of tokens in documents with label c.

  • → We will get P(SPAM|d) = 0 for any document that contains bagel!

Machine Learning: Chenhao Tan | Boulder | 21 of 33

slide-31
SLIDE 31

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • For a multinomial distribution (i.e. a discrete distribution, like over words):

ˆ P(wi|c) = ni + αi

  • k nk + αk

(2)

  • αi is called a smoothing factor, a pseudocount, etc.

Machine Learning: Chenhao Tan | Boulder | 22 of 33

slide-32
SLIDE 32

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • For a multinomial distribution (i.e. a discrete distribution, like over words):

ˆ P(wi|c) = ni + αi

  • k nk + αk

(2)

  • αi is called a smoothing factor, a pseudocount, etc.
  • When αi = 1 for all i, it’s called “Laplace smoothing” and corresponds to a

uniform prior over all multinomial distributions (just do this).

Machine Learning: Chenhao Tan | Boulder | 22 of 33

slide-33
SLIDE 33

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • For a multinomial distribution (i.e. a discrete distribution, like over words):

ˆ P(wi|c) = ni + αi

  • k nk + αk

(2)

  • αi is called a smoothing factor, a pseudocount, etc.
  • When αi = 1 for all i, it’s called “Laplace smoothing” and corresponds to a

uniform prior over all multinomial distributions (just do this).

Machine Learning: Chenhao Tan | Boulder | 22 of 33

slide-34
SLIDE 34

Naïve Bayes Classifier | Estimating Probability Distributions

How do we estimate a probability?

  • For many applications, we often have a prior notion of what our probability

distributions are going to look like (for example, non-zero, sparse, uniform, etc.).

  • This estimate of a probability distribution is called the maximum a posteriori

(MAP) estimate: βMAP = argmaxβf(x|β)g(β) (3)

Machine Learning: Chenhao Tan | Boulder | 23 of 33

slide-35
SLIDE 35

Naïve Bayes Classifier | Estimating Probability Distributions

Naïve Bayes for document classification

To reduce the number of parameters to a manageable size, recall the Naïve Bayes conditional independence assumption: P(d|cj) = P(w1, . . . , wnd|cj) =

  • 1≤i≤nd

P(Xi = wi|cj) We assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(Xi = wi|cj). Our estimates for these priors and conditional probabilities with Laplace smoothing: ˆ P(cj) = Nc+1

N+|C| and ˆ

P(w|c) =

Tcw+1 (

w′∈V Tcw′)+|V|

Machine Learning: Chenhao Tan | Boulder | 24 of 33

slide-36
SLIDE 36

Naïve Bayes Classifier | Estimating Probability Distributions

Implementation Detail: Taking the log

  • Multiplying lots of small probabilities can result in floating point underflow.
  • From last time lg is logarithm base 2; ln is logarithm base e.

lg x = a ⇔ 2a = x ln x = a ⇔ ea = x (4)

  • Since lg(xy) = lg(x) + lg(y), we can sum log probabilities instead of multiplying

probabilities.

  • Since lg is a monotonic function, the class with the highest score does not

change.

  • So what we usually compute in practice is:

cMAP = arg max

cj∈C [ˆ

P(cj)

  • 1≤i≤nd

ˆ P(wi|cj)] arg max

cj∈C [ ln ˆ

P(cj) +

  • 1≤i≤nd

ln ˆ P(wi|cj)]

Machine Learning: Chenhao Tan | Boulder | 25 of 33

slide-37
SLIDE 37

Naïve Bayes Classifier | Estimating Probability Distributions

Implementation Detail: Taking the log

  • Multiplying lots of small probabilities can result in floating point underflow.
  • From last time lg is logarithm base 2; ln is logarithm base e.

lg x = a ⇔ 2a = x ln x = a ⇔ ea = x (4)

  • Since lg(xy) = lg(x) + lg(y), we can sum log probabilities instead of multiplying

probabilities.

  • Since lg is a monotonic function, the class with the highest score does not

change.

  • So what we usually compute in practice is:

cMAP = arg max

cj∈C [ˆ

P(cj)

  • 1≤i≤nd

ˆ P(wi|cj)] arg max

cj∈C [ ln ˆ

P(cj) +

  • 1≤i≤nd

ln ˆ P(wi|cj)]

Machine Learning: Chenhao Tan | Boulder | 25 of 33

slide-38
SLIDE 38

Naïve Bayes Classifier | Estimating Probability Distributions

Implementation Detail: Taking the log

  • Multiplying lots of small probabilities can result in floating point underflow.
  • From last time lg is logarithm base 2; ln is logarithm base e.

lg x = a ⇔ 2a = x ln x = a ⇔ ea = x (4)

  • Since lg(xy) = lg(x) + lg(y), we can sum log probabilities instead of multiplying

probabilities.

  • Since lg is a monotonic function, the class with the highest score does not

change.

  • So what we usually compute in practice is:

cMAP = arg max

cj∈C [ˆ

P(cj)

  • 1≤i≤nd

ˆ P(wi|cj)] arg max

cj∈C [ ln ˆ

P(cj) +

  • 1≤i≤nd

ln ˆ P(wi|cj)]

Machine Learning: Chenhao Tan | Boulder | 25 of 33

slide-39
SLIDE 39

Logistic regression

Outline

Generative vs. Discriminative models Naïve Bayes Classifier Motivating Naïve Bayes Example Naïve Bayes Definition Estimating Probability Distributions Logistic regression Logistic Regression Example

Machine Learning: Chenhao Tan | Boulder | 26 of 33

slide-40
SLIDE 40

Logistic regression

What are we talking about?

  • Statistical classification: p(y|x)
  • Classification uses: ad placement, spam detection
  • Building block of other machine learning methods

Machine Learning: Chenhao Tan | Boulder | 27 of 33

slide-41
SLIDE 41

Logistic regression

Logistic Regression: Definition

  • Weight vector βi
  • Observations Xi
  • “Bias” β0 (like intercept in linear regression)

P(Y = 0|X) = 1 1 + exp [β0 +

i βiXi]

(5) P(Y = 1|X) = exp [β0 +

i βiXi]

1 + exp [β0 +

i βiXi]

(6)

Machine Learning: Chenhao Tan | Boulder | 28 of 33

slide-42
SLIDE 42

Logistic regression

Logistic Regression: Definition

  • Weight vector βi
  • Observations Xi
  • For shorthand, we’ll say that

P(Y = 1|X) = σ((β0 +

  • i

βiXi)) (7) P(Y = 0|X) = 1 − σ((β0 +

  • i

βiXi)) (8)

  • Where σ(z) =

1 1+exp[−z]

Machine Learning: Chenhao Tan | Boulder | 29 of 33

slide-43
SLIDE 43

Logistic regression

What’s this “exp” doing?

Exponential

  • exp [x] is shorthand for ex
  • e is a special number, about 2.71828
  • ex is the limit of compound interest formula as

compounds become infinitely small

  • It’s the function whose derivative is itself
  • The “logistic” function is σ(z) =

1 1+e−z

  • Looks like an “S”
  • Always between 0 and 1.

Machine Learning: Chenhao Tan | Boulder | 30 of 33

slide-44
SLIDE 44

Logistic regression

What’s this “exp” doing?

Logistic

  • exp [x] is shorthand for ex
  • e is a special number, about 2.71828
  • ex is the limit of compound interest formula as

compounds become infinitely small

  • It’s the function whose derivative is itself
  • The “logistic” function is σ(z) =

1 1+e−z

  • Looks like an “S”
  • Always between 0 and 1.
  • Allows us to model probabilities
  • Different from linear regression

Machine Learning: Chenhao Tan | Boulder | 30 of 33

slide-45
SLIDE 45

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • What does Y = 1 mean?

Example 1: Empty Document?

X = {}

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-46
SLIDE 46

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 1: Empty Document?

X = {}

  • P(Y = 0) =

1 1+exp [0.1] =

  • P(Y = 1) =

exp [0.1] 1+exp [0.1] =

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-47
SLIDE 47

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 1: Empty Document?

X = {}

  • P(Y = 0) =

1 1+exp [0.1] = 0.48

  • P(Y = 1) =

exp [0.1] 1+exp [0.1] = 0.52

  • Bias β0 encodes the prior probability
  • f a class

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-48
SLIDE 48

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 2

X = {Mother, Nigeria}

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-49
SLIDE 49

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 2

X = {Mother, Nigeria}

  • P(Y = 0) =

1 1+exp [0.1−1.0+3.0] =

  • P(Y = 1) =

exp [0.1−1.0+3.0] 1+exp [0.1−1.0+3.0] =

  • Include bias, and sum the other

weights

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-50
SLIDE 50

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 2

X = {Mother, Nigeria}

  • P(Y = 0) =

1 1+exp [0.1−1.0+3.0] = 0.11

  • P(Y = 1) =

exp [0.1−1.0+3.0] 1+exp [0.1−1.0+3.0] = 0.88

  • Include bias, and sum the other

weights

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-51
SLIDE 51

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 3

X = {Mother, Work, Viagra, Mother}

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-52
SLIDE 52

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 3

X = {Mother, Work, Viagra, Mother}

  • P(Y = 0) =

1 1+exp [0.1−1.0−0.5+2.0−1.0] =

  • P(Y = 1) =

exp [0.1−1.0−0.5+2.0−1.0] 1+exp [0.1−1.0−0.5+2.0−1.0] =

  • Multiply feature presence by weight

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-53
SLIDE 53

Logistic regression | Logistic Regression Example

Logistic Regression Example

feature coefficient weight bias β0 0.1 “viagra” β1 2.0 “mother” β2 −1.0 “work” β3 −0.5 “nigeria” β4 3.0

  • Y = 1: spam

Example 3

X = {Mother, Work, Viagra, Mother}

  • P(Y = 0) =

1 1+exp [0.1−1.0−0.5+2.0−1.0] =

0.60

  • P(Y = 1) =

exp [0.1−1.0−0.5+2.0−1.0] 1+exp [0.1−1.0−0.5+2.0−1.0] =

0.40

  • Multiply feature presence by weight

Machine Learning: Chenhao Tan | Boulder | 31 of 33

slide-54
SLIDE 54

Logistic regression | Logistic Regression Example

How is Logistic Regression Used?

  • Given a set of weights

β, we know how to compute the conditional likelihood P(y|β, x)

  • Find the set of weights

β that maximize the conditional likelihood on training data (next week)

  • Intuition: higher weights mean that this feature implies that this feature is a

good feature for the positive class

Machine Learning: Chenhao Tan | Boulder | 32 of 33

slide-55
SLIDE 55

Logistic regression | Logistic Regression Example

How is Logistic Regression Used?

  • Given a set of weights

β, we know how to compute the conditional likelihood P(y|β, x)

  • Find the set of weights

β that maximize the conditional likelihood on training data (next week)

  • Intuition: higher weights mean that this feature implies that this feature is a

good feature for the positive class

  • Naïve Bayes is a special case of logistic regression that uses Bayes rule and

conditional probabilities to set these weights arg max

cj∈C [ ln ˆ

P(cj) +

  • 1≤i≤nd

ln ˆ P(wi|cj)]

Machine Learning: Chenhao Tan | Boulder | 32 of 33

slide-56
SLIDE 56

Logistic regression | Logistic Regression Example

How is Logistic Regression Used?

  • Given a set of weights

β, we know how to compute the conditional likelihood P(y|β, x)

  • Find the set of weights

β that maximize the conditional likelihood on training data (next week)

  • Intuition: higher weights mean that this feature implies that this feature is a

good feature for the positive class

  • Naïve Bayes is a special case of logistic regression that uses Bayes rule and

conditional probabilities to set these weights arg max

cj∈C [ ln ˆ

P(cj) +

  • 1≤i≤nd

ln ˆ P(wi|cj)]

Machine Learning: Chenhao Tan | Boulder | 32 of 33

slide-57
SLIDE 57

Logistic regression | Logistic Regression Example

How is Logistic Regression Used?

  • Given a set of weights

β, we know how to compute the conditional likelihood P(y|β, x)

  • Find the set of weights

β that maximize the conditional likelihood on training data (next week)

  • Intuition: higher weights mean that this feature implies that this feature is a

good feature for the positive class

  • Naïve Bayes is a special case of logistic regression that uses Bayes rule and

conditional probabilities to set these weights arg max

cj∈C [ ln ˆ

P(cj) +

  • 1≤i≤nd

ln ˆ P(wi|cj)]

Machine Learning: Chenhao Tan | Boulder | 32 of 33

slide-58
SLIDE 58

Logistic regression | Logistic Regression Example

Contrasting Naïve Bayes and Logistic Regression

  • Naïve Bayes easier
  • Naïve Bayes better on smaller datasets
  • Logistic regression better on medium-sized datasets
  • On huge datasets, it doesn’t really matter (data always win)
  • Optional reading by Ng and Jordan has proofs and experiments
  • Logistic regression allows arbitrary features (biggest difference!)

Machine Learning: Chenhao Tan | Boulder | 33 of 33

slide-59
SLIDE 59

Logistic regression | Logistic Regression Example

Contrasting Naïve Bayes and Logistic Regression

  • Naïve Bayes easier
  • Naïve Bayes better on smaller datasets
  • Logistic regression better on medium-sized datasets
  • On huge datasets, it doesn’t really matter (data always win)
  • Optional reading by Ng and Jordan has proofs and experiments
  • Logistic regression allows arbitrary features (biggest difference!)
  • Don’t need to memorize (or work through) previous slide—just understand that

naïve Bayes is a special case of logistic regression

Machine Learning: Chenhao Tan | Boulder | 33 of 33