A Probabilistic View of Machine Learning (2/2) CMSC 422 M ARINE C - - PowerPoint PPT Presentation

a probabilistic view
SMART_READER_LITE
LIVE PREVIEW

A Probabilistic View of Machine Learning (2/2) CMSC 422 M ARINE C - - PowerPoint PPT Presentation

A Probabilistic View of Machine Learning (2/2) CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Some slides based on material by Tom Mitchell What we know so far Bayes rule A probabilistic view of machine learning If we know the


slide-1
SLIDE 1

A Probabilistic View

  • f Machine Learning

(2/2)

CMSC 422 MARINE CARPUAT

marine@cs.umd.edu

Some slides based on material by Tom Mitchell

slide-2
SLIDE 2

What we know so far…

  • Bayes rule
  • A probabilistic view of machine learning

– If we know the data generating distribution, we can define the Bayes optimal classifier – Under iid assumption

  • How to estimate a probability distribution from

data?

– Maximum likelihood estimation

slide-3
SLIDE 3

T

  • day
  • How to compute Maximum Likelihood

Estimates

– For Bernouilli and Categorical Distributions

  • Naïve Bayes classifier
slide-4
SLIDE 4

Maximum Likelihood Estimates

Given a data set D of iid flips, which contains 𝛽1 ones and 𝛽0 zeros 𝑄𝜄(𝐸) = 𝜄𝛽1(1 − 𝜄)𝛽0 𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄 𝑄𝜄 𝐸 = 𝛽1 𝛽1 + 𝛽0

slide-5
SLIDE 5

Maximum Likelihood Estimates

Given a data set D of iid rolls, which contains 𝑦𝑙 outcomes 𝑙 for each 𝑙 𝑄𝜄(𝐸) =

𝑙=1 𝐿

𝜄𝑙

𝑦𝑙

𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄 𝑄𝜄 𝐸 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄 log 𝑄𝜄 𝐸 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄

𝑙=1 𝐿

𝑦𝑙log(𝜄𝑙)

K sided die ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄𝑙 (Categorical Distribution)

Problem: This objective lacks constraints!

slide-6
SLIDE 6

Maximum Likelihood Estimates

A constrained optimization problem 𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄

𝑙=1 𝐿

𝑦𝑙log(𝜄𝑙) 𝑥𝑗𝑢ℎ

𝑙=1 𝐿

𝜄𝑙 = 1 How to solve it? Use lagrange multipliers to turn it into unconstrained objective (on board)

K sided die ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄𝑙

slide-7
SLIDE 7

Maximum Likelihood Estimates

The parameters that maximize the likelihood of the data are given by:

𝜄𝑙 =

𝑦𝑙 𝑙 𝑦𝑙

This is the relative frequency of rolls where side k comes up!

K sided die ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄𝑙

slide-8
SLIDE 8

T

  • day
  • How to compute Maximum Likelihood

Estimates

– For Bernouilli and Categorical Distributions

  • Naïve Bayes classifier
slide-9
SLIDE 9

Let’s learn a classifier by learning P(Y|X)

  • Goal: learn a classifier P(Y|X)
  • Prediction:

– Given an example x – Predict 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦𝑧 𝑄 𝑍 = 𝑧 𝑌 = 𝑦)

slide-10
SLIDE 10

Parameters for P(X,Y) vs. P(Y|X)

Y = Wealth X = <Gender, Hours_worked> Joint probability distribution P(X,Y) Conditional probability distribution P(Y|X)

slide-11
SLIDE 11

Parameters for P(X,Y) and P(Y|X)

  • P(Y|X) requires estimating fewer

parameters than P(X,Y)

  • But that is still too many parameters in

practice!

  • So we need simplifying assumptions to

make estimation more practical

slide-12
SLIDE 12

Naïve Bayes Assumption

Naïve Bayes assumes 𝑄 𝑌1, 𝑌2, … 𝑌𝑒 𝑍 = 𝑗=1

𝑒

𝑄(𝑌𝑗 |𝑍) i.e., that 𝑌𝑗 and 𝑌

𝑘 are conditionally

independent given Y, for all 𝑗 ≠ 𝑘

slide-13
SLIDE 13

Conditional Independence

  • Definition:

X is conditionally independent of Y given Z if P(X|Y,Z) = P(X|Z)

  • Recall that X is independent of Y if P(X|Y)=P(Y)
slide-14
SLIDE 14

Naïve Bayes classifier

𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦𝑧 𝑄 𝑍 = 𝑧 𝑌 = 𝑦) = 𝑏𝑠𝑕𝑛𝑏𝑦𝑧𝑄(𝑍 = 𝑧)𝑄 𝑌 = 𝑦 𝑍 = 𝑧) = 𝑏𝑠𝑕𝑛𝑏𝑦𝑧𝑄(𝑍 = 𝑧)

𝑗=1 𝑒

𝑄 𝑌𝑗 = 𝑦𝑗 𝑍 = 𝑧) Bayes rule + Conditional independence assumption

slide-15
SLIDE 15

How many parameters do we need to learn?

  • To describe P(Y)?
  • To describe 𝑄 𝑌 = < 𝑌1, 𝑌2, … 𝑌𝑒 > 𝑍)

– Without conditional independence assumption? – With conditional independence assumption?

(Suppose all random variables are Boolean)

slide-16
SLIDE 16

Training a Naïve Bayes classifier

Let’s assume discrete Xi and Y TrainNaïveBayes (Data) for each value 𝑧𝑙 of Y estimate 𝜌𝑙 = 𝑄(𝑍 = 𝑧𝑙) for each value 𝑦𝑗𝑘 of 𝑌𝑗 estimate 𝜄𝑗𝑘𝑙 = 𝑄 𝑌𝑗 = 𝑦𝑗𝑘 𝑍 = 𝑧𝑙)

# 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑍 = 𝑧𝑙 # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡

# 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑌𝑗 = 𝑦𝑗𝑘 𝑏𝑜𝑒 𝑍 = 𝑧𝑙 # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑍 = 𝑧𝑙

slide-17
SLIDE 17

Naïve Bayes Wrap-up

  • A simple classifier, that performs well in

practice

  • Subtleties

– Often the Xi are not really conditionally independent – What if the Maximum Likelihood estimate for P(Xi|Y) is zero?

slide-18
SLIDE 18

What you should know

  • The Naïve Bayes classifier

– Conditional independence assumption – How to train it? – How to make predictions? – How does it relate to other classifiers we know? [HW]

  • Fundamental Machine Learning concepts

– iid assumption – Bayes optimal classifier