A Probabilistic View
- f Machine Learning
(2/2)
CMSC 422 MARINE CARPUAT
marine@cs.umd.edu
Some slides based on material by Tom Mitchell
A Probabilistic View of Machine Learning (2/2) CMSC 422 M ARINE C - - PowerPoint PPT Presentation
A Probabilistic View of Machine Learning (2/2) CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Some slides based on material by Tom Mitchell What we know so far Bayes rule A probabilistic view of machine learning If we know the
CMSC 422 MARINE CARPUAT
marine@cs.umd.edu
Some slides based on material by Tom Mitchell
– If we know the data generating distribution, we can define the Bayes optimal classifier – Under iid assumption
data?
– Maximum likelihood estimation
– For Bernouilli and Categorical Distributions
Given a data set D of iid flips, which contains 𝛽1 ones and 𝛽0 zeros 𝑄𝜄(𝐸) = 𝜄𝛽1(1 − 𝜄)𝛽0 𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑛𝑏𝑦𝜄 𝑄𝜄 𝐸 = 𝛽1 𝛽1 + 𝛽0
Given a data set D of iid rolls, which contains 𝑦𝑙 outcomes 𝑙 for each 𝑙 𝑄𝜄(𝐸) =
𝑙=1 𝐿
𝜄𝑙
𝑦𝑙
𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑛𝑏𝑦𝜄 𝑄𝜄 𝐸 = 𝑏𝑠𝑛𝑏𝑦𝜄 log 𝑄𝜄 𝐸 = 𝑏𝑠𝑛𝑏𝑦𝜄
𝑙=1 𝐿
𝑦𝑙log(𝜄𝑙)
K sided die ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄𝑙 (Categorical Distribution)
Problem: This objective lacks constraints!
A constrained optimization problem 𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑛𝑏𝑦𝜄
𝑙=1 𝐿
𝑦𝑙log(𝜄𝑙) 𝑥𝑗𝑢ℎ
𝑙=1 𝐿
𝜄𝑙 = 1 How to solve it? Use lagrange multipliers to turn it into unconstrained objective (on board)
K sided die ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄𝑙
𝑦𝑙 𝑙 𝑦𝑙
K sided die ∀ 𝑙, 𝑄 𝑌 = 𝑙 = 𝜄𝑙
– For Bernouilli and Categorical Distributions
– Given an example x – Predict 𝑧 = 𝑏𝑠𝑛𝑏𝑦𝑧 𝑄 𝑍 = 𝑧 𝑌 = 𝑦)
Y = Wealth X = <Gender, Hours_worked> Joint probability distribution P(X,Y) Conditional probability distribution P(Y|X)
𝑒
𝑘 are conditionally
𝑗=1 𝑒
– Without conditional independence assumption? – With conditional independence assumption?
Let’s assume discrete Xi and Y TrainNaïveBayes (Data) for each value 𝑧𝑙 of Y estimate 𝜌𝑙 = 𝑄(𝑍 = 𝑧𝑙) for each value 𝑦𝑗𝑘 of 𝑌𝑗 estimate 𝜄𝑗𝑘𝑙 = 𝑄 𝑌𝑗 = 𝑦𝑗𝑘 𝑍 = 𝑧𝑙)
# 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑍 = 𝑧𝑙 # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡
# 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑌𝑗 = 𝑦𝑗𝑘 𝑏𝑜𝑒 𝑍 = 𝑧𝑙 # 𝑓𝑦𝑏𝑛𝑞𝑚𝑓𝑡 𝑔𝑝𝑠 𝑥ℎ𝑗𝑑ℎ 𝑍 = 𝑧𝑙
– Often the Xi are not really conditionally independent – What if the Maximum Likelihood estimate for P(Xi|Y) is zero?
– Conditional independence assumption – How to train it? – How to make predictions? – How does it relate to other classifiers we know? [HW]
– iid assumption – Bayes optimal classifier