- 5. Bayesian decision theory
5. Bayesian decision theory Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
5. Bayesian decision theory Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
Foundatjons of Machine Learning CentraleSuplec Fall 2017 5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Practjcal maters... I do not grade
Practjcal maters...
- I do not grade homework that is sent as .docx
- (Partjal) solutjons to Lab 2 are at the end of the
slides of Chap 4.
3
Learning objectjves
Afuer this lecture, you should be able to
- Apply Bayes rule for simple inference and decision
problems;
- Explain the connectjon between Bayes decision
rule, empirical risk minimizatjon, maximum a priori and maximum likelihood;
- Apply the Naive Bayes algorithm.
4
Let's start by tossing coins...
5
- Result of tossing a coin: x in {heads, tails}
– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the
random variable X in {0, 1} drawn from a probability distributjon P(X=x).
- Bernouilli distributjon
- We do not know P but a sample
- Goal: approximate P (from which X is drawn)
p0 = # heads / # tosses
- Predictjon of next toss:
heads if p0 > 0.5 , tails otherwise
Probability and inference
6
- Result of tossing a coin: x in {heads, tails}
– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the
random variable X in {0, 1} drawn from a probability distributjon P(X=x).
- We need to model P
- We do not know P but a sample
- Goal: approximate P (from which X is drawn)
p0 = # heads / # tosses
- Predictjon of next toss:
heads if p0 > 0.5 , tails otherwise
Probability and inference
E.g: a complex physical functjon
- f the compositjon of the coin,
the force that is applied to it, initjal conditjons, etc.
7
- Result of tossing a coin: x in {heads, tails}
– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the
random variable X in {0, 1} drawn from a probability distributjon P(X=x).
- We need to model P
- We do not know P but a sample
- Goal: approximate P (from which X is drawn)
p0 = # heads / # tosses
- Predictjon of next toss:
heads if p0 > 0.5 , tails otherwise
Probability and inference
?
E.g: a complex physical functjon
- f the compositjon of the coin,
the force that is applied to it, initjal conditjons, etc.
8
- Result of tossing a coin: x in {heads, tails}
– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the
random variable X in {0, 1} drawn from a probability distributjon P(X=x).
- Bernouilli distributjon
- We do not know P but a sample
- Goal: approximate P (from which X is drawn)
p0 = # heads / # tosses
- Predictjon of next toss:
heads if p0 > 0.5 , tails otherwise
Probability and inference
?
9
Probability and inference
- Result of tossing a coin: x in {heads, tails}
– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the
random variable X in {0, 1} drawn from a probability distributjon P(X=x).
- Bernouilli distributjon
- We do not know P but a sample
- Goal: approximate P (from which X is drawn)
p0 = # heads / # tosses
- Predictjon of next toss:
heads if p0 > 0.5 , tails otherwise
?
10
Probability and inference
- Result of tossing a coin: x in {heads, tails}
– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the
random variable X in {0, 1} drawn from a probability distributjon P(X=x).
- Bernouilli distributjon
- We do not know P but a sample
- Goal: approximate P (from which X is drawn)
p0 = # heads / # tosses
- Predictjon of next toss:
heads if p0 > 0.5 , tails otherwise
11
Classifjcatjon
- Cat vs. dog
– Cat = 1 (positjve) – Dog = 0 (negatjve) – x1 = human contact – x2 = good eater
- Predictjon:
human contact
Cat Dog
good eater
12
Bayes rule
13
Reverend Thomas Bayes
… possibly
170?-1761
14
Bayes rule
15
Example: rare disease testjng
– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000
What is the probability that a patjent that tested positjve actually has the disease? 99% ? 90% ? 10% ? 1% ?
16
Example: rare disease testjng
– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000
What is the probability that a patjent that tested positjve actually has the disease?
? ?
17
Example: rare disease testjng
– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000
What is the probability that a patjent that tested positjve actually has the disease?
0.0001 0.99
?
18
Example: rare disease testjng
– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000
What is the probability that a patjent that tested positjve actually has the disease?
0.0001 0.99 0.0001 0.99
? ?
19
Example: rare disease testjng
– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000
What is the probability that a patjent that tested positjve actually has the disease?
0.0001 0.99 0.0001 0.99 (1-0.0001) (1-0.99)
20
Example: rare disease testjng
– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000
What is the probability that a patjent that tested positjve actually has the disease?
0.0001 0.99 0.0001 0.99 (1-0.0001) (1-0.99)
21
Bayes rule
Bayes' decision rule:
evidence posterior likelihood prior
22
Maximum A Posteriori criterion
- MAP decision rule:
– pick the hypothesis that is most probable – i.e. maximize the posterior
- Decision rule:
If ΛMAP(x) > 1
then choose y=1 else choose y=0.
?
23
Maximum A Posteriori criterion
- MAP decision rule:
– pick the hypothesis that is most probable – i.e. maximize the posterior
- Decision rule:
If ΛMAP(x) > 1
then choose y=1 else choose y=0.
24
Likelihood ratjo test (LRT)
p(x) does not afgect the decision rule.
- Likelihood ratjo test:
test whether the likelihood ratjo Λ(x) is larger than decision rule:
?
25
Likelihood ratjo test (LRT)
p(x) does not afgect the decision rule.
- Likelihood ratjo test:
test whether the likelihood ratjo Λ(x) is larger than decision rule:
26
Example: LRT decision rule
Assuming the likelihoods below and equal priors, derive a decision rule based on the LRT.
?
27
- Likelihood ratjo:
- Simplifying the equatjon and taking the log:
- Equal priors mean we're testjng whether log(LR) > 0
Hence: If x < 7 then assign y=1 else assign y=0
7 C=1 C=0
28
- Likelihood ratjo:
- Simplifying the equatjon and taking the log:
- Equal priors mean we're testjng whether log(LR) > 0
Hence: If x < 7 then assign y=1 else assign y=0
7 C=1 C=0
Now assume P(y=1) = 2 P(y=0)
?
29
- Likelihood ratjo:
- Simplifying the equatjon and taking the log:
- Equal priors mean we're testjng whether log(LR) > 0
Hence: If x < 7 then assign y=1 else assign y=0
x < 7 – log(1/2) ≈ 7.69 y=1 is more likely.
7.69 C=1 C=0
Now assume P(y=1) = 2 P(y=0)
30
Maximum likelihood criterion
- Consider equal priors P(y=1) = P(y=0)
- Bayes decision rule seeks to maximize P(x|y=c) and
is hence called the Maximum Likelihood criterion
– Decision rule:
If ΛML(x) > 1 then choose y=1 else choose y=0
1
31
Bayes rule for K > 2
- Bayes rule:
- What is the decision rule?
? ?
32
Bayes rule for K > 2
- Bayes rule:
- Decision ?
33
Bayes rule for K > 2
- Bayes rule:
- Decision
34
Risk minimizatjon
35
Losses and risks
- So far we've assumed all errors were equally costly.
But misclassfying a cancer sufgerer as a healthy patjent is much more problematjc than the other way around.
- Actjon αk: assigining class ck
- Loss: quantjfy the cost λkl of taking actjon αk when
the true class is cl
- Expected risk:
- Decision (Bayes Classifjer):
36
Discriminant functjons
Classifjcatjon = fjnd K discriminant functjons fk s.t. x is assigned class ck if k = argmax fl(x)
- Bayes classifjer:
37
Discriminant functjons
Classifjcatjon = fjnd K discriminant functjons fk s.t. x is assigned class ck if k = argmax fl(x)
- Bayes classifjer:
- Defjnes K decision regions
x2 x1
Family car Luxury sedan Sports car Price Engine power
38
Bayes risk minimizatjon
- Bayes risk: overall expected risk
- Bayes decision rule: use the discriminant functjons
that minimize the Bayes risk.
39
Bayes risk minimizatjon
- Bayes risk: overall expected risk
- Bayes decision rule: use the discriminant functjons
that minimize the Bayes risk.
- This is also a LRT.
For 2 classes, let us show that Bayes decision rule is equivalent to:
?
40
0/1 Loss
- All misclassifjcatjons are equally costly.
- λkl = 0 if k=l and 1 otherwise
- Minimizing the risk:
– choose the most probable class (MAP) – this is equivalent to the Bayes decision rule.
41
Maximum likelihood criterion
- Consider equal priors P(y=1) = P(y=0)
- Consider the 0/1 loss functjon
? ?
42
Maximum likelihood criterion
- Consider equal priors P(y=1) = P(y=0)
- Consider the 0/1 loss functjon
=1 (equal priors) =1 (0/1 loss)
43
Maximum likelihood criterion
- Consider equal priors P(y=1) = P(y=0)
- Consider the 0/1 loss functjon
- Bayes decision rule is equivalent to the Maximum
likelihood criterion
Decision rule: If ΛML(x) > 1 then choose y=1 else choose y=0
=1 (equal priors) =1 (0/1 loss)
44
Reject
- Add an artjfjcial “reject” class (K+1) for refusing to
take a decision. E.g. Zip code detectjon.
- 0 if k = k
λkl = λ if k = K+1 1 otherwise
- Decision:
else reject. Only meaningful if 0 < λ < 1
45
Losses for regression
- Square loss: L(f(x), y) = (f(x) – y)2
46
Losses for regression
- Square loss: L(f(x), y) = (f(x) – y)2
square loss: dominated by outliers
47
Losses for regression
- Square loss: L(f(x), y) = (f(x) – y)2
- ε-insensitjve loss: L(f(x), y) = (|f(x) – y|– ε)+
square loss: dominated by outliers
48
Losses for regression
- Square loss: L(f(x), y) = (f(x) – y)2
- ε-insensitjve loss: L(f(x), y) = (|f(x) – y|– ε)+
square loss: dominated by outliers ε-insensitjve loss: non smooth
49
Losses for regression
- Square loss: L(f(x), y) = (f(x) – y)2
- ε-insensitjve loss: L(f(x), y) = (|f(x) – y|– ε)+
- Huber loss: mix of linear and quadratjc
square loss: dominated by outliers ε-insensitjve loss: non smooth
Empirical risk minimizatjon (ERM)
- Loss: L(f(x), y) small when f(x) predicts y well
- Expected risk:
- Empirical risk:
- The ERM estjmator of the functjonal class F is the
solutjon, when it exists, of:
51
Solving ERM
- There can sometjmes be an explicit analytjcal
solutjon
- Otherwise: convex optjmizatjon (if the loss functjon
is convex in f)
- Limits of ERM:
– ill-posed – not statjstjcally consistent
This is partjcularly true in high dimension.
52
ERM is ill-posed
- Well-posed problems (Hadamard):
Mathematjcal models of physical phenomena such that
– a solutjon exists; – the solutjon is unique; – the solutjon's behavior changes contjnuously with the
initjal conditjons.
- It can be that an infjnite
number of solutjons minimize the empirical risk to zero.
53
ERM is not statjstjcally consistent
- Statjstjcal consistency: Estjmator θN of θ that
converges in probability towards θ as N increases.
- From the law of large numbers,
but this isn't enough to guarantee that minimizing RN(f) gives a good estjmator of the minimizer of R(f).
- Vapnik showed that this is only true if the capacity of
hypothesis space F is “not too large”.
54
Multjvariate classifjcatjon: Naive Bayes
55
Naive Bayes
- Multjvariate classifjcatjon: x is multjdimensional
- Assume the variables x1, x2, … xp are conditjonally
independent:
56
Graphical representatjon
- We can use a graph to represent
conditjonal independence:
– arc from c to xj means the distributjon
- f Xj depends on c
– no arc between Xj1 and Xj2 means that
Xj1 and Xj2 are independent given C:
- A plate represents repeated
structure:
all Xj inside the same plate follow the same probability distributjon.
c x2 x1 x3 c xj
j=1, 2, 3
57
Naive Bayes
- Multjvariate classifjcatjon: x is multjdimensional
- Assume the variables x1, x2, … xp are conditjonally
independent:
Hence:
scaling factor, independent of ck
58
Maximum a posteriori estjmatjon
- MAP decision rule: pick the hypothesis that is most
probable
- For Naive Bayes:
59
Naive Bayes spam fjltering
- Input: email
bag of words (x1, x2, … xp) = (0, 1, …, 0)
- Output: spam / ham
- Naive Bayes assumptjon:
conditjonal independence
S P A M S P A M
N O T S P A M
rich CLICK viagra
60
- P(spam|(x1, x2, … xp))
= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)
- P(ham|(x1, x2, … xp))
= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)
- Decision:
If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham
- Inference: we need to determine
p(spam), p(ham), p(xj|spam), p(xj|ham)
?
61
frequency of spam in the training data
- P(spam|(x1, x2, … xp))
= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)
- P(ham|(x1, x2, … xp))
= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)
- Decision:
If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham
- Inference: we need to determine
p(spam), p(ham), p(xj|spam), p(xj|ham)
62
- P(spam|(x1, x2, … xp))
= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)
- P(ham|(x1, x2, … xp))
= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)
- Decision:
If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham
- Inference: we need to determine
p(spam), p(ham), p(xj|spam), p(xj|ham)
frequency of spam in the training data
63
- Bernouilli Naive Bayes:
– Each email is the outcome of p Bernouilli trials – Naive assumptjon: the trials are independent
word co-occurences in a category aren't independent stjll, independence assumptjons can give good results
– Direct estjmate of pj: pj = Sj / S – What happens if a word is never seen?
- S = # spams in train set
- Sj = # spams containing
word j in train set
64
- Bernouilli Naive Bayes:
– Each email is the outcome of p Bernouilli trials – Naive assumptjon: the trials are independent
word co-occurences in a category aren't independent stjll, independence assumptjons can give good results
– Direct estjmate of pj: pj = Sj / S – Laplace-smoothed estjmate of pj: pj = (Sj + 1) / (S + 2)
- S = # spams in train set
- Sj = # spams containing
word j in train set For a word that's not in the training set now pj=0.5 instead of 0
65
frequency of spam in the training data pj = (Sj + 1) / (S + 2) S = # spams in train set Sj = # spams with word j in train set Bernouilli Naive Bayes:
- P(spam|(x1, x2, … xp))
= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)
- P(ham|(x1, x2, … xp))
= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)
- Decision:
If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham
- Inference: we need to determine
p(spam), p(ham), p(xj|spam), p(xj|ham)
66
Gaussian naive Bayes
- Assume
p(xj|y=ck) univariate Gaussian
67
Bayesian model selectjon
- Priors on model: p(model)
- Regularizatjon ≡ prior that favors simpler models.
- Take the log
- MAP similar to minimizing
E' = empirical error + λ model complexity
≡ training error ≡ model complexity
68
Summary
- Bayes decision rule ≡ likelihood ratjo test
choose the most probable class, given evidence (data) and prior belief.
- Equivalent to minimizing Bayes risk
usually achieved approximately through empirical risk minimizatjon (not equivalent!!)
- For the 0/1 loss, equivalent to maximizing the
posterior.
- For the 0/1 loss and equal priors (uniform prior),
equivalent to maximizing the likelihood.
posterior evidence likelihood prior
69
References
- A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– Bayes classifjer: Chap 2.1 – LRT: Chap 9.4 – Naive Bayes: Chap 9.3
- The Elements of Statjstjcal Learning.
http://web.stanford.edu/~hastie/ElemStatLearn/
– Bayes classifjer: Chap 2.4 – Maximum Likelihood: Chap 2.6.3, Chap 8.3
- Probabilistjc machine learning
https://www.repository.cam.ac.uk/bitstream/handle/1810/ 248538/Ghahramani%202015%20Nature
- Spam detectjon: http://www.paulgraham.com/spam.html
- Naive Bayes:
https://nlp.stanford.edu/IR-book/pdf/13bayes.pdf
70
challenge project
How Many Shares? Challenge
https://www.kaggle.com/c/how-many-shares
- Predict the number of shares on social media for
artjcles from the same media site
– From artjcle length, topics, subjectjvity and much more. – What kind of machine learning task is this?
- Evaluatjon on
– Insights learned – Predictjon performance.
71
- Form teams of 2-5 students
– Engineer features (see Lab 4) – Model selectjon for several approaches – Predict with selected models and submit to leaderboard – Choose 2 fjnal models
- Deadline: December 23, 2017 23:59
– Report (2 pages + fjgures/tables) 25 pts – Leaderboard positjon 5 pts
- Get started early!
- Full instructjons on the course website
challenge project
72
Kaggle leaderboard setup
- The data is divided into:
– Training data – Public validatjon data – Test validatjon data
- You only have the labels of the training data
- You make predictjons for the whole validatjon set
- The public part is used to rank you on the public
leaderboard throughout the challenge
- The private part is used to determine your fjnal
ranking at the end.
73
Grading rubric
- Discussion of feature engineering 4pts
- Discussion of cross-validated performance 8pts
- Discussion of leaderboard performance 4pts
(of selected models — max 5/day)
- Discussion of fjnal model 4pts
- Clarity of report 5pts
(text, tables, fjgures)
- Final performance 5pts
74
Lab 3: make_Kfolds
75
- Each index (or instance)
should appear once and
- nly once in any test data.
- Each test fold containts
n/K points; the last one might contain a few more
- r less if n is not a multjple
- f K.
76
cross_validate
- predict_proba returns two arrays of predictjons:
- one contains the probability, for each point, to belong to class A
- the other the probability, for each point, to belong to class B.
- To determine which of class A and class B is the positjve one, you can use
classifjer.classes_ which contains [class A, class B].
- Note this extends to more than 2 classes
77