5. Bayesian decision theory Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

5 bayesian decision theory
SMART_READER_LITE
LIVE PREVIEW

5. Bayesian decision theory Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Practjcal maters... I do not grade


slide-1
SLIDE 1
  • 5. Bayesian decision theory

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

Practjcal maters...

  • I do not grade homework that is sent as .docx
  • (Partjal) solutjons to Lab 2 are at the end of the

slides of Chap 4.

slide-3
SLIDE 3

3

Learning objectjves

Afuer this lecture, you should be able to

  • Apply Bayes rule for simple inference and decision

problems;

  • Explain the connectjon between Bayes decision

rule, empirical risk minimizatjon, maximum a priori and maximum likelihood;

  • Apply the Naive Bayes algorithm.
slide-4
SLIDE 4

4

Let's start by tossing coins...

slide-5
SLIDE 5

5

  • Result of tossing a coin: x in {heads, tails}

– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the

random variable X in {0, 1} drawn from a probability distributjon P(X=x).

  • Bernouilli distributjon
  • We do not know P but a sample
  • Goal: approximate P (from which X is drawn)

p0 = # heads / # tosses

  • Predictjon of next toss:

heads if p0 > 0.5 , tails otherwise

Probability and inference

slide-6
SLIDE 6

6

  • Result of tossing a coin: x in {heads, tails}

– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the

random variable X in {0, 1} drawn from a probability distributjon P(X=x).

  • We need to model P
  • We do not know P but a sample
  • Goal: approximate P (from which X is drawn)

p0 = # heads / # tosses

  • Predictjon of next toss:

heads if p0 > 0.5 , tails otherwise

Probability and inference

E.g: a complex physical functjon

  • f the compositjon of the coin,

the force that is applied to it, initjal conditjons, etc.

slide-7
SLIDE 7

7

  • Result of tossing a coin: x in {heads, tails}

– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the

random variable X in {0, 1} drawn from a probability distributjon P(X=x).

  • We need to model P
  • We do not know P but a sample
  • Goal: approximate P (from which X is drawn)

p0 = # heads / # tosses

  • Predictjon of next toss:

heads if p0 > 0.5 , tails otherwise

Probability and inference

?

E.g: a complex physical functjon

  • f the compositjon of the coin,

the force that is applied to it, initjal conditjons, etc.

slide-8
SLIDE 8

8

  • Result of tossing a coin: x in {heads, tails}

– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the

random variable X in {0, 1} drawn from a probability distributjon P(X=x).

  • Bernouilli distributjon
  • We do not know P but a sample
  • Goal: approximate P (from which X is drawn)

p0 = # heads / # tosses

  • Predictjon of next toss:

heads if p0 > 0.5 , tails otherwise

Probability and inference

?

slide-9
SLIDE 9

9

Probability and inference

  • Result of tossing a coin: x in {heads, tails}

– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the

random variable X in {0, 1} drawn from a probability distributjon P(X=x).

  • Bernouilli distributjon
  • We do not know P but a sample
  • Goal: approximate P (from which X is drawn)

p0 = # heads / # tosses

  • Predictjon of next toss:

heads if p0 > 0.5 , tails otherwise

?

slide-10
SLIDE 10

10

Probability and inference

  • Result of tossing a coin: x in {heads, tails}

– x = f(z) z: unobserved variables – Replace f(z) (maybe deterministjc but unknown) with the

random variable X in {0, 1} drawn from a probability distributjon P(X=x).

  • Bernouilli distributjon
  • We do not know P but a sample
  • Goal: approximate P (from which X is drawn)

p0 = # heads / # tosses

  • Predictjon of next toss:

heads if p0 > 0.5 , tails otherwise

slide-11
SLIDE 11

11

Classifjcatjon

  • Cat vs. dog

– Cat = 1 (positjve) – Dog = 0 (negatjve) – x1 = human contact – x2 = good eater

  • Predictjon:

human contact

Cat Dog

good eater

slide-12
SLIDE 12

12

Bayes rule

slide-13
SLIDE 13

13

Reverend Thomas Bayes

… possibly

170?-1761

slide-14
SLIDE 14

14

Bayes rule

slide-15
SLIDE 15

15

Example: rare disease testjng

– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000

What is the probability that a patjent that tested positjve actually has the disease? 99% ? 90% ? 10% ? 1% ?

slide-16
SLIDE 16

16

Example: rare disease testjng

– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000

What is the probability that a patjent that tested positjve actually has the disease?

? ?

slide-17
SLIDE 17

17

Example: rare disease testjng

– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000

What is the probability that a patjent that tested positjve actually has the disease?

0.0001 0.99

?

slide-18
SLIDE 18

18

Example: rare disease testjng

– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000

What is the probability that a patjent that tested positjve actually has the disease?

0.0001 0.99 0.0001 0.99

? ?

slide-19
SLIDE 19

19

Example: rare disease testjng

– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000

What is the probability that a patjent that tested positjve actually has the disease?

0.0001 0.99 0.0001 0.99 (1-0.0001) (1-0.99)

slide-20
SLIDE 20

20

Example: rare disease testjng

– test is correct 99% of the tjme – disease prevalence = 1 out of 10,000

What is the probability that a patjent that tested positjve actually has the disease?

0.0001 0.99 0.0001 0.99 (1-0.0001) (1-0.99)

slide-21
SLIDE 21

21

Bayes rule

Bayes' decision rule:

evidence posterior likelihood prior

slide-22
SLIDE 22

22

Maximum A Posteriori criterion

  • MAP decision rule:

– pick the hypothesis that is most probable – i.e. maximize the posterior

  • Decision rule:

If ΛMAP(x) > 1

then choose y=1 else choose y=0.

?

slide-23
SLIDE 23

23

Maximum A Posteriori criterion

  • MAP decision rule:

– pick the hypothesis that is most probable – i.e. maximize the posterior

  • Decision rule:

If ΛMAP(x) > 1

then choose y=1 else choose y=0.

slide-24
SLIDE 24

24

Likelihood ratjo test (LRT)

p(x) does not afgect the decision rule.

  • Likelihood ratjo test:

test whether the likelihood ratjo Λ(x) is larger than decision rule:

?

slide-25
SLIDE 25

25

Likelihood ratjo test (LRT)

p(x) does not afgect the decision rule.

  • Likelihood ratjo test:

test whether the likelihood ratjo Λ(x) is larger than decision rule:

slide-26
SLIDE 26

26

Example: LRT decision rule

Assuming the likelihoods below and equal priors, derive a decision rule based on the LRT.

?

slide-27
SLIDE 27

27

  • Likelihood ratjo:
  • Simplifying the equatjon and taking the log:
  • Equal priors mean we're testjng whether log(LR) > 0

Hence: If x < 7 then assign y=1 else assign y=0

7 C=1 C=0

slide-28
SLIDE 28

28

  • Likelihood ratjo:
  • Simplifying the equatjon and taking the log:
  • Equal priors mean we're testjng whether log(LR) > 0

Hence: If x < 7 then assign y=1 else assign y=0

7 C=1 C=0

Now assume P(y=1) = 2 P(y=0)

?

slide-29
SLIDE 29

29

  • Likelihood ratjo:
  • Simplifying the equatjon and taking the log:
  • Equal priors mean we're testjng whether log(LR) > 0

Hence: If x < 7 then assign y=1 else assign y=0

x < 7 – log(1/2) ≈ 7.69 y=1 is more likely.

7.69 C=1 C=0

Now assume P(y=1) = 2 P(y=0)

slide-30
SLIDE 30

30

Maximum likelihood criterion

  • Consider equal priors P(y=1) = P(y=0)
  • Bayes decision rule seeks to maximize P(x|y=c) and

is hence called the Maximum Likelihood criterion

– Decision rule:

If ΛML(x) > 1 then choose y=1 else choose y=0

1

slide-31
SLIDE 31

31

Bayes rule for K > 2

  • Bayes rule:
  • What is the decision rule?

? ?

slide-32
SLIDE 32

32

Bayes rule for K > 2

  • Bayes rule:
  • Decision ?
slide-33
SLIDE 33

33

Bayes rule for K > 2

  • Bayes rule:
  • Decision
slide-34
SLIDE 34

34

Risk minimizatjon

slide-35
SLIDE 35

35

Losses and risks

  • So far we've assumed all errors were equally costly.

But misclassfying a cancer sufgerer as a healthy patjent is much more problematjc than the other way around.

  • Actjon αk: assigining class ck
  • Loss: quantjfy the cost λkl of taking actjon αk when

the true class is cl

  • Expected risk:
  • Decision (Bayes Classifjer):
slide-36
SLIDE 36

36

Discriminant functjons

Classifjcatjon = fjnd K discriminant functjons fk s.t. x is assigned class ck if k = argmax fl(x)

  • Bayes classifjer:
slide-37
SLIDE 37

37

Discriminant functjons

Classifjcatjon = fjnd K discriminant functjons fk s.t. x is assigned class ck if k = argmax fl(x)

  • Bayes classifjer:
  • Defjnes K decision regions

x2 x1

Family car Luxury sedan Sports car Price Engine power

slide-38
SLIDE 38

38

Bayes risk minimizatjon

  • Bayes risk: overall expected risk
  • Bayes decision rule: use the discriminant functjons

that minimize the Bayes risk.

slide-39
SLIDE 39

39

Bayes risk minimizatjon

  • Bayes risk: overall expected risk
  • Bayes decision rule: use the discriminant functjons

that minimize the Bayes risk.

  • This is also a LRT.

For 2 classes, let us show that Bayes decision rule is equivalent to:

?

slide-40
SLIDE 40

40

0/1 Loss

  • All misclassifjcatjons are equally costly.
  • λkl = 0 if k=l and 1 otherwise
  • Minimizing the risk:

– choose the most probable class (MAP) – this is equivalent to the Bayes decision rule.

slide-41
SLIDE 41

41

Maximum likelihood criterion

  • Consider equal priors P(y=1) = P(y=0)
  • Consider the 0/1 loss functjon

? ?

slide-42
SLIDE 42

42

Maximum likelihood criterion

  • Consider equal priors P(y=1) = P(y=0)
  • Consider the 0/1 loss functjon

=1 (equal priors) =1 (0/1 loss)

slide-43
SLIDE 43

43

Maximum likelihood criterion

  • Consider equal priors P(y=1) = P(y=0)
  • Consider the 0/1 loss functjon
  • Bayes decision rule is equivalent to the Maximum

likelihood criterion

Decision rule: If ΛML(x) > 1 then choose y=1 else choose y=0

=1 (equal priors) =1 (0/1 loss)

slide-44
SLIDE 44

44

Reject

  • Add an artjfjcial “reject” class (K+1) for refusing to

take a decision. E.g. Zip code detectjon.

  • 0 if k = k

λkl = λ if k = K+1 1 otherwise

  • Decision:

else reject. Only meaningful if 0 < λ < 1

slide-45
SLIDE 45

45

Losses for regression

  • Square loss: L(f(x), y) = (f(x) – y)2
slide-46
SLIDE 46

46

Losses for regression

  • Square loss: L(f(x), y) = (f(x) – y)2

square loss: dominated by outliers

slide-47
SLIDE 47

47

Losses for regression

  • Square loss: L(f(x), y) = (f(x) – y)2
  • ε-insensitjve loss: L(f(x), y) = (|f(x) – y|– ε)+

square loss: dominated by outliers

slide-48
SLIDE 48

48

Losses for regression

  • Square loss: L(f(x), y) = (f(x) – y)2
  • ε-insensitjve loss: L(f(x), y) = (|f(x) – y|– ε)+

square loss: dominated by outliers ε-insensitjve loss: non smooth

slide-49
SLIDE 49

49

Losses for regression

  • Square loss: L(f(x), y) = (f(x) – y)2
  • ε-insensitjve loss: L(f(x), y) = (|f(x) – y|– ε)+
  • Huber loss: mix of linear and quadratjc

square loss: dominated by outliers ε-insensitjve loss: non smooth

slide-50
SLIDE 50

Empirical risk minimizatjon (ERM)

  • Loss: L(f(x), y) small when f(x) predicts y well
  • Expected risk:
  • Empirical risk:
  • The ERM estjmator of the functjonal class F is the

solutjon, when it exists, of:

slide-51
SLIDE 51

51

Solving ERM

  • There can sometjmes be an explicit analytjcal

solutjon

  • Otherwise: convex optjmizatjon (if the loss functjon

is convex in f)

  • Limits of ERM:

– ill-posed – not statjstjcally consistent

This is partjcularly true in high dimension.

slide-52
SLIDE 52

52

ERM is ill-posed

  • Well-posed problems (Hadamard):

Mathematjcal models of physical phenomena such that

– a solutjon exists; – the solutjon is unique; – the solutjon's behavior changes contjnuously with the

initjal conditjons.

  • It can be that an infjnite

number of solutjons minimize the empirical risk to zero.

slide-53
SLIDE 53

53

ERM is not statjstjcally consistent

  • Statjstjcal consistency: Estjmator θN of θ that

converges in probability towards θ as N increases.

  • From the law of large numbers,

but this isn't enough to guarantee that minimizing RN(f) gives a good estjmator of the minimizer of R(f).

  • Vapnik showed that this is only true if the capacity of

hypothesis space F is “not too large”.

slide-54
SLIDE 54

54

Multjvariate classifjcatjon: Naive Bayes

slide-55
SLIDE 55

55

Naive Bayes

  • Multjvariate classifjcatjon: x is multjdimensional
  • Assume the variables x1, x2, … xp are conditjonally

independent:

slide-56
SLIDE 56

56

Graphical representatjon

  • We can use a graph to represent

conditjonal independence:

– arc from c to xj means the distributjon

  • f Xj depends on c

– no arc between Xj1 and Xj2 means that

Xj1 and Xj2 are independent given C:

  • A plate represents repeated

structure:

all Xj inside the same plate follow the same probability distributjon.

c x2 x1 x3 c xj

j=1, 2, 3

slide-57
SLIDE 57

57

Naive Bayes

  • Multjvariate classifjcatjon: x is multjdimensional
  • Assume the variables x1, x2, … xp are conditjonally

independent:

Hence:

scaling factor, independent of ck

slide-58
SLIDE 58

58

Maximum a posteriori estjmatjon

  • MAP decision rule: pick the hypothesis that is most

probable

  • For Naive Bayes:
slide-59
SLIDE 59

59

Naive Bayes spam fjltering

  • Input: email

bag of words (x1, x2, … xp) = (0, 1, …, 0)

  • Output: spam / ham
  • Naive Bayes assumptjon:

conditjonal independence

S P A M S P A M

N O T S P A M

rich CLICK viagra

slide-60
SLIDE 60

60

  • P(spam|(x1, x2, … xp))

= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)

  • P(ham|(x1, x2, … xp))

= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)

  • Decision:

If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham

  • Inference: we need to determine

p(spam), p(ham), p(xj|spam), p(xj|ham)

?

slide-61
SLIDE 61

61

frequency of spam in the training data

  • P(spam|(x1, x2, … xp))

= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)

  • P(ham|(x1, x2, … xp))

= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)

  • Decision:

If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham

  • Inference: we need to determine

p(spam), p(ham), p(xj|spam), p(xj|ham)

slide-62
SLIDE 62

62

  • P(spam|(x1, x2, … xp))

= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)

  • P(ham|(x1, x2, … xp))

= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)

  • Decision:

If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham

  • Inference: we need to determine

p(spam), p(ham), p(xj|spam), p(xj|ham)

frequency of spam in the training data

slide-63
SLIDE 63

63

  • Bernouilli Naive Bayes:

– Each email is the outcome of p Bernouilli trials – Naive assumptjon: the trials are independent

word co-occurences in a category aren't independent stjll, independence assumptjons can give good results

– Direct estjmate of pj: pj = Sj / S – What happens if a word is never seen?

  • S = # spams in train set
  • Sj = # spams containing

word j in train set

slide-64
SLIDE 64

64

  • Bernouilli Naive Bayes:

– Each email is the outcome of p Bernouilli trials – Naive assumptjon: the trials are independent

word co-occurences in a category aren't independent stjll, independence assumptjons can give good results

– Direct estjmate of pj: pj = Sj / S – Laplace-smoothed estjmate of pj: pj = (Sj + 1) / (S + 2)

  • S = # spams in train set
  • Sj = # spams containing

word j in train set For a word that's not in the training set now pj=0.5 instead of 0

slide-65
SLIDE 65

65

frequency of spam in the training data pj = (Sj + 1) / (S + 2) S = # spams in train set Sj = # spams with word j in train set Bernouilli Naive Bayes:

  • P(spam|(x1, x2, … xp))

= 1/Z p(spam) p(x1|spam) p(x2|spam) … p(xp|spam)

  • P(ham|(x1, x2, … xp))

= 1/Z p(ham) p(x1|ham) p(x2|ham) … p(xp|ham)

  • Decision:

If P(spam|(x1, x2, ..., xp)) > P(ham|(x1, x2, ..., xp)) then spam else ham

  • Inference: we need to determine

p(spam), p(ham), p(xj|spam), p(xj|ham)

slide-66
SLIDE 66

66

Gaussian naive Bayes

  • Assume

p(xj|y=ck) univariate Gaussian

slide-67
SLIDE 67

67

Bayesian model selectjon

  • Priors on model: p(model)
  • Regularizatjon ≡ prior that favors simpler models.
  • Take the log
  • MAP similar to minimizing

E' = empirical error + λ model complexity

≡ training error ≡ model complexity

slide-68
SLIDE 68

68

Summary

  • Bayes decision rule ≡ likelihood ratjo test

choose the most probable class, given evidence (data) and prior belief.

  • Equivalent to minimizing Bayes risk

usually achieved approximately through empirical risk minimizatjon (not equivalent!!)

  • For the 0/1 loss, equivalent to maximizing the

posterior.

  • For the 0/1 loss and equal priors (uniform prior),

equivalent to maximizing the likelihood.

posterior evidence likelihood prior

slide-69
SLIDE 69

69

References

  • A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– Bayes classifjer: Chap 2.1 – LRT: Chap 9.4 – Naive Bayes: Chap 9.3

  • The Elements of Statjstjcal Learning.

http://web.stanford.edu/~hastie/ElemStatLearn/

– Bayes classifjer: Chap 2.4 – Maximum Likelihood: Chap 2.6.3, Chap 8.3

  • Probabilistjc machine learning

https://www.repository.cam.ac.uk/bitstream/handle/1810/ 248538/Ghahramani%202015%20Nature

  • Spam detectjon: http://www.paulgraham.com/spam.html
  • Naive Bayes:

https://nlp.stanford.edu/IR-book/pdf/13bayes.pdf

slide-70
SLIDE 70

70

challenge project

How Many Shares? Challenge

https://www.kaggle.com/c/how-many-shares

  • Predict the number of shares on social media for

artjcles from the same media site

– From artjcle length, topics, subjectjvity and much more. – What kind of machine learning task is this?

  • Evaluatjon on

– Insights learned – Predictjon performance.

slide-71
SLIDE 71

71

  • Form teams of 2-5 students

– Engineer features (see Lab 4) – Model selectjon for several approaches – Predict with selected models and submit to leaderboard – Choose 2 fjnal models

  • Deadline: December 23, 2017 23:59

– Report (2 pages + fjgures/tables) 25 pts – Leaderboard positjon 5 pts

  • Get started early!
  • Full instructjons on the course website

challenge project

slide-72
SLIDE 72

72

Kaggle leaderboard setup

  • The data is divided into:

– Training data – Public validatjon data – Test validatjon data

  • You only have the labels of the training data
  • You make predictjons for the whole validatjon set
  • The public part is used to rank you on the public

leaderboard throughout the challenge

  • The private part is used to determine your fjnal

ranking at the end.

slide-73
SLIDE 73

73

Grading rubric

  • Discussion of feature engineering 4pts
  • Discussion of cross-validated performance 8pts
  • Discussion of leaderboard performance 4pts

(of selected models — max 5/day)

  • Discussion of fjnal model 4pts
  • Clarity of report 5pts

(text, tables, fjgures)

  • Final performance 5pts
slide-74
SLIDE 74

74

Lab 3: make_Kfolds

slide-75
SLIDE 75

75

  • Each index (or instance)

should appear once and

  • nly once in any test data.
  • Each test fold containts

n/K points; the last one might contain a few more

  • r less if n is not a multjple
  • f K.
slide-76
SLIDE 76

76

cross_validate

  • predict_proba returns two arrays of predictjons:
  • one contains the probability, for each point, to belong to class A
  • the other the probability, for each point, to belong to class B.
  • To determine which of class A and class B is the positjve one, you can use

classifjer.classes_ which contains [class A, class B].

  • Note this extends to more than 2 classes
slide-77
SLIDE 77

77

Gaussian Naive Bayes