CS 559: Machine Learning Fundamentals and Applications 3rd Set of Notes
Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215
1
CS 559: Machine Learning Fundamentals and Applications 3 rd Set of - - PowerPoint PPT Presentation
1 CS 559: Machine Learning Fundamentals and Applications 3 rd Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Making Decisions
1
2
tossing a fair coin. If you bet and win, you gain $100. If you bet and lose, you lose $200. If you don't bet, the cost to you is zero. U(win, bet) = 100 U(lose, bet) = -200 U(win, no bet) = 0 U(lose, no bet) = 0
U(bet) = p(win)×U(win, bet) + p(lose)×U(lose, bet) = 0.5×100 – 0.5×200 = -50 U(no bet) = 0
utility, you would therefore be advised not to bet.
3
4
Adapted from: Duda, Hart and Stork, Pattern Classification textbook
5
Pattern Classification, Chapter 2 6
j j j
posterior likelihood prior evidence
j j j j j j j j
From the Economist (2000)
7
The canonical example is to imagine that a precocious newborn
again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child's degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.
From the Economist (2000) 8
Pattern Classification, Chapter 2 9
Pattern Classification, Chapter 2 10
– = 1 : the event that the next sample is from category 1 – P(1) = probability of category 1 – P(2) = probability of category 2 – P(1) + P(2) = 1
(either 1 or 2 must occur)
11 Pattern Classification, Chapter 2
12 Pattern Classification, Chapter 2
Pattern Classification, Chapter 2 13
14 Pattern Classification, Chapter 2
15 Pattern Classification, Chapter 2
2 1
j j j j P
j j j
X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 Therefore: whenever we observe a particular x, the probability
P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1
16 Pattern Classification, Chapter 2
17 Pattern Classification, Chapter 2
18 Pattern Classification, Chapter 2
19
20
21 Pattern Classification, Chapter 2
22 Pattern Classification, Chapter 2
Pattern Classification, Chapter 2 23
Conditional risk
c j 1 j j j i i
Select the action i for which R(i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved
24 Pattern Classification, Chapter 2
loss incurred for deciding i when the true state of nature is j
R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x)
25 Pattern Classification, Chapter 2
action 1: decide 1
26 Pattern Classification, Chapter 2
The preceding rule is equivalent to the following rule: Then take action 1 (decide 1) Otherwise take action 2 (decide 2)
27 Pattern Classification, Chapter 2
1 2 11 21 22 12 2 1
28 Pattern Classification, Chapter 2
Select the optimal decision where: = {1, 2} P(x | 1) N(2, 0.5) (Normal distribution) P(x | 2) N(1.5, 0.2) P(1) = 2/3 P(2) = 1/3
Pattern Classification, Chapter 2 29
If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i j
30 Pattern Classification, Chapter 2
Therefore, the conditional risk is:
average probability of error
31 Pattern Classification, Chapter 2
c ,..., 1 j , i j i 1 j i ) , (
j i
1 1
) | ( 1 ) | ( ) | ( ) | ( ) | (
j i j c j j j j i i
x P x P x P x R
32 Pattern Classification, Chapter 2
33 Pattern Classification, Chapter 2
2 1 1 1 2 11 21 22 12
b 1 2 a 1 2
) ( P ) ( P 2 then 1 2 if ) ( P ) ( P then 1 1
34 Pattern Classification, Chapter 2
35 Pattern Classification, Chapter 2
gi(x) = P(i | x)
(ln: natural logarithm)
36 Pattern Classification, Chapter 2
37 Pattern Classification, Chapter 2
38 Pattern Classification, Chapter 2
2 1 2 1 2 1
39 Pattern Classification, Chapter 2
) ( ln ln 2 1 2 ln 2 ) ( ) ( 2 1 ) (
1 i i i i t i i
P d x x x g
(I is the identity matrix)
40
Prove it!
Pattern Classification, Chapter 2
2 2
i i i t i i i i i t i i
41 Pattern Classification, Chapter 2
42 Pattern Classification, Chapter 2
43 Pattern Classification, Chapter 2
t j i j t j j i t i i
2 2 j i j i j i j i j i
j i j i
44 Pattern Classification, Chapter 2
45 Pattern Classification, Chapter 2
) .( ) ( ) ( ) ( / ) ( ln ) ( 2 1
1 1 j i j i t j i j i j i i i
P P x w
46 Pattern Classification, Chapter 2
47 Pattern Classification, Chapter 2
– The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)
48 Pattern Classification, Chapter 2
) ( ln ln 2 1 2 1 w w 2 1 W : ) (
1 1 i 1 i i i i i t i i i i i i t i i t i
P where w x w x W x x g
49 Pattern Classification, Chapter 2
50 Pattern Classification, Chapter 2
take only one of m discrete values v1, v2, …, vm
problem
probabilities: pi = P(xi = 1 | 1) qi = P(xi = 1 | 2)
51 Pattern Classification, Chapter 2
52 Pattern Classification, Chapter 2
2 1 1 1 1 2 1 1 1 2 1 1 1
d i i i i i i i d i x i i x i i d i x i x i d i x i x i
i i i i i i
53 Pattern Classification, Chapter 2
2 1 2 1 1 i 1
d i i i i i i i i d i i
Pattern Classification, Chapter 3 54
max
55
2 2 i i t i i i i i t i i
56
57
58
59
Adapted from: Duda, Hart and Stork, Pattern Classification textbook
Pattern Classification, Chapter 3 60
Pattern Classification, Chapter 3 61
– Raw Data x (Feature Extraction) – Training Data { (x,y) } f (Learning)
– Raw Data x (Feature Extraction) – Test Data x f(x) (Apply function, Evaluate error)
(C) Dhruv Batra 62
(C) Dhruv Batra 63
Pattern Classification, Chapter 3 64
(C) Dhruv Batra 65
X)
Y)
X+2 Y)
(C) Dhruv Batra 66
– limit N∞ #(A is true)/N – limiting frequency of a repeating non-deterministic event
– P(A) is your “belief” about A
Pattern Classification, Chapter 3 67
probability of obtaining the samples observed
variables having some known distribution
classification rule
Pattern Classification, Chapter 3 68
Pattern Classification, Chapter 2 69
Pattern Classification, Chapter 3 70
Pattern Classification, Chapter 3 71
Pattern Classification, Chapter 3 72
p(x | j) ~ N( j, j) p(x | j) p (x | j, j) where:
Pattern Classification, Chapter 3 73
22 11 2 1 n j m j j j j j j j
Pattern Classification, Chapter 3 74
estimate = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category
maximizes p(D | ) “It is the value of that best agrees with the actually
1
n k k k
Pattern Classification, Chapter 3 75
– Let = (1, 2, …, p)t and let be the gradient operator – We define l() as the log-likelihood function l() = ln p(D | ) – New problem statement: determine that maximizes the log-likelihood
t p 2 1
Pattern Classification, Chapter 3 76
Necessary conditions for an optimum:
1
k n k k
Pattern Classification, Chapter 3 77
– p(xi | ) ~ N(, ) (Samples are drawn from a multivariate normal population) = therefore:
) ( ) | ( ln ) ( ) ( 2 1 ) 2 ( ln 2 1 ) | ( ln
1 1
k k k t k d k
x x p x x x p
1 1
k n k k
Pattern Classification, Chapter 3 78
Conclusion:
If p(xk | j) (j = 1, 2, …, c) is assumed to be Gaussian in a d- dimensional feature space, then we can estimate the vector = (1, 2, …, c)t and perform optimal classification!
n k k k
1
Pattern Classification, Chapter 3 79
2 ) ( 2 1 ) ( 1 )) | ( (ln )) | ( (ln ) ( 2 1 2 ln 2 1 ) | ( ln
2 2 2 1 2 1 2 2 1 2 1 2 2
k k k k k k
x x x p x p l x x p l
Pattern Classification, Chapter 3 80
Summation: Combining (1) and (2), one obtains:
n k k n k k k n k k k
1 1 2 2 2 1 2 1 1 2
n k k k n k k k
1 2 2 1
Pattern Classification, Chapter 3 81
2 2 2
1 ) ( 1 n n x x n E
i
n k k t k k
1
Sample covariance matrix
Pattern Classification, Chapter 3 82