Matthieu R Bloch Tuesday, January 21, 2020
BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS CLASSIFIERS
1
BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS - - PowerPoint PPT Presentation
BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS CLASSIFIERS Matthieu R Bloch Tuesday, January 21, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL C449 Cubicle D) -
Matthieu R Bloch Tuesday, January 21, 2020
1
TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pm Thursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pm Friday: Brighton (TSRB 523a) - 12pm-1:15pm Homework 1 posted on Canvas Due Wednesday January 29, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL)
2
What is the best risk (smallest) that we can achieve? Assume that we actually know and Denote the a posteriori class probabilities of by Denote the a priori class probabilities by Lemma (Bayes classifier) The classifier is optimal, i.e., for any classifier , we have . Terminology is called the Bayes classifier is called the Bayes risk
PX PY|X x ∈ X (x) ≜ (Y = k|X = x) ηk P ≜ (Y = k) πk P (x) ≜ (x) hB argmaxk∈[0;K−1] ηk h R( ) ≤ R(h) hB R( ) = [1 − (X)] hB EX max
k
ηk hB ≜ R( ) RB hB
3
4
For (binary classification): log-likelihood ratio test If all classes are equally likely Example (Bayes classifier) Assume and . The Bayes risk for is with In practice we do not now and Plugin methods: use the data to learn the distributions and plug result in Bayes classifier
(x) ≜ (x) hB argmaxk∈[0;K−1] ηk (x) ≜ (x|k) hB argmaxk∈[0;K−1] πkpX|Y K = 2 log ≷ log (x|1) pX|Y (x|0) pX|Y π0 π1 = = ⋯ = π0 π1 πK−1 (x) ≜ (x|k) hB argmax
k∈[0;K−1]
pX|Y X|Y = 0 ∼ N(0, 1) X|Y = 1 ∼ N(1, 1) = π0 π1 R( ) = Φ(− ) hB
1 2
Φ ≜ Normal CDF PX PY|X
5
We have focused on the risk
There are many situations in which this is not appropriate Cost sensitive classification: false alarm and missed detection may not be equivalent Unbalanced data set: the probability of the largest class will dominate More to explore in the next homework!
(h(X) ≠ Y ) P 1{h(X) ≠ Y } 1{h(X) ≠ 0 and Y = 0} + 1{h(X) ≠ 1 and Y = 1} c0 c1
6
Back to our training dataset The nearest-neighbor (NN) classifier is where Risk of NN classifier conditioned on and How well does the average risk compare to the Bayes risk for large ? Lemma. Let , be i.i.d. in a separable metric space . Let be the nearest neighbor of . Then with probability one as Theorem (Binary NN classifier) Let be a separable metric space. Let , be such that, with probability one, is either a continuity point of and
Then, as ,
D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN (x) ≜ hNN yNN(x) NN(x) ≜ argmini ∥ − x∥ xi x xNN(x) (x, ) = ( )(1 − (x)) = (x)(1 − ( )). RNN xNN(x) ∑
k
ηk xNN(x) ηk ∑
k
ηk ηk xNN(x) = R( ) RNN hNN N x {xi}N
i=1
∼ Px X xNN(x) x → x xNN(x) N → ∞ X p(x|y = 0) p(x|y = 1) x p(x|y = 0) p(x|y = 1) N → ∞ R( ) ≤ R( ) ≤ 2R( )(1 − R( )) hB hNN hB hB
7
8
9
Can drive the risk of the NN classifier to the Bayes risk by increasing the size of the neighborhood Assign label to by taking majority vote among nearest neighbors Definition. Let be a classifier learned from a set of data points. The classifier is consistent if as . Theorem (Stone's Theorem) If , , , then is consistent Choosing is a problem of model selection Do not choose by minimizing the empirical risk on training: Need to rely on estimates from model selection techniques (more later!)
x K hK-NN [R( )] ≤ (1 + ) R( ) lim
N→∞ E
hK-NN 2 K − − − √ hB h ^N N h ^N [R( )] → E h ^N RB N → ∞ N → ∞ K → ∞ K/N → 0 hK-NN K K ( ) = 1{ ( ) = } = 0 R ˆN h1-NN 1 N ∑
i=1 N
h1 xi yi
10
Given enough data, a
The number of samples can be huge (especially in high-dimension) The choice of matters a lot, model selection is important Finding the nearest neighbors out of a millions of datapoints is still computationally hard
We will discuss other classifiers that make more assumptions about the underlying data
K N K K N ≈ d
11
12