Machine Learning - MT 2016
- 7. Classification: Generative Models
Machine Learning - MT 2016 7. Classification: Generative Models - - PowerPoint PPT Presentation
Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements Practical 1 Submission Try to get signed off during session itself Otherwise, do it in the next session
◮ Practical 1 Submission ◮ Try to get signed off during session itself ◮ Otherwise, do it in the next session ◮ Exception: Practical 4 (Firm deadline Friday Week 8 at noon) ◮ Sheet 2 is due this Friday 12pm
1
2
◮ Categorical: xi ∈ {1, . . . , K} ◮ Real-Valued: xi ∈ R
3
c′=1 p(y = c′|θ)p(xnew | y = c′, θ)
4
5
c πc = 1
6
7
D
8
◮ xj is real-valued e.g., annual income ◮ Example: Use a Gaussian model, so θjc = (µjc, σ2 jc) ◮ Can use other distributions, e.g., age is probably not Gaussian!
◮ xj is categorical with values in {1, . . . , K} ◮ Use the multinoulli distribution, i.e. xj = i with probability µjc,i K
◮ The special case when xj ∈ {0, 1}, use a single parameter θjc ∈ [0, 1]
9
◮ We have to assign a probability for each of the 2D combination ◮ Thus, we have O(C · 2D) parameters! ◮ The ‘naïve’ assumption breaks the curse of dimensionality and avoids
10
i=1 i.i.d. from some joint distribution
C
c
C
D
c=1 Nc = N
C
C
D
11
C
C
D
C
C
12
z
13
∂Λ(z,λ) ∂λ
14
C
C
C
C
∂Λ(π,λ) ∂πc
∂Λ(π,λ) ∂λ
C
15
C
C
C
16
C
C
D
N
17
D
18
c πc = 1
19
c′=1 p(y = c′|θ)p(xnew | y = c′, θ)
2 exp
2(x − µc)TΣ−1 c (x − µc)
c′=1 πc′|2πΣc′|− 1
2 exp
2(x − µc′)TΣ−1 c′ (x − µc′)
2 exp
2(x − µc)TΣ−1 c (x − µc)
2 exp
2(x − µc′)TΣ−1 c′ (x − µc′)
20
21
c Σ−1x − 1
c Σ−1µc + log πc
c Σ−1µc + log πc
c x + γc
c x + γc
c x + γc
c′x + γc′
1x + γ1, · · · , βT Cx + γC].
23
24
1x + γ1
1x + γ1
0x + γ0
−4 −2 2 4 0.2 0.4 0.6 0.8 1 t Sigmoid
25
i=1 as:
C
C
i:yi=c
N For other parameters, it is
26
◮ The number of parameters in the model is roughly C · D2 ◮ In high-dimensions this can lead to overfitting ◮ Use diagonal covariance matrices (basically Naïve Bayes) ◮ Use weight tying a.k.a. parameter sharing (LDA vs QDA) ◮ Bayesian Approaches ◮ Use a discriminative classifier (+ regularize if needed)
27