Machine Learning - MT 2017
- 10. Classification : Generative Models
Machine Learning - MT 2017 10. Classification : Generative Models - - PowerPoint PPT Presentation
Machine Learning - MT 2017 10. Classification : Generative Models Varun Kanade University of Oxford October 30, 2017 Recap: Supervised Learning - Regression Discriminative Model: Linear Model (with Gaussian noise) p ( y | w , x ) = w x + N
1
◮ Categorical: xi ∈ {1, . . . , K} ◮ Real-Valued: xi ∈ R
2
c′=1 p(y = c′|θ)p(xnew | y = c′, θ)
3
4
c πc = 1
5
6
D
7
◮ xj is real-valued e.g., annual income ◮ Example: Use a Gaussian model, so θjc = (µjc, σ2 jc) ◮ Can use other distributions, e.g., age is probably not Gaussian!
◮ xj is categorical with values in {1, . . . , K} ◮ Use the multinoulli distribution, i.e. xj = i with probability µjc,i K
◮ The special case when xj ∈ {0, 1}, use a single parameter θjc ∈ [0, 1]
8
◮ We have to assign a probability for each of the 2D combination ◮ Thus, we have O(C · 2D) parameters! ◮ The ‘naïve’ assumption breaks the curse of dimensionality and avoids
9
i=1 i.i.d. from some joint distribution
C
c
C
D
c=1 Nc = N
C
C
D
10
C
C
D
C
C
11
z
12
∂Λ(z,λ) ∂λ
13
C
C
C
C
∂Λ(π,λ) ∂πc
∂Λ(π,λ) ∂λ
C
14
C
C
C
15
C
C
D
N
16
D
17
18
c′=1 p(y = c′|θ)p(xnew | y = c′, θ)
j=1 p(xj | y = c, θcj)
c′=1 p(y = c′|θ) D j=1 p(xj | y = c′, θjc)
j=2 p(xj | y = c, θcj)
c′=1 p(y = c′|θ) D j=2 p(xj | y = c′, θjc)
19
157 as the probability that a voter had voted in 2012,
20
c πc = 1
21
c′=1 p(y = c′|θ)p(xnew | y = c′, θ)
2 exp
2(x − µc)TΣ−1 c (x − µc)
c′=1 πc′|2πΣc′|− 1
2 exp
2(x − µc′)TΣ−1 c′ (x − µc′)
2 exp
2(x − µc)TΣ−1 c (x − µc)
2 exp
2(x − µc′)TΣ−1 c′ (x − µc′)
22
23
c Σ−1x − 1
c Σ−1µc + log πc
c Σ−1µc + log πc
c x + γc
c x + γc
c x + γc
c′x + γc′
1x + γ1, · · · , βT Cx + γC].
25
26
1x + γ1
1x + γ1
0x + γ0
−4 −2 2 4 0.2 0.4 0.6 0.8 1 t Sigmoid
27
i=1 as:
C
C
i:yi=c
N
28
◮ The number of parameters in the model is roughly C · D2 ◮ In high-dimensions this can lead to overfitting ◮ Use diagonal covariance matrices (basically Naïve Bayes) ◮ Use weight tying a.k.a. parameter sharing (LDA vs QDA) ◮ Bayesian Approaches ◮ Use a discriminative classifier (+ regularize if needed)
29