Information Theory and Feature Selection
(Joint Informativeness and Tractability)
Leonidas Lefakis
Zalando Research Labs
1 / 66
Information Theory and Feature Selection (Joint Informativeness - - PowerPoint PPT Presentation
Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1 , . . . , X D f 1 ( X 1 , . . . , X D ) ,
Zalando Research Labs
1 / 66
Feature Construction
◮ Construction
2 / 66
Feature Construction
◮ Construction
◮ Principal Component Analysis (PCA) ◮ Linear Discriminant Analysis (LDA) ◮ Autoencoders (Neural Networks) 2 / 66
Feature Selection
◮ Selection
◮ Wrappers ◮ Embedded methods ◮ Filters 3 / 66
Feature Selection
◮ Selection
◮ Wrappers
Features are selected relative to the performance of a specific predictor. Example RFE-SVM.
4 / 66
Feature Selection
◮ Selection
◮ Embedded Methods
Features are selected internally while optimizing the predictor. Example Decision Trees.
5 / 66
Feature Selection
◮ Selection
◮ Filters
Features are assessed based on some goodness-of-fit function Φ that is classifier agnostic. Example Correlation.
6 / 66
◮ Entropy
◮ Joint Entropy
◮ Conditional Entropy
7 / 66
◮ Relative Entropy (Kullback-Leibler Divergence )
◮ Mutual Information
8 / 66
9 / 66
Classification
10 / 66
11 / 66
(1)
(1)
12 / 66
13 / 66
(2)
3
14 / 66
(2)
3
14 / 66
(2)
3
14 / 66
(2)
3
14 / 66
S′,|S′|=K
15 / 66
S′,|S′|=K
15 / 66
Naive
S′,|S′|=K K
◮ Considers the Relevance of each variable individually ◮ Does not consider Redundancy ◮ Does not consider Joint Informativeness
16 / 66
Two Popular Solutions
◮ mRMR : Feature Selection Based on Mutual Information:
◮ CMIM : Fast Binary Feature Selection with Conditional
17 / 66
18 / 66
′ ← Sk−1 ∪ Xj
19 / 66
How do we calculate I(·; ·)?
◮ Discretize ◮ Distributions approximated using Parzen windows
N
20 / 66
Markov Blanket
Markov Blanket of variable A 1
1Wiki Commons 21 / 66
Markov Blanket
22 / 66
Markov Blanket
22 / 66
23 / 66
23 / 66
d
d
l<k
23 / 66
d
l<k
24 / 66
d
l<k
3 2 1 4 3 2 3 1 2 1 4 3 2 5 1 2 3 4 3 1 5 2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 5 6 7 6 3 5 3 2 1 5 5 4 6 4 3 4 2 5 3 ps[n] m[n] 5 4 3 5
l<m(n)
24 / 66
S′,|S′|=K
25 / 66
S′,|S′|=K
26 / 66
◮ Differential Entropy
◮ Relative Entropy (Kullback-Leibler Divergence )
◮ Mutual Information
27 / 66
intractable
28 / 66
intractable
28 / 66
intractable
28 / 66
intractable
28 / 66
29 / 66
intractable
30 / 66
intractable
intractable
31 / 66
intractable
32 / 66
32 / 66
32 / 66
32 / 66
33 / 66
33 / 66
1.4 1.6 1.8 2 2.2 2.4 2.6 1 2 3 4 5 6 7 8 Entropy Mean difference f f* Disjoint GC
34 / 66
35 / 66
S′,|S′|=K
′ ← Sk−1 ∪ Xj
y
y
37 / 66
38 / 66
′ ← Sk−1 ∪ Xj
39 / 66
′ ← Sk−1 ∪ Xj
39 / 66
j,Sk−1
j
j,Sk−1
j
40 / 66
′ ← Sk−1 ∪ Xj
41 / 66
′ ← Sk−1 ∪ Xj
41 / 66
′
k; Y ) = I(Xj; Y | Sk−1) + I(Sk−1; Y )
42 / 66
′
k; Y ) = I(Xj; Y | Sk−1) + ✘✘✘✘✘
✘ ❳❳❳❳❳ ❳
Xj∈F\Sk−1
Xj∈F\Sk−1
42 / 66
j|Sk−1 + 1
j|Sk−1 = σ2 j − ΣT j,Sk−1Σ−1 Sk−1Σj,Sk−1.
43 / 66
j|Sk−1
j|Sk−1 = σ2 j − ΣT j,Sk−1Σ−1 Sk−1Σj,Sk−1
i,Sk−2
i
n−2
i
i,Sk−2 + Σi,Sk−2eT n+1
44 / 66
Sk−1
45 / 66
Sk−1
Sn−1 =
Sn−2
βσ2
i u
βσ2
i uT
1 βσ2
i
i
Sk−1
Sn−1 =
Sn−2
βσ2
i u
βσ2
i uT
1 βσ2
i
i
j|Sn−1 = σ2 j − Previous Round
j,Sn−2Σ−1 Sn−2Σj,Sn−2
ji
i
j,Sn−1
βσ2
i u
1 βσ2
i
ji
i
j,Sn−1
46 / 66
′ ← Sk−1 ∪ Xj
47 / 66
′ ← Sk−1 ∪ Xj
47 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
48 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
48 / 66
′ ← Sk−1 ∪ Xj
49 / 66
′ ← Sk−1 ∪ Xj
49 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1
jSk−12 2 max i
jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i
50 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1
jSk−12 2 max i
jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i
50 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1
jSk−12 2 max i
jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i
50 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1
jSk−12 2 max i
jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i
50 / 66
j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1
jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1
jSk−12 2 max i
jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i
50 / 66
51 / 66
++
51 / 66
1
2
n
1
n
52 / 66
1
2
n
1
n
i
i
i
i 52 / 66
1
2
n
1
n
i
i
i
i
′n
1 T
1
′n
1
n+1
n+1
53 / 66
n+1
n+1
U′U′T =I
((4))
n+1
54 / 66
n+1
′′ share eigenvalues.
′′ 55 / 66
′
j − λ) +
i
′
j − λ)
i
56 / 66
′
j − λ) +
i
′
j − λ)
i
λ > − →λ′i
λ < − →λ′i
i
56 / 66
57 / 66
200 400 600 800 1000 1200 1400 1600 1800 2000 5 10 15 20 25 30 CPU time in secs Matrix size Lapack Update
Comparison between scratch and update
58 / 66
200 400 600 800 1000 1200 1400 1600 1800 2000 0.5 1 1.5 2 2.5 x 10
−10
max |λupdate − λLapack|/λLapack Matrix size
Numerical stability (eigenvalues)
59 / 66
60 / 66
CIFAR STL INRIA SVMLin 10 50 100 10 50 100 10 50 100 Fisher 25.19 39.47 48.12 26.09 34.63 38.02 92.55 94.03 94.68 FCBF 33.65 47.77 54.97 31.74 38.11 40.66 94.14 96.03 96.03 MRMR 27.94 37.78 43.63 28.26 31.16 33.12 86.03 86.77 86.72 SBMLR 30.43 51.41 56.81 32.29 43.29 47.15 85.92 88.57 88.64 tTest 25.69 40.17 45.12 26.72 36.23 39.14 80.01 87.64 89.23 InfoGain 24.79 37.98 47.37 27.17 33.70 37.84 92.35 93.75 94.68
17.19 32.78 42.6 18.91 32.65 38.24 92.67 93.64 94.44 RelieFF 24.56 38.17 46.51 29.16 38.05 42.94 90.99 95.97 96.36 CFS 31.49 42.17 51.70 28.63 38.54 41.88 88.64 96.11 97.53 CMTF 21.10 40.39 47.71 27.61 38.99 42.32 79.09 89.49 93.01 BAHSIC
GC.E 32.45 50.15 55.06 31.20 43.31 49.75 87.73 91.96 93.13
36.47 51.44 55.39 32.50 44.15 48.88 89.76 95.71 96.45 GKL.E 37.51 52.11 56.41 33.44 44.27 50.54 85.31 92.05 96.36
33.71 47.17 51.12 32.16 44.87 47.96 85.66 92.14 95.16
61 / 66
2with probability at least 1 − 2e−ct2 62 / 66
N
2with probability at least 1 − 2e−ct2 62 / 66
κ(Σ) =
Σ−1e Σ−1b e b
= Σ−1Σ, for d = 2048 and various values of N
63 / 66
500 1000 1500 2000 2500 25 30 35 40 45 50 Number of Samples per Class Test Accuracy 10 features 25 features 50 features
Effect of sample size on performance when using the Gaussian Approximation for the CIFAR dataset.
64 / 66
65 / 66
66 / 66