Information Theory and Feature Selection (Joint Informativeness - - PowerPoint PPT Presentation

information theory and feature selection
SMART_READER_LITE
LIVE PREVIEW

Information Theory and Feature Selection (Joint Informativeness - - PowerPoint PPT Presentation

Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1 , . . . , X D f 1 ( X 1 , . . . , X D ) ,


slide-1
SLIDE 1

Information Theory and Feature Selection

(Joint Informativeness and Tractability)

Leonidas Lefakis

Zalando Research Labs

1 / 66

slide-2
SLIDE 2

Dimensionality Reduction

Feature Construction

◮ Construction

X1, . . . , XD → f1(X1, . . . , XD), . . . , fk(X1, . . . , XD)

2 / 66

slide-3
SLIDE 3

Dimensionality Reduction

Feature Construction

◮ Construction

X1, . . . , XD → f1(X1, . . . , XD), . . . , fk(X1, . . . , XD) Examples

◮ Principal Component Analysis (PCA) ◮ Linear Discriminant Analysis (LDA) ◮ Autoencoders (Neural Networks) 2 / 66

slide-4
SLIDE 4

Dimensionality Reduction

Feature Selection

◮ Selection

X1, . . . , XD → Xs1, . . . , Xsk Approaches

◮ Wrappers ◮ Embedded methods ◮ Filters 3 / 66

slide-5
SLIDE 5

Dimensionality Reduction

Feature Selection

◮ Selection

X1, . . . , XD → Xs1, . . . , Xsk

◮ Wrappers

Features are selected relative to the performance of a specific predictor. Example RFE-SVM.

4 / 66

slide-6
SLIDE 6

Dimensionality Reduction

Feature Selection

◮ Selection

X1, . . . , XD → Xs1, . . . , Xsk

◮ Embedded Methods

Features are selected internally while optimizing the predictor. Example Decision Trees.

5 / 66

slide-7
SLIDE 7

Dimensionality Reduction

Feature Selection

◮ Selection

X1, . . . , XD → Xs1, . . . , Xsk

◮ Filters

Features are assessed based on some goodness-of-fit function Φ that is classifier agnostic. Example Correlation.

6 / 66

slide-8
SLIDE 8

Information Theory

◮ Entropy

H(X) = −

  • x∈X

p(x) log p(x)

◮ Joint Entropy

H(X, Y ) = −

  • x∈X
  • y∈Y

p(x, y) log p(x, y)

◮ Conditional Entropy

H(Y |X) = −

  • x∈X

p(x)H(Y |X = x)

7 / 66

slide-9
SLIDE 9

Information Theory

◮ Relative Entropy (Kullback-Leibler Divergence )

D(pq) =

  • x∈X

p(x) log p(x) q(x)

◮ Mutual Information

I (X; Y ) =

  • x∈X
  • y∈Y

p(x, y) log p(x, y) p(x)p(y)

  • dxdy
  • DKL(p(x,y)||p(x)p(y))

8 / 66

slide-10
SLIDE 10

Mutual Information

I (X; Y ) =

  • x∈X
  • y∈Y

p(x, y) log p(x, y) p(x)p(y)

  • dxdy
  • DKL(p(x,y)||p(x)p(y))

I (X; Y ) = H(Y ) − H(Y |X) Reduction in uncertainty of Y if X is known X ⊥ ⊥ Y ⇒ I(X; Y ) = 0 Y = f (X) ⇒ I(X; Y ) = H(Y )

9 / 66

slide-11
SLIDE 11

Feature Selection

Classification

We will look into feature selection in the context of classification. Given features {X1, . . . , XD} ∈ R and a class variable Y , we wish to select a subset S of size K << D such that a predictor f : RK → Y trained in this projected subspace generalizes well.

10 / 66

slide-12
SLIDE 12

Entropy & Prediction

Given X × Y ∈ RD × {1 . . . C}, f (X) = ˆ Y We define the error variable E = if ˆ Y = Y 1 if ˆ Y = Y

11 / 66

slide-13
SLIDE 13

Entropy & Prediction

H(E, Y | ˆ Y )

(1)

= H(Y | ˆ Y ) + H(E|Y , ˆ Y )

  • =0

(1)

= H(E| ˆ Y )

  • ≤H(E)

+H(Y |E, ˆ Y ) H(A, B) = H(A) + H(B|A) (1)

12 / 66

slide-14
SLIDE 14

Entropy & Prediction

H(Y | ˆ Y )= H(E, Y | ˆ Y ) ≤ H(E) + H(Y |E, ˆ Y ) ≤ 1 + Pe log (|Y|-1) H(E)= H (B(1, Pe))≤ H

  • B(1, 1

2)

  • = 1

H(Y |E, ˆ Y ) =(1 − Pe) H(Y |E = 0, ˆ Y )

  • =0

+Pe H(Y |E = 1, ˆ Y )

  • ≤log (|Y|-1)

13 / 66

slide-15
SLIDE 15

Fano’s Inequality

H(Y | ˆ Y ) ≤ 1 + Pe log (|Y|-1)

(2)

= ⇒ H(Y ) − I(Y ; ˆ Y ) ≤ 1 + Pe log (|Y|-1) = ⇒ Pe ≥ H(Y ) − I(Y ; ˆ Y ) − 1 log (|Y|-1)

3

= ⇒ Pe ≥ H(Y ) − 1 − I(X; Y ) log (|Y|-1) I(A; B) = H(A) − H(A|B) (2) Y → X → Z = ⇒ I(Y ; X) ≥ I(Y ; Z) (3)

14 / 66

slide-16
SLIDE 16

Fano’s Inequality

H(Y | ˆ Y ) ≤ 1 + Pe log (|Y|-1)

(2)

= ⇒ H(Y ) − I(Y ; ˆ Y ) ≤ 1 + Pe log (|Y|-1) = ⇒ Pe ≥ H(Y ) − I(Y ; ˆ Y ) − 1 log (|Y|-1)

3

= ⇒ Pe ≥ H(Y ) − 1 − I(X; Y ) log (|Y|-1) I(A; B) = H(A) − H(A|B) (2) Y → X → Z = ⇒ I(Y ; X) ≥ I(Y ; Z) (3)

14 / 66

slide-17
SLIDE 17

Data Processing Inequality

H(Y | ˆ Y ) ≤ 1 + Pe log (|Y|-1)

(2)

= ⇒ H(Y ) − I(Y ; ˆ Y ) ≤ 1 + Pe log (|Y|-1) = ⇒ Pe ≥ H(Y ) − I(Y ; ˆ Y ) − 1 log (|Y|-1)

3

= ⇒ Pe ≥ H(Y ) − 1 − I(X; Y ) log (|Y|-1) I(A; B) = H(A) − H(A|B) (2) Y → X → Z = ⇒ I(Y ; X) ≥ I(Y ; Z) (3)

14 / 66

slide-18
SLIDE 18

Fano’s Inequality

H(Y | ˆ Y ) ≤ 1 + Pe log (|Y|-1)

(2)

= ⇒ H(Y ) − I(Y ; ˆ Y ) ≤ 1 + Pe log (|Y|-1) = ⇒ Pe ≥ H(Y ) − I(Y ; ˆ Y ) − 1 log (|Y|-1)

3

= ⇒ Pe ≥ H(Y ) − 1 − I(X; Y ) log (|Y|-1) I(A; B) = H(A) − H(A|B) (2) Y → X → Z = ⇒ I(Y ; X) ≥ I(Y ; Z) (3)

14 / 66

slide-19
SLIDE 19

Objective

S = argmax

S′,|S′|=K

I(XS′(1), XS′(2), . . . , XS′(K); Y )

15 / 66

slide-20
SLIDE 20

Objective

S = argmax

S′,|S′|=K

I(XS′(1), XS′(2), . . . , XS′(K); Y )

NP-HARD

15 / 66

slide-21
SLIDE 21

Feature Selection

Naive

S = argmax

S′,|S′|=K K

  • k=1

I(XS′(k); Y )

◮ Considers the Relevance of each variable individually ◮ Does not consider Redundancy ◮ Does not consider Joint Informativeness

16 / 66

slide-22
SLIDE 22

Feature Selection

Two Popular Solutions

◮ mRMR : Feature Selection Based on Mutual Information:

Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, H.Peng et al.

◮ CMIM : Fast Binary Feature Selection with Conditional

Mutual Information, F.Fleuret

17 / 66

slide-23
SLIDE 23

mRMR

Relevance D(S, Y ) = 1 |S|

  • Xd∈S

I(Xd; Y ) Redundancy R(S) = 1 |S|2

  • Xd∈S
  • Xl∈S

I(Xd; Xl) mRMR Φ(S, Y ) = D(S, Y ) − R(S)

18 / 66

slide-24
SLIDE 24

Forward Selection

S0 = ∅ for k = 1 . . . K do z∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ←Φ(S′, Y ) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK

19 / 66

slide-25
SLIDE 25

mRMR

How do we calculate I(·; ·)?

◮ Discretize ◮ Distributions approximated using Parzen windows

ˆ p(x) = 1 N

N

  • i=1

δ(x − x(i), h) δ(dx, h) = 1 √ 2πD exp

  • −dxTΣ−1dx

2h2

  • ◮ Parametric Model

20 / 66

slide-26
SLIDE 26

CMIM (Binary Features)

Markov Blanket

Markov Blanket of variable A 1

1Wiki Commons 21 / 66

slide-27
SLIDE 27

CMIM (Binary Features)

Markov Blanket

For a set M of variables that form a Markov Blanket of Xi we have p (F \ {Xi, M}, Y |Xi, M) = p (F \ {Xi, M}, Y |M) I (F \ {Xi, M}, Y ; Xi| M) = 0

22 / 66

slide-28
SLIDE 28

CMIM (Binary Features)

Markov Blanket

For a set M of variables that form a Markov Blanket of Xi we have p (F \ {Xi, M}, Y |Xi, M) = p (F \ {Xi, M}, Y |M) I (F \ {Xi, M}, Y ; Xi| M) = 0 For Feature Selection → Too Strong I (Y ; Xi| M) = 0

22 / 66

slide-29
SLIDE 29

CMIM (Binary Features)

I(X1, . . . , XK; Y ) = H(Y ) − H(Y |X1, ..XK)

  • intractable

23 / 66

slide-30
SLIDE 30

CMIM (Binary Features)

I(X1, . . . , XK; Y ) = H(Y ) − H(Y |X1, ..XK)

  • intractable

I(Xi; Y |Xj) = H(Y |Xj) − H(Y |Xi, Xj) Distributions over triplets of variables

23 / 66

slide-31
SLIDE 31

CMIM (Binary Features)

I(X1, . . . , XK; Y ) = H(Y ) − H(Y |X1, ..XK)

  • intractable

I(Xi; Y |Xj) = H(Y |Xj) − H(Y |Xi, Xj) S1 ← argmax

d

ˆ I(Xd; Y ) Sk ← argmax

d

{ min

l<k

ˆ I(Y ; Xd|XSl) } Distributions over triplets of variables

23 / 66

slide-32
SLIDE 32

Fast CMIM

Sk ← argmax

d

{ min

l<k

ˆ I(Y ; Xd|XSl)

  • Can only decrease

}

24 / 66

slide-33
SLIDE 33

Fast CMIM

Sk ← argmax

d

{ min

l<k

ˆ I(Y ; Xd|XSl)

  • Can only decrease

}

3 2 1 4 3 2 3 1 2 1 4 3 2 5 1 2 3 4 3 1 5 2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 5 6 7 6 3 5 3 2 1 5 5 4 6 4 3 4 2 5 3 ps[n] m[n] 5 4 3 5

ps[n] = min

l<m(n)

ˆ I(Y ; Xd|XSl)

24 / 66

slide-34
SLIDE 34

Objective

S = argmax

S′,|S′|=K

I(XS′(1), XS′(1), . . . , XS′(K); Y ) CMIM, mRMR (and others) Compromise Joint Informativeness for Tractability

25 / 66

slide-35
SLIDE 35

Problem

{X1, . . . , XD} ∈ R, Y ∈ {1...C} S = argmax

S′,|S′|=K

I(XS′(1), XS′(2), . . . , XS′(K); Y ) Calculate I(X; Y ) using a joint law

26 / 66

slide-36
SLIDE 36

Information Theory

◮ Differential Entropy

h(X) = −

  • X

p(x) log (p(x)) dx

◮ Relative Entropy (Kullback-Leibler Divergence )

D(pq) =

  • X

p(x) log p(x) q(x)

  • dx

◮ Mutual Information

I (X; Y ) =

  • X
  • Y

p(x, y) log p(x, y) p(x)p(y)

  • dxdy

27 / 66

slide-37
SLIDE 37

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • intractable

28 / 66

slide-38
SLIDE 38

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • intractable

Parametric model pX|Y

28 / 66

slide-39
SLIDE 39

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • intractable

Parametric model pX|Y = N(µy, Σy)

28 / 66

slide-40
SLIDE 40

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • tractable

Parametric model pX|Y = N(µy, Σy) H(X|Y ) = 1 2 log(|Σy|) + n 2 (log 2π + 1) .

28 / 66

slide-41
SLIDE 41

Maximum Entropy Distribution

Given E(x) = µ, E

  • (x − µ)(x − µ)T

= Σ then the multivariate Normal

  • x ∼ N(

µ, Σ) is the Maximum Entropy Distribution

29 / 66

slide-42
SLIDE 42

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • tractable

Parametric model pX|Y = N(µy, Σy) H(X|Y ) = 1 2 log(|Σy|) + n 2 (log 2π + 1) .

30 / 66

slide-43
SLIDE 43

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • tractable

H(X) = H

  • y

πyN(µy, Σy)

  • 31 / 66
slide-44
SLIDE 44

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • tractable

H(X) = H

  • y

πyN(µy, Σy)

  • → Entropy of Mixture of Gaussians
  • No Analytical Solution

31 / 66

slide-45
SLIDE 45

Addressing Intractability

I(X; Y ) = H(X) − H(X|Y ) = H(X)

intractable

  • Y

P(Y = y) H(X|Y = y)

  • tractable

H(X) = H

  • y

πyN(µy, Σy)

  • → Entropy of Mixture of Gaussians
  • No Analytical Solution

Upper Bound or Approximate H

  • y

πyN(µy, Σy)

  • 31 / 66
slide-46
SLIDE 46

A Normal Upper Bound

pX =

  • Y

πypX|Y

32 / 66

slide-47
SLIDE 47

A Normal Upper Bound

pX =

  • Y

πypX|Y p∗ ∼ N(µ∗, Σ∗) maxEnt = ⇒ H(pX) ≤ H(p∗)

32 / 66

slide-48
SLIDE 48

A Normal Upper Bound

pX =

  • Y

πypX|Y p∗ ∼ N(µ∗, Σ∗) maxEnt = ⇒ H(pX) ≤ H(p∗) I(X; Y ) ≤ H(p∗) −

  • Y

PY H(X|Y )

32 / 66

slide-49
SLIDE 49

A Normal Upper Bound

pX =

  • Y

πypX|Y p∗ ∼ N(µ∗, Σ∗) maxEnt = ⇒ H(pX) ≤ H(p∗) I(X; Y ) ≤ H(p∗) −

  • Y

PY H(X|Y ) I(X; Y ) ≤ H(Y ) = −

  • y

py log py

32 / 66

slide-50
SLIDE 50

A Normal Upper Bound

pX =

  • Y

πypX|Y I(X; Y ) ≤ H(p∗) −

  • Y

PY H(X|Y ) I(X; Y ) ≤ H(Y ) = −

  • y

py log py

33 / 66

slide-51
SLIDE 51

A Normal Upper Bound

pX =

  • Y

πypX|Y I(X; Y ) ≤ H(p∗) −

  • Y

PY H(X|Y ) I(X; Y ) ≤ H(Y ) = −

  • y

py log py I(X; Y ) ≤ min

  • H(p∗),
  • Y

PY (H(X|Y ) − log(PY ))

  • Y

PY H(X|Y )

33 / 66

slide-52
SLIDE 52

1.4 1.6 1.8 2 2.2 2.4 2.6 1 2 3 4 5 6 7 8 Entropy Mean difference f f* Disjoint GC

34 / 66

slide-53
SLIDE 53

An approximation

pX =

  • Y

πypX|Y Under mild assumptions ∀y, H(p∗) > H(pX|Y =y) we can use an approximation to I(X; Y ) ˜ I(X; Y ) =

  • y

min(H(p∗), H(X|Y ) − log py)py −

  • y

H(X|Y )py

35 / 66

slide-54
SLIDE 54

Feature Selection Criterium

Mutual Information Approximation S = argmax

S′,|S′|=K

  • ˜

I(XS′(1), XS′(2), . . . , XS′(K); Y )

  • 36 / 66
slide-55
SLIDE 55

Forward Selection

S0 = ∅ for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˆ I(S′; Y ) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK ˆ I(S′; Y ) ∝

y

min (log(|Σ∗|), log(|Σy|) − log py) py −

y

log(|Σy|)py

37 / 66

slide-56
SLIDE 56

Complexity

At iteration k we need to calculate ∀j ∈ F \ Sk−1

  • ΣSk−1∪Xj
  • 38 / 66
slide-57
SLIDE 57

Complexity

At iteration k we need to calculate ∀j ∈ F \ Sk−1

  • ΣSk−1∪Xj
  • The cost of calculating each determinant is O(k3)

38 / 66

slide-58
SLIDE 58

Forward Selection

S0 = ∅ for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˜ I(S′; Y ) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Overall Complexity O(|Y ||F|K 4)

39 / 66

slide-59
SLIDE 59

Forward Selection

S0 = ∅ for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˜ I(S′; Y ) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Overall Complexity O(|Y ||F|K 4) However...

39 / 66

slide-60
SLIDE 60

Complexity

ΣSk−1∪Xj = ΣSk−1 Σj,Sk−1 ΣT

j,Sk−1

σ2

j

  • 40 / 66
slide-61
SLIDE 61

Complexity

ΣSk−1∪Xj = ΣSk−1 Σj,Sk−1 ΣT

j,Sk−1

σ2

j

  • We can exploit the matrix determinant lemma (twice)
  • Σ + uvT
  • =
  • 1 + vTΣ−1u
  • |Σ|

To compute each determinant in O(n2)

40 / 66

slide-62
SLIDE 62

Forward Selection

S0 = ∅ for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˜ I(S′; Y ) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Overall Complexity O(|Y ||F|K 3)

41 / 66

slide-63
SLIDE 63

Forward Selection

S0 = ∅ for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˜ I(S′; Y ) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Overall Complexity O(|Y ||F|K 3) Faster?

41 / 66

slide-64
SLIDE 64

Faster?

I(X, Z; Y ) = I(Z; Y ) + I(X; Y | Z) I(S

k; Y ) = I(Xj; Y | Sk−1) + I(Sk−1; Y )

  • common

42 / 66

slide-65
SLIDE 65

Faster?

I(X, Z; Y ) = I(Z; Y ) + I(X; Y | Z) I(S

k; Y ) = I(Xj; Y | Sk−1) + ✘✘✘✘✘

✘ ❳❳❳❳❳ ❳

I(Sk−1; Y )

  • common

argmax

Xj∈F\Sk−1

I(Xj; Y | Sk−1) = argmax

Xj∈F\Sk−1

(H(Xj | Sk−1) − H(Xj | Y , Sk−1))

42 / 66

slide-66
SLIDE 66

Conditional Entropy

H(Xj | Sk−1) =

  • R|Sk−1| H(Xj | Sk−1 = s)µSk−1(s)ds

= 1 2 log σ2

j|Sk−1 + 1

2 (log 2π + 1) σ2

j|Sk−1 = σ2 j − ΣT j,Sk−1Σ−1 Sk−1Σj,Sk−1.

43 / 66

slide-67
SLIDE 67

Updating σ2

j|Sk−1

σ2

j|Sk−1 = σ2 j − ΣT j,Sk−1Σ−1 Sk−1Σj,Sk−1

Assume Xi was chosen at iteration k − 1 ΣSk−1 = ΣSk−2 Σi,Sk−2 ΣT

i,Sk−2

σ2

i

  • =

ΣSk−2 0k−2 0T

n−2

σ2

i

  • + en+1ΣT

i,Sk−2 + Σi,Sk−2eT n+1

44 / 66

slide-68
SLIDE 68

Updating Σ−1

Sk−1

From Sherman-Morrison formula

  • Σ + uvT−1

= Σ−1 − Σ−1uvTΣ−1 1 + vTΣ−1u

45 / 66

slide-69
SLIDE 69

Updating Σ−1

Sk−1

From Sherman-Morrison formula

  • Σ + uvT−1

= Σ−1 − Σ−1uvTΣ−1 1 + vTΣ−1u Σ−1

Sn−1 =

  • Σ−1

Sn−2

− 1

βσ2

i u

− 1

βσ2

i uT

1 βσ2

i

  • +

1 βσ2

i

u uT

  • 45 / 66
slide-70
SLIDE 70

Updating Σ−1

Sk−1

Σ−1

Sn−1 =

  • Σ−1

Sn−2

− 1

βσ2

i u

− 1

βσ2

i uT

1 βσ2

i

  • +

1 βσ2

i

u uT

  • σ2

j|Sn−1 = σ2 j − Previous Round

  • ΣT

j,Sn−2Σ−1 Sn−2Σj,Sn−2

∈ O(n2) + σ2

ji

βσ2

i

uTΣj,Sn−2 − ΣT

j,Sn−1

− 1

βσ2

i u

1 βσ2

i

  • σ2

ji

∈ O(n) − 1 βσ2

i

  • ΣT

j,Sn−1

u uT

  • ΣjSn−1
  • ∈ O(n)

46 / 66

slide-71
SLIDE 71

Faster!

for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˆ I(S′) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Overall Complexity O(|Y ||F|K 2)

47 / 66

slide-72
SLIDE 72

Faster!

for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˆ I(S′) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Overall Complexity O(|Y ||F|K 2) Even Faster?

47 / 66

slide-73
SLIDE 73

Even Faster?

Main bottleneck O(k) σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

48 / 66

slide-74
SLIDE 74

Even Faster?

Main bottleneck O(k) σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

Skip non-promising features

48 / 66

slide-75
SLIDE 75

Forward Selection

S0 = ∅ for k = 1 . . . K do s∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

z ← ˆ I(S′) if z > z∗ then s∗ ← s S∗ ← S′ end if end for Si ← S∗ end for return SK Cheap (O(1)) score c: if c < z∗ then z < z∗

49 / 66

slide-76
SLIDE 76

Forward Selection

. . . z∗ = 0 for Xj ∈ F \ Sk−1 do S

′ ← Sk−1 ∪ Xj

c ←? if c > z∗ then z ← ˆ I(S′) if z > z∗ then s∗ ← s S∗ ← S′ end if end if end for . . . Cheap (O(1)) score c: if c < z∗ then z < z∗

49 / 66

slide-77
SLIDE 77

An O(1) Bound

σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

ΣT

jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1

ΣT

jSk−12 2 max i

λi ≥ ΣT

jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i

λi

  • O(1)

50 / 66

slide-78
SLIDE 78

An O(1) Bound

σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

ΣT

jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1

ΣT

jSk−12 2 max i

λi ≥ ΣT

jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i

λi

  • O(1)

50 / 66

slide-79
SLIDE 79

An O(1) Bound

σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

ΣT

jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1

ΣT

jSk−12 2 max i

λi ≥ ΣT

jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i

λi

  • O(1)

50 / 66

slide-80
SLIDE 80

An O(1) Bound

σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

ΣT

jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1

ΣT

jSk−12 2 max i

λi ≥ ΣT

jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i

λi

  • O(1)

50 / 66

slide-81
SLIDE 81

An O(1) Bound

σ2

j|Sk−1 = σ2 j − ΣT jSk−1Σ−1 Sk−1ΣjSk−1

ΣT

jSk−1Σ−1 Sk−1ΣjSk−1=ΣT jSk−1UΛUTΣjSk−1

ΣT

jSk−12 2 max i

λi ≥ ΣT

jSn−1Σ−1 Sk−1ΣjSk−1 ≥ ΣT jSk−12 2 min i

λi

  • O(1)

50 / 66

slide-82
SLIDE 82

EigenSystem Update Problem

Given Un, Λn such that Σn = UnTΛnUn

51 / 66

slide-83
SLIDE 83

EigenSystem Update Problem

Given Un, Λn such that Σn = UnTΛnUn Find Un+1, Λn+1 Σn+1 = Σn v v T 1

  • = Un+1TΛn+1Un+1

assume Σn+1 ∈ Sn+1

++

51 / 66

slide-84
SLIDE 84

EigenSystem Update

Un =

  • un

1

un

2

. . . un

n

  • ,

Λn =    λn

1

· · · . . . ... . . . · · · λn

n

  

52 / 66

slide-85
SLIDE 85

EigenSystem Update

Un =

  • un

1

un

2

. . . un

n

  • ,

Λn =    λn

1

· · · . . . ... . . . · · · λn

n

   Σn 0T 1 un

i

  • = λn

i

un

i

  • u′n

i 52 / 66

slide-86
SLIDE 86

EigenSystem Update

Un =

  • un

1

un

2

. . . un

n

  • ,

Λn =    λn

1

· · · . . . ... . . . · · · λn

n

   Σn 0T 1 un

i

  • = λn

i

un

i

  • u′n

i

Σn 0T 1 1

  • =

1

  • 52 / 66
slide-87
SLIDE 87

EigenSystem Update

Σn 0T 1

  • Σ′

=     u

′n

1 T

. . . en+1T        λn

1

· · · . . . ... . . . · · · 1   

  • Λ′
  • u

′n

1

. . . en+1

  • U′

Σn+1 = Σ′ + en+1 v T + v

  • eT

n+1

  • Σ′ + en+1

v T + v

  • eT

n+1

  • un+1 = λn+1un+1

53 / 66

slide-88
SLIDE 88

EigenSystem Update

  • Σ′ + en+1

v T + v

  • eT

n+1

  • un+1 = λn+1un+1

U′T

  • Σ′ + en+1

v T + v

  • eT

n+1

  • U′U′T

U′U′T =I

un+1 = λn+1U′Tun+1

((4))

= ⇒

  • Λ′ + en+1qT + qeT

n+1

  • U′Tun+1 = λn+1U′Tun+1

U′TΣ′U′ = Λ′ (4)

54 / 66

slide-89
SLIDE 89

EigenSystem Update

  • Λ′ + en+1qT + qeT

n+1

  • Σ′′

U′Tun+1 = λn+1U′Tun+1 → Σn+1 and Σ

′′ share eigenvalues.

→ Un+1 = U′U

′′ 55 / 66

slide-90
SLIDE 90

EigenSystem Update

|Σ′′ − λI| =

  • j

j − λ) +

  • i<n+1

−q2

i

  • j=i,j<n+1

j − λ)

f (λ) = λ′n+1 − λ +

  • i

−q2

i

(λ′i − λ).

56 / 66

slide-91
SLIDE 91

EigenSystem Update

|Σ′′ − λI| =

  • j

j − λ) +

  • i<n+1

−q2

i

  • j=i,j<n+1

j − λ)

f (λ) = λ′n+1 − λ +

  • i

−q2

i

(λ′i − λ). ∀i, lim

λ > − →λ′i

f (λ) = +∞, lim

λ < − →λ′i

f (λ) = −∞ ∂f (λ) ∂λ = −1 +

  • i

−q2

i

(λ′i − λ)2 ≤ 0

56 / 66

slide-92
SLIDE 92

57 / 66

slide-93
SLIDE 93

200 400 600 800 1000 1200 1400 1600 1800 2000 5 10 15 20 25 30 CPU time in secs Matrix size Lapack Update

Comparison between scratch and update

58 / 66

slide-94
SLIDE 94

200 400 600 800 1000 1200 1400 1600 1800 2000 0.5 1 1.5 2 2.5 x 10

−10

max |λupdate − λLapack|/λLapack Matrix size

Numerical stability (eigenvalues)

59 / 66

slide-95
SLIDE 95

Set to go . . .

Now that all the machinery is in place, How did we do?

60 / 66

slide-96
SLIDE 96

Nice Results

CIFAR STL INRIA SVMLin 10 50 100 10 50 100 10 50 100 Fisher 25.19 39.47 48.12 26.09 34.63 38.02 92.55 94.03 94.68 FCBF 33.65 47.77 54.97 31.74 38.11 40.66 94.14 96.03 96.03 MRMR 27.94 37.78 43.63 28.26 31.16 33.12 86.03 86.77 86.72 SBMLR 30.43 51.41 56.81 32.29 43.29 47.15 85.92 88.57 88.64 tTest 25.69 40.17 45.12 26.72 36.23 39.14 80.01 87.64 89.23 InfoGain 24.79 37.98 47.37 27.17 33.70 37.84 92.35 93.75 94.68

  • Spec. Clus.

17.19 32.78 42.6 18.91 32.65 38.24 92.67 93.64 94.44 RelieFF 24.56 38.17 46.51 29.16 38.05 42.94 90.99 95.97 96.36 CFS 31.49 42.17 51.70 28.63 38.54 41.88 88.64 96.11 97.53 CMTF 21.10 40.39 47.71 27.61 38.99 42.32 79.09 89.49 93.01 BAHSIC

  • 28.95 39.05 45.49 78.54 89.77 91.96

GC.E 32.45 50.15 55.06 31.20 43.31 49.75 87.73 91.96 93.13

  • GC. MI

36.47 51.44 55.39 32.50 44.15 48.88 89.76 95.71 96.45 GKL.E 37.51 52.11 56.41 33.44 44.27 50.54 85.31 92.05 96.36

  • GKL. MI

33.71 47.17 51.12 32.16 44.87 47.96 85.66 92.14 95.16

GC.MI was the fastest of the more complex algorithms

61 / 66

slide-97
SLIDE 97

Influence of Sample Size

We use estimates ˆ ΣN = 1 N PTP For (sub)-Gaussian data we have2 If N ≥ C(t/ǫ)2d then ˆ ΣN − Σ ≤ ǫ

2with probability at least 1 − 2e−ct2 62 / 66

slide-98
SLIDE 98

Influence of Sample Size

We use estimates ˆ ΣN = 1 N PTP For (sub)-Gaussian data we have2 If N ≥ C(t/ǫ)2d then ˆ ΣN − Σ ≤ ǫ However the faster implementations use ˆ Σ−1

N

2with probability at least 1 − 2e−ct2 62 / 66

slide-99
SLIDE 99

Influence of Sample Size

κ(Σ) =

Σ−1e Σ−1b e b

= Σ−1Σ, for d = 2048 and various values of N

63 / 66

slide-100
SLIDE 100

500 1000 1500 2000 2500 25 30 35 40 45 50 Number of Samples per Class Test Accuracy 10 features 25 features 50 features

Effect of sample size on performance when using the Gaussian Approximation for the CIFAR dataset.

64 / 66

slide-101
SLIDE 101

Jointly Informative Feature Selection Made Tractable by Gaussian Modeling L.Lefakis and F.Fleuret Journal of Machine Learning Research, 2016

65 / 66

slide-102
SLIDE 102

The End

Thank You!

66 / 66