Order parameters and model selection in Machine Learning: model - - PowerPoint PPT Presentation

order parameters and model selection in machine learning
SMART_READER_LITE
LIVE PREVIEW

Order parameters and model selection in Machine Learning: model - - PowerPoint PPT Presentation

Order parameters and model selection in Machine Learning: model characterization and feature selection Romaric Gaudel Advisor: Mich` ele Sebag; Co-advisor: Antoine Cornu ejols PhD, December 14, 2010 Introduction Relational Kernels


slide-1
SLIDE 1

Order parameters and model selection in Machine Learning: model characterization and feature selection

Romaric Gaudel

Advisor: Mich` ele Sebag; Co-advisor: Antoine Cornu´ ejols

PhD, December 14, 2010

slide-2
SLIDE 2

Introduction Relational Kernels Feature Selection Conclusion +

Supervised Machine Learning

Background

Unknown distribution I P(x, y) on X × Y

Objective

Find h∗ minimizing generalization error Err (h) = I EI

P(x,y) [ℓ (h(x), y)]

Where ℓ (h(x), y) is the cost of error on example x

Given

Training examples L = {(x1, y1), . . . , (xn, yn)}

Where (xi, yi) ∼ I P(x, y), i ∈ 1, . . . , n

h∗(x) = 0 h∗(x) > 0 h∗(x) < 0

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 2 / 52

slide-3
SLIDE 3

Introduction Relational Kernels Feature Selection Conclusion +

Supervised Machine Learning 2

(Vapnik-Chervonenkis; Bottou & Bousquet, 08)

Approximation error (a.k.a. bias)

Learned hypothesis belong to H h∗

H = argmin h∈H

Err (h)

Estimation error (a.k.a. variance)

Err estimated by empirical error Errn (h) = 1

n

P ℓ(h(xi), yi) hn = argmin

h∈H

Errn (h)

Optimization error

Learned hypothesis returned by an

  • ptimization algorithm A

ˆ hn = A(L)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52

H h∗ h∗

H

Approximation

slide-4
SLIDE 4

Introduction Relational Kernels Feature Selection Conclusion +

Supervised Machine Learning 2

(Vapnik-Chervonenkis; Bottou & Bousquet, 08)

Approximation error (a.k.a. bias)

Learned hypothesis belong to H h∗

H = argmin h∈H

Err (h)

Estimation error (a.k.a. variance)

Err estimated by empirical error Errn (h) = 1

n

P ℓ(h(xi), yi) hn = argmin

h∈H

Errn (h)

Optimization error

Learned hypothesis returned by an

  • ptimization algorithm A

ˆ hn = A(L)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52

H h∗ h∗

H

Approximation Estimation

hn

slide-5
SLIDE 5

Introduction Relational Kernels Feature Selection Conclusion +

Supervised Machine Learning 2

(Vapnik-Chervonenkis; Bottou & Bousquet, 08)

Approximation error (a.k.a. bias)

Learned hypothesis belong to H h∗

H = argmin h∈H

Err (h)

Estimation error (a.k.a. variance)

Err estimated by empirical error Errn (h) = 1

n

P ℓ(h(xi), yi) hn = argmin

h∈H

Errn (h)

Optimization error

Learned hypothesis returned by an

  • ptimization algorithm A

ˆ hn = A(L)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52

H h∗ h∗

H

Approximation Estimation

hn

Optimization

ˆ hn

slide-6
SLIDE 6

Introduction Relational Kernels Feature Selection Conclusion +

Focus of the thesis

Combinatorial optimization problems hidden in Machine Learning

Relational representation

= ⇒ Combinatorial optimization problem Example: Mutagenesis database

+

  • Feature Selection

= ⇒ Combinatorial optimization problem Example: Microarray data

+ −

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 4 / 52

slide-7
SLIDE 7

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 5 / 52

slide-8
SLIDE 8

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 6 / 52

slide-9
SLIDE 9

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Relational Learning / Inductive Logic Programming

Position

Relational database

X: keys in the database Background knowledge

H: set of logical formulas

Expressive language Actual covering test: Constraint Satisfaction Problem (CSP)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 7 / 52

slide-10
SLIDE 10

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

CSP consequences within Inductive Logic Programming

Consequences of the Phase Transition

Complexity

Worst case: NP-hard Average case: “easy” except in Phase Transistion (Cheeseman et al. 91)

Phase Transition in Inductive Logic Programming

Existence

(Giordana & Saitta, 00)

Impact: fails to learn in Phase Transition region

(Botta et al., 03)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 8 / 52

slide-11
SLIDE 11

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Multiple Instance Problems

The missing link between Relational and Propositional Learning

Multiple Instance Problems (MIP) (Dietterich et al., 89)

An example: set of instances An instance: vector of features Target-concept: there exists an instance satisfying a predicate P pos(x) ⇐ ⇒ ∃I ∈ x, P(I)

Example of MIP

A locked door A positive key-ring contains a key which can unlock the door

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 9 / 52 Negative key ring Positive key ring

slide-12
SLIDE 12

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Support Vector Machine

A Convex optimization problem

argmin

α∈I Rn n

X

i=1

αi − 1 2

n

X

i=1

αiαjyiyjxi, xj s.t. (Pn

i=1 αiyi = 0

0 αi C, i = 1, . . . , n

Kernel trick

xi, xj K(xi, xj)

ˆ hn(x) = 1 ˆ hn(x) = −1 ˆ hn(x) = 0 ˆ hn(x) > 0 ˆ hn(x) < 0 0 < ξi < 1 ξi > 1 ξi = 0

Kernel-based propositionalization (differs from RKHS framework)

( L = {(x1, y1), . . . , (xn, yn)} K

  • Φ : x → (K(x1, x), . . . , K(xn, x))
  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 10 / 52

slide-13
SLIDE 13

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

SVM and MIP

Averaging-kernel for MIP (G¨ artner et al., 02)

Given a kernel k on instances K(x, x′) = P

xi ∈x

P

xj ∈x′ k(xi, xj)

norm (x) norm (x′)

Question

MIP Target-concept: existential properties Averaging-Kernel: average properties Do averaging-kernels sidestep limitations of Relational Learning?

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 11 / 52

slide-14
SLIDE 14

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Methodology

Inspired from Phase Transition studies

Usual Phase Transition framework

Generate data after control parameters Observe results Draw phase diagram: results w.r.t. order parameters

This study

Generalized Multiple Instance Problem Experimental results of averaging-kernel-based propositionalization

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 12 / 52

slide-15
SLIDE 15

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Outline

1

Relational Kernels Theoretical failure region Lower bound on the generalization error Empirical failure region

2

Feature Selection

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 13 / 52

slide-16
SLIDE 16

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Generalized Multiple Instance Problems

Generalized MIP (Weidmann et al., 03)

An example: set of instances An instance: vector of features Target-concept: conjunction of predicates P1, . . . , Pm pos(x) ⇐ ⇒ ∃I1, . . . , Im ∈ x,

m

^

i=1

Pi(Ii)

Example of Generalized MIP

A molecule: set of sub-graphs Bioactivity: implies several sub-graphs

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 14 / 52

C C N N CH3 N C N CH3 O C O CH3 CH

= ⇒

N N CH3 CH C C C O C N CH3 O CH3 N

slide-17
SLIDE 17

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Control Parameters

Category Param. Definition Instances I = (a, z) |Σ| Size of alphabet Σ, a ∈ Σ d number of numerical features, z ∈ [0, 1]d Examples M+ Number of instances per posi- tive example M− Number of instances per nega- tive example m+ Number of instances in a predi- cate, for positive example m− Number of instances in a predi- cate, for negative example Pm Number of predicates “missed” by each negative example Concept P Number of predicate ε Radius of each predicate (ε- ball)

+

ε

  • ε
  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 15 / 52

slide-18
SLIDE 18

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Limitation of averaging-kernels

Theoretical analysis Failure for m+

M+ = m− M−

I Ex∼D+ [K(xi, x)] = I Ex∼D− [K(xi, x)]

0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 K(x−,x) K(x+,x) exemples positifs exemples négatifs

Empirical approach

Generate, test and average empirical results Establish a lower bound on generalization error

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 16 / 52

slide-19
SLIDE 19

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Limitation of averaging-kernels

Theoretical analysis Failure for m+

M+ = m− M−

I Ex∼D+ [K(xi, x)] = I Ex∼D− [K(xi, x)]

0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 K(x−,x) K(x+,x) exemples positifs exemples négatifs

Empirical approach

Generate, test and average empirical results Establish a lower bound on generalization error

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 16 / 52

slide-20
SLIDE 20

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Efficiency of kernel-based propositionalization

Kernel-based propositionalization H′ (differs from RKHS framework)

  • L = {(x1, y1), . . . , (xn, yn)}

K

  • Φ : x → (K(x1, x), . . . , K(xn, x))

Question (Q): separability of test examples T in H′ ∃? αi,    n

i=1 αiyi = 0

SVM constraint 0 αi C i = 1, . . . , n SVM constraint n

i=1 αiyiK(xi, x′) + b

  • y′ 1

(x′, y′) ∈ T test constraint An optimistic criterion

Test examples used to define αi

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 17 / 52

slide-21
SLIDE 21

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Efficiency of kernel-based propositionalization

Kernel-based propositionalization H′ (differs from RKHS framework)

  • L = {(x1, y1), . . . , (xn, yn)}

K

  • Φ : x → (K(x1, x), . . . , K(xn, x))

Question (Q): separability of test examples T in H′ ∃? αi,    n

i=1 αiyi = 0

SVM constraint 0 αi C i = 1, . . . , n SVM constraint n

i=1 αiyiK(xi, x′) + b

  • y′ 1

(x′, y′) ∈ T test constraint An optimistic criterion

Test examples used to define αi

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 17 / 52

slide-22
SLIDE 22

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Lower bound on the generalization error

Theorem

For each setting

Generate T training (respectively test) datasets L (resp. T ) Record τ: proportion of couples (L, T ) s.t. (Q) is satisfiable

Let ErrL be the generalization error when learning from L Then, ∀η > 0, with probability at least 1 − exp(−2η2T) I E|L|=n[ErrL] > 1 − (τ + η)

1 |T |

Remark

(Q) solved using Linear Programming

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 18 / 52

slide-23
SLIDE 23

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Empirical failure region

0.2 0.4 0.6 0.8 1

r+ r-

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

(Q) satisfiability

0.1 0.2 0.3 0.4 0.5

r+ r-

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

SVM test error Control parameters

Instance space: Σ × [0, 1]30 100 instances per example 30 predicates 40 couples (L, T ) per setting

The averaging kernel fails when Small training dataset (|L| 100)

m+ M+ ≈ m− M−

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 19 / 52

slide-24
SLIDE 24

Introduction Relational Kernels Feature Selection Conclusion + Position Theory Lower bound Experiments Discussion

Partial conclusion on Relational Kernel

Contributions

Theoretical and empirical identification of limitations for averaging-kernels A lower bound on generalization error Perspectives

Failure region for other kernels

Claim: any kernel computable in a polynomial time leads to a failure region When is the failure region small enough?

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 20 / 52

slide-25
SLIDE 25

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 21 / 52

slide-26
SLIDE 26

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Feature Selection

Optimization problem argmin

F⊆F

Err (A (F, L))

F: Set of features F: Feature subset L: Training data set A: Machine Learning algorithm Err: Generalization error

Feature Selection (FS)

Minimize the Generalization Error Decrease the learning/use cost of models Lead to more understandable models

Bottlenecks

Combinatorial optimization problem: find F ⊆ F Unknown objective function: generalization error

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 22 / 52

slide-27
SLIDE 27

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Filter approaches for Feature Selection

Score features Select the best ones

Pro

Cheap

Cons

Cannot handle all inter-dependencies between features Filter approaches

ANOVA (Analysis of Variance) RELIEFF (Kira & Rendell, 92)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 23 / 52

slide-28
SLIDE 28

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Embedded approaches for Feature Selection

Exploit the learned hypothesis And/Or modify the learning criterion to induce sparsity

Pro

Based on relevance of features in the learned model

Cons

Limited to linear models or a linear combination of kernels Possibly misled by feature interdependencies Embedded approaches

Lasso (Tibshirani, 94) Multiple Kernel Learning (Bach, 08) Gini score on Random Forest (Rogers & Gunn, 05)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 24 / 52

slide-29
SLIDE 29

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Wrapper approaches for Feature Selection

Test feature subsets

Actually address the combinatorial problem

Pro

Look for (approximate) best solution

Cons

Computationally expensive Wrapper approaches

Look ahead (Margaritis, 09) Mix forward/backward search (Zhang, 08) Mix global/local search (Boull´ e, 07)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 25 / 52

slide-30
SLIDE 30

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Proposed Feature Selection framework

Goal: optimal

Find argmin

F⊆F

Err (A (F, L))

Virtually explore the whole lattice

Goal: tractable

Frugal, unbiased assessment of F

Cannot compute Err (A (F, L))

Gradually focus search on most promising subtrees

Exploration vs Exploitation trade-off

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 26 / 52

f1 f3 f , f

1 3

f , f

2 3

f , f

1 2

f3 f2 f , f

1 2

slide-31
SLIDE 31

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Proposed Feature Selection framework

Goal: optimal

Find argmin

F⊆F

Err (A (F, L))

Virtually explore the whole lattice

Goal: tractable

Frugal, unbiased assessment of F

Cannot compute Err (A (F, L))

Gradually focus search on most promising subtrees

Exploration vs Exploitation trade-off

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 26 / 52

f1 f3 f , f

1 3

f , f

2 3

f , f

1 2

f3 f2 f , f

1 2

slide-32
SLIDE 32

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection Feature Selection trough Reinforcement Learning A one-player game with Monte-Carlo Tree Search The FUSE algorithm Experimental validation

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 27 / 52

slide-33
SLIDE 33

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Feature Selection as a Markov Decision Process

From the lattice of subsets . . .

Set of features: F Set of candidates: 2F

. . . to a Markov Decision Process

Set of states: S = 2F Initial state: ∅ Set of actions: A = {add f, f ∈ F} Reward function: V : S → [0, 1]

Ideally : V(F) = Err (A (F, L)) In practice: Fast unbiased estimate

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 28 / 52

f1 f3 f2 f , f

1 3

f , f

2 3

f , f

1 2

f3 f , f

1 2

f3 f1 f2 f1 f3 f2 f2 f2 f1 f3 f3 f1

slide-34
SLIDE 34

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Optimal Policy

Policy: π : S → A Final state following a policy: Fπ Optimal policy: π∗ = argmin

π

Err Fπ Bellman’s optimality principle π∗(F) = argmin

f∈F

V ∗(F ∪ {f}) with V ∗(F) =

  • Err (Err (A (F, L)))

if final(F) min

f∈F V ∗(F ∪ {f})

  • therwise
  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 29 / 52

f1 f3 f , f

1 3

f , f

2 3

f , f

1 2

f3 f3 f1 f3 f2 f2 f2 f1 f3 f3 f1 f1 f2 f2 f , f

1 2

slide-35
SLIDE 35

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Optimal Policy

Policy: π : S → A Final state following a policy: Fπ Optimal policy: π∗ = argmin

π

Err Fπ Bellman’s optimality principle π∗(F) = argmin

f∈F

V ∗(F ∪ {f}) with V ∗(F) =

  • Err (Err (A (F, L)))

if final(F) min

f∈F V ∗(F ∪ {f})

  • therwise

π∗ intractable ⇒ approximation using a one-player game approach

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 29 / 52

f1 f3 f , f

1 3

f , f

2 3

f , f

1 2

f3 f3 f1 f3 f2 f2 f2 f1 f3 f3 f1 f1 f2 f2 f , f

1 2

slide-36
SLIDE 36

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection Feature Selection trough Reinforcement Learning A one-player game with Monte-Carlo Tree Search The FUSE algorithm Experimental validation

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 30 / 52

slide-37
SLIDE 37

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-38
SLIDE 38

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-39
SLIDE 39

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-40
SLIDE 40

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-41
SLIDE 41

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-42
SLIDE 42

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-43
SLIDE 43

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-44
SLIDE 44

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-45
SLIDE 45

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-46
SLIDE 46

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-47
SLIDE 47

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-48
SLIDE 48

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-49
SLIDE 49

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-50
SLIDE 50

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

The UCT Monte-Carlo Tree Search

(Kocsis & Szepesv´ ari, 06)

Gradually grow a search tree Building Blocks

Select next action (bandit-based phase) Add a node (leaf of the search tree) Monte-Carlo exploration (random phase) Compute instant reward Update visited nodes

Returned solution

Path visited most often

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 31 / 52

slide-51
SLIDE 51

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Multi-Arm Bandit-based phase

Upper Confidence Bound (UCB1-tuned) (Auer et al., 02)

Exploration vs Exploitation trade-off Select argmax

a∈A

ˆ µa +

  • ce log(T)

ta

min

  • 1

4, ˆ

σ2

a +

  • ce log(T)

ta

  • ˆ

µa: Empirical average reward for action a ˆ σ2

a: Empirical variance of reward for action a

T: Total number of trials in current node ta: Number of trials for action a ce: Parameter

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 32 / 52

Search Tree Phase Bandit−Based

?

slide-52
SLIDE 52

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Multi-Arm Bandit-based phase

External information

Mixing UCB with

Priors on actions (Rolet et al., 09) Information learned during iterations (Gelly & Silver, 07 ; Auer, 02 ; Filippi et al., 10)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 33 / 52

Search Tree Phase Bandit−Based

?

slide-53
SLIDE 53

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection Feature Selection trough Reinforcement Learning A one-player game with Monte-Carlo Tree Search The FUSE algorithm Experimental validation

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 34 / 52

slide-54
SLIDE 54

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

A many-armed bandit problem

Bottleneck

UCT degenerates to pure exploration as the number of arms increases (several hundred features) ⇒

Control the number of arms Select the arms

How to control the number of arms?

Continuous heuristics (Gelly & Silver, 07)

Use a small exploration constant ce (10−2, 10−4)

Discrete heuristics (Coulom, 06; Rolet et al., 09)

Progressive Widening Select a new action whenever ⌊T b⌋ increases (b = 1

2 in experiments) Number of iterations Number of considered actions

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 35 / 52 Search Tree Phase Bandit−Based

?

slide-55
SLIDE 55

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

Sharing information among nodes

How to share information among nodes?

Rapid Action Value Estimation (RAVE) (Gelly & Silver, 07)

RAVE(f) = average reward when f ∈ F

F

8

F

3

F

5

F

2

F

9

F

11

F

4

F

10

F

7

F

1

F

6

F f

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 36 / 52 Search Tree Phase Bandit−Based

?

slide-56
SLIDE 56

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

Sharing information among nodes

How to share information among nodes?

Rapid Action Value Estimation (RAVE) (Gelly & Silver, 07)

RAVE(f) = average reward when f ∈ F

F

8

F

3

F

5

F

2

F

9

F

4

F

11µ

F

10

F

7

F

1

F

6

F f

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 36 / 52 Search Tree Phase Bandit−Based

?

slide-57
SLIDE 57

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

Sharing information among nodes

How to share information among nodes?

Rapid Action Value Estimation (RAVE) (Gelly & Silver, 07)

RAVE(f) = average reward when f ∈ F

F

8

F

3

F

5

F

2

F

9

F

4

F

11µ

F

10

F

7

F

1

F

6

F f f f f f

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 36 / 52 Search Tree Phase Bandit−Based

?

slide-58
SLIDE 58

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

Sharing information among nodes

How to share information among nodes?

Rapid Action Value Estimation (RAVE) (Gelly & Silver, 07)

RAVE(f) = average reward when f ∈ F

F

8

F

3

F

5

F

2

F

9

F

4

F

11µ

F

7

F

10

ℓ-RAVE F

1

F

6

F f f f f f

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 36 / 52 Search Tree Phase Bandit−Based

?

slide-59
SLIDE 59

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

Sharing information among nodes

How to share information among nodes?

Rapid Action Value Estimation (RAVE) (Gelly & Silver, 07)

RAVE(f) = average reward when f ∈ F

F

8

F

3

F

5

F

2

F

9

F

4

F

11µ

F

7

F

10

ℓ-RAVE F

1

F

6

g−RAVE f f F f f f

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 36 / 52 Search Tree Phase Bandit−Based

?

slide-60
SLIDE 60

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: bandit-based phase

Sharing information among nodes

Guiding search with RAVE

Continuous heuristics

Generalizing the empirical reward

(1 − α) · ˆ µF,f + α ((1 − β) · ℓ-RAVE(F, f) + β · g-RAVE(f)) + exploration term

with α = c c + tF,f β = c′ c′ + t′

F,f

tF,f = Number of trials for feature f from state F t′

F,f = Number of trials for feature f after visiting state F

c, c′: Parameters

Discrete heuristics

New arm = argmax

f∈F

(1 − β) · ℓ-RAVE(F, f) + β · g-RAVE(f)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 37 / 52 Search Tree Phase Bandit−Based

?

slide-61
SLIDE 61

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: random phase

Dealing with an unknown horizon

Bottleneck

Finite unknown horizon (= number of relevant features)

Random phase policy

With probability 1 − q|F| stop | Else • add a uniformly selected feature |

  • |F| = |F| + 1

⌊ Iterate

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 38 / 52 Explored Tree Search Tree Random Phase

?

slide-62
SLIDE 62

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: reward(F)

Generalization error estimate

Requisite

fast (to be computed 104 times) unbiased

Proposed reward

k-NN: strong consistency results (Cover & Hart, 67) + AUC criterion *

Complexity: ˜ O(mnd)

d Number of selected features n Size of the training set m Size of sub-sample (m ≪ n)

* Mann Whitney Wilcoxon test: V(F) =

|{((x,y),(x′,y′))∈V2, NF,k (x)<NF,k (x′), y<y′}| |{((x,y),(x′,y′))∈V2, y<y′}|

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 39 / 52

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

slide-63
SLIDE 63

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: reward(F)

Generalization error estimate

Requisite

fast (to be computed 104 times) unbiased

Proposed reward

k-NN: strong consistency results (Cover & Hart, 67) + AUC criterion *

Complexity: ˜ O(mnd)

d Number of selected features n Size of the training set m Size of sub-sample (m ≪ n)

* Mann Whitney Wilcoxon test: V(F) =

|{((x,y),(x′,y′))∈V2, NF,k (x)<NF,k (x′), y<y′}| |{((x,y),(x′,y′))∈V2, y<y′}|

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 39 / 52

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

slide-64
SLIDE 64

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: reward(F)

Generalization error estimate

Requisite

fast (to be computed 104 times) unbiased

Proposed reward

k-NN: strong consistency results (Cover & Hart, 67) + AUC criterion *

Complexity: ˜ O(mnd)

d Number of selected features n Size of the training set m Size of sub-sample (m ≪ n)

* Mann Whitney Wilcoxon test: V(F) =

|{((x,y),(x′,y′))∈V2, NF,k (x)<NF,k (x′), y<y′}| |{((x,y),(x′,y′))∈V2, y<y′}|

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 39 / 52

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

slide-65
SLIDE 65

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: reward(F)

Generalization error estimate

Requisite

fast (to be computed 104 times) unbiased

Proposed reward

k-NN: strong consistency results (Cover & Hart, 67) + AUC criterion *

Complexity: ˜ O(mnd)

d Number of selected features n Size of the training set m Size of sub-sample (m ≪ n)

* Mann Whitney Wilcoxon test: V(F) =

|{((x,y),(x′,y′))∈V2, NF,k (x)<NF,k (x′), y<y′}| |{((x,y),(x′,y′))∈V2, y<y′}|

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 39 / 52

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

slide-66
SLIDE 66

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: reward(F)

Generalization error estimate

Requisite

fast (to be computed 104 times) unbiased

Proposed reward

k-NN: strong consistency results (Cover & Hart, 67) + AUC criterion *

Complexity: ˜ O(mnd)

d Number of selected features n Size of the training set m Size of sub-sample (m ≪ n)

+ + −

AUC

* Mann Whitney Wilcoxon test: V(F) =

|{((x,y),(x′,y′))∈V2, NF,k (x)<NF,k (x′), y<y′}| |{((x,y),(x′,y′))∈V2, y<y′}|

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 39 / 52

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

slide-67
SLIDE 67

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

FUSE: update

Explore a graph

⇒ Several paths to the same node

Update followed path only

New Node Search Tree Bandit−Based Phase Random Phase

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 40 / 52

Explored Tree Search Tree Phase Bandit−Based New Node Phase Random

slide-68
SLIDE 68

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

From UCT to Feature Selection to Learning

Algorithm

N iterations: each iteration (

  • 1. Follows a path
  • 2. Evaluates a final node

Output Search tree ← → RAVE score ⇓ ⇓ FUSE FUSER Wrapper approach Filter approach Most visited path Using RAVE as score End learner

Any Machine Learning algorithm Support Vector Machine with Gaussian kernel in experiments

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 41 / 52

slide-69
SLIDE 69

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Outline

1

Relational Kernels

2

Feature Selection Feature Selection trough Reinforcement Learning A one-player game with Monte-Carlo Tree Search The FUSE algorithm Experimental validation

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 42 / 52

slide-70
SLIDE 70

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Experimental setting

Questions

FUSE vs FUSER Continuous vs discrete exploration heuristics FS performance w.r.t. complexity of the target concept Convergence speed

Datasets from NIPS’03 Feature Selection challenge

DATA SET SAMPLES FEATURES PROPERTIES MADELON 2,600 500 XOR-LIKE ARCENE 200 10, 000 REDUNDANT FEATURES COLON 62 2, 000 “EASY”

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 43 / 52

slide-71
SLIDE 71

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Experimental setting

Baselines

CFS (Constraint-based Feature Selection) (Hall, 00) Random Forest (Rogers & Gunn, 05) Lasso (Tibshirani, 94) RANDR: RAVE obtained by selecting 20 random features at each iteration

Results averaged on 50 splits (10 × 5 fold cross-validation) Gaussian SVM

Hyper-parameters optimized by 5 fold cross-validation

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 44 / 52

slide-72
SLIDE 72

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Results on Madelon after 200,000 iterations

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 5 10 15 20 25 30 Test error Number of used top-ranked features D-FUSER C-FUSER CFS Random Forest Lasso RANDR

Comment: FUSER = best of both worlds

Removes redundancy (like CFS) Keeps conditionally relevant features (like Random Forest)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 45 / 52

slide-73
SLIDE 73

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Results on Arcene after 200,000 iterations

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 50 100 150 200 Test error Number of used top-ranked features D-FUSER C-FUSER CFS Random Forest Lasso RANDR

Comment: FUSER = best of both worlds

Removes redundancy (like CFS) Keeps conditionally relevant features (like Random Forest)

T-test “CFS vs. FUSER ” with 100 features: p-value=0.036

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 46 / 52

slide-74
SLIDE 74

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Results on Colon after 200,000 iterations

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 50 100 150 200 Test error Number of used top-ranked features D-FUSER C-FUSER CFS Random Forest Lasso RANDR

Comment

All equivalent

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 47 / 52

slide-75
SLIDE 75

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

NIPS 2003 Feature Selection challenge

Test error on the NIPS 2003 Feature Selection challenge

On an disjoint test set

DATABASE ALGORITHM CHALLENGE SUBMITTED IRRELEVANT ERROR FEATURES FEATURES

MADELON FSPP2 [1] 6.22% (1st ) 12 D-FUSER 6.50% (24th) 18 BAYES-NN-RED [2] 7.20% (1st ) 100 ARCENE D-FUSER(ON ALL) 8.42% (3rd ) 500 34 D-FUSER 9.42% 500 (8th) 500

Comment

Accurate w.r.t Feature Selection

[1]

  • K. Q. Shen, C. J. Ong, X. P

. Li, E. P . V. Wilder-Smith Feature selection via sensitivity analysis of SVM probabilistic outputs. Mach.

  • Learn. 2008

[2]

  • R. M. Neal, and J. Zhang Chap. High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees.

Feature extraction, foundations and applications, Springer 2006

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 48 / 52

slide-76
SLIDE 76

Introduction Relational Kernels Feature Selection Conclusion + Position FS-RL MCTS FUSE Experiments Discussion

Partial conclusion on Feature Selection

Contributions

Formalization of Feature Selection as a Markov Decision Process Efficient approximation of the optimal policy (based on UCT)

⇒ Any-time algorithm

Experimental results

State of the art High computational cost (45 minutes on Madelon)

Perspectives

Proof of convergence (including heuristics) (Berthier et al., 10) Include other improvements from Reinforcement Learning community

Function approximation of the Q-value (Melo et al., 08; Auer, 02) Biased random phase (Rimmel & Teytaud, 10)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 49 / 52

slide-77
SLIDE 77

Introduction Relational Kernels Feature Selection Conclusion +

Conclusion

Focus on combinatorial optimization problems hidden in Machine Learning Relational Learning

Theoretical and empirical limitations of averaging-kernels

Feature Selection

Exploration of the feature lattice using a Monte-Carlo tree search approach = ⇒ refining wrapper approaches using a frugal assessment of candidate subsets

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 50 / 52

slide-78
SLIDE 78

Introduction Relational Kernels Feature Selection Conclusion +

Perspective 1: Constructive Induction

Context

Relational Learning / Inductive Logic Programming

Goal

Find a relevant set of primitives / queries = ⇒ combinatorial optimization problem

Proposed approach

Extending FUSE to grammar structured search spaces (de Mesmay et al., 09)

Motivating applications

Customer Relationship Management

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 51 / 52

slide-79
SLIDE 79

Introduction Relational Kernels Feature Selection Conclusion +

Perspective 2: Feature/Example Selection

Context

FUSE: Feature Selection based on UCT BAAL: Active Learning based on UCT (Rolet et al. 09) = ⇒ Can we mix both approaches?

Goal

Local Feature Selection Local Distance Metric Learning (Weinberger & Saul, 09)

f1 f3 f , f

1 3

f , f

2 3

f , f

1 2

f3 f3 f1 f3 f2 f2 f2 f1 f3 f3 f1 f1 f2 f2 f , f

1 2

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 52 / 52

slide-80
SLIDE 80

Bibliography In a slide TP on MIP FUSE

Bibliography 1/3

P . Auer Using confidence bounds for exploitation-exploration trade-offs. JMLR’02 P . Auer, N. Cesa-Bianchi, and P . Fischer Finite-time analysis of the Multiarmed Bandit

  • Problem. ML

’02

  • F. Bach Exploring large feature spaces with hierarchical Multiple Kernel Learning.

NIPS’08

  • V. Berthier, H. Doghmen, and O. Teytaud Consistency Modifications for Automatically

Tuned Monte-Carlo Tree Search. CAP’10

  • M. Botta, A. Giordana, L. Saitta, and M. Sebag Relational Learning as search in a

critical region. JMLR’03

  • L. Bottou, and O. Bousquet The Tradeoffs of Large Scale Learning. NIPS’08
  • M. Boull´

e Compression-based averaging of selective Naive Bayes classifiers. J. Mach.

  • Learn. Res. 07
  • L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen Classification and regression
  • trees. Taylor & Francis, Inc., 84

P . Cheeseman, B. Kanefsky, and W. M. Taylor Where the really hard problems are. IJCAI’91

  • R. Coulom Efficient selectivity and backup operators in Monte-Carlo tree search.

Computer and Games 06

  • T. Cover, and P

. Hart Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967

  • R. F´

eraud, M. Boull´ e, F . Cl´ erot, F. Fessant, V. Lemaire The Orange Customer Analysis

  • Platform. ICDM’10
  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 53 / 52

slide-81
SLIDE 81

Bibliography In a slide TP on MIP FUSE

Bibliography 2/3

  • S. Filippi, O. Capp´

e, A. Garivier and C. Szepesv´ ari Parametric Bandits: The Generalized Linear Case. NIPS’10

  • S. Gelly, and D. Silver Combining online and offline knowledge in UCT. ICML

’07

  • A. Giordana, and L. Saitta Phase Transitions in Relational Learning. Mach. Learn. 00
  • M. A. Hall Correlation-based Feature Selection for discrete and numeric class Machine
  • Learning. ICML

’00

  • K. Kira, and L. A. Rendell A practical approach to feature selection. ML

’92

  • L. Kocsis, and C. Szepesv´

ari Bandit based Monte-Carlo planning. ECML ’06

  • D. Margaritis Toward provably correct Feature Selection in arbitrary domains. NIPS’09
  • F. Melo, S. Meyn, and I. Ribeiro An Analysis of Reinforcement Learning with Function
  • Approximation. ICML

’08

  • F. de Mesmay, A. Rimmel, Y. Voronenko, and M. P¨

uschel Bandit-based optimization on graphs with application to library performance tuning. ICML ’09

  • R. M. Neal, and J. Zhang Chap. High Dimensional Classification with Bayesian Neural

Networks and Dirichlet Diffusion Trees. Feature extraction, foundations and applications, Springer 2006

  • J. Rogers, and S. R. Gunn Identifying feature relevance using a Random Forest.

SLSFS’05 P . Rolet, M. Sebag, and O. Teytaud Boosting Active Learning to optimality: a tractable Monte-Carlo, Billiard-based algorithm. ECML ’09

  • K. Q. Shen, C. J. Ong, X. P

. Li, E. P . V. Wilder-Smith Feature selection via sensitivity analysis of SVM probabilistic outputs. Mach. Learn. 08

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 54 / 52

slide-82
SLIDE 82

Bibliography In a slide TP on MIP FUSE

Bibliography 3/3

  • R. Tibshirani Regression shrinkage and selection via the Lasso. Journal of the Royal

Statistical Society 94

  • K. Q. Weinberger, L. K. Saul Distance Metric Learning for Large Margin Nearest

Neighbor Classification. JMLR’09

  • T. Zhang Adaptive forward-backward greedy algorithm for sparse learning with linear
  • models. NIPS’08
  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 55 / 52

slide-83
SLIDE 83

Bibliography In a slide TP on MIP FUSE TP on MIP FUSE

Relational Kernels

Position

Relational data

X: keys in a database

Bottlenecks

H: set of logical formula h: logical formula

Support Vector Machine: the solution ?

A propositionalization

Use only relations between examples

= ⇒ argmin easy to solve

Contribution

On Multiple-Instance datasets, averaging-kernels miss some concepts Theoretical/empirical identification of failure region + lower bound

  • n generalization error
  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 56 / 52

slide-84
SLIDE 84

Bibliography In a slide TP on MIP FUSE TP on MIP FUSE

Feature Selection

Position

Thousands of features (& Only estimation of Err (A (h, L))) = ⇒ Overfitting: small error on training data / large generalization error

Solution: Feature Selection argmin

F⊆F

Err (A (F, L))

F: Set of features F: Feature subset L: Training data set A: ML algorithm

Bottlenecks

Combinatorial optimization problem: find F ⊆ F Unknown objective function: generalization error

Contribution

Actually handle combinatorial optimization

Use a Monte-Carlo Tree Search algorithm: UCT (Kocsis & Szepesv´

ari, 06)

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 57 / 52

slide-85
SLIDE 85

Bibliography In a slide TP on MIP FUSE Position Contribution

Relational Kernels

Position

Relational data

X: keys in a database

H: set of logical formula

= ⇒

Expressive language Value of a hypothesis on an example: NP-hard Number of hypothesis to test: exponential

Support Vector Machine (SVM)

Only based on relations between examples = ⇒

Value of a hypothesis on one example: linear in # examples Best hypothesis search ≡ convex problem

Question

Does SVM have the same expressiveness as logical formula?

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 58 / 52

slide-86
SLIDE 86

Bibliography In a slide TP on MIP FUSE Position Contribution

Relational Kernels

Position

Relational data Multiple Instance data

X: keys in a database

H: set of logical formula

= ⇒

Expressive language Value of a hypothesis on an example: NP-hard Number of hypothesis to test: exponential

Support Vector Machine (SVM)

Only based on relations between examples Averaging kernel = ⇒

Value of a hypothesis on one example: linear in # examples Best hypothesis search ≡ convex problem

Question

Does SVM have the same expressiveness as logical formula?

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 58 / 52

slide-87
SLIDE 87

Bibliography In a slide TP on MIP FUSE Position Contribution

Relational Kernels failure

A Phase Transition-based study

Identify order parameters Generate artificial problems Identify difficult region

Contribution

On Multiple-Instance data, averaging-kernels miss some concepts Theoretical demonstration of relational kernels failure region New criterion leading to a lower bound on generalization error Empirical visualization of failure region Discussion

What about other kernels?

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 59 / 52

0.1 0.2 0.3 0.4 0.5

r+ r-

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

slide-88
SLIDE 88

Bibliography In a slide TP on MIP FUSE Stopping feature Computational effort Hyperparameters

Stopping feature

Dealing with an unknown horizon

Any state can be final or not

Final(F) = “fs ∈ F” fs: A virtual stopping feature

RAVE(fs)

g-RAVE(f (d)

s

) = average {V(Ft), |Ft| = d + 1} V(Ft): Reward of Feature Subset Ft selected at iteration t d: When RAVE(fs) is used, d is set to the number of features in current state

f1 f3 f2 f , f

1 3

f , f

2 3

f , f

1 2

f3 f , f

1 2

f3 f1 f2 f1 f3 f2 f2 f2 f1 f3 f3 f1 fS f1 fS f3 fS fS f , f

1 2

f3 fS f , f

1 2

fS f2 fS f , f

1 3

fS f , f

2 3

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 60 / 52

slide-89
SLIDE 89

Bibliography In a slide TP on MIP FUSE Stopping feature Computational effort Hyperparameters

Sensitivity of FUSE to the Computational Effort

Madelon

0.1 0.2 0.3 0.4 0.5 1 10 102 103 104 105 Test error Iteration D-FUSE D-FUSER C-FUSE C-FUSER RANDR

5 10 15 20 1 10 102 103 104 105 Number of features chosen by FUSE Iteration D-FUSE C-FUSE

Comments

FUSE: not enough features FUSER: 10 times faster than RANDR

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 61 / 52

slide-90
SLIDE 90

Bibliography In a slide TP on MIP FUSE Stopping feature Computational effort Hyperparameters

FUSE hyperparameters

HOW TO RESTRICT EXPLORATION DISC. CONTINUOUS

HEUR. HEURISTICS

PARAM. k-NN q b ce c, c′ VALUE 5-NN 1 − 10i 1/2 10i 10i i {−1, −3, −5} {−4, −2, 0, 2} {−∞, 2, 4} BEST VALUE 5-NN 1 − 10−1 1/2 ARCENE 5-NN 1 − 10−1 10−2 ANY 5-NN 1 − 10−3 10−4 ALMOST ANY MADELON 5-NN 1 − 10−3 1/2 5-NN 1 − 10−1 10−2 {(102, 0), (104, 0)} COLON 5-NN 1 − 10−5 1/2 5-NN 1 − 10−5 ANY ALMOST ANY

  • R. Gaudel (LRI)

Model Characterization and Feature Selection PhD, December 14, 2010 62 / 52