Causal and Non-Causal Feature Selection for Ridge Regression Gavin - - PowerPoint PPT Presentation

causal and non causal feature selection for ridge
SMART_READER_LITE
LIVE PREVIEW

Causal and Non-Causal Feature Selection for Ridge Regression Gavin - - PowerPoint PPT Presentation

Causal and Non-Causal Feature Selection for Ridge Regression Gavin Cawley School of Computing Sciences University of East Anglia Norwich, United Kingdom gcc@cmp.uea.ac.uk Wednesday 3 rd June 2008 Introduction Causal feature selection


slide-1
SLIDE 1

Causal and Non-Causal Feature Selection for Ridge Regression

Gavin Cawley

School of Computing Sciences University of East Anglia Norwich, United Kingdom gcc@cmp.uea.ac.uk

Wednesday 3rd June 2008

slide-2
SLIDE 2

Introduction

◮ Causal feature selection useful given covariate-shift. ◮ What works best? ◮ Use the same base classifier (ridge regression). ◮ Careful (but efficient) optimisation of ridge parameter. ◮ Minimal pre-processing. ◮ Causal feature selection strategies: ◮ Markov blanket. ◮ Direct causes + direct effects. ◮ Direct causes. ◮ For comparison: ◮ Non-causal feature selection (BLogReg). ◮ No feature selection (regularisation only). ◮ WCCI-2008 Causality and Prediction Challenge ◮ Solution a little “heuristic”!

slide-3
SLIDE 3

Ridge Regression

◮ Linear classifier with regularised sum-of-squares loss function

ˆ yi = xi · β and L = 1 2

  • i=1

[yi − ˆ yi]2 + λ 2 β2

◮ Weights found via the “Normal equations”

  • XTX + λI
  • β = XTy.

◮ Optimise regularisation parameter, λ, via VLOO

P(λ) = 1 ℓ

  • i=1
  • ˆ

y(−i)

i

− yi 2 where ˆ y(−i)

i

− yi = ˆ yi − yi 1 − hii . and the “hat” matrix is H = [hij]ℓ

i,j=1 = X

  • XTX + λI

−1 XT.

slide-4
SLIDE 4

Linear Kernel Ridge Regression

◮ Useful for problems with more features than patterns. ◮ Dual representation of model

ˆ yi =

  • j=1

αj < xj, xi >

◮ Model parameters given by system of linear equations

  • XXT + λI
  • α = y.

◮ Optimise regularisation parameter, λ, via VLOO

P(λ) = 1 ℓ

  • i=1
  • ˆ

y(−i)

i

− yi 2 where ˆ y(−i)

i

− yi = αi Cii .

◮ Computational complexity O(ℓ3) instead of O(d3).

slide-5
SLIDE 5

Optimisiation of the Regularisation Parameter

◮ Sneaky trick well known to statisticians! ◮ Eigen-decomposition of covariance matrix: XTX = VTΛV. ◮ We can then re-write the normal equations as

[Λ + λI] α = VTXTy where α = VTβ

◮ Similarly, the “hat” matrix can be written as

H = V [Λ + λI]−1 VT

◮ Note only a diagonal matrix need be inverted ◮ Performing the eigen-decomposition is expensive. ◮ Cost is amortized across the investigation of many values for λ. ◮ Regularisation parameter, λ, optimised via gradient descent. ◮ A similar trick can be implemented for KRR.

slide-6
SLIDE 6

Feature Selection

◮ Non-causal - logistic regression with Laplace prior ◮ Regularisation parameter integrated out using reference prior. ◮ Causal feature selection using Causal Explorer ◮ Selecting the Markov blanket using HITON MB ◮ Direct the edges of the DAG ◮ PC algorithm for problems with continuous features. ◮ MMHC algorithm for binary only problems. ◮ Use HITON MB to pre-select features. ◮ Use an ensemble of 100 models ◮ Variability of feature selection methods. ◮ Gives an indication of generalisation performance.

slide-7
SLIDE 7

Results for the REGED Benchmark

◮ Non-causal feature selection works well.

Dataset Selection FNUM FSCORE DSCORE TSCORE AUC REGED0 None 999.00 0.9204 1.0000 0.9983 0.9612 Non-causal 14.69 0.8070 1.0000 0.9997 0.9997 Markov blanket 26.86 0.8988 0.9999 0.9997 0.9994 Causes & effects 8.60 0.8095 0.9999 0.9996 0.9978 Causes only 1.56 0.7143 0.9984 0.9955 0.9346 REGED1 None 999.00 0.9078 1.0000 0.9321 — Non-causal 14.69 0.7798 1.0000 0.9508 — Markov blanket 24.85 0.8438 0.9999 0.9346 — Causes & effects 8.60 0.7822 0.9999 0.9329 — Causes only 1.56 0.7124 0.9984 0.8919 — REGED2 None 999.00 0.9950 1.0000 0.7184 — Non-causal 14.69 0.9980 1.0000 0.7992 — Markov blanket 24.85 0.9975 0.9999 0.7644 — Causes & effects 8.60 0.9970 0.9999 0.7989 — Causes only 1.56 0.9970 0.9984 0.7653 —

slide-8
SLIDE 8

Results for SIDO Benchmark

◮ Very large dataset - not all results available. ◮ Best performance achieved without feature selection.

Dataset Selection FNUM FSCORE DSCORE TSCORE AUC SIDO0 None 4928.00 0.5890 0.9840 0.9427 0.9472 Non-causal 28.96 0.5160 0.9482 0.9294 0.9226 Markov blanket 136.47 0.5818 0.9563 0.9418 0.9356 SIDO1 None 4928.00 0.5314 0.9840 0.7532 — Non-causal 28.96 0.4909 0.9482 0.6971 — Markov blanket 136.47 0.5348 0.9563 0.6948 — SIDO2 None 4928.00 0.5314 0.9840 0.6684 — Non-causal 28.96 0.4909 0.9482 0.6298 — Markov blanket 136.47 0.5348 0.9563 0.6298 —

slide-9
SLIDE 9

Results for CINA Benchmark

◮ Non-causal, Markov blanket & no selection all work well.

Dataset Selection FNUM FSCORE DSCORE TSCORE AUC CINA0 None 132.00 0.7908 0.9677 0.9674 0.9664 Non-causal 29.44 0.5708 0.9682 0.9679 0.9660 Markov blanket 55.30 0.7708 0.9669 0.9669 0.9660 Causes & effects 21.21 0.6826 0.9654 0.9661 0.9653 Causes 1.02 0.5174 0.7923 0.7911 0.5351 CINA1 None 132.00 0.5865 0.9677 0.7953 — Non-causal 29.44 0.6436 0.9682 0.7609 — Markov blanket 55.30 0.5261 0.9669 0.7979 — Causes & effects 21.21 0.5477 0.9654 0.7749 — Causes 1.02 0.5114 0.7923 0.5402 — CINA2 None 132.00 0.5865 0.9677 0.5502 — Non-causal 29.44 0.6436 0.9682 0.5464 — Markov blanket 55.30 0.5261 0.9669 0.5469 — Causes & effects 21.21 0.5477 0.9654 0.5394 — Causes 1.02 0.5114 0.7923 0.4825 —

slide-10
SLIDE 10

Pre-processing for the MARTI Benchmark

◮ MARTI has correllated noise. ◮ Use KRR to estimate the noise as a function of (x,y) co-ords.

yi = φ(xi) · W + εi, where ǫi ∼ N

  • 0, σ2

i I

  • .

◮ Radial basis function kernel defines φ(x). ◮ Iteratively re-estimate noise variance for each spot using

residuals.

5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30

Raw Noise Signal

slide-11
SLIDE 11

Results for MARTI Benchmark

◮ Markov blanket and non-causal selection work well.

Dataset Selection FNUM FSCORE DSCORE TSCORE AUC MARTI0 None 1024.00 0.7980 1.0000 0.9970 0.9950 Non-causal 15.19 0.8029 0.9998 0.9993 0.9986 Markov blanket 26.86 0.8862 1.0000 0.9994 0.9994 Causes & effects 8.60 0.7894 0.9987 0.9986 0.9978 Causes only 1.56 0.5714 0.9821 0.9775 MARTI1 None 1024.00 0.7923 1.0000 0.9085 — Non-causal 15.19 0.7752 0.9998 0.9310 — Markov blanket 26.86 0.8264 1.0000 0.9234 — Causes & effects 8.60 0.7820 0.9987 0.8929 — Causes only 1.56 0.5347 0.9821 0.6370 — MARTI2 None 1024.00 0.9951 1.0000 0.9085 — Non-causal 15.19 0.9976 0.9998 0.7975 — Markov blanket 26.86 0.9966 1.0000 0.7740 — Causes & effects 8.60 0.9956 0.9987 0.7416 — Causes only 1.56 0.7485 0.9821 0.6607 —

slide-12
SLIDE 12

Results for Final Submission

◮ Ridge regression provides a satisfactory base classifier. ◮ ARD/RBF kernel classifier may be better for CINA. ◮ Feature selection beneficial for manipulated datsets.

Dataset Causal Discovery Target Prediction Rank Fnum Fscore Dscore Tscore Top Ts Max Ts cina0 128 0.5166 0.9737 0.9743 0.9765 0.9788 3 cina1 128 0.5860 0.9737 0.8691 0.8691 0.8977 cina2 64 0.5860 0.9734 0.7031 0.8157 0.8910 marti0 128 0.8697 1.0000 0.9996 0.9996 0.9996 1 marti1 32 0.8064 1.0000 0.9470 0.9470 0.9542 marti2 64 0.9956 0.9998 0.7975 0.7975 0.8273 reged0 128 0.9410 0.9999 0.9997 0.9998 1.0000 2 reged1 32 0.8393 0.9970 0.9787 0.9888 0.9980 reged2 8 0.9985 0.9996 0.8045 0.8600 0.9534 sido0 4928 0.5890 0.9840 0.9427 0.9443 0.9467 1 sido1 4928 0.5314 0.9840 0.7532 0.7532 0.7893 sido2 4928 0.5314 0.9840 0.6684 0.6684 0.7674

slide-13
SLIDE 13

Summary

◮ Things that worked well: ◮ Regularisation can suppress irrelevant features. ◮ Use an ensemble to average over sources of uncertainty. ◮ Pre-processing is important (e.g. MARTI) ◮ Things that didn’t work so well: ◮ Computational expense — need more efficient tools. ◮ Effective non-linear models for large datasets (e.g. CINA). ◮ Challenge makes a convincing case for causal feature selection. ◮ Can deal with covariate shift. ◮ Rather difficult! ◮ Availability of MATLAB code ◮ http://theoval.cmp.uea.ac.uk/∼gcc/cbl/blogreg/ ◮ http://theoval.cmp.uea.ac.uk/∼gcc/projects/gkm/