Causal and Non-Causal Feature Selection for Ridge Regression Gavin - PowerPoint PPT Presentation

Causal and Non-Causal Feature Selection for Ridge Regression Gavin Cawley School of Computing Sciences University of East Anglia Norwich, United Kingdom gcc@cmp.uea.ac.uk Wednesday 3 rd June 2008

Introduction ◮ Causal feature selection useful given covariate-shift. ◮ What works best? ◮ Use the same base classifier (ridge regression). ◮ Careful (but efficient) optimisation of ridge parameter. ◮ Minimal pre-processing. ◮ Causal feature selection strategies: ◮ Markov blanket. ◮ Direct causes + direct effects. ◮ Direct causes. ◮ For comparison: ◮ Non-causal feature selection (BLogReg). ◮ No feature selection (regularisation only). ◮ WCCI-2008 Causality and Prediction Challenge ◮ Solution a little “heuristic”!

Ridge Regression ◮ Linear classifier with regularised sum-of-squares loss function ℓ L = 1 y i ] 2 + λ � 2 � β � 2 ˆ y i = x i · β [ y i − ˆ and 2 i =1 ◮ Weights found via the “Normal equations” � � X T X + λ I β = X T y . ◮ Optimise regularisation parameter, λ , via VLOO ℓ P ( λ ) = 1 � 2 − y i = ˆ y i − y i � y ( − i ) y ( − i ) � ˆ − y i ˆ . where i i 1 − h ii ℓ i =1 � − 1 X T . and the “hat” matrix is H = [ h ij ] ℓ � X T X + λ I i , j =1 = X

Linear Kernel Ridge Regression ◮ Useful for problems with more features than patterns. ◮ Dual representation of model ℓ � ˆ y i = α j < x j , x i > j =1 ◮ Model parameters given by system of linear equations � XX T + λ I � α = y . ◮ Optimise regularisation parameter, λ , via VLOO ℓ P ( λ ) = 1 � 2 − y i = α i � y ( − i ) y ( − i ) � ˆ − y i where ˆ . i i C ii ℓ i =1 ◮ Computational complexity O ( ℓ 3 ) instead of O ( d 3 ).

Optimisiation of the Regularisation Parameter ◮ Sneaky trick well known to statisticians! ◮ Eigen-decomposition of covariance matrix: X T X = V T ΛV . ◮ We can then re-write the normal equations as [ Λ + λ I ] α = V T X T y α = V T β where ◮ Similarly, the “hat” matrix can be written as H = V [ Λ + λ I ] − 1 V T ◮ Note only a diagonal matrix need be inverted ◮ Performing the eigen-decomposition is expensive. ◮ Cost is amortized across the investigation of many values for λ . ◮ Regularisation parameter, λ , optimised via gradient descent. ◮ A similar trick can be implemented for KRR.

Feature Selection ◮ Non-causal - logistic regression with Laplace prior ◮ Regularisation parameter integrated out using reference prior. ◮ Causal feature selection using Causal Explorer ◮ Selecting the Markov blanket using HITON MB ◮ Direct the edges of the DAG ◮ PC algorithm for problems with continuous features. ◮ MMHC algorithm for binary only problems. ◮ Use HITON MB to pre-select features. ◮ Use an ensemble of 100 models ◮ Variability of feature selection methods. ◮ Gives an indication of generalisation performance.

Results for the REGED Benchmark ◮ Non-causal feature selection works well. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC REGED0 None 999.00 0.9204 1.0000 0.9983 0.9612 Non-causal 14.69 0.8070 1.0000 0.9997 0.9997 Markov blanket 26.86 0.8988 0.9999 0.9997 0.9994 Causes & effects 8.60 0.8095 0.9999 0.9996 0.9978 Causes only 1.56 0.7143 0.9984 0.9955 0.9346 REGED1 None 999.00 0.9078 1.0000 0.9321 — Non-causal 14.69 0.7798 1.0000 0.9508 — Markov blanket 24.85 0.8438 0.9999 0.9346 — Causes & effects 8.60 0.7822 0.9999 0.9329 — Causes only 1.56 0.7124 0.9984 0.8919 — REGED2 None 999.00 0.9950 1.0000 0.7184 — Non-causal 14.69 0.9980 1.0000 0.7992 — Markov blanket 24.85 0.9975 0.9999 0.7644 — Causes & effects 8.60 0.9970 0.9999 0.7989 — Causes only 1.56 0.9970 0.9984 0.7653 —

Results for SIDO Benchmark ◮ Very large dataset - not all results available. ◮ Best performance achieved without feature selection. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC SIDO0 None 4928.00 0.5890 0.9840 0.9427 0.9472 Non-causal 28.96 0.5160 0.9482 0.9294 0.9226 Markov blanket 136.47 0.5818 0.9563 0.9418 0.9356 None 4928.00 0.5314 0.9840 — SIDO1 0.7532 Non-causal 28.96 0.4909 0.9482 0.6971 — Markov blanket 136.47 0.5348 0.9563 0.6948 — SIDO2 None 4928.00 0.5314 0.9840 0.6684 — Non-causal 28.96 0.4909 0.9482 0.6298 — Markov blanket 136.47 0.5348 0.9563 0.6298 —

Results for CINA Benchmark ◮ Non-causal, Markov blanket & no selection all work well. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC CINA0 None 132.00 0.7908 0.9677 0.9674 0.9664 Non-causal 29.44 0.5708 0.9682 0.9679 0.9660 Markov blanket 55.30 0.7708 0.9669 0.9669 0.9660 Causes & effects 21.21 0.6826 0.9654 0.9661 0.9653 Causes 1.02 0.5174 0.7923 0.7911 0.5351 CINA1 None 132.00 0.5865 0.9677 0.7953 — Non-causal 29.44 0.6436 0.9682 0.7609 — Markov blanket 55.30 0.5261 0.9669 0.7979 — Causes & effects 21.21 0.5477 0.9654 0.7749 — Causes 1.02 0.5114 0.7923 0.5402 — CINA2 None 132.00 0.5865 0.9677 0.5502 — Non-causal 29.44 0.6436 0.9682 0.5464 — Markov blanket 55.30 0.5261 0.9669 0.5469 — Causes & effects 21.21 0.5477 0.9654 0.5394 — Causes 1.02 0.5114 0.7923 0.4825 —

Pre-processing for the MARTI Benchmark ◮ MARTI has correllated noise. ◮ Use KRR to estimate the noise as a function of (x,y) co-ords. 0 , σ 2 � � y i = φ ( x i ) · W + ε i , ǫ i ∼ N i I where . ◮ Radial basis function kernel defines φ ( x ). ◮ Iteratively re-estimate noise variance for each spot using residuals. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 Raw Noise Signal

Results for MARTI Benchmark ◮ Markov blanket and non-causal selection work well. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC MARTI0 None 1024.00 0.7980 1.0000 0.9970 0.9950 Non-causal 15.19 0.8029 0.9998 0.9993 0.9986 Markov blanket 26.86 0.8862 1.0000 0.9994 0.9994 Causes & effects 8.60 0.7894 0.9987 0.9986 0.9978 Causes only 1.56 0.5714 0.9821 0.9775 MARTI1 None 1024.00 0.7923 1.0000 0.9085 — Non-causal 15.19 0.7752 0.9998 0.9310 — Markov blanket 26.86 0.8264 1.0000 0.9234 — Causes & effects 8.60 0.7820 0.9987 0.8929 — Causes only 1.56 0.5347 0.9821 0.6370 — MARTI2 None 1024.00 0.9951 1.0000 0.9085 — Non-causal 15.19 0.9976 0.9998 0.7975 — Markov blanket 26.86 0.9966 1.0000 0.7740 — Causes & effects 8.60 0.9956 0.9987 0.7416 — Causes only 1.56 0.7485 0.9821 0.6607 —

Results for Final Submission ◮ Ridge regression provides a satisfactory base classifier. ◮ ARD/RBF kernel classifier may be better for CINA. ◮ Feature selection beneficial for manipulated datsets. Causal Discovery Target Prediction Dataset Rank Fnum Fscore Dscore Tscore Top Ts Max Ts cina0 128 0.5166 0.9737 0.9743 0.9765 0.9788 cina1 128 0.5860 0.9737 0.8691 0.8691 0.8977 3 cina2 64 0.5860 0.9734 0.7031 0.8157 0.8910 marti0 128 0.8697 1.0000 0.9996 0.9996 0.9996 marti1 32 0.8064 1.0000 0.9470 0.9470 0.9542 1 marti2 64 0.9956 0.9998 0.7975 0.7975 0.8273 reged0 128 0.9410 0.9999 0.9997 0.9998 1.0000 reged1 32 0.8393 0.9970 0.9787 0.9888 0.9980 2 reged2 8 0.9985 0.9996 0.8045 0.8600 0.9534 sido0 4928 0.5890 0.9840 0.9427 0.9443 0.9467 sido1 4928 0.5314 0.9840 0.7532 0.7532 0.7893 1 sido2 4928 0.5314 0.9840 0.6684 0.6684 0.7674

Summary ◮ Things that worked well: ◮ Regularisation can suppress irrelevant features. ◮ Use an ensemble to average over sources of uncertainty. ◮ Pre-processing is important (e.g. MARTI) ◮ Things that didn’t work so well: ◮ Computational expense — need more efficient tools. ◮ Effective non-linear models for large datasets (e.g. CINA). ◮ Challenge makes a convincing case for causal feature selection. ◮ Can deal with covariate shift. ◮ Rather difficult! ◮ Availability of MATLAB code ◮ http://theoval.cmp.uea.ac.uk/ ∼ gcc/cbl/blogreg/ ◮ http://theoval.cmp.uea.ac.uk/ ∼ gcc/projects/gkm/

Causal and Non-Causal Feature Selection for Ridge Regression Gavin - PowerPoint PPT Presentation

Causal and Non-Causal Feature Selection for Ridge Regression Gavin Cawley School of Computing Sciences University of East Anglia Norwich, United Kingdom gcc@cmp.uea.ac.uk Wednesday 3 rd June 2008 Introduction Causal feature selection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

RIDGE and LASSO regularization for regression Feature selection - Some algorithms perform

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

Update and Annual Review 10 September 2018 What is Pixie Energy? Pixie Energy: established

Management Challenges for DevOps Adoption within UK SMEs QUDOS16 - Saarbrcken, Germany.

Importance of Reliable Continuous Records of the Earth Systems / Earth System Research Laboratory

Norfolk Young Carers Partnership Tuesday 20 th August 2019 Getting to know each other Take a

DHS S&T Cyber Security Division (CSD) Overview TCIPG Industry Workshop UIUC November 8,

School of Law Westminster Legal and Policy Forum Keynote Seminar: Third Party Litigation Funding

LAUNCH CONFERENCE Workshops WORKSHOP 1 ETF Sector Support, Gail Lydon ETF INTRODUCTION

BREAKING GROUND IN FRAUD RECOVERY CLAIMS The CMOC litigation International pursuit of cyber

Causal and Non-Causal Feature Selection for Ridge Regression Gavin - PowerPoint PPT Presentation

Causal and Non-Causal Feature Selection for Ridge Regression Gavin Cawley School of Computing Sciences University of East Anglia Norwich, United Kingdom gcc@cmp.uea.ac.uk Wednesday 3 rd June 2008 Introduction Causal feature selection

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

RIDGE and LASSO regularization for regression Feature selection - Some algorithms perform

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Mount Sutro Mount Sutro South Ridge &amp; Edgewood Avenue South Ridge &amp; Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Mutual Information an Adequate Tool for Feature Selection ? Benot Frnay November 15, 2013

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

Update and Annual Review 10 September 2018 What is Pixie Energy? Pixie Energy: established

Management Challenges for DevOps Adoption within UK SMEs QUDOS16 - Saarbrcken, Germany.

Importance of Reliable Continuous Records of the Earth Systems / Earth System Research Laboratory

Norfolk Young Carers Partnership Tuesday 20 th August 2019 Getting to know each other Take a

DHS S&amp;T Cyber Security Division (CSD) Overview TCIPG Industry Workshop UIUC November 8,

School of Law Westminster Legal and Policy Forum Keynote Seminar: Third Party Litigation Funding

LAUNCH CONFERENCE Workshops WORKSHOP 1 ETF Sector Support, Gail Lydon ETF INTRODUCTION

BREAKING GROUND IN FRAUD RECOVERY CLAIMS The CMOC litigation International pursuit of cyber

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

DHS S&T Cyber Security Division (CSD) Overview TCIPG Industry Workshop UIUC November 8,