Granger-causal Attentive Mixtures of Experts Learning Important - PowerPoint PPT Presentation

Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks Patrick Schwab 1 @schwabpa Djordje Miladinovic 2 and Walter Karlen 1 1 Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer Science, ETH Zurich

Motivation

Motivation �

Motivation � Age Weight Blood Pressure inputs

Motivation � Age Weight � Heart Failure Risk Blood Pressure inputs model output

Motivation � Age What was the decision Weight � based on? Heart Failure Risk Blood Pressure inputs model output

Motivation black box � Age Weight � Heart Failure Risk Blood Pressure inputs model output

Motivation � Age Weight � Heart Failure Risk Blood Pressure We desire explanation. inputs model output

The Idea Can we train a neural network to output both (1) accurate predictions , and (2) feature importance scores ?

Use Cases • Model understanding • Human-ML cooperation - why was this decision made? • Does this decision make sense ? • Are my model’s decisions justi fi able ? • What patterns has my model discovered ? Schwab et al. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks

Approach

Attentive Mixture of Experts (AME) y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3

Attentive Mixture of Experts (AME) y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3 One independent expert per feature / feature group

Attentive Mixture of Experts (AME) Attentive gates control expert contributions y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3

Attentive Mixture of Experts (AME) Experts can only contribute to y after modulation by a i y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3

However, on its own this structure has the same issue as naive soft attention mechanisms: - No incentive to learn to output accurate feature importance estimates [1]. - Often collapses to use only very few or a single expert early on during training [2, 3]. [1] Sundararajan, Taly, and Yan 2017; [2] Bengio et al. 2015; [3] Shazeer et al. 2017

Granger-causal Objective • Granger (1969) postulated Granger-causality • declares relationship X ➞ Y if we are better able to predict Y using all information than if all information apart from X had been used* * Other assumptions apply that are not relevant in the presented setting.

Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4

Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 Error when considering all information

Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 Error when considering Error when considering all information information apart from E 1

Granger-causal Objective ε X ε X/{1} f aux f aux,1 - E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

Granger-causal Objective 1 a 0 ε X/{1} ε X - f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X - - f aux f aux,1 f aux f aux,2 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X ε X - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X/{4} ε X ε X ε X - - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 f aux f aux,4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

Granger-causal Objective We now have a di ff erentiable link between labels (prediction error) and feature importance. 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X/{4} ε X ε X ε X - - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 f aux f aux,4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

Evaluation

Important Features in Handwritten Digits

Important Features in Handwritten Digits Estimation accuracy comparable to SHAP.

Important Features in Handwritten Digits Orders of magnitude faster at importance estimation

Important Features in Handwritten Digits

Important Features in Handwritten Digits Lower MGE correlates with better feature importance estimates.

Drivers of Medical Prescription Demand 35,00 34,98 33,85 33,08 32,50 SMAPE [%] 32,87 32,79 30,00 27,50 25,00 R F A A A N N M M R N I N E E M ( ( A a a = = 0 0 ) . 0 4 ) Slightly lower prediction accuracy when using AME architecture

Drivers of Medical Prescription Demand 35,00 34,98 33,85 33,08 32,50 SMAPE [%] 32,87 32,79 30,00 27,50 25,00 R F A A A N N M M R N I N E E M ( ( A a a = = 0 0 ) . 0 4 ) Slightly lower prediction accuracy when using Granger-causal objective

Discriminatory Genes across Cancer Types AME LIME All All BRCA BRCA KIRC KIRC COAD COAD LUAD LUAD a i PRAD a i PRAD AADAT 100133144 729884 90288 A ABCA13 ABCB6 ABCB9 ABCC9 ABCD2 ABCB9 ABCC3 ABCC4 ABCC6P1 A 553137 A1BG A1CF AASS ABCC11 729884 G A B SHAP All BRCA KIRC COAD LUAD a i PRAD ABCB1 553137 ABCA3 ABCC3 ABCC9 729884 90288 A1BG A1CF AADAT

Discriminatory Genes across Cancer Types AME LIME All All BRCA BRCA KIRC KIRC COAD COAD LUAD LUAD a i PRAD a i PRAD AADAT 100133144 729884 90288 A ABCA13 ABCB6 ABCB9 ABCC9 ABCD2 ABCB9 ABCC3 ABCC4 ABCC6P1 A 553137 A1BG A1CF AASS ABCC11 729884 G A B SHAP All BRCA AME discriminates well between KIRC 1- cancer types, and COAD 2- important and unimportant genes LUAD a i PRAD ABCB1 553137 ABCA3 ABCC3 ABCC9 729884 90288 A1BG A1CF AADAT

Discriminatory Genes across Cancer Types 10 10 10 8 8 7 Recall @ 10 2 1 A R S L D A I M H M F e M e A E E E p P ( L ( a a I F = = T 0 0 Associations discovered by AMEs are consistent . ) 0 5 ) with those reported by domain experts.

Discriminatory Genes across Cancer Types 10 10 10 8 8 7 Recall @ 10 2 1 A R S L D A I M H M F e M e A E E E p P ( L ( a a I F = = T 0 0 . ) 0 5 ) Granger-causal objective is crucial for estimation accuracy.

Limitations • No information about direction of importance, i.e. negative evidence • Large numbers of experts (>200) can become slow at training time • Workaround: Feature grouping • Requires speci fi c model architecture

Conclusion

Conclusion • We present a feature importance estimation approach that • learns to estimate importance from labelled data ✔ • produces accurate predictions and importance scores ✔ in a single model • is orders of magnitude faster at estimating importance ✔ than perturbation-based approaches • is consistent with associations reported by domain ✔ experts

Questions? Patrick Schwab @schwabpa patrick.schwab@hest.ethz.ch Institute for Robotics and Intelligent Systems ETH Zurich Schwab, Patrick, Miladinovic, Djordje, and Karlen, Walter. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks. AAAI 2019 Source Code: github.com/d909b/AME 43

Granger-causal Attentive Mixtures of Experts Learning Important - PowerPoint PPT Presentation

Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks Patrick Schwab 1 @schwabpa Djordje Miladinovic 2 and Walter Karlen 1 1 Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Granger Causality and Dynamic Structural Systems Halbert White and Xun Lu Department of

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

Causal Programming Causal Programming Joshua Brul Joshua Brul

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Causal Discovery from Observational Data Brady Neal causalcourse.com What if we dont have

Does practice make perfect? A study of the Granger-causal relationship betw een attem pting to

Release granular mushrooms Release granular mushrooms and dried mixtures and dried mixtures

The science of mixtures and separation techniques Rahul Bhambure PhD Scientist, Chemical

Mixtures of models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Introduction to Causal Inference Lan Liu University of Minnesota at Twin Cities liux3771@umn.edu

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

A Brief Introduction to Causal Inference Brady Neal causalcourse.com What is causal inference?

forecast the Korean electricity load Junghwan Jin Jinsoo Kim 1. Introduction 2. Data 3.

The Long-Run Causal Relationship between Economic Growth, Transport Energy Consumption and

Unraveling the Federal Budget Process By Laura A. Logan, CPA, CGFM Assistant to the Chief

ARGENTINA ECONOMY: CURRENT SITUATION AND OUTLOOK DANTE E. SICA abeceb.com Director 04/23/2015

Cross-Firm Information Flows Anna Scherbina (joint with Bernd Schlusche) The Q Group Conference

important banks NTTS March 11, 2015 www.jrc.ec.europa.eu Rationale people sometimes

Futures Market Efficiency in the EU ETS Dr Paul Twomey Centre for Energy and Environmental

Climate changes: Assessment of natural and anthropogenic factors and cause-and-effect relations