Granger-causal Attentive Mixtures of Experts Learning Important - - PowerPoint PPT Presentation

granger causal attentive mixtures of experts
SMART_READER_LITE
LIVE PREVIEW

Granger-causal Attentive Mixtures of Experts Learning Important - - PowerPoint PPT Presentation

Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks Patrick Schwab 1 @schwabpa Djordje Miladinovic 2 and Walter Karlen 1 1 Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer


slide-1
SLIDE 1

Granger-causal Attentive Mixtures of Experts

@schwabpa

Djordje Miladinovic2 and Walter Karlen1

1Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer Science, ETH Zurich

Patrick Schwab1

Learning Important Features with Neural Networks

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

Motivation

Age

  • Weight

inputs Blood Pressure

slide-5
SLIDE 5

Motivation

  • Heart Failure Risk

inputs model

  • utput

Weight Blood Pressure Age

slide-6
SLIDE 6

Motivation

  • Heart Failure Risk

inputs model

  • utput

Weight Blood Pressure Age

slide-7
SLIDE 7

Motivation

  • Heart Failure Risk

inputs model

  • utput

Weight

What was the decision based on?

Blood Pressure Age

slide-8
SLIDE 8

Motivation

  • Heart Failure Risk

black box inputs model

  • utput

Weight Blood Pressure Age

slide-9
SLIDE 9

Motivation

  • Heart Failure Risk

inputs model

  • utput

We desire explanation.

Weight Blood Pressure Age

slide-10
SLIDE 10

The Idea

Can we train a neural network to output both (1) accurate predictions, and (2) feature importance scores ?

slide-11
SLIDE 11

Use Cases

Schwab et al. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks

  • Model understanding
  • Human-ML cooperation - why was this

decision made?

  • Does this decision make sense?
  • Are my model’s decisions justifiable?
  • What patterns has my model discovered?
slide-12
SLIDE 12

Approach

slide-13
SLIDE 13

Attentive Mixture of Experts (AME)

Expert E2 Expert E3 Attentive Gating Network G3 c2 c3 c1 a2 a3

+

y h1 h2 h3 hall = (h1,c1,h2,c2,h3,c3) Expert E1 a1 G2 G1 G1 Granger-causally grounded

slide-14
SLIDE 14

Attentive Mixture of Experts (AME)

Expert E2 Expert E3 Attentive Gating Network G3 c2 c3 c1 a2 a3

+

y h1 h2 h3 hall = (h1,c1,h2,c2,h3,c3) Expert E1 a1 G2 G1 G1 Granger-causally grounded

One independent expert per feature / feature group

slide-15
SLIDE 15

Attentive Mixture of Experts (AME)

Expert E2 Expert E3 Attentive Gating Network G3 c2 c3 c1 a2 a3

+

y h1 h2 h3 hall = (h1,c1,h2,c2,h3,c3) Expert E1 a1 G2 G1 G1 Granger-causally grounded

Attentive gates control expert contributions

slide-16
SLIDE 16

Attentive Mixture of Experts (AME)

Expert E2 Expert E3 Attentive Gating Network G3 c2 c3 c1 a2 a3

+

y h1 h2 h3 hall = (h1,c1,h2,c2,h3,c3) Expert E1 a1 G2 G1 G1 Granger-causally grounded

Experts can only contribute to y after modulation by ai

slide-17
SLIDE 17

However, on its own this structure has the same issue as naive soft attention mechanisms:

  • No incentive to learn to output accurate

feature importance estimates [1].

  • Often collapses to use only very few or a single

expert early on during training [2, 3].

[1] Sundararajan, Taly, and Yan 2017; [2] Bengio et al. 2015; [3] Shazeer et al. 2017

slide-18
SLIDE 18

Granger-causal Objective

  • Granger (1969) postulated Granger-causality
  • declares relationship X➞Y if we are better

able to predict Y using all information than if all information apart from X had been used*

* Other assumptions apply that are not relevant in the presented setting.

slide-19
SLIDE 19

Granger-causal Objective

E1 E2 E3

εX

E4 E1 E2 E3

εX/{1}

E4

faux faux,1

slide-20
SLIDE 20

Granger-causal Objective

Error when considering all information

E1 E2 E3

εX

E4 E1 E2 E3

εX/{1}

E4

faux faux,1

slide-21
SLIDE 21

Granger-causal Objective

Error when considering all information Error when considering information apart from E1

E1 E2 E3

εX

E4 E1 E2 E3

εX/{1}

E4

faux faux,1

slide-22
SLIDE 22

Granger-causal Objective

We define feature importance as the reduction in prediction error associated with adding that feature.

E1 E2 E3

εX

E4 E1 E2 E3

εX/{1}

E4

faux faux,1

slide-23
SLIDE 23

Granger-causal Objective

We define feature importance as the reduction in prediction error associated with adding that feature.

E1 E2 E3 εX E4 E1 E2 E3 εX/{1} E4 faux faux,1
  • a

1

slide-24
SLIDE 24

Granger-causal Objective

We define feature importance as the reduction in prediction error associated with adding that feature.

E1 E2 E3 εX E4 E1 E2 E3 εX/{1} E4 faux faux,1 E1 E2 E3 εX E4 E1 E2 E3 εX/{2} E4 faux faux,2
  • a

1

slide-25
SLIDE 25

Granger-causal Objective

We define feature importance as the reduction in prediction error associated with adding that feature.

E1 E2 E3 εX E4 E1 E2 E3 εX/{1} E4 faux faux,1 E1 E2 E3 εX E4 E1 E2 E3 εX/{2} E4 faux faux,2 E1 E2 E3 εX E4 E1 E2 E3 εX/{3} E4 faux faux,3
  • a

1

slide-26
SLIDE 26

Granger-causal Objective

We define feature importance as the reduction in prediction error associated with adding that feature.

E1 E2 E3 εX E4 E1 E2 E3 εX/{1} E4 faux faux,1 E1 E2 E3 εX E4 E1 E2 E3 εX/{2} E4 faux faux,2 E1 E2 E3 εX E4 E1 E2 E3 εX/{3} E4 faux faux,3 E1 E2 E3 εX E4 E1 E2 E3 εX/{4} E4 faux faux,4
  • a

1

slide-27
SLIDE 27

Granger-causal Objective

We define feature importance as the reduction in prediction error associated with adding that feature.

E1 E2 E3 εX E4 E1 E2 E3 εX/{1} E4 faux faux,1 E1 E2 E3 εX E4 E1 E2 E3 εX/{2} E4 faux faux,2 E1 E2 E3 εX E4 E1 E2 E3 εX/{3} E4 faux faux,3 E1 E2 E3 εX E4 E1 E2 E3 εX/{4} E4 faux faux,4
  • a

1

We now have a differentiable link between labels (prediction error) and feature importance.

slide-28
SLIDE 28

Evaluation

slide-29
SLIDE 29

Important Features in Handwritten Digits

slide-30
SLIDE 30

Important Features in Handwritten Digits

Estimation accuracy comparable to SHAP.

slide-31
SLIDE 31

Important Features in Handwritten Digits

Orders of magnitude faster at importance estimation

slide-32
SLIDE 32

Important Features in Handwritten Digits

slide-33
SLIDE 33

Important Features in Handwritten Digits

Lower MGE correlates with better feature importance estimates.

slide-34
SLIDE 34

Drivers of Medical Prescription Demand

SMAPE [%]

25,00 27,50 30,00 32,50 35,00

R N N F N N A M E ( a = ) A M E ( a = . 4 ) A R I M A

34,98 33,85 33,08 32,87 32,79

Slightly lower prediction accuracy when using AME architecture

slide-35
SLIDE 35

Drivers of Medical Prescription Demand

Slightly lower prediction accuracy when using Granger-causal objective

SMAPE [%]

25,00 27,50 30,00 32,50 35,00

R N N F N N A M E ( a = ) A M E ( a = . 4 ) A R I M A

34,98 33,85 33,08 32,87 32,79
slide-36
SLIDE 36

Discriminatory Genes across Cancer Types

729884 100133144 ABCD2 ABCC9 ABCB9 ABCB6 ABCA13 A A G A B AADAT 90288 All BRCA KIRC COAD LUAD PRAD ai 729884 553137 ABCC6P1 ABCC4 ABCC3 ABCC11 ABCB9 AASS A1CF A1BG All BRCA KIRC COAD LUAD PRAD ai 729884 553137 ABCC9 ABCC3 ABCB1 ABCA3 AADAT A1CF A1BG 90288 All BRCA KIRC COAD LUAD PRAD ai

AME SHAP LIME

slide-37
SLIDE 37

Discriminatory Genes across Cancer Types

729884 100133144 ABCD2 ABCC9 ABCB9 ABCB6 ABCA13 A A G A B AADAT 90288 All BRCA KIRC COAD LUAD PRAD ai 729884 553137 ABCC6P1 ABCC4 ABCC3 ABCC11 ABCB9 AASS A1CF A1BG All BRCA KIRC COAD LUAD PRAD ai 729884 553137 ABCC9 ABCC3 ABCB1 ABCA3 AADAT A1CF A1BG 90288 All BRCA KIRC COAD LUAD PRAD ai

AME SHAP LIME

AME discriminates well between 1- cancer types, and 2- important and unimportant genes

slide-38
SLIDE 38

Discriminatory Genes across Cancer Types

Recall @ 10

1 10

A M E ( a = . 5 ) R F S H A P L I M E D e e p L I F T A M E ( a = )

2 7 8 8 10 10

Associations discovered by AMEs are consistent with those reported by domain experts.

slide-39
SLIDE 39

Discriminatory Genes across Cancer Types

Recall @ 10

1 10

A M E ( a = . 5 ) R F S H A P L I M E D e e p L I F T A M E ( a = )

2 7 8 8 10 10

Granger-causal objective is crucial for estimation accuracy.

slide-40
SLIDE 40

Limitations

  • No information about direction of importance,

i.e. negative evidence

  • Large numbers of experts (>200) can become

slow at training time

  • Workaround: Feature grouping
  • Requires specific model architecture
slide-41
SLIDE 41

Conclusion

slide-42
SLIDE 42

Conclusion

  • We present a feature importance estimation approach that
  • learns to estimate importance from labelled data
  • produces accurate predictions and importance scores

in a single model

  • is orders of magnitude faster at estimating importance

than perturbation-based approaches

  • is consistent with associations reported by domain

experts

✔ ✔ ✔ ✔

slide-43
SLIDE 43

Questions?

43

Patrick Schwab

patrick.schwab@hest.ethz.ch

Institute for Robotics and Intelligent Systems ETH Zurich

@schwabpa

Schwab, Patrick, Miladinovic, Djordje, and Karlen, Walter. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks. AAAI 2019 Source Code: github.com/d909b/AME