Intelligible Models for Classification and Regression Yin Lou 1 Rich - - PowerPoint PPT Presentation

intelligible models for classification and regression
SMART_READER_LITE
LIVE PREVIEW

Intelligible Models for Classification and Regression Yin Lou 1 Rich - - PowerPoint PPT Presentation

Intelligible Models for Classification and Regression Yin Lou 1 Rich Caruana 2 Johannes Gehrke 1 Department of Computer Science 1 Microsoft Research 2 Cornell University Microsoft Corporation Aug. 13, 2012 Yin Lou (Cornell University)


slide-1
SLIDE 1

Intelligible Models for Classification and Regression

Yin Lou1 Rich Caruana2 Johannes Gehrke1

Department of Computer Science1 Microsoft Research2 Cornell University Microsoft Corporation

  • Aug. 13, 2012

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

1 / 31

slide-2
SLIDE 2

Motivation

Simple Model

Linear regression, logistic regression Regression: y = β0 + β1x1 + ... + βnxn Classification: logit(y) = β0 + β1x1 + ... + βnxn Linear Regression

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

2 / 31

slide-3
SLIDE 3

Motivation

Simple Model

Linear regression, logistic regression Regression: y = β0 + β1x1 + ... + βnxn Classification: logit(y) = β0 + β1x1 + ... + βnxn Linear Regression Intelligible but usually less accurate

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

2 / 31

slide-4
SLIDE 4

Motivation

Complex Model

Random forest, SVMs with RBF kernel, etc. y = f (x1, ..., xn) Random Forest

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

3 / 31

slide-5
SLIDE 5

Motivation

Complex Model

Random forest, SVMs with RBF kernel, etc. y = f (x1, ..., xn) Random Forest Unintelligible but usually more accurate

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

3 / 31

slide-6
SLIDE 6

Motivation

The tradeoff

Intelligibility Complexity

Linear Regression Logistic Regression SVMs with RBF Kernel Random Forest ?

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

4 / 31

slide-7
SLIDE 7

Motivation

The tradeoff

Intelligibility Complexity

Linear Regression Logistic Regression SVMs with RBF Kernel Random Forest ?

Intelligibility is important

Medical applications Domains where we want scientific understanding Efficient model engineering

Impact of features in a ranker

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

4 / 31

slide-8
SLIDE 8

Outline

1

Motivation

2

Towards More Accurate Models

3

Algorithms

4

Experiments

5

Discussion

6

Conclusion

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

5 / 31

slide-9
SLIDE 9

Outline

1

Motivation

2

Towards More Accurate Models

3

Algorithms

4

Experiments

5

Discussion

6

Conclusion

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

6 / 31

slide-10
SLIDE 10

Generalized Additive Models

Developed by Hastie and Tibshirani Regression: y = f1(x1) + ... + fn(xn) Classification: logit(y) = f1(x1) + ... + fn(xn) Each feature is “shaped” by shape function fi Intelligible and accurate

  • T. Hastie and R. Tibshirani.

Generalized additive models. Chapman & Hall/CRC, 1990.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

7 / 31

slide-11
SLIDE 11

Example

y = x1 + x2

2 + √x3 + log x4 + ex5 + 2 sin x6 + ǫ

1 2 3 4 −2 −1 1 2 0.0 0.5 1.0 1.5 2.0 −2 −1 1 2 5 10 15 −2 −1 1 2

f1(x1) f2(x2) f3(x3)

10 20 30 40 50 −2 −1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 −2 −1 1 2 1 2 3 4 5 6 −2 −1 1 2

f4(x4) f5(x5) f6(x6)

Figure: Shape Functions for Synthetic Dataset.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

8 / 31

slide-12
SLIDE 12

Model Space

Model Form Intelligibility Accuracy Linear Model y = β0 + β1x1 + ... + βnxn +++ + Generalized Linear Model g(y) = β0 + β1x1 + ... + βnxn +++ + Additive Model y = f1(x1) + ... + fn(xn) ++ ++ Generalized Additive Model g(y) = f1(x1) + ... + fn(xn) ++ ++ Full Complexity Model y = f (x1, ..., xn) + +++

Table: From Linear to Additive Models.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

9 / 31

slide-13
SLIDE 13

Outline

1

Motivation

2

Towards More Accurate Models

3

Algorithms

4

Experiments

5

Discussion

6

Conclusion

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

10 / 31

slide-14
SLIDE 14

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Functions

Splines (SP) Single Tree (TR) Bagged Trees (bagTR) Boosted Trees (bstTR) Boosted Bagged Trees (bbTR)

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

11 / 31

slide-15
SLIDE 15

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Functions

Splines (SP) Single Tree (TR) Bagged Trees (bagTR) Boosted Trees (bstTR) Boosted Bagged Trees (bbTR)

Learning Methods

Penalized Least Squares (P-LS/P-IRLS) Backfitting (BF) Gradient Boosting (BST)

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

11 / 31

slide-16
SLIDE 16

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Function: Splines (SP)

fi(xi) = d

k=1 βkbk(xi)

100 200 300 400 500 −20 −10 10 20 30 40

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

12 / 31

slide-17
SLIDE 17

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Function: Single Tree (TR)

fi(xi) = RegressionTree(xi, response)

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

13 / 31

slide-18
SLIDE 18

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Function: Bagged Trees (bagTR)

fi(xi) = 1

B

B

j=1 RegressionTree(xi, bootstrap sample j) 1 B

( + ... + )

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

14 / 31

slide-19
SLIDE 19

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Function: Boosted Trees (bstTR)

fi(xi) = B

j=1 RegressionTree(xi, residualj)

+ ... +

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

15 / 31

slide-20
SLIDE 20

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Shape Function: Boosted Bagged Trees (bbTR)

fi(xi) = B

j=1 BaggedRegressionTree(xi, residualj)

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

16 / 31

slide-21
SLIDE 21

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Learning Method: Penalized Least Squares (P-LS/P-IRLS)

Works only on Splines (fi(xi) = d

k=1 βkbk(xi))

Converts the optimization problem to fitting linear regression/logistic regression with different basis

  • S. Wood.

Generalized additive models: an introduction with R. CRC Press, 2006.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

17 / 31

slide-22
SLIDE 22

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Learning Method: Backfitting (BF)

1: fj ← 0 2: for m = 1 to M do 3:

for j = 1 to n do

4:

R ← {xij, yi −

k=j fk}N 1

5:

Learn shaping function S : xj → y using R as training dataset

6:

fj ← S

7:

end for

8: end for

  • T. Hastie and R. Tibshirani.

Generalized additive models. Chapman & Hall/CRC, 1990.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

18 / 31

slide-23
SLIDE 23

Fitting GAMs

g(y) = f1(x1) + ... + fn(xn)

Learning Method: Gradient Boosting (BST)

1: fj ← 0 2: for m = 1 to M do 3:

for j = 1 to n do

4:

R ← {xij, yi −

k fk}N 1

5:

Learn shaping function S : xj → y using R as training dataset

6:

fj ← fj + S

7:

end for

8: end for

  • J. Friedman.

Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29:1189–1232, 2001.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

19 / 31

slide-24
SLIDE 24

Contributions

First large-scale study that uses trees as shape function for GAMs Novel methods for using trees as shape functions Largest empirical study of fitting GAMs

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

20 / 31

slide-25
SLIDE 25

Outline

1

Motivation

2

Towards More Accurate Models

3

Algorithms

4

Experiments

5

Discussion

6

Conclusion

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

21 / 31

slide-26
SLIDE 26

Datasets

Dataset Size Attributes %Pos Regression Concrete 1030 9

  • Wine

4898 12

  • Delta

7192 6

  • CompAct

8192 22

  • Music

50000 90

  • Synthetic

10000 6

  • Classification

Spambase 4601 58 39.40 Insurance 9823 86 5.97 Magic 19020 11 64.84 Letter 20000 17 49.70 Adult 46033 9/43 16.62 Physics 50000 79 49.72

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

22 / 31

slide-27
SLIDE 27

Methods

Shape Least Gradient Backfitting Function Squares Boosting Splines P-LS/P-IRLS BST-SP BF-SP Single Tree N/A BST-TRx BF-TR Bagged Trees N/A BST-bagTRx BF-bagTR Boosted Trees N/A BST-TRx BF-bstTRx Boosted N/A BST-bagTRx BF-bbTRx Bagged Trees

Table: Notation for learning methods and shape functions.

9 different methods 5-fold cross validation for each method

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

23 / 31

slide-28
SLIDE 28

Results

Model Regression Classification Mean Linear/Logistic P-LS/P-IRLS BST-SP BF-SP BST-bagTR2 BST-bagTR3 BST-bagTR4 BST-bagTRX Random Forest

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

24 / 31

slide-29
SLIDE 29

Results

Model Regression Classification Mean Linear/Logistic 1.68 1.22 1.45 P-LS/P-IRLS BST-SP BF-SP BST-bagTR2 BST-bagTR3 BST-bagTR4 BST-bagTRX Random Forest

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

24 / 31

slide-30
SLIDE 30

Results

Model Regression Classification Mean Linear/Logistic 1.68 1.22 1.45 P-LS/P-IRLS BST-SP BF-SP BST-bagTR2 BST-bagTR3 BST-bagTR4 BST-bagTRX Random Forest 0.88 0.80 0.84

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

24 / 31

slide-31
SLIDE 31

Results

Model Regression Classification Mean Linear/Logistic 1.68 1.22 1.45 P-LS/P-IRLS 1.00 1.00 1.00 BST-SP 1.04 1.00 1.02 BF-SP 1.00 1.00 1.00 BST-bagTR2 BST-bagTR3 BST-bagTR4 BST-bagTRX Random Forest 0.88 0.80 0.84

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

24 / 31

slide-32
SLIDE 32

Results

Model Regression Classification Mean Linear/Logistic 1.68 1.22 1.45 P-LS/P-IRLS 1.00 1.00 1.00 BST-SP 1.04 1.00 1.02 BF-SP 1.00 1.00 1.00 BST-bagTR2 0.96 0.96 0.96 BST-bagTR3 0.97 0.95 0.96 BST-bagTR4 0.99 0.95 0.97 BST-bagTRX 0.95 0.94 0.95 Random Forest 0.88 0.80 0.84

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

24 / 31

slide-33
SLIDE 33

Results

Model Regression Classification Mean Linear/Logistic 1.68 1.22 1.45 P-LS/P-IRLS 1.00 1.00 1.00 BST-SP 1.04 1.00 1.02 BF-SP 1.00 1.00 1.00 BST-bagTR2 0.96 0.96 0.96 BST-bagTR3 0.97 0.95 0.96 BST-bagTR4 0.99 0.95 0.97 BST-bagTRX 0.95 0.94 0.95 Random Forest 0.88 0.80 0.84

Observations

Two accuracy gaps: shaping and interactions Tree-base shaping methods are more accurate than spline-based methods

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

24 / 31

slide-34
SLIDE 34

Outline

1

Motivation

2

Towards More Accurate Models

3

Algorithms

4

Experiments

5

Discussion

6

Conclusion

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

25 / 31

slide-35
SLIDE 35

Bias Variance Decomposition

Expected Loss = (bias)2 + variance + noise

10 20 30 40 50 P

  • L

S B S T

  • S

P B F

  • S

P B S T

  • T

R 2 B S T

  • T

R 3 B S T

  • T

R 4 B S T

  • b

a g T R 2 B S T

  • b

a g T R 3 B S T

  • b

a g T R 4 B F

  • T

R B F

  • b

a g T R Bias Variance

(a) Concrete

0.1 0.2 0.3 0.4 0.5 0.6 0.7 P

  • L

S B S T

  • S

P B F

  • S

P B S T

  • T

R 2 B S T

  • T

R 3 B S T

  • T

R 4 B S T

  • b

a g T R 2 B S T

  • b

a g T R 3 B S T

  • b

a g T R 4 B F

  • T

R B F

  • b

a g T R Bias Variance

(b) Wine

Figure: Bias-variance analysis.

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

26 / 31

slide-36
SLIDE 36

Learned Shaped Function: Splines vs. Trees

Blast Furnace Slag Fly Ash Superplasticizer Coarse Aggregate Fine Aggregate Splines

50 100 150 200 250 300 350 −20 −10 10 20 30 40 50 100 150 200 −20 −10 10 20 30 40 5 10 15 20 25 30 −20 −10 10 20 30 40 800 850 900 950 1000 1050 1100 1150 −20 −10 10 20 30 40 600 700 800 900 1000 −20 −10 10 20 30 40

Trees

  • 20

20 40 60 50 100 150 200 250 300 350 400

  • 20

20 40 60 50 100 150 200

  • 20

20 40 60 5 10 15 20 25 30 35

  • 20

20 40 60 800 850 900 950 1000 1050 1100 1150

  • 20

20 40 60 550 600 650 700 750 800 850 900 950 1000

0.14 0.06 0.04 0.05 0.06

Figure: Shapes of features for the “Concrete” dataset produced by P-LS (top) and BST-bagTR3 (bottom).

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

27 / 31

slide-37
SLIDE 37

Outline

1

Motivation

2

Towards More Accurate Models

3

Algorithms

4

Experiments

5

Discussion

6

Conclusion

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

28 / 31

slide-38
SLIDE 38

Conclusion

Generalized additive models are accurate and intelligible Tree has low bias but high variance Bagging reduces variance and makes tree-based method stand out Bagged shallow trees with gradient boosting are most accurate

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

29 / 31

slide-39
SLIDE 39

Future Work

Feature selection Scalability Statistical interaction detection

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

30 / 31

slide-40
SLIDE 40

Thank You

Questions?

Yin Lou (Cornell University) Intelligible Models

  • Aug. 13, 2012

31 / 31