A Utility-based Approach for Estimating Conditional Probability of - - PowerPoint PPT Presentation

a utility based approach for estimating conditional
SMART_READER_LITE
LIVE PREVIEW

A Utility-based Approach for Estimating Conditional Probability of - - PowerPoint PPT Presentation

A Utility-based Approach for Estimating Conditional Probability of Human Lung Cancer From Microarray Data Craig Friedman (craig.friedman@cims.nyu.edu) Wenbo Cao (wcao@gc.cuny.edu ) Introduction Measuring model performance and


slide-1
SLIDE 1

A Utility-based Approach for Estimating Conditional Probability of Human Lung Cancer From Microarray Data

Craig Friedman (craig.friedman@cims.nyu.edu) Wenbo Cao (wcao@gc.cuny.edu)

slide-2
SLIDE 2

2

  • Introduction
  • Measuring model performance and building probabilistic

models from a model-user’s perspective

  • Probabilistic models from Microarray data

– 2 state conditional probability (NL or AD) – 6 state conditional probability (NL, AD, SMCL, SQ, COID, OA) – Conditional density (survival models, time permitting)

  • Conclusion
slide-3
SLIDE 3

3

Introduction Conditional Probability Models

  • Prob(y=y|x), where

– X is a vector of explanatory variables (for example, microarray data), – Y is a random variable or vector (for example, Y in {NL,AD}), and, – y is a particular value for Y (for example, y=AD).

  • Given one (or more) cutoff(s), each conditional probability model can

be converted to a classification model. – For example, classify as AD if and only if Prob(y=AD|x)>cutoff.

  • Conditional probability models tell us directly about the probabilities.

The provide more information than classification models, in general.

slide-4
SLIDE 4

4

Introduction Principal References

  • Friedman, C. and Sandow, S., Model Performance Measures for

Expected Utility Maximizing Investors, International Journal of Theoretical and Applied Finance, June, 2003.

  • Friedman, C. and Sandow, S., Learning Probabilistic Models: An

Expected Utility Maximization Approach, Journal of Machine Learning Research, July, 2003.

  • Friedman, C. and Huang, J., A Utility-Based Public Firm Default

Probability Model, Working Paper, 2003

  • Friedman, C. and Sandow, S., Ultimate Recoveries, Risk,

August, 2003

slide-5
SLIDE 5

5

Introduction Model Performance Measures

  • Develop good conditional probability models

– To do so, we must have a way to measure model performance – The models will be used by model-users to make decisions—performance should be measured accordingly

slide-6
SLIDE 6

6

Performance Measures Utility Theory Elements

  • Utility functions assign values (utilities) to random wealth levels

(power=2 utility used by Morningstar to rank funds)

  • Utility functions characterize the investor’s risk aversion.
  • Rational investors maximize their expected utility (from Utility Theory).
slide-7
SLIDE 7

7

Performance Measures: Utility Theory Example

GAME 1 GAME 2 $1 $1

A Fair Coin is Tossed $1.10 $ .91 $2.00 $ .80

Maximizing Expected Utility, The More Risk Averse Investor Chooses Game 1 Maximizing Expected Utility, The Less Risk Averse Investor Chooses Game 2

slide-8
SLIDE 8

8

Performance Measures Utility Theory Elements

  • More is preferred to less (the utility function is a

strictly increasing function of wealth)

  • The slope of the utility function decreases as wealth

increases (a gift of $1 provides more utility when your wealth is low than when your wealth is large.)

slide-9
SLIDE 9

9

Performance Measures

These Model Performance Measures are – Natural Extensions of the Axioms of Utility Theory – Practical implementation is widely used (log-likelihood) – Many probabilistic contexts (not just 2 state, e.g., ROC) – Consistent with our approach to Model Formulation

slide-10
SLIDE 10

10

Performance Measures

  • Assumptions:

– Investor/model-user with utility function – Market with odds ratio for each state (NL and AD have different payoffs) – Model-user believes model and makes decisions to maximize expected utility (a consequence of Utility Theory)

  • Paradigm: We base our model performance measure on an (out of

sample) estimate of expected utility.

  • Accurate models allow for effective decisions/investment strategies
  • Inaccurate models induce over-betting and under-betting
  • Our performance measures have financial interpretation
slide-11
SLIDE 11

11

Performance Measures: An Important Class of Utilities

  • Interpretations (for a rich, tractable class of utility functions)

– Estimated wealth growth rate pickup (for a certain type of investor) who uses model 2 rather than model 1 – Logarithm of likelihood ratio (our old friend from classical Statistics) – Performance measure that generates an optimal (in the sense of the Neyman-Pearson Lemma) decision surface.

slide-12
SLIDE 12

12

Performance Measures: An Important Class of Utilities

  • Morningstar uses the power utility with power 2.
  • This is how a member of our family approximates Morningstar’s

utility function

slide-13
SLIDE 13

13

Performance Measures

Investor has utility function U(W),

slide-14
SLIDE 14

14

Performance Measures Our Paradigm

slide-15
SLIDE 15

15

Performance Measures Our Paradigm

slide-16
SLIDE 16

16

Maximum Expected Utility Models Introduction

  • To find a model, we balance

– Consistency with the Data (there is a precise measure) – Consistency with Prior Beliefs (there is a precise measure) From an investor/model-user’s perspective

  • Result: a 1 hyperparameter family of models (an efficient frontier), each
  • f which is associated with a a given level of consistency with the data.

Each model – asymptotically maximizes expected utility over a potentially rich family of models – is robust: maximizing outperformance of benchmark model under the most adverse true measure (can make precise).

  • Choose optimal hyperparameter value by maximizing expected utility
  • n an out of sample data set.
slide-17
SLIDE 17

17

Maximum Expected Utility Models Formulation

  • We define the notion of Dominance (of one model measure over

another). We select a measure on the efficient frontier.

slide-18
SLIDE 18

18

Maximum Expected Utility Models Formulation

  • The probability simplex for a 2 state model:
slide-19
SLIDE 19

19

Maximum Expected Utility Models Formulation

  • The probability simplex for a 3 state model:
slide-20
SLIDE 20

20

Maximum Expected Utility Models Formulation

slide-21
SLIDE 21

21

MEU Modeling Approach

  • Balance

–Consistency with the Data –Consistency with Prior Beliefs

slide-22
SLIDE 22

22

MEU Modeling Approach Dual Problem

  • We solve the dual of above optimization problem, which

amounts to a maximization with respect to a set of Lagrange multipliers.

  • The dual problem can be interpreted as the search for a

maximum-likelihood exponential model, with regularization.

  • Choose α to maximize the out-of sample log-likelihood
slide-23
SLIDE 23

23

MEU Modeling Approach

  • Dual problem is strictly convex
  • Model produced is theoretically robust (best-

worst case measure)

  • Features allow for flexibility
  • Approach avoids overfitting
slide-24
SLIDE 24

24

Maximum Expected Utility Models Logistic Regression

The models can be made flexible enough to conform to the data, but avoid

  • verfitting.

Linear logistic regression (and some generalizations) have the axis alignment Property.

slide-25
SLIDE 25

25

Toy (2 Variable) Problem

  • Based on data from the Wisconsin Breast Cancer Databases, to

illustrate, we select 2 variables:

– Standard error of texture – Fractal dimension

  • We want to model the probability of malignancy, given the

Standard error of texture and the fractal dimension.

slide-26
SLIDE 26

26

Toy (2 Variable) Version Data

  • (We have transformed the data so that they lie in the unit

square.)

slide-27
SLIDE 27

27

Toy (2 Variable) Problem The Model

  • For each explanatory variable pair, we seek the function

prob(malignancy|se,fd), where se=standard error of texture fd=fractal dimension

slide-28
SLIDE 28

28

We model 2 ways

  • Linear logistic regression
  • MEU methodology
slide-29
SLIDE 29

29

Toy Problem Same Data, 2 Different Models

  • Logistic Regression

MEU

slide-30
SLIDE 30

30

Toy Problem Contour Plots

  • The MEU model is clearly more consistent with the data, without
  • verfitting. Linear logit is too stiff to reflect the story told by the

data.

slide-31
SLIDE 31

31

Model Performance Results Toy (2 Variable) Version

  • Logistic Regression:

– ROC: 0.4646 – EWGRP: -0.0035

  • MEU Model:

– ROC: 0.6196 – EWGRP : 0.0223 where

EWGRP=Expected Wealth Growth Rate Pickup

  • By adding additional explanatory variables, we can improve

model performance

slide-32
SLIDE 32

32

Model Performance MEU vs Benchmark

  • Overfit linear Logit versus MEU model
slide-33
SLIDE 33

33

Applications 2 Categories: Harvard and Michigan Data

Categories: 1) AD and suspected AD 2) Normal lung RESULTS, using 58 common probe set variables, level-set features

Training on Harvard data, Testing on Michigan data MEU method: roc = 1, delta = 0.3341 Linear Logit: roc = .8186, delta = -22.1663 Training on Michigan data, Testing on Harvard data MEU method: roc = 1, delta = 0.3276 Linear Logit: roc = 0.7245, delta = -1.5982

We note that the MEU model produced perfect classification on the out

  • f sample data sets.
slide-34
SLIDE 34

34

Applications 2 Categories: Harvard and Michigan Data

  • Training on Harvard, testing on Michigan

– Blue: Normal Lung; Red: AD and Suspected AD (OA)

slide-35
SLIDE 35

35

Applications 2 Categories: Harvard and Michigan Data

  • Training on Michigan, testing on Harvard

– Blue: Normal Lung; Red: AD and Suspected AD (OA)

slide-36
SLIDE 36

36

Applications Multi-Category: Harvard Data

  • Categories:

– NL (17 samples): 1 -- (0 0 0 0 0 1) normal lung – AD (127 samples): 2 -- (0 0 0 0 1 0) lung adenocarcinomas – SMCL (6 samples): 3 -- (0 0 0 1 0 0) small-cell lung carcinomas (SCLC) – SQ (21 samples): 4 -- (0 0 1 0 0 0) squamous cell lung carcinomas – COID (20 samples):5 -- (0 1 0 0 0 0) pulmonary carcinoids – OA (12 samples): 6 -- (1 0 0 0 0 0)

  • ther adenocarcinomas (which were

suspected to be extrapulmonary metastases)

  • Variables: 8 most important variables from 58 common probe set vars.
  • Features: linear + quadratic + gaussian

– (10 k-mean centers for Gaussian features)

  • ? = 0.4408

? 0 = -10.7712

slide-37
SLIDE 37

37

Applications Multi-Category: Harvard Data

VARIABLES: Rank Probe set 1 AFFX-BioDn-3_at 2 AFFX-BioC-3_at 3 AFFX-BioC-5_st 4 AFFX-HUMGAPDH/M33197_5_st 5 AFFX-CreX-5_at 6 AFFX-BioDn-5_st 7 AFFX-CreX-3_st 8 AFFX-DapX-3_at

slide-38
SLIDE 38

38

Multi-category Prob(Y=y|patient index)

  • Preliminary Results
  • Point color in the figures:
  • Blue points with symbol "+" : NL
  • Red points with symbol "." : AD
  • Black points with symbol "*" : OA
  • Green points with symbol "*" : SMCL
  • Cyan points with symbol "x" : SQ
  • Magenta points with symbol "+":

COID

slide-39
SLIDE 39

39

Multi-category Concentration Measure

1 certainty, 0 uncertainty Point color in the figures:

  • Blue points with symbol "+" : NL
  • Red points with symbol "." : AD
  • Black points with symbol "*" : OA
  • Green points with symbol "*" : SMCL
  • Cyan points with symbol "x" : SQ
  • Magenta points with symbol "+":

COID

slide-40
SLIDE 40

40

Application Survival Time Density Modeling

  • p(t|x) – probability density of survival time t, conditioned on

vector x of explanatory variables, which are:

slide-41
SLIDE 41

41

Survival Time Model Performance

.3536 Maximum Expected Utility Delta Model

  • Model Performance Measure, Delta : gain in

expected logarithmic utility (wealth growth rate) with respect to a non-informative model for survival time

slide-42
SLIDE 42

42

Conditional Probability Survival Time density as function of J04423E coli bloD gene

Other explanatory variables are set to median values

slide-43
SLIDE 43

43

Conditional Time Probability density as function of gene expression

Blue curves for several gene expression levels Other variables are set to median values

slide-44
SLIDE 44

44

  • We have used Model Performance measures based on the

perspective of an investor/model-user’s

  • We have built Microarray Data Conditional Probability

Models on that are approximately optimal with respect to the above performance measures. These models – Are numerically robust (convex programming) – Are theoretically robust (best-worst case measure) – Are flexible – Do not overfit – Perform well in practice

Conclusion

slide-45
SLIDE 45

45

  • References available on request.

craig.friedman@cims.nyu.edu

References

slide-46
SLIDE 46

46

  • AFFX-BioB-5_at
  • AFFX-BioB-m_at
  • AFFX-BioB-3_at
  • AFFX-BioC-5_at
  • AFFX-BioC-3_at
  • AFFX-BioDn-5_at
  • AFFX-BioDn-3_at
  • AFFX-CreX-5_at
  • AFFX-CreX-3_at
  • AFFX-BioB-5_st
  • AFFX-BioB-m_st
  • AFFX-BioB-3_st
  • AFFX-BioC-5_st
  • AFFX-BioC-3_st
  • AFFX-BioDn-5_st