THE SCIENCE OF RISK SM Chun Li, PhD ISO Innovative Analytics March - - PowerPoint PPT Presentation

the science of risk sm chun li phd iso innovative
SMART_READER_LITE
LIVE PREVIEW

THE SCIENCE OF RISK SM Chun Li, PhD ISO Innovative Analytics March - - PowerPoint PPT Presentation

Interaction Detection in GLM a Case Study THE SCIENCE OF RISK SM Chun Li, PhD ISO Innovative Analytics March 2012 1 Agenda Case study Approaches Proc Genmod, GAM in R, Proc Arbor Details Summary T H E S C I E N C E O F


slide-1
SLIDE 1

THE SCIENCE OF RISK SM

1

Interaction Detection in GLM – a Case Study

Chun Li, PhD ISO Innovative Analytics March 2012

slide-2
SLIDE 2

T H E S C I E N C E O F R I S K S M 2

Agenda

  • Case study
  • Approaches

– Proc Genmod, GAM in R, Proc Arbor

  • Details
  • Summary
slide-3
SLIDE 3

T H E S C I E N C E O F R I S K S M 3

Case Study

  • Personal Auto loss prediction

–Pure premium prediction (GLM – Tweedie) –Inputs:

  • Environment components
  • Vehicle components
  • Driver components
  • Household components

–Objective is to detect interactions among the components to further improve model performance

slide-4
SLIDE 4

T H E S C I E N C E O F R I S K S M 4

Components

Environment (frequency and severity for each)

  • Traffic density
  • Traffic composition
  • Traffic generators
  • Weather
  • Experience and trend

Vehicle

  • ISO Symbol relativity
  • Price new relativity
  • Model year relativity
  • Body style and dimension
  • Performance and safety
  • Theft
  • Weather
  • Animal
  • Glass
  • All other perils

Driver

  • Driver characteristics (age,

gender, marital, good student etc)

  • Violation history
  • Claim history

Household

  • Usage/mileage
  • Household composition
slide-5
SLIDE 5

T H E S C I E N C E O F R I S K S M 5

Challenges

  • There are many different approaches that can be

used to detect interactions

  • The approach we selected was based on our

requirements that:

– interaction detection be completed in a timely manner

  • despite the large number of observations (>1 million) and

large number of interaction pairs (>300)

– all variables in the final model (including interactions) be interpretable – the final model (including interactions) be built in the form of a SAS GLM model

slide-6
SLIDE 6

T H E S C I E N C E O F R I S K S M 6

Approach

Step 0

  • Build main effect model
  • Aim to model the residual using interaction terms

Step I

  • Automated pair-wise selection
  • Based on standalone contribution

Step II

  • Manual selection from Step I results
  • Based on marginal contribution in GLM

Step III

  • Validation/Refinement/Finalization

* We’ll be focusing on Step I

slide-7
SLIDE 7

T H E S C I E N C E O F R I S K S M 7

Step I - Details

The purpose of Step I is to separate significant interaction pairs from insignificant ones, so that we can focus on those that have higher potential. The principle is to add each pair to the model to predict the residual, measure their contribution, and rank the pairs based on contribution.

slide-8
SLIDE 8

T H E S C I E N C E O F R I S K S M 8

Step I - Details

Three methods are used

–Proc Genmod in SAS –GAM in R –Proc Arbor (Regression Tree) in SAS

slide-9
SLIDE 9

T H E S C I E N C E O F R I S K S M 9

Proc Genmod in SAS

  • Use main effect model as offset
  • Add a component pair to the model
  • Use ‘Increase in Gini’ as the

performance metric

  • Created SAS macro to loop through all

component pairs and output these pairs ranked according to the performance metric

slide-10
SLIDE 10

T H E S C I E N C E O F R I S K S M 11

Proc Genmod in SAS

  • Interaction terms

–Both linear –Both binned –One linear and one binned

The linear assumption is based on the fact that the components (or sometimes, the log transformation of the components) are developed in the way that they have linear relationship with the target.

slide-11
SLIDE 11

T H E S C I E N C E O F R I S K S M 12

GAM in R

GAM = Generalized Additive Model

– In R package: mgcv – Able to do Tweedie distribution with Log link – Fits splines – Multi-dimentional smoothing for interactions

  • Smoothing classes: s(a, b)
  • Tensor product smoothing: te(a, b)
slide-12
SLIDE 12

T H E S C I E N C E O F R I S K S M 13

1.1 1.5 1.9 2.3 2.7 0.5 1 1.5 2 2.5 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2-2.5 1.5-2 1-1.5 0.5-1 0-0.5

Illustration of interaction surface

te(X1, X2)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3

X1

0.5 1 1.5 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

X2

slide-13
SLIDE 13

T H E S C I E N C E O F R I S K S M 14

GAM in R

  • Use main effect model as offset
  • Add a component pair to the model
  • Use ‘Decrease in AIC’ as the

performance metric

  • Create R process to loop through all

possible component pairs and output these pairs ranked according to the performance metric

slide-14
SLIDE 14

T H E S C I E N C E O F R I S K S M 15

Proc Arbor in SAS

Proc Arbor in SAS

– The same algorithm behind EMiner’s Decision Tree Node – Can be part of a programmable process

  • Loop through component pairs
  • Build model
  • Evaluate model performance
slide-15
SLIDE 15

T H E S C I E N C E O F R I S K S M 16

Proc Arbor in SAS

Proc Arbor in SAS

– Use residual of main effect mode as target – Build regression tree using a pair of components – Performance metric

  • sqrt(MSE*Leaf_Count)

– Created SAS macro to loop through all possible component pairs and output these pairs ranked according to the performance metric

slide-16
SLIDE 16

T H E S C I E N C E O F R I S K S M 17

Example – Collision Coverage

Drivers in the low household relativity segment should have the driver relativity adjusted higher, and high lower.

0.4 0.8 1.2 1.6 2 2.4 2.8 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Combined Relativity Driver Relativity

Driver Relativity by Household Relativity

Household Relativity - low Household Relativity - med Household Relativity - high

slide-17
SLIDE 17

T H E S C I E N C E O F R I S K S M 18

Example – Collision Coverage

In the location where the loss experience is low, the weather relativity needs to be adjusted lower, and high higher

0.5 1 1.5 2 2.5 3 3.5 4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Combined Relativity Weather Relativity

Weather Relativity by Experience Relativity

Experience Ralativity - low Experience Ralativity - med Experience Ralativity - high

slide-18
SLIDE 18

T H E S C I E N C E O F R I S K S M 19

Summary

  • Most of the significant pairs are captured

by proc Genmod method

–Closest to the final model format

  • Both GAM in R and proc Arbor detect

additional significant interaction pairs

–Need to convert to the format that Proc Genmod can handle

slide-19
SLIDE 19

T H E S C I E N C E O F R I S K S M 20

Take away

  • The methodologies described can be

applied generally to variable selection processes

–May need to do variable de-correlation process beforehand (eg. variable clustering)

  • Significantly reduces the time/effort

needed for variable selection

slide-20
SLIDE 20

T H E S C I E N C E O F R I S K S M 21

Q & A

Questions? Contact: cli@iso.com