Data Cleansing for Predictive Models: The Next Level Roosevelt C. - - PowerPoint PPT Presentation

data cleansing for predictive models the next level
SMART_READER_LITE
LIVE PREVIEW

Data Cleansing for Predictive Models: The Next Level Roosevelt C. - - PowerPoint PPT Presentation

Data Cleansing for Predictive Models: The Next Level Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Ratemaking & Product Management Seminar Philadelphia, PA March 19 21, 2012 Experience the Pinnacle Difference! Data Cleaning Data cleansing


slide-1
SLIDE 1

Data Cleansing for Predictive Models: The Next Level

Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Ratemaking & Product Management Seminar Philadelphia, PA March 19 – 21, 2012

Experience the Pinnacle Difference!

slide-2
SLIDE 2

Data Cleaning

  • Why simple visualization may not

tell the whole story

Data cleansing – the next level

  • There are distinct groups in your

underlying data

Data homogeneity

  • Certain combinations of variables

may point to data issues

Multivariate data anomalies

slide-3
SLIDE 3

Data Cleansing – The Next Level

slide-4
SLIDE 4

Data Validation – One and Two Way Summaries

slide-5
SLIDE 5

Data Cleansing – the Next Level

One and two way data summarization and visualization is absolutely key in determining that individual factors are valid In building predictive models, multivariate techniques consider independent variables simultaneously to account for dependencies Data issues don’t just exist in one and two dimensions, they can exist in n dimensions (where n is the number of individual elements) Underlying causes: heterogeneity, data anomalies Multivariate data exploration techniques can be used to address these issues

slide-6
SLIDE 6

Data Homogeneity

slide-7
SLIDE 7

Clustering/Segmentation

Unsupervised classification technique Groups data into set of discrete clusters or contiguous groups of cases Performs disjoint cluster analysis on the basis of Euclidean distances computed from one or more quantitative input variables and cluster seeds Objects in each cluster tend to be similar, objects in different clusters tend to be dissimilar Can be used as a dimension reduction technique

slide-8
SLIDE 8

Example

Homeowners dataset Ran clustering analysis using key risk characteristics

Amount of insurance Age of home Billing option Construction Protection class Deductible Multiline State/territory

Developed predictive model on clusters independently

slide-9
SLIDE 9

Cluster Distance Map

slide-10
SLIDE 10

Cluster Characteristics

155,509 267,415 219,585

‐ 100,000 200,000 300,000 9 20 Total Coverage A

Coverage A

56 25 43

10 20 30 40 50 60 9 20 Total Age of Home

Age of Home

35% 15% 25%

0% 10% 20% 30% 40% 9 20 Total Percent without Multiline Discount

Percent without Multiline Discount

slide-11
SLIDE 11

Billing Plan Indications

1.112 1.000 1.346 1.076 1.281 1.000 1.407 1.116 0.992 1.000 1.192 1.035 0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 Monthly Semi‐Annual Pay in Full Mortgagee I n d i c a t e d R e l a t i v i t y Bill Plan

Bill Plan

Total Cluster 9 Cluster 20

slide-12
SLIDE 12

Deductible Indications

0.200 0.400 0.600 0.800 1.000 1.200 1.400 50 100 250 500 1000 2500 5000 10000 I n d i c a t e d R e l a t i v i t y Deductible

Deductible

Total Cluster 9 Cluster 20

slide-13
SLIDE 13

Multi­Line Indications

0.942 0.907 0.892 0.860 0.870 0.880 0.890 0.900 0.910 0.920 0.930 0.940 0.950 Auto & Home I n d i c a t e d R e l a t i v i t y Multi Line

Multi Line

Total Cluster 9 Cluster 20

slide-14
SLIDE 14

Multivariate Data Anomalies – Back to Cluster 1

Cluster 1 $1,109,048 Total Av $219,585 A erage Amount of Insurance verage Age of Home 19.6 years 19.9% 42.7 years Pe 1.9% rcentage of Deductibles > $2500

Higher value homes Segment of the business that is certainly heterogeneous – will behave differently that overall population Represents 0.2% of the

  • verall exposures

Should we exclude data points such as these?

slide-15
SLIDE 15

Outlier Data Points

Midpoint of the cluster, represents an average risk for that cluster Risk that is slightly different than average, but still fits well with that cluster Potential anomaly – data point fits best within this cluster but is actually an

  • utlier for the cluster.

This generally means it doesn’t fit well anywhere.

slide-16
SLIDE 16

Data “Cleanup”

Reflect heterogeneity in final product (rating plan adjustments, underwriting, tiering) Data verification Modify data Exclude data