De-anonymization of Insurance Applicants' Sensitive Information - - PowerPoint PPT Presentation

de anonymization of insurance applicants sensitive
SMART_READER_LITE
LIVE PREVIEW

De-anonymization of Insurance Applicants' Sensitive Information - - PowerPoint PPT Presentation

De-anonymization of Insurance Applicants' Sensitive Information Team 3: Jay Lee, Maxim Castaneda, Rosalie Dolor To enhance insurance Goal companies best practices Benefits to Stakeholders by ensuring the clients privacy rights in the


slide-1
SLIDE 1

De-anonymization of Insurance Applicants' Sensitive Information

Team 3: Jay Lee, Maxim Castaneda, Rosalie Dolor

slide-2
SLIDE 2

Benefits to Stakeholders

To enhance insurance companies best practices by ensuring the clients’ privacy rights in the information gathering process. Goal Dangers Success Margin of error in risk predictions might increase instead of decrease. Increase in the number of applicants. NAIC US insurance Companies Insurance Applicants The results of the research will provide guidance for new insurance policies. A higher probability

  • f attracting

potential clients. To gain new knowledge from the data mining outputs about insurance dataset. Reduced privacy invasion.

slide-3
SLIDE 3

Data Mining Problem

To de-anonymize the dataset by predicting sensitive variables using the insurance company dataset.

  • Supervised task and

predictive modelling

  • Classification:
  • Decision Trees
  • Logistic Regression
  • Ensembles
  • Regression:
  • Decision Trees
  • Random Forest
  • 1. Family History (5):
  • Categorical and Numeric
  • Death-related variables
  • 2. Employment Information (6):
  • Categorical and Numeric
  • Income-related variables

Main challenge: Unknown specific variable labels (Assumptions were made)

slide-4
SLIDE 4

Data Description

  • Data source: Kaggle competition
  • Size: 59,381 rows and 128 columns (with

dummy variables: 900+ columns)

  • Each row is an insurance applicant.
  • Pre-processing and Exploration:

○ Fill in missing values ○ Correlations ○ PCA

  • Partitioning:

○ Train & Test: 70%-30% ○ 5-fold Cross-validation (parameter-tuning) Percentage Distribution of Applicants’ Risk Level

1 2 3 4 5 6 7 8

Risk Level Risk Level

slide-5
SLIDE 5

Is the CV predictable? De-anonymization: Predict/Classify confidential variable using the remaining variables in the dataset. Identify confidential variable (CV). Set it as the outcome variable. No Yes No Final result: Anonymized dataset Identify which variables are most important in predicting CV. Drop them from the dataset. Any other CV in the data? No Yes Try to predict risk level using the anonymized

  • dataset. Evaluate

performance.

Methodology (Process Flow)

Is Fam_Hist_1 predictable? De-anonymization: Use Decision Tree to predict Family_Hist_1 (levels 1,2,3) using the remaining variables. Family_Hist_1 Not anymore Yes! No Final result: Anonymized dataset Based on Decision Tree results of important features, 4 variables are very predictable of Fam_Hist_1. Drop them

  • ne by one.

Any other CV in the data? No Yes, Employment- related variables. Try to predict risk level using the anonymized

  • dataset. Evaluate

performance.

slide-6
SLIDE 6

Predicting Family_Hist_1

1. Dimension reduction a. PCA vs Random Forest (RF) Feature importances b. Using the result from RF feature importances: i. Tuned Decision Tree with 20 variables can predict with 77% accuracy 2. Performance metrics for multi-class classification problem: a. Averaged version of Accuracy, Precision, Recall and F1-score - the higher, the better

Feature Importances

slide-7
SLIDE 7

De-anonymizing Family_Hist_1

slide-8
SLIDE 8

De-anonymizing Employment_Info_1

slide-9
SLIDE 9

Evaluating the Risk Level Prediction

slide-10
SLIDE 10

Implementation/Production Considerations

NOTES

  • Assumptions about the variables should be checked with Prudential.
  • Based on the results, dropping the identified sensitive variables (and

the important variables related to them) is possible and it did not significantly affect the risk level prediction.

  • Performance metrics (setting a threshold for de-anonymization) is

critical and should be discussed with NAIC. RECOMMENDATIONS

  • Repeat the algorithm with the remaining identified sensitive values.
  • Re-evaluate risk level modelling.