De-anonymization of Insurance Applicants' Sensitive Information - PowerPoint PPT Presentation

De-anonymization of Insurance Applicants' Sensitive Information Team 3: Jay Lee, Maxim Castaneda, Rosalie Dolor

To enhance insurance Goal companies best practices Benefits to Stakeholders by ensuring the clients’ privacy rights in the information gathering NAIC US insurance Insurance process. Companies Applicants Dangers Margin of error in risk The results of the A higher probability Reduced privacy research will of attracting invasion. predictions might provide guidance potential clients. increase instead of for new insurance decrease. policies. To gain new knowledge from the data mining outputs about insurance Success Increase in the dataset. number of applicants.

Data Mining Problem To de-anonymize the ● Supervised task and 1. Family History (5): dataset by predicting predictive modelling ● Categorical and Numeric sensitive variables using the ● Death-related variables insurance company dataset. ● Classification: - Decision Trees 2. Employment Information (6): - Logistic Regression ● Categorical and Numeric - Ensembles ● Income-related variables ● Regression: Main challenge: Unknown specific - Decision Trees variable labels (Assumptions were - Random Forest made)

Percentage Distribution of Data Description Applicants’ Risk Level ● Data source: Kaggle competition ● Size: 59,381 rows and 128 columns (with dummy variables: 900+ columns) ● Each row is an insurance applicant. ● Pre-processing and Exploration: ○ Fill in missing values ○ Correlations ○ PCA ● Partitioning: ○ Train & Test: 70%-30% ○ 5-fold Cross-validation (parameter-tuning) 1 2 3 4 5 6 7 8 Risk Level Risk Level

Methodology (Process Flow) Yes, Employment- related Yes variables. Identify De-anonymization: De-anonymization: Use Decision Tree to Predict/Classify confidential Is Any other Any other confidential variable predict No No variable (CV). Set Family_Hist_1 Is the CV Fam_Hist_1 CV in the CV in the Family_Hist_1 (levels using the remaining it as the outcome predictable? Not No predictable? data? data? variable. variables in the 1,2,3) using the anymore remaining variables. dataset. Yes! Yes No No Based on Decision Tree Final result: Anonymized Final result: Anonymized Identify which variables results of important dataset dataset are most important in features, 4 variables are predicting CV. Drop them very predictable of from the dataset. Fam_Hist_1. Drop them one by one. Try to predict risk level Try to predict risk level using the anonymized using the anonymized dataset. Evaluate dataset. Evaluate performance. performance.

Predicting Family_Hist_1 Feature Importances 1. Dimension reduction a. PCA vs Random Forest (RF) Feature importances b. Using the result from RF feature importances: i. Tuned Decision Tree with 20 variables can predict with 77% accuracy 2. Performance metrics for multi-class classification problem: a. Averaged version of Accuracy, Precision, Recall and F1-score - the higher, the better

De-anonymizing Family_Hist_1

De-anonymizing Employment_Info_1

Evaluating the Risk Level Prediction

Implementation/Production Considerations NOTES ● Assumptions about the variables should be checked with Prudential. ● Based on the results, dropping the identified sensitive variables (and the important variables related to them) is possible and it did not significantly affect the risk level prediction. ● Performance metrics (setting a threshold for de-anonymization) is critical and should be discussed with NAIC. RECOMMENDATIONS ● Repeat the algorithm with the remaining identified sensitive values. ● Re-evaluate risk level modelling.

De-anonymization of Insurance Applicants' Sensitive Information - PowerPoint PPT Presentation

De-anonymization of Insurance Applicants' Sensitive Information Team 3: Jay Lee, Maxim Castaneda, Rosalie Dolor To enhance insurance Goal companies best practices Benefits to Stakeholders by ensuring the clients privacy rights in the

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Introduction to Anonymization (I) Claire McKay Bowen Postdoctoral Researcher, Los Alamos

Data Masking and Anonymization for PostgreSQL 1 The Anonymization Challenge 8 Strategies

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Towards Plausible Graph Anonymization Yang Zhang, Mathias Humbert, Bartlomiej Surma, Praveen

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

Group Insurance Group Insurance Overview Overview Group Insurance Segments Group Insurance

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt

Big Data and the application of anonymization techniques Annual Privacy Forum 2015 7-8 October,

Issues of Data Mining Kyle Borah OutLine Background Data Anonymization Encryption

Anonymization Algorithms - Other techniques, metrics, and extended scenarios Li Xiong CS573

CS573 Data Privacy and Security Data Anonymization (cont.) Li Xiong Department of Mathematics

Encryption and Anonymization in Hadoop Current and Future needs Sept-28-2015 ApacheCon, Budapest

Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and

GDPR BREAKFAST EVENT LONDON OCT18 ABOUT PRIVACERA GLOBAL PARTNERS BACKED BY PRIVACERA

De-anonymizing D4D Datasets Kumar Sharad 1 and George Danezis 2 1 University of Cambridge Computer

Anonymization of Network Trace Using Differential Privacy By Ahmed AlEroud Assistant Professor

Living Successfully with Aphasia Professor Linda Worrall B SpThy FSPA PhD Co-director,

TOWARDS PRIVACY-AWARE RESEARCH AND DEVELOPMENT IN WEARABLE HEALTH A WEAR BLES one survey

#MicroFocusCyberSummit Voltage Data Security Product Direction Reiner Kappenberger Director of

www.jpmorganchaseinstitute.com #JPMCInstitute @Farrell_Diana The JPMorgan Chase Institute is a

Agenda Page Who we are JPMCI research is public facing for public consumption: primary