Privacy-preserving and Authenticated Data Cleaning on Outsourced - PowerPoint PPT Presentation

Privacy-preserving and Authenticated Data Cleaning on Outsourced Databases Thesis Defense Boxiang Dong THESIS COMMITTEE: Advisor: Prof. Wendy Hui Wang Prof. Yingying Chen Prof. David Naumann Prof. Antonio Nicolosi Department of Computer Science Stevens Institute of Technology December 1, 2016

Dirty Data Real-world datasets, particularly those from multiple sources, tend to be dirty . Inaccuracy Multiple records that refer to the same entity Inconsistency Violation of integrity constraints Incompleteness Missing data values Name Street City Phone John Leonard NY 518-457-5181 John Lenard NY 518-457-5181 Kevin LA 213-974-3211 Mike Main Phil 518-457-5181 The ubiquitous dirty data: 40% of companies have suffered losses, problems, or costs due to data of poor quality [Eck02]. 2 / 61

Data Cleaning Data cleaning aims at detecting and removing errors, duplications, missing values, and inconsistencies to improve data quality. • Data deduplication • Data inconsistency repair • Data imputation Data cleaning is a labor-intensive and complex process. It can be NP-complete [BFFR05]. 3 / 61

Data-Cleaning-as-a-Service Outsourcing the data to a third-party data cleaning service provider provides a cost-effective way. E.g., Google’s OpenRefine, Melissa Data. Dirty Dataset D Clean Dataset D ′ D ′ Server Client (Data Owner) Client with limited computational resources Server computationally powerful 4 / 61

Security Concerns The third-party server is untrusted. Result integrity The server may return incorrect data cleaning result. • Software bugs • Intention to save computational cost Data privacy The outsourced data may include sensitive personal information. • Medical information • Financial record 5 / 61

My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair Security & Privacy Data Cleaning Authentication Deduplication 6 / 61

My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis [BigDataSecurity’16] Inconsistency Privacy Repair [ICDE’17] (Under Review) Security & Privacy Data Cleaning Authentication Deduplication 7 / 61

My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair [CIKM’14] Security & Privacy Data Cleaning Authentication Deduplication 8 / 61

My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair Security & Privacy Data Cleaning [IRI’16] Authentication Deduplication 9 / 61

My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair Security & Privacy Data Cleaning [IRI’16] Authentication Deduplication 10 / 61

Related Work Data cleaning • Data deduplication [GIJ + 01, SAA10, YLKG07] • Data inconsistency repair [PEM + 15, BFG + 07, BFFR05] Privacy-preserving outsourced computation • Encryption [SV10, PRZB12] • Encoding [EAMY + 13, CC04] • Secure multiparty computation [TOEY11, LZL + 15] • Differential privacy [CMF + 11, AHMP15] Verifiable computing • General-purpose verifiable computing [SVP + 12, PHGR13] • Function-specific verifiable computing [DLW13, LWM + 12] 11 / 61

Outline 1 Introduction 2 Research Results • Authentication of Outsourced Data Deduplication • Verification of Similarity Search Approach ( VS 2 ) • Embedding-based Verification of Similarity Search Approach ( E - VS 2 ) • Experiments • Privacy-preserving Outsourced Data Deduplication • Privacy-preserving Outsourced Data Inconsistency Repair 3 Research beyond the Thesis 4 Future Plan 5 Conclusion 12 / 61

Authentication of Outsourced Data Deduplication Boxiang Dong, Wendy Hui Wang. IEEE International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA. July 2016. (Acceptance rate = 25%) 13 / 61

Data Deduplication Data deduplication Eliminate near-duplicate copies. • Record matching: Detect near-duplicate copies. D s q s q θ { s | s ∈ D, DST ( s, s q ) ≤ θ } θ : similarity threshold θ : similarity threshold DST : edit distance DST : edit distance 14 / 61

Data Deduplication Data deduplication Eliminate near-duplicate copies. • Record matching: Detect near-duplicate copies. RID Name Street City Age John Leonard NY 45 r 1 s q = (John, Lenard, NY, 45) Kevin Wicks LA 31 r 2 Mike Main Phil 22 r 3 θ = 2 { r 1 } 15 / 61

Outsourcing Framework The client (data owner) outsources the record matching service to the untrusted server. D ( s q , θ ) R S = { s | s ∈ D, DST ( s, s q ) ≤ θ } Client Server Assumption: The client is aware of the edit distance metric. We want to make sure that R S is both sound and complete. Soundness ∀ s ∈ R S , s ∈ D and DST ( s , s q ) ≤ θ . Completeness ∀ s ∈ D s.t. DST ( s , s q ) ≤ θ , s ∈ R S . 16 / 61

Authentication We aim at an authentication framework that satisfies the following objectives. ∃ s ∈ R S , but s �∈ D soundness violation ∃ s ∈ R S , but DST ( s, s q ) > θ catches ∃ s ∈ D s .t. DST ( s , s q ) ≤ θ completeness violation b ut s �∈ R S Authentication Objective supports efficient verification scales well with big data 17 / 61

Preliminary - Merkle Tree Merkle tree is a generalization of hash lists and hash chains. H ABCD H ABCD Hash ( H AB || H CD ) Hash ( H AB || H CD ) H AB H AB H CD H CD Hash ( H A || H B ) Hash ( H A || H B ) Hash ( H C || H D ) Hash ( H C || H D ) H A H A H B H B H C H C H D H D Hash ( D A ) Hash ( D A ) Hash ( D B ) Hash ( D B ) Hash ( D C ) Hash ( D C ) Hash ( D D ) Hash ( D D ) • It allows efficient and secure verification of the contents of large data structures. • Hash is computationally more efficient than edit distance calculation. 18 / 61

Preliminary - B ed -Tree B ed -Tree [ZHOS10] is a string indexing structure. N 1 p N 2 p N 3 Ø N 2 N 3 p N 4 p N 5 p N 6 p N 7 Ø Ø N 4 N 5 N 6 N 7 Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr • Sort the strings in dictionary order. • Store the longest common prefix (LCP) of the enclosed strings in every node. 19 / 61

Preliminary - B ed -Tree B ed -Tree [ZHOS10] is a string indexing structure. s q = “Celestine” N 1 0 θ = 4 p N 2 p N 3 Ø N 2 0 N 3 0 p N 4 p N 5 p N 6 p N 7 Ø Ø N 4 3 N 5 N 6 N 7 6 0 1 Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr • ∀ N , calculate MIN _ DST ( s q , N . LCP ) . 20 / 61

Preliminary - B ed -Tree B ed -Tree [ZHOS10] is a string indexing structure. s q = “Celestine” N 1 0 θ = 4 p N 2 p N 3 Ø N 2 N 3 0 0 p N 4 p N 5 p N 6 p N 7 Ø Ø MF-node N 4 3 N 5 0 N 6 N 7 6 1 Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr Similar strings C-strings NC-strings dissimilar and non NC-strings dissimilar strings covered by MF-node • If MIN _ DST ( s q , N . LCP ) > θ , then N is a MF-node. • All strings covered by a MF-node must be dissimilar to s q . • Avoid the edit distance calculation for NC-strings. • Perform well with memory constraints. 21 / 61

Preliminary - Embedding Embedding maps strings into Euclidean points in a similarity-preserving way. S 1 S 2 S 3 • Euclidean distance calculation is much more efficient than edit distance computing, i.e., O ( dst ( p i , p j )) << O ( DST ( s i , s j )) . • SparseMap [HS] is a contractive embedding approach, i.e., dst ( p i , p j ) ≤ DST ( s i , s j ) . • The complexity is O ( cn 2 ) , where c is a small constant, and n is the number of strings. 22 / 61

Solution in a Nutshell We require the server to construct verification object ( VO ) to demonstrate the soundness and completeness of the result. σ ← s etup ( D ) D s q , θ ( R S , V O ) ← search ( D, s q , θ ) Client Server ( R S / ⊥ ) ← verify ( R S , V O, σ ) The client is able to efficiently detect any unsound or incomplete result returned by the server by checking the VO . 23 / 61

Outline 1 Introduction 2 Research Results • Authentication of Outsourced Data Deduplication • Verification of Similarity Search Approach ( VS 2 ) • Embedding-based Verification of Similarity Search Approach ( E - VS 2 ) • Experiments • Privacy-preserving Outsourced Data Deduplication • Privacy-preserving Outsourced Data Inconsistency Repair 3 Research beyond the Thesis 4 Future Plan 5 Conclusion 24 / 61

Privacy-preserving and Authenticated Data Cleaning on Outsourced - PowerPoint PPT Presentation

Privacy-preserving and Authenticated Data Cleaning on Outsourced Databases Thesis Defense Boxiang Dong THESIS COMMITTEE: Advisor: Prof. Wendy Hui Wang Prof. Yingying Chen Prof. David Naumann Prof. Antonio Nicolosi Department of Computer

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

Diagnose data for cleaning CLEAN IN G DATA IN P YTH ON Daniel Chen Instructor Cleaning data

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Equipment Cleaning for Drug Products www.gmpsop.com 1 Scope and General Types of Cleaning

Building Cleaning Services (BCS) Kirsty Thomas, Assistant Building Cleaning Finance officer

Privacy preserving data mining randomized response and association rule hiding Li Xiong

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

University of Some enumeration formulas : lay |YsYdYIYhY Thm I Hurwitz , . , ( 23 )

AGENDA ITEM #6: HRTPO PROJECT PRIORITIZATION: RECOMMENDED ENHANCEMENTS The HRTPO Project

Fundamentals of vortex dynamics Andrew David Gilbert, Mathematics Department, University of Exeter,

SMA Teaching Moment Take-A-Stand: Super Bowl Commercial Nov.5, 2016 Jin-Woo Kim, Ph.D.

Encouraging Effective Contract Specifications Todd Schiller , Kellen Donohue, Forrest Coward,

Neural Architectures for NLP Jindich Helcl, Jindich Libovick February 26, 2020 NPFL116

Sequence-to-Sequence Learning using Recurrent Neural Networks Jindich Helcl, Jindich

Improving the Airflow User Experience Speakers Ry Walker Viraj Parekh Maxime Beauchemin

Sambuz

Useful Links

Newsletter

Mail Us