privacy preserving and authenticated data cleaning on
play

Privacy-preserving and Authenticated Data Cleaning on Outsourced - PowerPoint PPT Presentation

Privacy-preserving and Authenticated Data Cleaning on Outsourced Databases Thesis Defense Boxiang Dong THESIS COMMITTEE: Advisor: Prof. Wendy Hui Wang Prof. Yingying Chen Prof. David Naumann Prof. Antonio Nicolosi Department of Computer


  1. Privacy-preserving and Authenticated Data Cleaning on Outsourced Databases Thesis Defense Boxiang Dong THESIS COMMITTEE: Advisor: Prof. Wendy Hui Wang Prof. Yingying Chen Prof. David Naumann Prof. Antonio Nicolosi Department of Computer Science Stevens Institute of Technology December 1, 2016

  2. Dirty Data Real-world datasets, particularly those from multiple sources, tend to be dirty . Inaccuracy Multiple records that refer to the same entity Inconsistency Violation of integrity constraints Incompleteness Missing data values Name Street City Phone John Leonard NY 518-457-5181 John Lenard NY 518-457-5181 Kevin LA 213-974-3211 Mike Main Phil 518-457-5181 The ubiquitous dirty data: 40% of companies have suffered losses, problems, or costs due to data of poor quality [Eck02]. 2 / 61

  3. Data Cleaning Data cleaning aims at detecting and removing errors, duplications, missing values, and inconsistencies to improve data quality. • Data deduplication • Data inconsistency repair • Data imputation Data cleaning is a labor-intensive and complex process. It can be NP-complete [BFFR05]. 3 / 61

  4. Data-Cleaning-as-a-Service Outsourcing the data to a third-party data cleaning service provider provides a cost-effective way. E.g., Google’s OpenRefine, Melissa Data. Dirty Dataset D Clean Dataset D ′ D ′ Server Client (Data Owner) Client with limited computational resources Server computationally powerful 4 / 61

  5. Security Concerns The third-party server is untrusted. Result integrity The server may return incorrect data cleaning result. • Software bugs • Intention to save computational cost Data privacy The outsourced data may include sensitive personal information. • Medical information • Financial record 5 / 61

  6. My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair Security & Privacy Data Cleaning Authentication Deduplication 6 / 61

  7. My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis [BigDataSecurity’16] Inconsistency Privacy Repair [ICDE’17] (Under Review) Security & Privacy Data Cleaning Authentication Deduplication 7 / 61

  8. My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair [CIKM’14] Security & Privacy Data Cleaning Authentication Deduplication 8 / 61

  9. My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair Security & Privacy Data Cleaning [IRI’16] Authentication Deduplication 9 / 61

  10. My Thesis Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases My Thesis Inconsistency Privacy Repair Security & Privacy Data Cleaning [IRI’16] Authentication Deduplication 10 / 61

  11. Related Work Data cleaning • Data deduplication [GIJ + 01, SAA10, YLKG07] • Data inconsistency repair [PEM + 15, BFG + 07, BFFR05] Privacy-preserving outsourced computation • Encryption [SV10, PRZB12] • Encoding [EAMY + 13, CC04] • Secure multiparty computation [TOEY11, LZL + 15] • Differential privacy [CMF + 11, AHMP15] Verifiable computing • General-purpose verifiable computing [SVP + 12, PHGR13] • Function-specific verifiable computing [DLW13, LWM + 12] 11 / 61

  12. Outline 1 Introduction 2 Research Results • Authentication of Outsourced Data Deduplication • Verification of Similarity Search Approach ( VS 2 ) • Embedding-based Verification of Similarity Search Approach ( E - VS 2 ) • Experiments • Privacy-preserving Outsourced Data Deduplication • Privacy-preserving Outsourced Data Inconsistency Repair 3 Research beyond the Thesis 4 Future Plan 5 Conclusion 12 / 61

  13. Authentication of Outsourced Data Deduplication Boxiang Dong, Wendy Hui Wang. IEEE International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA. July 2016. (Acceptance rate = 25%) 13 / 61

  14. Data Deduplication Data deduplication Eliminate near-duplicate copies. • Record matching: Detect near-duplicate copies. D s q s q θ { s | s ∈ D, DST ( s, s q ) ≤ θ } θ : similarity threshold θ : similarity threshold DST : edit distance DST : edit distance 14 / 61

  15. Data Deduplication Data deduplication Eliminate near-duplicate copies. • Record matching: Detect near-duplicate copies. RID Name Street City Age John Leonard NY 45 r 1 s q = (John, Lenard, NY, 45) Kevin Wicks LA 31 r 2 Mike Main Phil 22 r 3 θ = 2 { r 1 } 15 / 61

  16. Outsourcing Framework The client (data owner) outsources the record matching service to the untrusted server. D ( s q , θ ) R S = { s | s ∈ D, DST ( s, s q ) ≤ θ } Client Server Assumption: The client is aware of the edit distance metric. We want to make sure that R S is both sound and complete. Soundness ∀ s ∈ R S , s ∈ D and DST ( s , s q ) ≤ θ . Completeness ∀ s ∈ D s.t. DST ( s , s q ) ≤ θ , s ∈ R S . 16 / 61

  17. Authentication We aim at an authentication framework that satisfies the following objectives. ∃ s ∈ R S , but s �∈ D soundness violation ∃ s ∈ R S , but DST ( s, s q ) > θ catches ∃ s ∈ D s .t. DST ( s , s q ) ≤ θ completeness violation b ut s �∈ R S Authentication Objective supports efficient verification scales well with big data 17 / 61

  18. Preliminary - Merkle Tree Merkle tree is a generalization of hash lists and hash chains. H ABCD H ABCD Hash ( H AB || H CD ) Hash ( H AB || H CD ) H AB H AB H CD H CD Hash ( H A || H B ) Hash ( H A || H B ) Hash ( H C || H D ) Hash ( H C || H D ) H A H A H B H B H C H C H D H D Hash ( D A ) Hash ( D A ) Hash ( D B ) Hash ( D B ) Hash ( D C ) Hash ( D C ) Hash ( D D ) Hash ( D D ) • It allows efficient and secure verification of the contents of large data structures. • Hash is computationally more efficient than edit distance calculation. 18 / 61

  19. Preliminary - B ed -Tree B ed -Tree [ZHOS10] is a string indexing structure. N 1 p N 2 p N 3 Ø N 2 N 3 p N 4 p N 5 p N 6 p N 7 Ø Ø N 4 N 5 N 6 N 7 Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr • Sort the strings in dictionary order. • Store the longest common prefix (LCP) of the enclosed strings in every node. 19 / 61

  20. Preliminary - B ed -Tree B ed -Tree [ZHOS10] is a string indexing structure. s q = “Celestine” N 1 0 θ = 4 p N 2 p N 3 Ø N 2 0 N 3 0 p N 4 p N 5 p N 6 p N 7 Ø Ø N 4 3 N 5 N 6 N 7 6 0 1 Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr • ∀ N , calculate MIN _ DST ( s q , N . LCP ) . 20 / 61

  21. Preliminary - B ed -Tree B ed -Tree [ZHOS10] is a string indexing structure. s q = “Celestine” N 1 0 θ = 4 p N 2 p N 3 Ø N 2 N 3 0 0 p N 4 p N 5 p N 6 p N 7 Ø Ø MF-node N 4 3 N 5 0 N 6 N 7 6 1 Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr Similar strings C-strings NC-strings dissimilar and non NC-strings dissimilar strings covered by MF-node • If MIN _ DST ( s q , N . LCP ) > θ , then N is a MF-node. • All strings covered by a MF-node must be dissimilar to s q . • Avoid the edit distance calculation for NC-strings. • Perform well with memory constraints. 21 / 61

  22. Preliminary - Embedding Embedding maps strings into Euclidean points in a similarity-preserving way. S 1 S 2 S 3 • Euclidean distance calculation is much more efficient than edit distance computing, i.e., O ( dst ( p i , p j )) << O ( DST ( s i , s j )) . • SparseMap [HS] is a contractive embedding approach, i.e., dst ( p i , p j ) ≤ DST ( s i , s j ) . • The complexity is O ( cn 2 ) , where c is a small constant, and n is the number of strings. 22 / 61

  23. Solution in a Nutshell We require the server to construct verification object ( VO ) to demonstrate the soundness and completeness of the result. σ ← s etup ( D ) D s q , θ ( R S , V O ) ← search ( D, s q , θ ) Client Server ( R S / ⊥ ) ← verify ( R S , V O, σ ) The client is able to efficiently detect any unsound or incomplete result returned by the server by checking the VO . 23 / 61

  24. Outline 1 Introduction 2 Research Results • Authentication of Outsourced Data Deduplication • Verification of Similarity Search Approach ( VS 2 ) • Embedding-based Verification of Similarity Search Approach ( E - VS 2 ) • Experiments • Privacy-preserving Outsourced Data Deduplication • Privacy-preserving Outsourced Data Inconsistency Repair 3 Research beyond the Thesis 4 Future Plan 5 Conclusion 24 / 61

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend