Privacy-preserving and Authenticated Data Cleaning on Outsourced - - PowerPoint PPT Presentation

privacy preserving and authenticated data cleaning on
SMART_READER_LITE
LIVE PREVIEW

Privacy-preserving and Authenticated Data Cleaning on Outsourced - - PowerPoint PPT Presentation

Privacy-preserving and Authenticated Data Cleaning on Outsourced Databases Thesis Defense Boxiang Dong THESIS COMMITTEE: Advisor: Prof. Wendy Hui Wang Prof. Yingying Chen Prof. David Naumann Prof. Antonio Nicolosi Department of Computer


slide-1
SLIDE 1

Privacy-preserving and Authenticated Data Cleaning on Outsourced Databases

Thesis Defense Boxiang Dong THESIS COMMITTEE: Advisor: Prof. Wendy Hui Wang

  • Prof. Yingying Chen
  • Prof. David Naumann
  • Prof. Antonio Nicolosi

Department of Computer Science Stevens Institute of Technology

December 1, 2016

slide-2
SLIDE 2

Dirty Data

Real-world datasets, particularly those from multiple sources, tend to be dirty.

Inaccuracy Multiple records that refer to the same entity Inconsistency Violation of integrity constraints Incompleteness Missing data values Name Street City Phone John Leonard NY 518-457-5181 John Lenard NY 518-457-5181 Kevin LA 213-974-3211 Mike Main Phil 518-457-5181

The ubiquitous dirty data: 40% of companies have suffered losses, problems, or costs due to data of poor quality [Eck02].

2 / 61

slide-3
SLIDE 3

Data Cleaning

Data cleaning aims at detecting and removing errors, duplications, missing values, and inconsistencies to improve data quality.

  • Data deduplication
  • Data inconsistency repair
  • Data imputation

Data cleaning is a labor-intensive and complex process. It can be NP-complete [BFFR05].

3 / 61

slide-4
SLIDE 4

Data-Cleaning-as-a-Service

Outsourcing the data to a third-party data cleaning service provider provides a cost-effective way. E.g., Google’s OpenRefine, Melissa Data.

Server Client (Data Owner) Dirty Dataset D Clean Dataset D′ D′

Client with limited computational resources Server computationally powerful

4 / 61

slide-5
SLIDE 5

Security Concerns

The third-party server is untrusted. Result integrity The server may return incorrect data cleaning result.

  • Software bugs
  • Intention to save computational cost

Data privacy The outsourced data may include sensitive personal information.

  • Medical information
  • Financial record

5 / 61

slide-6
SLIDE 6

My Thesis

Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases

My Thesis Security & Privacy

Privacy Authentication

Data Cleaning

Inconsistency Repair Deduplication

6 / 61

slide-7
SLIDE 7

My Thesis

Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases

My Thesis Security & Privacy

Privacy Authentication

Data Cleaning

Inconsistency Repair Deduplication

[BigDataSecurity’16] [ICDE’17] (Under Review)

7 / 61

slide-8
SLIDE 8

My Thesis

Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases

My Thesis Security & Privacy

Privacy Authentication

Data Cleaning

Inconsistency Repair Deduplication

[CIKM’14]

8 / 61

slide-9
SLIDE 9

My Thesis

Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases

My Thesis Security & Privacy

Privacy Authentication

Data Cleaning

Inconsistency Repair Deduplication

[IRI’16]

9 / 61

slide-10
SLIDE 10

My Thesis

Thesis topic: Privacy-preserving and authenticated data cleaning on outsourced databases

My Thesis Security & Privacy

Privacy Authentication

Data Cleaning

Inconsistency Repair Deduplication

[IRI’16]

10 / 61

slide-11
SLIDE 11

Related Work

Data cleaning

  • Data deduplication [GIJ+01, SAA10, YLKG07]
  • Data inconsistency repair [PEM+15, BFG+07, BFFR05]

Privacy-preserving outsourced computation

  • Encryption [SV10, PRZB12]
  • Encoding [EAMY+13, CC04]
  • Secure multiparty computation [TOEY11, LZL+15]
  • Differential privacy [CMF+11, AHMP15]

Verifiable computing

  • General-purpose verifiable computing [SVP+12, PHGR13]
  • Function-specific verifiable computing [DLW13, LWM+12]

11 / 61

slide-12
SLIDE 12

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data

Deduplication

  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

12 / 61

slide-13
SLIDE 13

Authentication of Outsourced Data Deduplication

Boxiang Dong, Wendy Hui Wang. IEEE International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA. July 2016. (Acceptance rate = 25%)

13 / 61

slide-14
SLIDE 14

Data Deduplication

Data deduplication Eliminate near-duplicate copies.

  • Record matching: Detect near-duplicate

copies.

D sq sq {s|s ∈ D, DST(s, sq) ≤ θ} θ θ: similarity threshold DST: edit distance θ: similarity threshold DST: edit distance

14 / 61

slide-15
SLIDE 15

Data Deduplication

Data deduplication Eliminate near-duplicate copies.

  • Record matching: Detect near-duplicate

copies.

RID Name Street City Age r1 John Leonard NY 45 r2 Kevin Wicks LA 31 r3 Mike Main Phil 22 sq = (John, Lenard, NY, 45)

θ = 2 {r1}

15 / 61

slide-16
SLIDE 16

Outsourcing Framework

The client (data owner) outsources the record matching service to the untrusted server.

Client (sq,θ) RS = {s|s ∈ D, DST(s, sq) ≤ θ} Server D

Assumption: The client is aware of the edit distance metric. We want to make sure that RS is both sound and complete. Soundness ∀s ∈ RS, s ∈ D and DST(s, sq) ≤ θ. Completeness ∀s ∈ D s.t. DST(s, sq) ≤ θ, s ∈ RS.

16 / 61

slide-17
SLIDE 17

Authentication

We aim at an authentication framework that satisfies the following objectives.

Authentication Objective catches soundness violation ∃s ∈ RS, but DST(s, sq) > θ completeness violation ∃s ∈ D s.t. DST(s, sq) ≤ θ supports efficient verification scales well with big data ∃s ∈ RS, but s ∈ D but s ∈ RS

17 / 61

slide-18
SLIDE 18

Preliminary - Merkle Tree

Merkle tree is a generalization of hash lists and hash chains.

HA HA Hash(DA) Hash(DA) HB HB Hash(DB) Hash(DB) HC HC Hash(DC) Hash(DC) HD HD Hash(DD) Hash(DD) HAB HAB Hash(HA||HB) Hash(HA||HB) HCD HCD Hash(HC||HD) Hash(HC||HD) HABCD HABCD Hash(HAB||HCD) Hash(HAB||HCD)

  • It allows efficient and secure verification of the contents of

large data structures.

  • Hash is computationally more efficient than edit distance

calculation.

18 / 61

slide-19
SLIDE 19

Preliminary - Bed-Tree

Bed-Tree [ZHOS10] is a string indexing structure.

Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr Ø

N5 N4

Ø

N6 N7 pN7 pN6 pN4 pN5 N2 N3

Ø

N1 pN2 pN3

  • Sort the strings in dictionary order.
  • Store the longest common prefix (LCP) of the enclosed strings

in every node.

19 / 61

slide-20
SLIDE 20

Preliminary - Bed-Tree

Bed-Tree [ZHOS10] is a string indexing structure.

Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr Ø

N5 N4

Ø

N6 N7 pN7 pN6 pN4 pN5 N2 N3

Ø

N1 pN2 pN3 3 1 6 sq=“Celestine” θ=4

  • ∀N, calculate MIN_DST(sq, N.LCP).

20 / 61

slide-21
SLIDE 21

Preliminary - Bed-Tree

Bed-Tree [ZHOS10] is a string indexing structure.

Christina Christine Christion Christi Donatello Elizabeth Gabrielle Ø Harrison Hollands Huffmann H Jim Grace Jim Grady Jim Gregg Jim Gr Ø

N5 N4

Ø

N6 N7 pN7 pN6 pN4 pN5 N2 N3

Ø

N1 pN2 pN3 3 1 MF-node Similar strings C-strings

dissimilar and non NC-strings

NC-strings

dissimilar strings covered by MF-node

sq=“Celestine” θ=4 6

  • If MIN_DST(sq, N.LCP) > θ, then N is a MF-node.
  • All strings covered by a MF-node must be dissimilar to sq.
  • Avoid the edit distance calculation for NC-strings.
  • Perform well with memory constraints.

21 / 61

slide-22
SLIDE 22

Preliminary - Embedding

Embedding maps strings into Euclidean points in a similarity-preserving way.

S1 S2 S3

  • Euclidean distance calculation is much more efficient than edit

distance computing, i.e., O(dst(pi, pj)) << O(DST(si, sj)).

  • SparseMap[HS] is a contractive embedding approach, i.e.,

dst(pi, pj) ≤ DST(si, sj).

  • The complexity is O(cn2), where c is a small constant, and n

is the number of strings.

22 / 61

slide-23
SLIDE 23

Solution in a Nutshell

We require the server to construct verification object (VO) to demonstrate the soundness and completeness of the result.

Client Server D (RS, V O) ← search(D, sq, θ) sq, θ σ ← setup(D) (RS/ ⊥) ← verify(RS, V O, σ)

The client is able to efficiently detect any unsound or incomplete result returned by the server by checking the VO.

23 / 61

slide-24
SLIDE 24

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search

Approach (VS2)

  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

24 / 61

slide-25
SLIDE 25

VS2 - Setup

We propose an authenticated string indexing structure, named MB-tree (Merkle Bed-tree).

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 hN1 = h(hN2||hN3||h(LCPN1)) Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 hN2 = h(hN4||hN5||h(LCPN2)) N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 hN4 = h(h(s1)||h(s2)||h(s3)||h(LCPN4))

  • The client signs the hash value in the root, and only keeps the

signature of the MB-tree locally.

  • The hash function is more efficient than edit distance

calculation.

25 / 61

slide-26
SLIDE 26

VS2-VO Construction

The server searches for the similar strings and constructs VO by traversing the MB-tree.

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 6 1 3 Similar Strings C-Strings NC-Strings MF-Node RS = {s1, s2} V O = {(((s1, s2,s3), (s4, s5, s6)), ((s7, s8, s9), (LCPN7, hN7)))} sq=“Celestine” θ=4

  • Include all the C-strings and similar strings in VO.
  • Substitute the large amount of NC-strings with the MF-nodes.

26 / 61

slide-27
SLIDE 27

VS2 - VO Verification

The client checks the soundness of completeness of RS by verifying the VO.

catches soundness violation ∃s ∈ RS, but DST(s, sq) > θ completeness violation ∃s ∈ D s.t. DST(s, sq) ≤ θ ∃s ∈ RS, but s ∈ D but s ∈ RS Compute Sig(T) from V O Compute Sig(T) from V O

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 RS = {s1, s2} V O = {(((s1, s2,s3), (s4, s5, s6)), ((s7, s8, s9), (LCPN7, hN7)))} sq=“Celestine” θ=4 27 / 61

slide-28
SLIDE 28

VS2 - VO Verification

The client checks the soundness and completeness of RS by verifying the VO.

catches soundness violation ∃s ∈ RS, but DST(s, sq) > θ completeness violation ∃s ∈ D s.t. DST(s, sq) ≤ θ ∃s ∈ RS, but s ∈ D but s ∈ RS Compute Sig(T) from V O Compute Sig(T) from V O

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 RS = {s1, s2} V O = {(((s1, s2,s3), (s4, s5, s6)), ((s7, s8, s9), (LCPN7, hN7)))} Check if Sig(T) Sig(T) matches the local copy sq=“Celestine” θ=4 28 / 61

slide-29
SLIDE 29

VS2 - VO Verification

The client checks the soundness and completeness of RS by verifying the VO.

catches soundness violation ∃s ∈ RS, but DST(s, sq) > θ completeness violation ∃s ∈ D s.t. DST(s, sq) ≤ θ ∃s ∈ RS, but s ∈ D but s ∈ RS Compute Sig(T) from V O Compute Sig(T) from V O ∀s ∈ RS, check if DST(s, sq) ≤ θ ∀s ∈ RS, check if DST(s, sq) ≤ θ ∀C-string s, check if DST(s, sq) > θ ∀MF-node N, check if MIN DST(N.LCP, sq) > θ RS = {s1, s2} V O = {(((s1, s2,s3), (s4, s5, s6)), ((s7, s8, s9), (LCPN7, hN7)))} for C-strings DST(s3, sq) = 5 > 4 DST(s4, sq) = 9 > 4 DST(s5, sq) = 9 > 4 DST(s6, sq) = 8 > 4 DST(s7, sq) = 8 > 4 DST(s8, sq) = 8 > 4 DST(s9, sq) = 8 > 4 for MF-node MIN DST(LCPN7, sq) = 6 > 4 for similar strings DST(s1, sq) = 4 DST(s2, sq) = 3 < 4 10 DST calculations Naive approach: 12 DST calculations sq= “ Celestine ” θ=4

29 / 61

slide-30
SLIDE 30

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of

Similarity Search Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

30 / 61

slide-31
SLIDE 31

E-VS2 - Setup

  • The client constructs the MB-tree.
  • The client applies SparseMap to embed strings into

Euclidean points.

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 hN1 = h(hN2||hN3||h(LCPN1)) Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 hN2 = h(hN4||hN5||h(LCPN2)) N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 hN4 = h(h(s1)||h(s2)||h(s3)||h(LCPN4)) S1 S2 S3

Key idea For any C-string s, if dst(p, pq) > θ, it must be true that DST(s, sq) > θ.

31 / 61

slide-32
SLIDE 32

E-VS2 - VO Construction

Distant Bounding Hyper-rectangle (DBH) A hyper-rectangle R in the Euclidean space is a DBH if min_dst(pq, R) > θ. DBH-String For any C-string s, if dst(p, pq) > θ, we call it a DBH-string. FP-String For any C-string s, if dst(p, pq) ≤ θ, we call it a FP-string. Key idea

  • To save the verification cost at the client side,

the server should organize the set of DBH-strings into a small number of DBHs.

  • By only checking the Euclidean distance between

the query point pq and the DBHs, the client assures that all DBH-strings are dis-similar to sq.

32 / 61

slide-33
SLIDE 33

E-VS2 - VO Construction

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 6 1 3 Similar Strings DBH-Strings NC-Strings MF-Node sq= “ Celestine ” θ=4 FP-Strings C-Strings

p3 p5 p6 p8 p4 p7 pq p1 p9 p10 p11 p12 θ p2

33 / 61

slide-34
SLIDE 34

E-VS2 - VO Construction

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 6 1 3 Similar Strings DBH-Strings NC-Strings MF-Node sq= “ Celestine ” θ=4 FP-Strings

p3 p5 p6 p8 p4 p7 pq p1 p9 p10 p11 p12 θ p2 R1 R2

34 / 61

slide-35
SLIDE 35

E-VS2 - VO Construction

Theorem (NP-Completeness of DBH Construction) Given a query string sq, and a set of DBH-strings {s1, . . . , st}, let {p1, . . . , pt} be their Euclidean points. It is a NP-complete problem to construct a mimimum number of rectangles R = {R1, . . . , Rk} s.t. (1) ∀i = j, Ri and Rj do not overlap; and (2) ∀pi, there exists a Rj s.t. pi is included in Rj.

  • We design an efficient heuristic algorithm for the server to

construct a small amount of DBHs.

  • The complexity is cubic to the number of DBH-strings.

35 / 61

slide-36
SLIDE 36

E-VS2 - VO Construction

The server includes the DBHs in the VO.

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 6 1 3 Similar Strings DBH-Strings NC-Strings MF-Node sq= “ Celestine ” θ=4 FP-Strings RS = {s1, s2} V O = {(((s1, s2, (s3, pR1)), ((s4, pR2), (s5, pR1), (s6, pR1))), (((s7, pR2), (s8, pR1), s9), (LCPN7, hN7))), {R1, R2}} p3 p5 p6 p8 p4 p7 pq p1 p9 p10 p11 p12 θ p2 R2 R1

36 / 61

slide-37
SLIDE 37

E-VS2 - VO Verification

The client checks the soundness and completeness of RS by verifying the VO.

catches soundness violation ∃s ∈ RS, but DST(s, sq) > θ completeness violation ∃s ∈ D s.t. DST(s, sq) ≤ θ ∃s ∈ RS, but s ∈ D but s ∈ RS Compute Sig(T) from V O Compute Sig(T) from V O

pN3 pN3 pN2 pN2 LCPN1 LCPN1 hN1 hN1 Sig(T) = sign(hN1) N1 pN5 pN5 pN4 pN4 LCPN2 LCPN2 hN2 hN2 N2 pN7 pN7 pN6 pN6 LCPN3 LCPN3 hN3 hN3 N3 s6 s6 s5 s5 s4 s4 LCPN5 LCPN5 hN5 hN5 N5 s9 s9 s8 s8 s7 s7 LCPN6 LCPN6 hN6 hN6 N6 s12 s12 s11 s11 s10 s10 LCPN7 LCPN7 hN7 hN7 N7 s3 s3 s2 s2 s1 s1 LCPN4 LCPN4 hN4 hN4 N4 Check if Sig(T) Sig(T) matches the local copy sq=“Celestine” θ=4 RS = {s1, s2} V O = {(((s1, s2, (s3, pR1)), ((s4, pR2), (s5, pR1), (s6, pR1))), (((s7, pR2), (s8, pR1), s9), (LCPN7, hN7))), {R1, R2}} 37 / 61

slide-38
SLIDE 38

E-VS2 - VO Verification

The client checks the soundness and completeness of RS by verifying the VO.

catches soundness violation ∃s ∈ RS, but DST(s, sq) > θ completeness violation ∃s ∈ D s.t. DST(s, sq) ≤ θ ∃s ∈ RS, but s ∈ D but s ∈ RS Compute Sig(T) from V O Compute Sig(T) from V O ∀s ∈ RS, check if DST(s, sq) ≤ θ ∀s ∈ RS, check if DST(s, sq) ≤ θ ∀MF-node N, check if MIN DST(N.LCP, sq) > θ ∀DBH-string (s, pR), check if p ∈ R, and if min dst(pq, R) > θ ∀FP-string s, check if DST(sq, s) > θ

sq= “ Celestine ” θ=4 for MF-node MIN DST(LCPN7, sq) = 6 > 4 for similar strings DST(s1, sq) = 4 DST(s2, sq) = 3 < 4 4 DST calculations + 2 dst calculations Naive approach: 12 DST calculations RS = {s1, s2} V O = {(((s1, s2, (s3, pR1)), ((s4, pR2), (s5, pR1), (s6, pR1))), (((s7, pR2), (s8, pR1), s9), (LCPN7, hN7))), {R1, R2}} for DBH-strings min dst(pq, R1) > θ min dst(pq, R2) > θ for FP-string DST(s9, sq) = 8 > 4 V S2 V S2: 10 DST calculations

38 / 61

slide-39
SLIDE 39

Complexity Analysis

Phase Measurement VS2 E-VS2 Setup Time O(n) O(cdn2) Space O(n) O(n) VO Construction Time O(n) O(n + n3

DS)

VO Size (nR + nC )σS + nMF σM (nR + nC )σS + nMF σM + nDBHσD VO Verification Time O((nR + nMF + nC )CEd)O((nR + nMF + nFP)CEd + nDBHCEl) ( n: # of strings in D; c: a constant in [0, 1]; d: # of dimensions of Euclidean space; σS: the average length of the string; σM: Avg. size of a MB-tree node; σD: Avg. size of a DBH; nR: # of strings in MS; nC : # of C-strings; nFP: # of FP-strings; nDS: # of DBH-strings; nDBH: # of DBHs; nMF : # of MF nodes; CEd: the complexity of an edit distance computation; CEl: the complexity of Euclidean distance calculation.)

  • E-VS2 results in higher VO construction complexity at the

server side.

  • E-VS2 dramatically saves the VO verification cost at the

client side.

39 / 61

slide-40
SLIDE 40

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

40 / 61

slide-41
SLIDE 41

Experiments - Setup

  • Environment

Language C++ Testbed A Linux machine with 2.4 GHz CPU and 48 GB RAM

  • Datasets

Actors 1 260, 000 lastnames Authors 2 1, 000, 000 full names

  • Evaluation metric
  • VO construction time
  • VO verification time

1http://www.imdb.com/interfaces 2http://dblp.uni-trier.de/xml/

41 / 61

slide-42
SLIDE 42

Experiments - VO Construction Time

Time Performance of VO Construction

5 10 15 20 25 30 1 2 3 4 5 6 Time(Second) Threshold value VS2 E-VS2 20 40 60 80 100 120 1 2 3 4 5 6 Time(Second) Threshold value VS2 E-VS2

(a) The Actors dataset (b) The Authors dataset

  • E-VS2 takes more time at the server side to construct VO,

especially when θ is small.

42 / 61

slide-43
SLIDE 43

Experiments - VO Verification Time

Time Performance of VO Verification

0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 Time(Second) Threshold value VS2 E-VS2 baseline 2 4 6 8 10 12 1 2 3 4 5 6 Time(Second) Threshold value VS2 E-VS2 baseline

(a) The Actors dataset (f = 1, 000) (b) The Authors dataset (f = 1, 000)

  • VS2 and E-VS2 are significantly more efficient than the

baseline approach in verification cost.

  • The advantage of E-VS2 is large when θ is small.

43 / 61

slide-44
SLIDE 44

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data

Deduplication

  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

44 / 61

slide-45
SLIDE 45

α-Security against Frequency Analysis (FA) Attack 3

Define α-security to limit the success probability of frequency analysis attack.

Experiment ExpF A

A,Π()

p′ ← Afreq(e),freq(P) Return 1 if p′ = Decrypt(k, e) Return 0 otherwise α-security against FA attack if Pr[ExpF A

A,Π() = 1] ≤ α

3Boxiang Dong, Ruilin Liu, Wendy Hui Wang.

Prada: Privacy-preserving Data-Deduplication-as-a-Service. International Conference on Information and Knowledge Management, 2014. (Acceptance rate=20%). 45 / 61

slide-46
SLIDE 46

Privacy-preserving Outsourced Data Deduplication 4

We design two approaches to enable data deduplication and defend against the frequency analysis attack.

  • Locality-sensitive Hashing Based Approach (LSHB)
  • Embedding & Homomorphic Substitution Approach (EHS)

S1(f1) LSH1(f) S2(f3) S3(f3) LSH2(f) LSH3(f) LSH4(f) LSH5(f) LSH6(f) LSH7(f)

≈ ≉

S1 S2 S3

LSHB approach encodes strings into LSH values that EHS approach encodes strings into Euclidean points that (1) preserve the string similarity; and (1) preserve the string similarity; and (2) are of the same frequency groupwise. (2) are of uniform frequency.

4Boxiang Dong, Ruilin Liu, Wendy Hui Wang.

Prada: Privacy-preserving Data-Deduplication-as-a-Service. International Conference on Information and Knowledge Management, 2014. (Acceptance rate=20%). 46 / 61

slide-47
SLIDE 47

Privacy-preserving Outsourced Data Deduplication

Experiment Results

20 40 60 80 100 120 140 160 180 200 2k 4k 6k 8k 10k 12k 14k 16k 18k Time (Second) Data Size EHS LSHB 20 40 60 80 100 2k 4k 6k 8k 10k 12k 14k 16k 18k Recall and Precision (%) Data Size Recall (EHS) Precision (EHS) Recall (LSHB) Precision (LSHB) (a) Time performance (b) Deduplication accuracy 47 / 61

slide-48
SLIDE 48

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data

Inconsistency Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

48 / 61

slide-49
SLIDE 49

Functional Dependency (FD)

Functional dependency (FD) X → Y if r1[X] = r2[X], then r1[Y ] = r2[Y ]. FDs play a key role in identifying and fixing data inconsistency.

TID Conference Year Country Capital City r1 SIGMOD 2007 China Beijing Beijing r2 ICDM 2014 China Shanghai Shenzhen r3 KDD 2014 U.S. Washington D.C. New York City r4 KDD 2015 Australia Canberra Sydney r5 ICDM 2015 U.S. New York City Atlantic City

FD : Country → Capital

49 / 61

slide-50
SLIDE 50

Indistinguishability against FD-preserving Chosen Plaintext Attack (IND-FCPA)

Experiment ExpIND−F CP A

A,Π

(λ) k ← KeyGen(λ) (D0, D1) ← AOEncrypt(.)(k) s.t. FD0 = FD1 and |D0| = |D1| b

$

← − {0, 1} b′ ← AOEncrypt(.)(k) Return 1 if b = b′ Return 0 otherwise IND − FCPA if Pr[ExpIND−F CP A

A

(n) = 1] ≤ 1

2 + negl(n)

50 / 61

slide-51
SLIDE 51

Privacy-preserving Outsourced Data Inconsistency Repair

We consider two scenarios of the outsourced data inconsistency repair, and design two encryption/encoding approaches to provide robust privacy guarantee 5.

Adversarial Knowldge FDs Adversarial Attack FD-Attack Security Setting Partial Data Secure Data Inconsistency Repair against FD-Attack4 Adversarial Knowldge Frequency Adversarial Attack FA-Attack Security Setting Whole Data Secure Data Inconsistency Repair against Frequency Analysis Attack

5Boxiang Dong, Wendy Hui Wang, Jie Yang.

Secure Data Outsourcing with Adversarial Data Dependency Constraints. International Conference on Big Data Security on Cloud, 2016. (Acceptance rate=23%). 51 / 61

slide-52
SLIDE 52

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

52 / 61

slide-53
SLIDE 53

Research beyond the Thesis

  • Authentication of outsourced data mining computations
  • Association rule mining [DBSec’13, ICDM’13, TSC’15]
  • Outlier mining (under review)
  • Rank aggregation in the crowdsourcing setting (under

review)

  • Rank inference
  • Task assignment with data privacy concern
  • Data-as-a-commodity (under review)
  • Budget constraint
  • High quality (low inconsistency)

53 / 61

slide-54
SLIDE 54

Outline

1 Introduction 2 Research Results

  • Authentication of Outsourced Data Deduplication
  • Verification of Similarity Search Approach (VS2)
  • Embedding-based Verification of Similarity Search

Approach (E-VS2)

  • Experiments
  • Privacy-preserving Outsourced Data Deduplication
  • Privacy-preserving Outsourced Data Inconsistency

Repair

3 Research beyond the Thesis 4 Future Plan 5 Conclusion

54 / 61

slide-55
SLIDE 55

Future Plan

  • Authenticated outsourced data inconsistency repair

Challenge It is NP-complete to find a repair with the minimum cost. Solution

  • Convert the strings into Euclidean space.
  • It is the center of mass that results in the

smallest repair cost.

  • Authenticated outsourced data imputation

Challenge It demands a similarity matrix between all values. Solution Create evidence imputation objects to verify the result in a probabilistic way.

55 / 61

slide-56
SLIDE 56

Conclusion

Privacy-preserving and authenticated data cleaning on

  • utsourced databases.
  • Define two security notions, namely α-security and

IND-FCPA.

  • Authentication of outsourced data deduplication.
  • Privacy-preserving outsourced data deduplication.
  • Privacy-preserving outsourced data inconsistency repair.
  • Privacy against FD attack.
  • Privacy against frequency analysis attack.

The suit of encryption, encoding, and authentication schemes address the security and privacy concerns in outsourced computing.

56 / 61

slide-57
SLIDE 57

My Publications

IRI’16 Boxiang Dong, Hui (Wendy) Wang. ARM: Authenticated Approximate Record Matching for Outsourced Databases. IEEE International Conference on Information Reuse and Integration (IRI). Pittsburgh, PA. 2016. (Acceptance rate = 25%). BigDataSecurity’16 Boxiang Dong, Hui (Wendy) Wang, Jie Yang. Secure Data Outsourcing with Adversarial Data Dependency Constraints. IEEE International Conference on Big Data Security on Cloud (BigDataSecurity). New York. 2016. (Acceptance rate = 23%). TSC’15 Boxiang Dong, Ruilin Liu, Hui (Wendy) Wang. Trust-but-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining. IEEE Transactions on Services Computing. 2015. CIKM’14 Boxiang Dong, Ruilin Liu, Hui (Wendy) Wang. Prada: Privacy-preserving Data-Deduplication-as-a-Service. ACM International Conference on Information and Knowledge Management (CIKM). Shanghai, China. 2014. (Acceptance rate = 20%). ICDM’13 Boxiang Dong, Ruilin Liu, Hui (Wendy) Wang. Integrity Verification of Outsourced Frequent Itemset Mining with Deterministic Guarantee. IEEE International Conference on Data Mining (ICDM). Dallas, Texas. 2013. (Acceptance rate = 19.7%). DBSec’13 Boxiang Dong, Ruilin Liu, Hui (Wendy) Wang. Result Integrity Verification of Outsourced Frequent Itemset Mining. Annual IFIP WG 11.3 Conference on Data and Application Security and Privacy (DBSec). Newark, NJ. 2013. IJIPM’10 Weifeng Sun, Juanyun Wang, Boxiang Dong, Mingchu Li, Zhenquan Qin. A Mediated RSA-based End Entity Certificates Revocation Mechanism with Secure Concern in Grid. International Journal of Information Processing and Management (IJIPM). 2010. IIH-MSP’10 Weifeng Sun, Boxiang Dong, Zhenquan Qin, Juanyun Wang, Mingchu Li. A Low-Level Security Solving Method in Grid. International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP). Darmstadt,

  • Germany. 2010.

57 / 61

slide-58
SLIDE 58

References I

[AHMP15] Tristan Allard, Georges Hébrail, Florent Masseglia, and Esther Pacitti. Chiaroscuro: Transparency and privacy for massive personal time-series clustering. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 779–794, 2015. [BFFR05] Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 143–154, 2005. [BFG+07] Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In IEEE International Conference on Data Engineering, pages 746–755, 2007. [CC04] Tim Churches and Peter Christen. Some methods for blindfolded record linkage. BMC Medical Informatics and Decision Making, 4(1):9, 2004. [CMF+11] Rui Chen, Noman Mohammed, Benjamin CM Fung, Bipin C Desai, and Li Xiong. Publishing set-valued data via differential privacy. Proceedings of the VLDB Endowment, 4(11):1087–1098, 2011. [DLW13] Boxiang Dong, Ruilin Liu, and Hui Wendy Wang. Result integrity verification of outsourced frequent itemset mining. In Data and Applications Security and Privacy XXVII, pages 258–265. 2013. [EAMY+13] Durham E. Ashley, Kantarcioglu M., Xue Y., Kuzu M., and Malin Bradley. Composite bloom filters for secure record linkage. In IEEE Transactions on Knowledge and Data Engineering, 2013. 58 / 61

slide-59
SLIDE 59

References II

[Eck02] Wayne W Eckerson. Data quality and the bottom line. The Data Warehouse Institute Report, 2002. [GIJ+01] Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, Divesh Srivastava, et al. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Data Bases, volume 1, pages 491–500, 2001. [HS] G Hjaltason and H Samet. Contractive embedding methods for similarity searching in metric spaces. Technical report, Computer Science Department, University of Maryland. [LWM+12] Ruilin Liu, Hui Wendy Wang, Anna Monreale, Dino Pedreschi, Fosca Giannotti, and Wenge Guo. Audio: An integrity auditing framework of outlier-mining-as-a-service systems. In Machine Learning and Knowledge Discovery in Databases, pages 1–18. 2012. [LZL+15] An Liu, Kai Zhengy, Lu Liz, Guanfeng Liu, Lei Zhao, and Xiaofang Zhou. Efficient secure similarity computation on encrypted trajectory data. In IEEE International Conference on Data Engineering, pages 66–77, 2015. [PEM+15] Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 8(10):1082–1093, 2015. [PHGR13] Bryan Parno, Jon Howell, Craig Gentry, and Mariana Raykova. Pinocchio: Nearly practical verifiable computation. In IEEE Symposium on Security and Privacy (SP), pages 238–252, 2013. 59 / 61

slide-60
SLIDE 60

References III

[PRZB12] Raluca Ada Popa, Catherine Redfield, Nickolai Zeldovich, and Hari Balakrishnan. Cryptdb: Processing queries on an encrypted database. Communications of the ACM, 55(9):103–111, 2012. [SAA10] Yasin N Silva, Walid G Aref, and Mohamed H Ali. The similarity join database operator. In IEEE International Conference on Data Engineering, volume 10, pages 892–903, 2010. [SV10] Nigel P Smart and Frederik Vercauteren. Fully homomorphic encryption with relatively small key and ciphertext sizes. In Public Key Cryptography–PKC, pages 420–443. 2010. [SVP+12] Srinath Setty, Victor Vu, Nikhil Panpalia, Benjamin Braun, Andrew J Blumberg, and Michael Walfish. Taking proof-based verified computation a few steps closer to practicality. In The USENIX Security Symposium, pages 253–268, 2012. [TOEY11] Nilothpal Talukder, Mourad Ouzzani, Ahmed K Elmagarmid, and Mohamed Yakout. Detecting inconsistencies in private data with secure function evaluation. Technical report, Computer Science Department, Purdue University, 2011. [YLKG07] Su Yan, Dongwon Lee, Min-Yen Kan, and Lee C Giles. Adaptive sorted neighborhood methods for efficient record linkage. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, pages 185–194, 2007. [ZHOS10] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In Proceedings of the International Conference on Management of Data, 2010. 60 / 61

slide-61
SLIDE 61

Q & A Thank you! Questions?