truth inference on sparse crowdsourcing data with local
play

Truth Inference on Sparse Crowdsourcing Data with Local - PowerPoint PPT Presentation

Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA 18 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 Ting Yu 3 Zhan Qin 4 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University


  1. Truth Inference on Sparse Crowdsourcing Data with Local Differential Privacy IEEE BIG DATA ’18 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 Ting Yu 3 Zhan Qin 4 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ 3 Qatar Computing Research Institute Doha, Qatar 4 The University of Texas at San Antonio San Antonio, Texas December 12, 2018

  2. Crowdsourcing Workers Data Curator Tasks • Data curator releases tasks on a crowdsourcing platform. 2 / 36

  3. Crowdsourcing Workers Answers Data Curator Answers Answers • Data curator releases tasks on a crowdsourcing platform. • The workers provide their answers to these tasks in exchange for a reward. 3 / 36

  4. Privacy Concern Collecting answers from individual workers may pose potential privacy risks. • Crowdsourcing-related applications collect sensitive personal information from workers. • By using a sequence of surveys, a data curator (DC) could potentially determine the identities of workers. 4 / 36

  5. Differential Privacy Differential privacy (DP) provides rigorous privacy guarantee. Workers Noise Trusted The Public x 1 Data Curator m 1 � x i + ξ m x 2 i =1 x m However, classical DP requires a trusted data curator to publish privatized statistical information. 5 / 36

  6. Local Differential Privacy Local differential privacy (LDP) is the state-of-the-art approach for privacy-preserving data collection. Workers Untrusted x 1 = x 1 + ξ 1 ˆ Data Curator x 2 = x 2 + ξ 2 ˆ f (ˆ x 1 , ˆ x 2 , . . . , ˆ x m ) ˆ x m = x m + ξ m Before sending the answer to the data curator, each worker perturbs his/her private data locally. 6 / 36

  7. Challenges I - Data Sparsity • Most workers only provide answers to a very small portion of the tasks. • We use NULL to represent the answer if a worker does not provide response for a specific task. Dataset # of Workers # of Tasks Average Sparsity Web 1 34 177 0.705882 AdultContent 2 825 11,040 0.993666 • NULL values should also be protected. • Careless perturbation of NULL values may significantly alter the original answer distribution. 1 http://dbgroup.cs.tsinghua.edu.cn/ligl/crowddata/ 2 https: //github.com/ipeirotis/Get-Another-Label/tree/master/data 7 / 36

  8. Challenges II - Data Utility • Truth inference estimates the true results from answers provided by workers of different quality. • Most truth inference algorithms iterate until convergence. • We aim to preserve the accuracy of truth inference on the perturbed worker answers, even a slight amount of initial noise in the worker answers may be propagated during iterations. 8 / 36

  9. Our Contributions Extension to Existing Approaches • Laplace perturbation (LP) approach • Randomized response (RR) approach • Large expected error in the truth inference results Novel Approach We design a new matrix factorization (MF) perturbation algorithm to satisfy LDP, and guarantee small error. 9 / 36

  10. Outline 1 Introduction 2 Related Work 3 Preliminaries 4 Perturbation Schemes • Laplace Perturbation (LP) • Randomized Response (RR) • Matrix Factorization (MF) 5 Experiments 6 Conclusion 10 / 36

  11. Related Work Local differential privacy • Count, heavy hitters [HILM02, HIM02] • Graph synthesization [QYY + 17] • Linear regression [NXY + 16] Privacy-preserving crowdsourcing • Mutual information [KOV14] • Truth discovery on complete data [LMS + 18] Differentially private recommendation • Perturbation on categories [Can02, SJ14] • Iterative factorization [SKSX18] 11 / 36

  12. Preliminaries - Local Differential Privacy (LDP) Definition ( ǫ -Local Differential Privacy) A randomized privatization mechanism M satisfies ǫ -local differential privacy ( ǫ -LDP) iff for any pair of answer vectors � a a ′ that differ at one cell, we have: and � z p ∈ Range ( M ) : Pr [ M ( � a ) = � z p ] z p ] ≤ e ǫ , ∀ � Pr [ M ( � a ′ ) = � where Range ( M ) denotes the set of all possible outputs of the algorithm M . 12 / 36

  13. Preliminaries - Truth Inference • Associated each worker with a quality. • For each task, estimate the truth by taking the weighted average of the worker answers. • For each worker, estimate the quality by measuring the difference between his answers and the estimated truth. . . . q 1 q 2 q m Quality a 1 j a 2 j a m j t j � Wi ∈ Wj q i × a i,j Estimated truth ˆ µ j = � Wi ∈ Wj q i 1 1 Estimated quali ty q i ∝ σ i = � 1 � tj ∈ T i ( a i,j − ˆ µ j ) 2 |T i | 13 / 36

  14. Preliminaries - Truth Inference Iteratively updating the estimated truth and worker quality until convergence [LLG + 14]. Algorithm 1 Truth inference Require: The workers’ answers { a i,j } Ensure: The estimated true answer (i.e. , the truth) of tasks { ˆ µ j } and the quality of workers { q i } 1: Initialize worker quality q i = 1 /m for each worker W i ∈ W ; 2: while the convergence condition is not met do Estimate { ˆ µ j } ; 3: Estimate { q i } ; 4: 5: end while 6: return { ˆ µ j } and { q i } ; 14 / 36

  15. Preliminaries - Matrix Factorization Given M ∈ R m × n , find U ∈ R m × d and V ∈ R n × d s.t. v j ) 2 is minimized. L ( M , U , V ) = � ( i , j ) ∈ Ω ( M i , j − � u T i � ≈ M i , j , can be approximated by the inner product of � u i and � v j , i.e., � u T v j . i � 15 / 36

  16. Problem Statement Input A set of answers { W i } and their answer vectors A = { � a i } , and a privacy parameter ǫ Output The perturbed answer vectors A P = {M ( � a i ) |∀ � a i ∈ A } Requirement • Privacy: A P satisfies ǫ -LDP. • Utility: Accurate truth inference results from A P , i.e., minimize � T j ∈T | µ j − ˆ µ j | MAE ( A P ) = . n 16 / 36

  17. Laplace Perturbation (LP) Step 1 Replace NULL values with some value in the answer domain Γ . � v a i , j = NULL g ( a i , j ) = a i , j � = NULL , a i , j Step 2 Add Laplace noise to each answer. g ( a i , 1 )+ Lap ( | Γ | ǫ ) , g ( a i , 2 )+ Lap ( | Γ | ǫ ) , ..., g ( a i , n )+ Lap ( | Γ | � � L ( � a i ) = ǫ ) 17 / 36

  18. Laplace Perturbation (LP) Theorem 1 (Expected MAE of LP) a i } , let A P = { ˆ Given a set of answer vectors A = { � a i } be the answer vectors after applying LP on A . Then the expected of the estimated truth on A P must satisfy � MAE ( A P ) � error E that n m ≤ 1 � � � MAE ( A P ) � ( q i × e LP i , j ) , E n j = 1 i = 1 � � � � � φ j + | Γ | π + | Γ | 2 where e LP i , j = ( 1 − s i ) + s i σ i , µ j is the ǫ ǫ ground truth of task T j , σ i is the standard error deviation of worker W i , s i is the fraction of the tasks that W i returns non-NULL values, and φ j is the deviation between µ j and the expected value E ( v ) of v . 18 / 36

  19. Laplace Perturbation (LP) Simple Setting • q i = 1 m , σ i = 1, i.e., all workers have the same quality. • µ j = 1, i.e., all ground truths are 1. • s i = 0 . 1, i.e., 10% answers are not NULL. • | Γ | = 10. • ǫ = 1. Expected Error � MAE ( A P ) � ≤ 14 . 13 E 19 / 36

  20. Randomized Response (RR) • Add NULL to the answer domain Γ . • For each answer a i , j , apply randomized response. � e ǫ if y = a i , j | Γ | + e ǫ ∀ y ∈ Γ , Pr [ M ( a i , j ) = y ] = 1 if y � = a i , j | Γ | + e ǫ Each original answer either e ǫ • remains unchanged in with probability | Γ | + e ǫ , or 1 • is replaced with a different value with probability | Γ | + e ǫ . 20 / 36

  21. Randomized Response (RR) Theorem 2 (Expected MAE of RR) a i } , let A P = { ˆ Given a set of answer vectors A = { � a i } be the answer vectors after applying RR on A . Then the expected of the estimated truth on A P must satisfy � � MAE ( A P ) error E that n � W i ∈ W j q i × e RR ≤ 1 i , j � � MAE ( A P ) � E , � n W i ∈ W j q i j = 1 where � � � � � � � � 1 � � � e RR � � � � i , j =( 1 − s i ) µ j − + s i N ( x ; µ j , σ i ) µ j − , y yP xy � e ǫ + | Γ | � � � � � � � y ∈ Γ x ∈ Γ y ∈ Γ � � � � s i is the fraction of tasks that worker W i returns non-NULL values, and P xy is the probability that value x is replaced with y . 21 / 36

  22. Randomized Response (RR) Simple Setting • q i = 1 m , σ i = 1, i.e., all workers have the same quality. • µ j = 0, i.e., all ground truths are 1. • s i = 0 . 1, i.e., 10% answers are not NULL. • Γ = [ 0 , 9 ] . • ǫ = 1. Expected Error � MAE ( A P ) � ≤ 3 . 551 E 22 / 36

  23. Matrix Factorization (MF) • DC randomly generates the task profile matrix V ∈ R n × d , and sends both V and the tasks T to the workers. Workers V , T Data Curator V , T V , T 23 / 36

  24. Matrix Factorization (MF) • DC randomly generates the task profile matrix V ∈ R n d , and sends both V and the tasks T to the workers. • Every worker gets the answers � a i , and returns the differentially private answer profile vector � u i . Workers � u 1 Data Curator � u 2 � u m { � a i = � u i V } 24 / 36

  25. Matrix Factorization (MF) Instead of directly adding noise to � u i , we design a novel approach based on objective perturbation to reduce the distortion. � u i = arg min L DP ( � a i , � u i , V ) . u i � v j ) 2 + 2 � � u T u T L DP ( � u i , V ) = ( a i , j − � a i , � i � i � η i , T j ∈T i η i = { Lap ( | Γ | ǫ ) , . . . , Lap ( | Γ | where � ǫ ) } is a d -dimensional vector. 25 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend