SLIDE 1 PRIVATE CORE-SETS
FOR FACE IDENTIFICATION
DAN FELDMAN, UNIVERSITY OF HAIFA
A JOINT WORK WITH Rita Osadchy: SOME SLIDES FROM: KOBBI NISSIM
[STOC’11] – PRIVATE CORESETS [OAKLAND’10] - SECURE COMPUTATION OF FACE
IDENTIFICATION
SLIDE 2 GOALS OF THIS WORK
Practical and useful implementation of private core-sets by combining:
- Theoretical proofs regarding Private
Coresets
[D., Nissim, Fiat Kaplan]
- Practical system for face
identification
[Osadchy, Pinkas, Jarrous, Moskovitch]
SLIDE 3 K-ANONYMITY [SS98,S02]
As in ancient crypto:
- 'ehlol owrdl' looked like a good
encryption for 'hello world‘
- No proofs against smart attacker or
usage of auxiliary data
- In general:
- (1) Algorithm is published against last
attack mechanism
- (2) Attack mechanism is published
- (3) Goto 1
SLIDE 4 K-ANONYMITY [SS98,S02]
Prevent re-identification:
Make every individual’s identity unidentifiable from other k-1
individuals
Disease sex Age ZIP Heart Female 55 23456 Heart Male 30 12345 Heart Male 33 12346 Breast Cancer Female 45 13144 Hepatitis Male 42 13155 Viral Male 42 23456 Disease sex Age ZIP Heart * ** 23456 Heart Male 3* 1234* Heart Male 3* 1234* Breast Cancer * 4* 131** Hepatitis * 4* 131** Viral * ** 23456
Bugger! I Cannot tell which disease for the patients from zip 23456 Both guys from zip 1234* that are in their thirties have heart problems My (male) neighbor from zip 13155 has hepatitis! Slide borrowed from Kobbi Nissim
SLIDE 5 5
SEARCH FOR: PRIVACY
SLIDE 6 AOL SEARCH HISTORY RELEASE (2006)
650,000 users, 20 Million queries, 3 months
Goal: provide real query log data that is
based on real users
“It could be used for personalization, query
reformulation or other types of search research”
Privacy? Identifying information replaced with random identifiers
6
SLIDE 7 4417749best dog for older owner 3/6/2006 11:48:24 1 http://www.canismajor.com 4417749best dog for older owner 3/6/2006 11:48:24 5 http://dogs.about.com 4417749landscapers in lilburn ga. 3/6/2006 18:37:26 4417749 effects of nicotine 3/7/2006 19:17:19 6 http://www.nida.nih.gov 4417749best retirement in the world 3/9/2006 21:47:26 4 http://www.escapeartist.com 4417749best retirement place in usa 3/9/2006 21:49:37 10 http://www.clubmarena.com 4417749best retirement place in usa 3/9/2006 21:49:37 9 http://www.committment.com 4417749bi polar and heredity 3/13/2006 20:57:11 4417749adventure for the older american 3/17/2006 21:35:48 4417749nicotine effects on the body 3/26/2006 10:31:15 3 http://www.geocities.com 4417749nicotine effects on the body 3/26/2006 10:31:15 2 http://health.howstuffworks.com 4417749wrinkling of the skin 3/26/2006 10:38:23 4417749mini strokes 3/26/2006 14:56:56 1 http://www.ninds.nih.gov 4417749panic disorders 3/26/2006 14:58:25 4417749jarrett t. arnold eugene oregon 3/23/2006 21:48:01 2 http://www2.eugeneweekly.com 4417749jarrett t. arnold eugene oregon 3/23/2006 21:48:01 3 http://www2.eugeneweekly.com 4417749plastic surgeons in gwinnett county 3/28/2006 15:04:231 http://www.wedalert.com 4417749plastic surgeons in gwinnett county 3/28/2006 15:04:234 http://www.implantinfo.com 4417749plastic surgeons in gwinnett county 3/28/2006 15:31:00 441774960 single men 3/29/2006 20:11:52 6 http://www.adultlovecompass.com 441774960 single men 3/29/2006 20:14:14 4417749clothes for 60 plus age 4/19/2006 12:44:03 4417749clothes for age 60 4/19/2006 12:44:41 10 http://www.news.cornell.edu 4417749clothes for age 60 4/19/2006 12:45:41 4417749lactose intolerant 4/21/2006 20:53:51 2 http://digestive.niddk.nih.gov 4417749lactose intolerant 4/21/2006 20:53:51 10 http://www.netdoctor.co.uk 4417749dog who urinate on everything 4/28/2006 13:24:07 6 http://www.dogdaysusa.com 4417749fingers going numb 5/2/2006 17:35:47
SLIDE 8 Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA
SLIDE 9 LINKAGE ATTACKS [SWEENEY 02]
Ethnicity visit date Diagnosis Procedure Medication Total Charge ZIP Birth date Sex Anonymized GIC data ZIP Birth date Sex Name Address Date registered Party affiliation Date last voted Voter registration
GIC Group Insurance Commission patient specific data ( 135,000 patients) 100 attributes per encounter Anonymized Voter registration
“Public records”
- pen for inspection by anyone
SLIDE 10 LINKAGE ATTACKS [SWEENEY 02]
William Weld (governor of Massachusetts at
the time)
According to the Cambridge Voter list:
Six people had his particular birth date Of which three were men He was the only one in his 5-digit ZIP code!
SLIDE 11 LINKAGE ATTACK ON THE NETFLIX DATASET
Netflix: online movie rental service In October 2006, released real movie ratings of
500,000 subscribers
10% of all Netflix users as of late 2005 Names removed, maybe perturbed
SLIDE 12 THE NETFLIX DATASET
Movie 1 Movie 2 Movie 3 … … Rating/ timestamp Rating/ timestamp Rating/ timestamp …… 1234 5678 2589 4379 … … 500K users
17K movies – high dimensional! Average subscriber has 214 dated ratings
SLIDE 13 NETFLIX DATASET: NEAREST NEIGHBOR
Considering just movie names, for 90% of records there isn’t a single other record which is more than 30% similar
similarity Slide borrowed from Elaine Shi
SLIDE 14 Threat: deanonymization
User Movie Rating 1234 Rocky II 3/5 1234 The Wizard 4/5 1234 The Dark Knight 5/5 … 1234 Girls Gone Wild 5/5 User Movie Rating dukefan The Wizard 8/10 dukefan The Dark Knight 10/10 dukefan Rocky II 6/10 …
User 1234 is dukefan!
SLIDE 15 Auditing using diffential privacy
data
Query log
q1,…,qi Here’s a new query: qi+1 Answer is… Query denied (answering would cause privacy loss) Auditor OR
SLIDE 16 Database Privacy: The Setting
Government, Businesses, Researchers (or) Malicious adversary
Users
- Database x = (x1,x2, …,xn) (a table of n rows)
- Each element is from some domain D
- D can be numbers, categories, tax forms, etc.
Database x xn xn-1 x3 x2 x1
Algorithm A
(queries) answer s
SLIDE 17 DIFFERENTIAL PRIVACY [DMNS06]
xn xn-1 x3 x2 x1
x=
Distribs at “distance” <
x’=
xn xn-1 x3 x2’ x1
A A One row modified A(x) A(x’)
SLIDE 18
DIFFERENTIAL PRIVACY [DMNS06]
SLIDE 19 CONCLUSION
k-Anonymity is practical and easy to
use, but not so safe in theory and practice
Differential privacy is safe but not so
practical to use
SLIDE 20 Main Tool : Coresets
20
Given data D and Algorithm A with A(D)
intractable, can we efficiently reduce D to C so that A(C) fast and A(C)~A(D)?
SLIDE 21
SLIDE 22
SLIDE 23
SLIDE 24
SLIDE 25
SLIDE 26
SLIDE 27
SLIDE 28
SLIDE 29
SLIDE 30
SLIDE 31
SLIDE 32
From Big Data to Small Data
SLIDE 33
SLIDE 34
SLIDE 35
SLIDE 36
Delete the pair of original coresets from memory
SLIDE 37
SLIDE 38
SLIDE 39
SLIDE 40
SLIDE 41
SLIDE 42
SLIDE 43
SLIDE 44
SLIDE 45
SLIDE 46 46
Parallel+ Streaming Computation
SLIDE 47 Good: Coresets reveal little information
Coreset Point Original Point
SLIDE 48 Bad: Still, not preserve privacy Bad: Coresets are not differential private
Coreset Point Original Point
SLIDE 49
SLIDE 50
(STOC’11)
SLIDE 51
(STOC’11)
SLIDE 52
SLIDE 53 Application: Face Identification
Our user will have ID 72135: his eyes are similar to user No. 7, ears are similar to user No. 2, lips are similar to user No. 1, …
match / no match Operator
Public Database of faces
SLIDE 54 NEW FACE REPRESENTATION: PATCH-BASED FACE REPRESENTATION
A face is represented by a collection of
informative patches:
Assume that the face is represented by p
patches.
Patch centers Patch size –could vary
SLIDE 55 COMPUTE K-REPRESENTATIVES FROM EACH
DATABASE
K-Means – Minimize the sum of squared distances over each point to its nearest center
SLIDE 56
Your Eye ID is the closest center
SLIDE 57 1 5, 8, 9, 14 2 3, 8, 10, 11 3 7, 9, 12, 18 p 4, 6, 10, 12 V= 1 2 p
Representing a face
For each of the p patches, store indices of the closest patche in the dictionary.
SLIDE 58 Privacy
- The same technique can be used for
databases of Genes, drivers, and customers
- Currently we use public database:
People agreed to have their faces published
- We want to use a private database instead
- You can compute your ID without learning
about the actual people in the database
SLIDE 59 Experiment
- Input: Database P of users (faces) that want to
keep their privacy
- Compute Private Coreset for k-means of P
- ah the coreset with its k-means
- Repeat for each database:
– Ears, noses, etc.
- Every user have a public ID based on a private
coreset
SLIDE 60 Implementation
- Generation of representations from images:
– Implemented in Matlab, translated to Java using Matlab Java builder.
– ~0.3 sec to compare to an image in the database – An Implementation in C will be much faster
- Private Coreset in Python (By Gilad Levi, Oren
Efraimov, Yona Zahi)
SLIDE 61 Results on Face Identification
No Privacy d=100, alpha=0.5 leakage d=20, alpha=0.4 leakage d=10, alpha=0.3 leakage d=5, alpha=0.05 leakage d=3, alpha=0.01 leakage False Positive True Positive
SLIDE 62 Future Work
- Compute Private Coreset on the Cloud
– Using Homomorphic Encryption (with Shafi Goldwaser & Daniela Rus)
- Compute coresets of error polynomial in d
– (with Daniela Rus & Kobbi Nissim)
- Fit the error function to Face Identification
– (With Rita Osadchy and Kobbi Nissim?)
- Private coresets for other machine learning
problems
SLIDE 63
SLIDE 64
SLIDE 65
SLIDE 66
SLIDE 67
SLIDE 68
SLIDE 69
SLIDE 70
SLIDE 71
SLIDE 72
SLIDE 73
SLIDE 74
SLIDE 75
SLIDE 76
SLIDE 77
SLIDE 78
SLIDE 79
SLIDE 80
SLIDE 81
SLIDE 82
SLIDE 83
SLIDE 84
SLIDE 85
SLIDE 86
SLIDE 87
SLIDE 88
SLIDE 89
SLIDE 90
SLIDE 91
SLIDE 92
SLIDE 93
SLIDE 94
SLIDE 95
SLIDE 96
SLIDE 97
SLIDE 98
Input
A set of n point s P ½ Rd, k ¸ 1.
SLIDE 99
Output
N : a small bicrit eria approximat ion t o t he k median of P
SLIDE 100 The Algorithm
1) t à 1 Count er for it erat ions 2) N à ; T he out put set of cent ers
3) Const ruct a weak ( 1
8k) -net Nt for P
4) N Ã N [ Nt
5) 8p : Comput e dist (p; Nt)
6) Remove Pt: t he half of P t hat is closer t o Nt
7) t à t + 1 8) Repeat st eps 3 t o 6 t ill t here are no more input point s. 9) Ret urn N
SLIDE 101
Proof of Correctness (for the non-private case)
SLIDE 102
A point b 2 P is bad for Nt, if: dist (b; Nt) > 2 dist (b; N ¤)
b
SLIDE 103
A point g 2 P is good for Nt ot herwise: dist (g; Nt) · 2 dist (g; N ¤)
g
SLIDE 104
Main Technical Theorem
W e can map every bad point b 2 Pt t o a dist inct good point g 2 Pt+ 1.
g
b
SLIDE 105
dist ( b; N ) · dist ( b; Nt) , because N ¶ Nt. Since b 2 Pt and g 2 Pt+ 1: dist ( b; Nt) · dist ( g; Nt) Since g is good for Nt: dist ( g; Nt) · 2 dist ( g; N ¤)
SLIDE 106
dist ( b; N ) · dist ( b; Nt) , because N ¶ Nt. Since b 2 Pt and g 2 Pt+ 1: dist ( b; Nt) · dist ( g; Nt) Since g is good for Nt: dist ( g; Nt) · 2 dist ( g; N ¤) dist (b; N ) · 2 dist (g; N ¤)
SLIDE 107 Bi-Criteria for k-Median
X
p2P
dist ( p; N )=
X
g
dist ( g; N ) +
X
b
dist ( b; N ) ·
X
g
2 dist ( g; N ¤) +
X
g
2 dist ( g; N ¤) · 4
X
p2 P
dist ( p; N ¤)
SLIDE 108
SLIDE 109
Bi-Criteria Approximation Algorithm [FFS07]
SLIDE 110
Initialization
1) t à 1 Count er for it erat ions 2) F à ; T he out put set of j -° at s
SLIDE 111
3) Const ruct a weak ( 1
8k)-net Nt for P
t = 1
SLIDE 112
4) N Ã N [ Nt
(t = 1)
SLIDE 113
5) 8p : Comput e dist (p; Nt)
p
(t = 1)
SLIDE 114
(t = 1)
6) Remove Pt: t he half of P t hat is closer t o Nt
SLIDE 115
(t = 1)
6) Remove Pt: t he half of P t hat is closer t o Nt
SLIDE 116
7) t à t + 1
8) Repeat st eps 3 t o 6:
SLIDE 117
(t = 2)
3) Const ruct a weak (1=k)-net Nt for P
SLIDE 118
(t = 2)
4) N Ã N [ Nt
SLIDE 119
5) 8p : Comput e dist (p; Nt)
p
(t = 2)
SLIDE 120
6) Remove Pt: t he half of P t hat is closer t o Nt
(t = 2)
SLIDE 121
6) Remove Pt: t he half of P t hat is closer t o Nt
(t = 2)
SLIDE 122
6) Remove Pt: t he half of P t hat is closer t o Nt
(t = 2)
SLIDE 123
7) t à t + 1 8) Repeat st eps 3 t o 6 t ill t here are no more input point s. 9) Ret urn N :
SLIDE 124
Let N ¤ be any set of k point s in Rd.
SLIDE 125
Let N ¤ be any set of k point s in Rd.
SLIDE 126
Let N ¤ be any set of k point s in Rd.
Consider Nt t hat is const ruct ed during t he tt h it erat ion.
SLIDE 127
A point b 2 P is bad for Nt, if: dist (b; Nt) > 2 dist (b; N ¤)
b
SLIDE 128
A point g 2 P is good for Nt ot herwise: dist (g; Nt) · 2 dist (g; N ¤)
g
SLIDE 129
Main Technical Theorem
W e can map every bad point b 2 Pt t o a dist inct good point g 2 Pt+ 1.
g
b
SLIDE 130
dist ( b; N ) · dist ( b; Nt) , because N ¶ Nt. Since b 2 Pt and g 2 Pt+ 1: dist ( b; Nt) · dist ( g; Nt) Since g is good for Nt: dist ( g; Nt) · 2 dist ( g; N ¤)
SLIDE 131
dist ( b; N ) · dist ( b; Nt) , because N ¶ Nt. Since b 2 Pt and g 2 Pt+ 1: dist ( b; Nt) · dist ( g; Nt) Since g is good for Nt: dist ( g; Nt) · 2 dist ( g; N ¤) dist (b; N ) · 2 dist (g; N ¤)
SLIDE 132 Bi-Criteria for k-Median
X
p2P
dist ( p; N )=
X
g
dist ( g; N ) +
X
b
dist ( b; N ) ·
X
g
2 dist ( g; N ¤) +
X
g
2 dist ( g; N ¤) · 4
X
p2 P
dist ( p; N ¤)
SLIDE 133
² T he number of bad point s is at most jB j = jPtj 8 ²
¯ ¯ ¯Pt+ 1 ¯ ¯ ¯ = jPtj
2 T he number of good point s in Pt+ 1 is at least
¯ ¯ ¯Pt+ 1 ¯ ¯ ¯ ¡ jB j ¸
jPtj 2 ¡ jPtj 8 ¸ jB j
Proof of the Technical Theorem
SLIDE 134
Claim: O nly B0 = jPtj 8k point s are bad for q 2 Nt
q
q¤ p
dist (p; q) · 2 dist (p; q¤)
SLIDE 135
B 0: t he jPtj
8k closest point s t o q¤ q¤
SLIDE 136 B 0: t he jPtj
8k closest point s t o q¤
B0 cont ains q 2 Nt
³ 1
8k-net
´
q
q¤
SLIDE 137
dist (p; q) · dist ( p; q¤) + dist ( q¤; q) · 2 dist ( p; q¤) For every yellow point p 2 P n B0:
p
q
q¤
SLIDE 138
dist (p; q) · 2 dist (p; q¤)
All t he yellow point s are good for Nt
p
q
q¤
SLIDE 139
jB 0j = jPtj 8
O nly t he black point s B 0 are bad for Nt
q
q¤