Would that it were so simple: Yet another theory of privacy John - - PowerPoint PPT Presentation
Would that it were so simple: Yet another theory of privacy John - - PowerPoint PPT Presentation
Would that it were so simple: Yet another theory of privacy John Mitchell (Stanford) Avradip Mandal, Hart Montgomery, Arnab Roy (Fujitsu) It seems Allegras a noshow, which is simply a bore, but Ill partner you in bridge. 2 Would
2
It seems Allegra’s a no‐show, which is simply a bore, but I’ll partner you in bridge.
3
Would that it were so simple
Yet another …
- YAAF – … Application
Framework
- Yabasic – … BASIC
- Yacc – … compiler compiler
- Yacas … computer algebra
system
- YaDICs – …Digital Image
Correlation Software
- YADIFA – … DNS
Implementation For All
- YafaRay – … free Ray tracer
- Yafc – … FTP client
- YAFFS – … Flash File System
- YAP – Yet Another Previewer
- YAPC – … Perl Conference
- YARN – … Resource Negotiator
- YARP – … Robot Platform
- YARV – … Ruby VM
- Yasca – …Source Code
Analyzer
- Y.A.S.U. – … SecuROM Utility
- Yate – … Telephony Engine,
- YAWC – … Wersion of Citadel
- YAWL – … Workflow Language,
- Yaws – … web server
4
Hard to “anonymize” data
- De‐identifying data does not necessarily achieve
- anonymity. It can often be re‐identified:
Ethnicity Visit date Diagnosis Procedure Medication Total bill
ZIP Birth date Sex
Medical Data
Name Address Date registered Party Date last voted
Voter Lists
ZIP Birth date Sex
Source: Latanya Sweeney
Date of birth, gender + 5‐digit ZIP uniquely identifies 87.1% of U.S. population
ZIP 60623 has 112,167 people, 11% uniquely
- identified. Insufficient #
- ver 55 living there.
SOURCE: LATANYA SWEENEY = one ZIP code
Privacy example 1: US Census
- Raw data: information about every US household
– Who, where; age, gender, racial, income and educational data
- Why released: determine representation, planning
- How anonymized: aggregated to geographic areas (Zip code)
– Broken down by various combinations of dimensions – Released in full after 72 years
- Attacks: no reports of successful deanonymization
- Consequences: greater understanding of US population
– Affects funding of civil projects – Rich source of data for future historians, etc.
Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
Privacy example 2: AOL Search Data
- Raw data: 20M search queries for 650K users from 2006
- Why released: allow researchers to understand search patterns
- How anonymized: user identifiers removed
– All searches from same user linked by an arbitrary identifier
- Attacks: many successful attacks identified individual users
– Ego‐surfers: people typed in their own names – Zip codes and town names identify an area – NY Times identified 4417749 as 62yr old GA widow [
- Consequences: CTO resigned, two researchers fired
– Well‐intentioned effort failed due to inadequate anonymization
Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
Privacy example 3: Netflix Prize
- Raw data: 100M dated ratings from 480K users to 18K movies
- Why released: improve predicting ratings of unlabeled examples
- How anonymized: exact details not described by Netflix
– All direct customer information removed – Only subset of full data; dates modified; some ratings deleted, – Movie title and year published in full
- Attacks: Narayanan Shmatikov 08]
– Attack links data to IMDB where same users also rated movies – Find matches based on similar ratings or dates in both
Anonymized Data: Generation, Models, Usage – Cormode & Srivastava
k‐Anonymity
- Make individuals “blend into the crowd”
– Suppress or generalize attributes in a database so that the identifying characteristics in each row match at least k‐1 other rows
k‐Anonymity
- Advantage of this concept
– Does not involve probability
- Disadvantages
– Does not involve probability – Depends on absence of additional info
Two rigorous theories of privacy
- Contextual integrity
– Normative framework for evaluating the flow of information between agents – Agents act in roles within social contexts – Principles of transmission
- Confidentiality, reciprocity, dessert, etc
- Differential privacy
San
DB= S
¢¢¢
San
DB’= S’
¢¢¢
Distrib. distance ≤
Adam Smith
Yet Another Theory of Privacy
- Formulate privacy and utility around
– Private database accessed via privacy mechanism – Possibly linkable to public information – In presence of prior distribution about population
- Measure privacy for user and utility using entropy
– Privacy loss: information gain about user, identified by name and address or other public identifier – Utility gain: information about “anonymized” identifier that is used to access private data
13
The Targeted Advertising Ecosystem
Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies
AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir
Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies
AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff $$$
Tracking data is stored and exchanged amongst these companies
The Targeted Advertising Ecosystem
Yet Another Theory of Privacy
- Assume advertising data provider
– Maintains database of user interest and behavior, indexed by a supercookie – Supplies rows of this database to ad networks
- Assume Ad Networks
– Use this information to bid $$ on ad impression
- Threat model
– Attacker has access to
- Rows of database indexed by supercookie
- Prior distribution of traits within population
- Public information in external databases (e.g., FB, Yelp)
– What can attacker learn about real individuals?
16
The Targeted Advertising Ecosystem
Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies
AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir
Another theory of privacy
- Database
– Mapping from users U to rows (from some set X) – Mapping from users to their supercookies – Prior distribution P on databases, known to adversary
- Privacy mechanism
– Transformation M of database – Advertiser gets supercookie‐based access to transformed database
- Privacy for user
– Decrease in uncertainty about the user and her data, when provided supercookie‐based access to transformed database
- Utility per user
– Decrease in uncertainty about the supercookie and user data, when provided supercookie‐based access to transformed database
18
More Details
- Database
– Mapping from users U to rows (from some set X) – Mapping from users to their supercookies – Prior distribution P on databases, known to adversary
- Privacy mechanism
– Transformation M of database – Function maps user to supercookie
- Privacy for user
– Decrease in uncertainty about the user and her data, when provided supercookie‐based access to transformed database
- Utility per user
– Decrease in uncertainty about the supercookie and user data, when provided supercookie‐based access to transformed database
19
More Details
- Privacy mechanism
– Transformation M of database – Function maps user to supercookie
- Privacy for user
- Utility per user
20
Uncertainty about user and her data Uncertainty when provided “private” data access Uncertainty about supercookie and associated data Uncertainty when provided “private” data access
Some insight from the model
- Privacy is still mathematically complicated
– Not easy to prove interesting entropy relationships
- Lower bound
– Generalize the Netflix example: sparse database where # of users << exp(# of columns) e.g., columns =movies – Even adding randomly sampled Bernoulli noise, there is a level of noise where privacy loss is still catastrophic and utility insufficient for most practical applications
- Upper bound
– Course‐grained database with # users >> exp(# of columns) – Intuition: restaurant recommendations based on user category preferences; privacy even with Yelp as side information
21
Example
Can advertiser connect preference data with public reviews to estimate ui private preferences?
22
Example
- Under conservative
assumption
- Can calculate privacy as
function of n and
23
Complicated Expression
24
Summary
Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies
AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir
Summary
Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies
AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir
Provable quantitative privacy for individualized data, used coarsely, in presence of public side information