Would that it were so simple: Yet another theory of privacy John - - PowerPoint PPT Presentation

would that it were so simple yet another theory of privacy
SMART_READER_LITE
LIVE PREVIEW

Would that it were so simple: Yet another theory of privacy John - - PowerPoint PPT Presentation

Would that it were so simple: Yet another theory of privacy John Mitchell (Stanford) Avradip Mandal, Hart Montgomery, Arnab Roy (Fujitsu) It seems Allegras a noshow, which is simply a bore, but Ill partner you in bridge. 2 Would


slide-1
SLIDE 1

Would that it were so simple: Yet another theory of privacy

John Mitchell (Stanford) Avradip Mandal, Hart Montgomery, Arnab Roy (Fujitsu)

slide-2
SLIDE 2

2

It seems Allegra’s a no‐show, which is simply a bore, but I’ll partner you in bridge.

slide-3
SLIDE 3

3

Would that it were so simple

slide-4
SLIDE 4

Yet another …

  • YAAF – … Application

Framework

  • Yabasic – … BASIC
  • Yacc – … compiler compiler
  • Yacas … computer algebra

system

  • YaDICs – …Digital Image

Correlation Software

  • YADIFA – … DNS

Implementation For All

  • YafaRay – … free Ray tracer
  • Yafc – … FTP client
  • YAFFS – … Flash File System
  • YAP – Yet Another Previewer
  • YAPC – … Perl Conference
  • YARN – … Resource Negotiator
  • YARP – … Robot Platform
  • YARV – … Ruby VM
  • Yasca – …Source Code

Analyzer

  • Y.A.S.U. – … SecuROM Utility
  • Yate – … Telephony Engine,
  • YAWC – … Wersion of Citadel
  • YAWL – … Workflow Language,
  • Yaws – … web server

4

slide-5
SLIDE 5

Hard to “anonymize” data

  • De‐identifying data does not necessarily achieve
  • anonymity. It can often be re‐identified:

Ethnicity Visit date Diagnosis Procedure Medication Total bill

ZIP Birth date Sex

Medical Data

Name Address Date registered Party Date last voted

Voter Lists

ZIP Birth date Sex

Source: Latanya Sweeney

slide-6
SLIDE 6

Date of birth, gender + 5‐digit ZIP uniquely identifies 87.1% of U.S. population

ZIP 60623 has 112,167 people, 11% uniquely

  • identified. Insufficient #
  • ver 55 living there.

SOURCE: LATANYA SWEENEY = one ZIP code

slide-7
SLIDE 7

Privacy example 1: US Census

  • Raw data: information about every US household

– Who, where; age, gender, racial, income and educational data

  • Why released: determine representation, planning
  • How anonymized: aggregated to geographic areas (Zip code)

– Broken down by various combinations of dimensions – Released in full after 72 years

  • Attacks: no reports of successful deanonymization
  • Consequences: greater understanding of US population

– Affects funding of civil projects – Rich source of data for future historians, etc.

Anonymized Data: Generation, Models, Usage – Cormode & Srivastava

slide-8
SLIDE 8

Privacy example 2: AOL Search Data

  • Raw data: 20M search queries for 650K users from 2006
  • Why released: allow researchers to understand search patterns
  • How anonymized: user identifiers removed

– All searches from same user linked by an arbitrary identifier

  • Attacks: many successful attacks identified individual users

– Ego‐surfers: people typed in their own names – Zip codes and town names identify an area – NY Times identified 4417749 as 62yr old GA widow [

  • Consequences: CTO resigned, two researchers fired

– Well‐intentioned effort failed due to inadequate anonymization

Anonymized Data: Generation, Models, Usage – Cormode & Srivastava

slide-9
SLIDE 9

Privacy example 3: Netflix Prize

  • Raw data: 100M dated ratings from 480K users to 18K movies
  • Why released: improve predicting ratings of unlabeled examples
  • How anonymized: exact details not described by Netflix

– All direct customer information removed – Only subset of full data; dates modified; some ratings deleted, – Movie title and year published in full

  • Attacks: Narayanan Shmatikov 08]

– Attack links data to IMDB where same users also rated movies – Find matches based on similar ratings or dates in both

Anonymized Data: Generation, Models, Usage – Cormode & Srivastava

slide-10
SLIDE 10

k‐Anonymity

  • Make individuals “blend into the crowd”

– Suppress or generalize attributes in a database so that the identifying characteristics in each row match at least k‐1 other rows

slide-11
SLIDE 11

k‐Anonymity

  • Advantage of this concept

– Does not involve probability

  • Disadvantages

– Does not involve probability – Depends on absence of additional info

slide-12
SLIDE 12

Two rigorous theories of privacy

  • Contextual integrity

– Normative framework for evaluating the flow of information between agents – Agents act in roles within social contexts – Principles of transmission

  • Confidentiality, reciprocity, dessert, etc
  • Differential privacy

San

DB= S

¢¢¢

San

DB’= S’

¢¢¢

Distrib. distance ≤ 

Adam Smith

slide-13
SLIDE 13

Yet Another Theory of Privacy

  • Formulate privacy and utility around

– Private database accessed via privacy mechanism – Possibly linkable to public information – In presence of prior distribution about population

  • Measure privacy for user and utility using entropy

– Privacy loss: information gain about user, identified by name and address or other public identifier – Utility gain: information about “anonymized” identifier that is used to access private data

13

slide-14
SLIDE 14

The Targeted Advertising Ecosystem

Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies

AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir

slide-15
SLIDE 15

Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies

AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff $$$

Tracking data is stored and exchanged amongst these companies

The Targeted Advertising Ecosystem

slide-16
SLIDE 16

Yet Another Theory of Privacy

  • Assume advertising data provider

– Maintains database of user interest and behavior, indexed by a supercookie – Supplies rows of this database to ad networks

  • Assume Ad Networks

– Use this information to bid $$ on ad impression

  • Threat model

– Attacker has access to

  • Rows of database indexed by supercookie
  • Prior distribution of traits within population
  • Public information in external databases (e.g., FB, Yelp)

– What can attacker learn about real individuals?

16

slide-17
SLIDE 17

The Targeted Advertising Ecosystem

Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies

AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir

slide-18
SLIDE 18

Another theory of privacy

  • Database

– Mapping from users U to rows (from some set X) – Mapping  from users to their supercookies – Prior distribution P on databases, known to adversary

  • Privacy mechanism

– Transformation M of database – Advertiser gets supercookie‐based access to transformed database

  • Privacy for user

– Decrease in uncertainty about the user and her data, when provided supercookie‐based access to transformed database

  • Utility per user

– Decrease in uncertainty about the supercookie and user data, when provided supercookie‐based access to transformed database

18

slide-19
SLIDE 19

More Details

  • Database

– Mapping from users U to rows (from some set X) – Mapping  from users to their supercookies – Prior distribution P on databases, known to adversary

  • Privacy mechanism

– Transformation M of database – Function  maps user to supercookie

  • Privacy for user

– Decrease in uncertainty about the user and her data, when provided supercookie‐based access to transformed database

  • Utility per user

– Decrease in uncertainty about the supercookie and user data, when provided supercookie‐based access to transformed database

19

slide-20
SLIDE 20

More Details

  • Privacy mechanism

– Transformation M of database – Function  maps user to supercookie

  • Privacy for user
  • Utility per user

20

Uncertainty about user and her data Uncertainty when provided “private” data access Uncertainty about supercookie and associated data Uncertainty when provided “private” data access

slide-21
SLIDE 21

Some insight from the model

  • Privacy is still mathematically complicated

– Not easy to prove interesting entropy relationships

  • Lower bound

– Generalize the Netflix example: sparse database where # of users << exp(# of columns) e.g., columns =movies – Even adding randomly sampled Bernoulli noise, there is a level of noise where privacy loss is still catastrophic and utility insufficient for most practical applications

  • Upper bound

– Course‐grained database with # users >> exp(# of columns) – Intuition: restaurant recommendations based on user category preferences; privacy even with Yelp as side information

21

slide-22
SLIDE 22

Example

Can advertiser connect preference data with public reviews to estimate ui private preferences?

22

slide-23
SLIDE 23

Example

  • Under conservative

assumption

  • Can calculate privacy as

function of n and

23

slide-24
SLIDE 24

Complicated Expression

24

slide-25
SLIDE 25

Summary

Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies

AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir

slide-26
SLIDE 26

Summary

Publishers Consumers Ad Exchanges & Ad Auctions Ad Networks Companies

AdNexus Rubicon Rocketfuel People who browse the web Websites that publish content Services that match people to targeted ads Services that manage ad campaigns and target users People trying to sell you stuff Ads Slide credit: Guevara Noubir

Provable quantitative privacy for individualized data, used coarsely, in presence of public side information