Privacy & Fairness in Data Science CS848 Fall 2019 2 - - PowerPoint PPT Presentation

privacy fairness in data science
SMART_READER_LITE
LIVE PREVIEW

Privacy & Fairness in Data Science CS848 Fall 2019 2 - - PowerPoint PPT Presentation

Privacy & Fairness in Data Science CS848 Fall 2019 2 Instructor Xi He: Research interest: privacy and fairness for big-data management and analysis CS848, Fall 2019: Tue: 3:00pm - 5:50pm (DC2568) 3 Tell me why do you


slide-1
SLIDE 1

Privacy & Fairness in Data Science

CS848 Fall 2019

slide-2
SLIDE 2

Instructor

Xi He:

  • Research interest: privacy and fairness

for big-data management and analysis

  • CS848, Fall 2019:

– Tue: 3:00pm - 5:50pm (DC2568)

2

slide-3
SLIDE 3

Tell me …

3

… why do you want to do this course?

slide-4
SLIDE 4

Personalization …

4

slide-5
SLIDE 5

5

In perspective: ~90% of Google’s revenue comes from online ads (as of 2015)

Online Advertising

slide-6
SLIDE 6

6

In perspective: ~90% of Google’s revenue comes from online ads (as of 2015)

Online Advertising

slide-7
SLIDE 7

7

slide-8
SLIDE 8

Health

8

Detecting influenza epidemics using search engine query data

http://www.nature.com/nature/journal/v457/n7232/full/nature07 634.html

Red: official numbers from Center for Disease Control and Prevention; weekly Black: based on Google search logs; daily (potentially instantaneously)

slide-9
SLIDE 9

Medicine

9

https://www.nature.com/news/personalized-medicine-time-for-one-person-trials-1.17411

slide-10
SLIDE 10

Precision Medicine

10

Source: forbes.com

slide-11
SLIDE 11

Predictive Policing

11

slide-12
SLIDE 12

Predictive Policing

12

slide-13
SLIDE 13

The dark side of the force…

13

http://ragekg.deviantart.com/art/The-Dark-Side-of-the-Force-174559980

slide-14
SLIDE 14

39% of the experts agree…

Thanks to many changes, including the building of “the Internet

  • f Things,” human and machine analysis of Big Data will

cause more problems than it solves by 2020. The existence of huge data sets for analysis will engender false confidence in

  • ur predictive powers and will lead many to make significant

and hurtful mistakes. Moreover, analysis of Big Data will be misused by powerful people and institutions with selfish agendas who manipulate findings to make the case for what they

  • want. And the advent of Big Data has a harmful impact because

it serves the majority (at times inaccurately) while diminishing the minority and ignoring important outliers. Overall, the rise of Big Data is a big negative for society in nearly all respects.

— 2012 Pew Research Center Report

http://pewinternet.org/Reports/2012/Future-of-Big-Data/Overview.aspx

14

slide-15
SLIDE 15

Harm due to personalized data analytics …

  • Privacy
  • Fairness

15

slide-16
SLIDE 16

Where is the data coming from?

16

slide-17
SLIDE 17

Where is the data coming from?

  • Census surveys
  • IRS Records
  • Medical records
  • Insurance records
  • Search logs
  • Browse logs
  • Shopping histories
  • Photos
  • Videos
  • Smart phone Sensors
  • Mobility trajectories

17

V e r y s e n s i t i v e i n f

  • r

m a t i

  • n

slide-18
SLIDE 18

How is this data collected?

18

http://graphicsweb.wsj.com/documents/divSlider /media/ecosystem100730.png

slide-19
SLIDE 19

Isn’t my data anonymous ?

19

slide-20
SLIDE 20

Device Fingerprinting

20

slide-21
SLIDE 21

21

https://panopticlick.eff.org/

slide-22
SLIDE 22

Let’s get rid of unique identifiers …

22

slide-23
SLIDE 23

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Medication
  • Total Charge

Medical Data

  • Zip
  • Birth

date

  • Sex

23

slide-24
SLIDE 24

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Medication
  • Total Charge
  • Name
  • Address
  • Date

Registered

  • Party

affiliation

  • Date last

voted

  • Zip
  • Birth

date

  • Sex

Medical Data Voter List

24

slide-25
SLIDE 25

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Medication
  • Total Charge
  • Name
  • Address
  • Date

Registered

  • Party

affiliation

  • Date last

voted

  • Zip
  • Birth

date

  • Sex

Medical Data Voter List

  • Governor of MA

uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

25

slide-26
SLIDE 26

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Medication
  • Total Charge
  • Name
  • Address
  • Date

Registered

  • Party

affiliation

  • Date last

voted

  • Zip
  • Birth

date

  • Sex

Medical Data Voter List

  • Governor of MA

uniquely identified using ZipCode, Birth Date, and Sex.

Quasi Identifier

87 % of US population

26

slide-27
SLIDE 27

AOL data publishing fiasco

27

slide-28
SLIDE 28

AOL data publishing fiasco …

28

Xi222 Xi222 Xi222 Xi222 Abel156 Abel156 Jane12345 Jane12345 Jane12345 Jane12345 Bob222 Bob222 Uefa cup Uefa champions league Champions league final Champions league final 2013 exchangeability Proof of deFinitti’s theorem Zombie games Warcraft Beatles anthology Ubuntu breeze Python in thought Enthought Canopy

slide-29
SLIDE 29

User IDs replaced with random numbers

29

Uefa cup Uefa champions league Champions league final Champions league final 2013 exchangeability Proof of deFinitti’s theorem Zombie games Warcraft Beatles anthology Ubuntu breeze Python in thought Enthought Canopy 865712345 865712345 865712345 865712345 236712909 236712909 112765410 112765410 112765410 112765410 865712345 865712345

slide-30
SLIDE 30

Privacy Breach

30

[NYTimes 2006]

slide-31
SLIDE 31

Machine learning models can reveal sensitive information

31

[Korolova JPC 2011]

Facebook Profile

+

Online Data Number of Impressions + Who are interested in Men + Who are interested in Women

25

Facebook’s learning algorithm uses private information to predict match to ad

slide-32
SLIDE 32

Genome wide association studies

32

Did Bob participate in the study Results of a GWAS study High density SNP profile of Bob [Homer et al PLOS Genetics 08]

slide-33
SLIDE 33

Harm due to personalized data analytics …

  • Privacy
  • Fairness

33

slide-34
SLIDE 34

The red side of learning

  • Redlining: the practice of denying, or charging more

for, services such as banking, insurance, access to health care, or even supermarkets, or denying jobs to residents in particular, often racially determined, areas.

34

slide-35
SLIDE 35

Predictive Policing

  • Predictive policing

systems use machine learning algorithms to predict crime.

  • But … the algorithms

learn … patterns not about crime, per se, but about how police record crime.

  • This can amplify existing

biases

35

slide-36
SLIDE 36

36

https://www.nytimes.com/2015/07/10/upshot/ when-algorithms-discriminate.html

slide-37
SLIDE 37

37

slide-38
SLIDE 38

Deep Learning

Incredibly powerful tool for …

  • Extracting regularities from data

according to a given data

  • Amplifying bias!

38

slide-39
SLIDE 39

39

http://slides.com/simonescardapane/the-dark-side-of-deep-learning

slide-40
SLIDE 40

40

http://slides.com/simonescardapane/the-dark-side-of-deep-learning

slide-41
SLIDE 41

Deep Learning

Incredibly powerful tool for …

  • Extracting regularities from data

according to a given data

  • Amplifying privacy concerns!

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

This course:

43

http://www.webvisionsevent.com/userfiles/lightsabercrop_large_verge_medium_landscape.jpg

Learn to combat the dark side

slide-44
SLIDE 44

You will …

  • mathematically formulate privacy.
  • mathematically formulate fairness.

44

slide-45
SLIDE 45

Differential Privacy

45

For every output … O D2 D1 For every pair of inputs that differ in one row Adversary should not be able to distinguish between any D1 and D2 based on any O Pr[A(D1) = O] Pr[A(D2) = O] . < ε (ε>0)

log

slide-46
SLIDE 46

You will …

  • mathematically formulate privacy.
  • mathematically formulate fairness.
  • design algorithms to ensure privacy
  • design algorithms to ensure fairness

46

slide-47
SLIDE 47

Differential Privacy in practice

47

OnTheMap [ICDE 2008] [CCS 2014] [Apple WWDC 2016]

slide-48
SLIDE 48

You will …

  • mathematically formulate privacy.
  • mathematically formulate fairness.
  • design algorithms to ensure privacy
  • design algorithms to ensure fairness
  • do research into the interplay between

privacy and fairness.

48

slide-49
SLIDE 49

Course Format

  • Module 1: Intro to Privacy
  • Module 2: Intro to Fairness
  • Module 3:

Paper Reading by Topics

– privacy v.s. fairness – private machine learning – deployments of DP – sources of bias – fairness mechanisms

49

In-class Exercise In-class Mini-project Lectures Read papers Mini-critiques Research Project

slide-50
SLIDE 50

50

slide-51
SLIDE 51

What we expect you to know …

  • Strong background in

– Probability – Proof techniques

  • Some knowledge of

– Programming with Python – Machine learning – Statistics – Algorithms

51

slide-52
SLIDE 52
  • Misc. course info
  • Website: https://cs.uwaterloo.ca/~xihe/cs848

– Schedule (with links to lecture slides, readings, projects, etc.)

  • Grading

– In class mini-projects: 10% x 2 – Mini-critiques: 10% – Class participation and presentation: 20%

  • Attending class!

– Project: 50%

  • LEARN for submission and grades:

– https://learn.uwaterloo.ca/d2l/home/492027

52

slide-53
SLIDE 53

Academic Integrity

  • See course website
  • Mini-project reports and paper critiques are

individual work and submission.

  • Group discussion okay (and encouraged), but

– Acknowledge help you receive from others – Make sure you “own” your solution

  • All suspected cases of violation will be

aggressively pursued

53

slide-54
SLIDE 54

Reference

  • Course materials are adapted from:

https://sites.duke.edu/cs590f18privacyfair ness/

54