Releasing a Differentially Private Password Frequency Corpus from 70 - - PowerPoint PPT Presentation

β–Ά
releasing a differentially private
SMART_READER_LITE
LIVE PREVIEW

Releasing a Differentially Private Password Frequency Corpus from 70 - - PowerPoint PPT Presentation

Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords Jeremiah Blocki Purdue University DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness What


slide-1
SLIDE 1

Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords

Jeremiah Blocki

Purdue University

DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness

slide-2
SLIDE 2

What is a Password Frequency List?

Password Dataset: (N users)

password 12345 abc123 abc123

1 2

Histogram

1

slide-3
SLIDE 3

What is a Password Frequency List?

Password Dataset: (N users)

password 12345 abc123 abc123

1 2 2

Histogram Frequency List

1 1 1

slide-4
SLIDE 4

What is a Password Frequency List?

Password Dataset: (N users)

1 2 2

Formally: 𝐠 ∈ β„˜(𝑢)

Password Frequency List is just an integer partition.

Histogram Frequency List

1 1 1

password 12345 abc123 abc123

slide-5
SLIDE 5

Password Frequency List (Example Use)

πœ‡π›Ύ =

𝑗=1 𝛾

𝑔

𝑗

Estimate #accounts compromised by attacker with 𝛾 guesses per user

  • Online Attacker (𝛾 small)
  • Offline Attacker (𝛾 large)

Password Frequency Lists allow us to estimate

  • Marginal Guessing Cost (MGC)
  • Marginal Benefit (MB)
  • Rational Adversary: MGC = MB
slide-6
SLIDE 6

Available Password Frequency Lists (2015)

Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* …. … … Yahoo! [B12] 70 Million With Permission**

** frequency list perturbed slightly to preserve differential privacy.

https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage

slide-7
SLIDE 7

Yahoo! Password Frequency List

  • Collected by Joseph Bonneau in 2011 (with permission from Yahoo!)
  • Store H(s|pwd)
  • Secret salt value s (same for all users)
  • Discarded after data-collection
  • β‰ˆ 70 million Yahoo! Users
  • Yahoo! Legal gave permission to publish analysis of the frequency list
slide-8
SLIDE 8

Project Origin

Would it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.

slide-9
SLIDE 9

Project Origin

I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

slide-10
SLIDE 10

Project Origin

I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

slide-11
SLIDE 11

Available Password Frequency Lists

Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* …. … … Yahoo! [B12] 70 Million With Permission**

** frequency list perturbed slightly to preserve differential privacy.

https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage

slide-12
SLIDE 12

Yahoo! Frequency Corpus

Largest publicly available frequency corpus

(that was not result of a data-breach) Source β‰  Dark Web

slide-13
SLIDE 13

Why not just publish the original frequency lists?

  • Heuristic Approaches to Data Privacy often break down when the

adversary has background knowledge

  • Netflix Prize Dataset[NS08]
  • Background Knowledge: IMDB
  • Massachusetts Group Insurance Medical Encounter Database [SS98]
  • Background Knowledge: Voter Registration Record
  • Many other attacks [BDK07,…]
  • In the absence of provable privacy guarantees Yahoo! was

understandably reluctant to release these password frequency lists.

slide-14
SLIDE 14

Security Risks (Example)

???

12345 abc123 abc123

Adversary Background Knowledge

slide-15
SLIDE 15

Security Risks (Example)

???

12345 abc123 abc123

abc123 12345

  • ther

3 1 2 2 2 1 1 2

slide-16
SLIDE 16

Differential Privacy (Dwork et al)

𝑔 βˆ’ 𝑔′ 1 ≝

𝑗

𝑔

𝑗 βˆ’ 𝑔 𝑗′

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

slide-17
SLIDE 17

Differential Privacy (Dwork et al)

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

f – original password frequency list f’ – remove Alice’s password from dataset

slide-18
SLIDE 18

Differential Privacy (Dwork et al)

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

Small Constant (e.g., 𝜁 = 0.5)

f – original password frequency list f’ – remove Alice’s password from dataset

slide-19
SLIDE 19

Differential Privacy (Dwork et al)

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

Small Constant (e.g., 𝜁 = 0.5) Negligibly Small Value (e.g., πœ€ = 2βˆ’100)

f – original password frequency list f’ – remove Alice’s password from dataset

slide-20
SLIDE 20

Differential Privacy (Example)

20

2 2 2 2 1 2

minus

=

f f’

Subset S of all potentially harmful outcomes to Alice

𝑃𝑣𝑒𝑑𝑝𝑛𝑓𝑑

slide-21
SLIDE 21

Differential Privacy (Example)

21

𝐐𝐬 𝐡 𝑔 ∈ ≀ π‘“πœ»ππ¬ 𝐡(𝑔′) ∈ + πœ€

2 2 2 2 1 2

minus

=

f f’

Subset S of all potentially harmful outcomes to Alice

slide-22
SLIDE 22

Intuition: Alice won’t be harmed because her password was included in the dataset.

22

Differential Privacy (Example)

𝐐𝐬 𝐡 𝑔 ∈ ≀ π‘“πœ»ππ¬ 𝐡(𝑔′) ∈ + πœ€

slide-23
SLIDE 23

Main Technical Result

Theorem: There is a computationally efficient algorithm 𝐡: β„˜ Γ— β„˜ β†’ β„˜ such that A preserves 𝜁, πœ€ -differential privacy and, except with probability πœ€, A(f) outputs 𝑔 s.t. 𝑔 βˆ’ 𝑔 1 𝑂 ≀ 𝑃 1 𝜁 𝑂 + ln 1 πœ€ πœπ‘‚ . π”π£π§πŸ 𝐁 = 𝑃 𝑂 𝑂 + 𝑂 ln 1 Ξ΄ 𝜁 = 𝐓πͺπ›ππŸ(𝐡)

β„˜ =

𝒐=𝟐 ∞

β„˜ 𝒐

slide-24
SLIDE 24

Main Tool: Exponential Mechanism [MT07]

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Assigns very small probability to inaccurate outcomes.

slide-25
SLIDE 25

Main Tool: Exponential Mechanism [MT07]

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem [MT07]: The exponential mechanism preserves 𝜁, 0 - differential privacy.

slide-26
SLIDE 26

Analysis: Exponential Mechanism

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem [HR18]: There are 𝑓𝑃

𝑂 partitions of the integer N.

Assigns very small probability to inaccurate outcomes.

slide-27
SLIDE 27

Analysis: Exponential Mechanism

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem [HR18]: There are 𝑓𝑃

𝑂 partitions of the integer N.

Assigns very small probability to inaccurate outcomes. Union Bound οƒ  𝑔 βˆ’ 𝑔 1 ≀ 𝑃

𝑂 𝜁

with high probability when

1 𝜁 = 𝑃

𝑂 .

slide-28
SLIDE 28

Analysis: Exponential Mechanism

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem:

π‘”βˆ’ 𝑔 1 𝑂

≀ 𝑃

1 𝜁 𝑂 with high probability.

Assigns very small probability to inaccurate outcomes.

Theorem [MT07]: The exponential mechanism preserves 𝜁, 0 - differential privacy.

slide-29
SLIDE 29

(e.g., [U13])

slide-30
SLIDE 30

But, we did run the exponential mechanism

Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€β€“close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑃 𝑂 𝑂 + N ln 1 πœ€ 𝜁

Key Intuition:

π’‡βˆ’πœ» 𝒋 π’ˆπ’‹βˆ’

𝑔𝑗 = π’‡βˆ’πœ» 𝒋≀𝒖 π‘”π‘—βˆ’ 𝑔𝑗 Γ— π’‡βˆ’πœ» 𝒋>𝒖 π‘”π‘—βˆ’ 𝑔𝑗

Suggests Potential Recurrence Relationships

slide-31
SLIDE 31

But, we did run the exponential mechanism

Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€β€“close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑃 𝑂 𝑂 + N ln 1 πœ€ 𝜁

Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,k such that

𝐐𝐬 𝑔𝑗 = 𝑙 𝑔

π‘—βˆ’1 =

Wi,𝑙

t=0

π‘”π‘—βˆ’1 Wi,t

.

slide-32
SLIDE 32

But, we did run the exponential mechanism

Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€β€“close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑃 𝑂 𝑂 + N ln 1 πœ€ 𝜁

Key Idea 2: Allow A to ignore a partition 𝑔 if 𝑔 βˆ’ 𝑔 1 very large. Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,t

slide-33
SLIDE 33

Practical Challenge #1

  • Space is Limiting Factor: N=70 million, 𝜁 = 0.02

𝑂 𝑂 + N ln 1 πœ€ 𝜁 (8 bytes) β‰ˆ 200 π‘ˆπΆ

  • Workaround: Initial pruning phase to identify relevant subset of DP

table for sampling.

  • Running Time: β‰ˆ 12 hours on this laptop
slide-34
SLIDE 34

Practical Challenge #2

  • Wi,𝑙 can get very large (too big for native floating point types in C#)
  • Workaround: Store log Wi,𝑙 instead of Wi,𝑙.
  • Important Implementation Question: Where do your random bits

come from?

  • Default random number generator is much easier for developer to use.
  • Example: Rand.NextDouble() vs CryptoRand.NextBytes()
slide-35
SLIDE 35

Practical Challenge #3

Does Yahoo! have any preference about the privacy parameter 𝜁?

slide-36
SLIDE 36

Practical Challenge #3

Are there standardized guidelines to select 𝜁?

slide-37
SLIDE 37

Practical Challenge #3

No, I was thinking 𝜁 =

1 2 would be

reasonable….

slide-38
SLIDE 38

Practical Challenge #3

Yahoo! is fine with 𝜁 =

1 2

Risk: Industry deployments become de facto standard for selecting 𝜁? Suggested Dinner Discussion Topic: What role should academia play in influencing these standards?

slide-39
SLIDE 39

Yahoo! Results

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:

slide-40
SLIDE 40

Yahoo! Results

Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:

slide-41
SLIDE 41

Yahoo! Results (Selecting Epsilon)

Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = πœπ‘π‘šπ‘š + 22Ξ΅β€²

Any individual participates in at most 23 groups (including All)

slide-42
SLIDE 42

Yahoo! Results (Selecting Epsilon)

Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = πœπ‘π‘šπ‘š + 22Ξ΅β€²

πœπ‘π‘šπ‘š = 0.25

Ξ΅β€² =

πœπ‘π‘šπ‘š

22

slide-43
SLIDE 43

Yahoo! Results (Selecting Epsilon)

Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = 0.5

slide-44
SLIDE 44

Yahoo! Results (Selecting Epsilon)

Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = 0.5, Ξ΄ = 2βˆ’100

slide-45
SLIDE 45

An Open Problem

Application to Social Networks: Degree Distribution with Node Privacy

Th

Conjecture: For

1 𝜁 = 𝑃

3 π‘œ

E β„°πœ 𝑔 βˆ’ 𝑔 1 ≀ 𝑃 π‘œ 𝜁

slide-46
SLIDE 46

Lower Bounds on L1 Error

E 𝐡 𝑔 βˆ’ 𝑔 1 = Ξ©

𝑂 𝜁

[AS16,B16] E 𝐡 𝑔 βˆ’ 𝑔 1 = Ξ©

1 𝜁2

relevant when

1 𝜁 = Ω

𝑂

slide-47
SLIDE 47

Empirical Evidence

20000 40000 60000 80000

5 10 15 20 25

𝜁 βˆ’1

L1Error

(100 Samples)

n=32.6 million users

slide-48
SLIDE 48

More Empirical Evidence

100 200 300 400 500 600

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Average L1 Error (200 Samples) 1/√Ρ

Bitcoin Trust Network

𝜁 β‰ˆ 1

3 𝑂

slide-49
SLIDE 49

More Empirical Evidence

5000 10000 15000 20000 25000 2000 4000 6000 8000 10000 12000

Average L1 Error (200 Samples) 1/Ξ΅2

Bitcoin Trust Network

𝜁 β‰ˆ 1

3 𝑂

𝜁 β‰ˆ 1 𝑂

slide-50
SLIDE 50

Comparison with Prior Techniques

100000000 200000000 300000000 400000000 500000000 600000000 700000000 800000000 900000000 1E+09 2000 4000 6000 8000 10000 12000

Mean Squared Error (200 Samples)

1/Ξ΅2

Bitcoin Trust Network

Laplace (Post Process) Exponential Mechanism

slide-51
SLIDE 51

Conclusions

  • Differential Privacy Enables Analysis of Sensitive Data
  • The exponential mechanism is not always intractable
  • integer partitions
  • Other practical settings?
  • Applications to Social Networks?
slide-52
SLIDE 52

Thanks for Listening

Anupam Datta CMU Joseph Bonneau NYU