Frequency Lists Jeremiah Blocki Anupam Datta Joseph Bonneau - - PowerPoint PPT Presentation

β–Ά
frequency lists
SMART_READER_LITE
LIVE PREVIEW

Frequency Lists Jeremiah Blocki Anupam Datta Joseph Bonneau - - PowerPoint PPT Presentation

Differentially Private Password Frequency Lists Jeremiah Blocki Anupam Datta Joseph Bonneau MSR/Purdue CMU Stanford/EFF Or, How to release statistics from 70 million passwords (on purpose) Jeremiah Blocki Anupam Datta Joseph Bonneau


slide-1
SLIDE 1

Differentially Private Password Frequency Lists

Jeremiah Blocki MSR/Purdue Anupam Datta CMU Joseph Bonneau Stanford/EFF

slide-2
SLIDE 2

Or, How to release statistics from 70 million passwords (on purpose)

Jeremiah Blocki MSR/Purdue Anupam Datta CMU Joseph Bonneau Stanford/EFF

slide-3
SLIDE 3

Outline

  • Password Frequency List
  • Potential Security Concerns
  • Differential Privacy
  • A DP Algorithm with Minimal Distortion
  • Released Yahoo! Frequency List
slide-4
SLIDE 4

What is a Password Frequency List?

Password Dataset: (N users)

password 12345 abc123 abc123

1 2

Histogram

1

slide-5
SLIDE 5

What is a Password Frequency List?

Password Dataset: (N users)

password 12345 abc123 abc123

1 2 2

Histogram Frequency List

1 1 1

slide-6
SLIDE 6

What is a Password Frequency List?

Password Dataset: (N users)

1 2 2

Formal Notation: 𝐠 = 𝑔1,…,𝑔𝑂 such that

  • 𝑔1 β‰₯ 𝑔2 β‰₯ β‹― β‰₯ 𝑔

𝑂 β‰₯ 0

  • 𝑂 = 𝑗=1

𝑂

𝑔𝑗 Histogram Frequency List

1 1 1

password 12345 abc123 abc123

slide-7
SLIDE 7

Password Frequency List (Application 1)

πœ‡π›Ύ =

𝑗=1 𝛾

𝑔

𝑗

Estimate #accounts compromised by attacker with 𝛾 guesses per user

  • Online Attacker (𝛾 small)
  • Offline Attacker (𝛾 large)
slide-8
SLIDE 8

Password Frequency List (Application 2)

Quantify Benefits from Key-Stretching

Halting Condition (Rational Offline Adversary):

  • Marginal Guessing Cost β‰₯ Marginal Benefit

Password Frequency Lists allow us to estimate

  • Marginal Guessing Cost (MGC)
  • Marginal Benefit (MB)
  • Rational Adversary: MGC = MB

Can estimate when the offline adversary will give up.

slide-9
SLIDE 9

Available Password Frequency Lists

Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* …. … … Yahoo! [B12] 70 Million With Permission**

** frequency list perturbed slightly to preserve differential privacy.

https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage

slide-10
SLIDE 10

How the project started

Would it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.

slide-11
SLIDE 11

How the project started

I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

slide-12
SLIDE 12

How the project started

I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They won’t let me release it.

slide-13
SLIDE 13

Available Password Frequency Lists

Site #User Accounts (N) How Released RockYou 32.6 Million Data Breach* LinkedIn 6 Data Breach* …. … … Yahoo! [B12] 70 Million With Permission**

** frequency list perturbed slightly to preserve differential privacy.

https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage

slide-14
SLIDE 14

Why not just publish the original frequency lists?

  • Heuristic Approaches to Data Privacy often break down when the

adversary has background knowledge

  • Massachusetts Group Insurance Medical Encounter Database [SS98]
  • Background Knowledge: Voter Registration Record
slide-15
SLIDE 15

Why not just publish the original frequency lists?

  • Heuristic Approaches to Data Privacy often break down when the

adversary has background knowledge

  • Massachusetts Group Insurance Medical Encounter Database [SS98]
  • Background Knowledge: Voter Registration Record
  • Netflix Prize Dataset[NS08]
  • Background Knowledge: IMDB
slide-16
SLIDE 16

Why not just publish the original frequency lists?

  • Heuristic Approaches to Data Privacy often break down when the

adversary has background knowledge

  • Netflix Prize Dataset[NS08]
  • Background Knowledge: IMDB
  • Massachusetts Group Insurance Medical Encounter Database [SS98]
  • Background Knowledge: Voter Registration Record
  • Many other attacks [BDK07,…]
  • In the absence of provable privacy guarantees Yahoo! was

understandably reluctant to release these password frequency lists.

slide-17
SLIDE 17

Security Risks (Example)

???

12345 abc123 abc123

Adversary Background Knowledge

slide-18
SLIDE 18

Security Risks (Example)

???

12345 abc123 abc123

abc123 12345

  • ther

3 1 2 2 2 1 1 2

slide-19
SLIDE 19

Differential Privacy (Dwork et al)

𝑔 βˆ’ 𝑔′ 1 ≝

𝑗

𝑔

𝑗 βˆ’ 𝑔 𝑗′

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

slide-20
SLIDE 20

Differential Privacy (Dwork et al)

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

f – original password frequency list f’ – remove Alice’s password from dataset

slide-21
SLIDE 21

Differential Privacy (Dwork et al)

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

Small Constant (e.g., 𝜁 = 0.5)

f – original password frequency list f’ – remove Alice’s password from dataset

slide-22
SLIDE 22

Differential Privacy (Dwork et al)

Definition: An (randomized) algorithm A preserves 𝜁, πœ€ -differential privacy if for any subset SβŠ† π‘†π‘π‘œπ‘•π‘“(𝐡) of possible outcomes and any we have Pr 𝐡(𝑔) ∈ S ≀ π‘“πœPr 𝐡(𝑔′) ∈ S + πœ€ for any pair of adjacent password frequency lists f and f’, 𝑔 βˆ’ 𝑔′ 1 = 1.

Small Constant (e.g., 𝜁 = 0.5) Negligibly Small Value (e.g., πœ€ = 2βˆ’100)

f – original password frequency list f’ – remove Alice’s password from dataset

slide-23
SLIDE 23

Differential Privacy (Example)

23

2 2 2 2 1 2

minus

=

f f’

Subset S of all potentially harmful outcomes to Alice

𝑃𝑣𝑒𝑑𝑝𝑛𝑓𝑑

slide-24
SLIDE 24

Differential Privacy (Example)

24

𝐐𝐬 𝐡 𝑔 ∈ ≀ π‘“πœ»ππ¬ 𝐡(𝑔′) ∈ + πœ€

2 2 2 2 1 2

minus

=

f f’

Subset S of all potentially harmful outcomes to Alice

slide-25
SLIDE 25

Intuition: Alice will not harmed because her password was included in the dataset.

25

Differential Privacy (Example)

𝐐𝐬 𝐡 𝑔 ∈ ≀ π‘“πœ»ππ¬ 𝐡(𝑔′) ∈ + πœ€

slide-26
SLIDE 26

Main Technical Result

Theorem: There is a computationally efficient algorithm 𝑔 ← 𝐡 𝑔 such that A preserves 𝜁, πœ€ -differential privacy and, except with probability πœ€, outputs 𝑔 s.t. 𝑔 βˆ’ 𝑔 1 𝑂 ≀ 𝑃 1 𝜁 𝑂 + ln 1 πœ€ πœπ‘‚ .

slide-27
SLIDE 27

Main Tool: Exponential Mechanism [MT07]

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Assigns very small probability to inaccurate outcomes.

slide-28
SLIDE 28

Main Tool: Exponential Mechanism [MT07]

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem [MT07]: The exponential mechanism preserves 𝜁, 0 - differential privacy.

Assigns very small probability to inaccurate outcomes.

slide-29
SLIDE 29

Analysis: Exponential Mechanism

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem [HR18]: There are 𝑓𝑃

𝑂 partitions of the integer N.

Assigns very small probability to inaccurate outcomes.

slide-30
SLIDE 30

Analysis: Exponential Mechanism

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem [HR18]: There are 𝑓𝑃

𝑂 partitions of the integer N.

Assigns very small probability to inaccurate outcomes.

Union Bound οƒ  𝑔 βˆ’ 𝑔 1 ≀ 𝑃

𝑂 𝜁

with high probability.

slide-31
SLIDE 31

Analysis: Exponential Mechanism

Input: f Output: Pr β„‡πœ 𝑔 = 𝑔 ∝ π‘“βˆ’

π‘”βˆ’ 𝑔 1 2𝜁

Theorem:

π‘”βˆ’ 𝑔 1 𝑂

≀ 𝑃

1 𝜁 𝑂 with high probability.

Assigns very small probability to inaccurate outcomes.

Theorem [MT07]: The exponential mechanism preserves 𝜁, 0 - differential privacy.

slide-32
SLIDE 32

The Challenge --- Efficiency

Strong Evidence: Sampling from the exponential mechanism is computationally intractable in general (e.g., [U13]). NaΓ―ve Implementation: Exponential time (distribution assigns weights to infinitely many integer partitions)

slide-33
SLIDE 33

Good News

Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€β€“close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑃 𝑂 𝑂 + N ln 1 πœ€ 𝜁

slide-34
SLIDE 34

Good News

Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€β€“close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑃 𝑂 𝑂 + N ln 1 πœ€ 𝜁

Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,k such that

𝐐𝐬 𝑔𝑗 = 𝑙 𝑔

π‘—βˆ’1 =

Wi,𝑙

t=0

π‘”π‘—βˆ’1 Wi,t

.

slide-35
SLIDE 35

Good News

Theorem: There is an efficient algorithm A to sample from a distribution that is πœ€β€“close to the exponential mechanism ℇ over integer partitions. The algorithm uses time and space 𝑃 𝑂 𝑂 + N ln 1 πœ€ 𝜁

Key Idea 2: Allow A to ignore a partition 𝑔 if 𝑔 βˆ’ 𝑔 1 very large. Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,t

slide-36
SLIDE 36

RockYou Experiments

slide-37
SLIDE 37

Yahoo! Results

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

slide-38
SLIDE 38

Yahoo! Results

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

slide-39
SLIDE 39

Yahoo! Results (Selecting Epsilon)

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = πœπ‘π‘šπ‘š + 22Ξ΅β€²

Any individual participates in at most 23 groups (including All)

slide-40
SLIDE 40

Yahoo! Results (Selecting Epsilon)

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = πœπ‘π‘šπ‘š + 22Ξ΅β€²

πœπ‘π‘šπ‘š = 0.25

Ξ΅β€² =

πœπ‘π‘šπ‘š

22

slide-41
SLIDE 41

Yahoo! Results (Selecting Epsilon)

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = 0.5

slide-42
SLIDE 42

Yahoo! Results (Selecting Epsilon)

Original Data Sanitized Data N π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” 𝑢 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐 π¦π©π‘πŸ‘ 𝑢 𝝁𝟐𝟏𝟏 π¦π©π‘πŸ‘ π‘―πŸ.πŸ” All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 … … … … … … … … … language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 … … … … … … … … …

𝜁 = 0.5, Ξ΄ = 2βˆ’100

slide-43
SLIDE 43

Conclusions

  • Novel differentially private algorithm for integer partitions
  • Password Frequency Lists
  • Degree Distribution in a Social Network?
  • Other applications?
  • The Yahoo! Frequency data is now available
  • Search: β€œYahoo! Password Frequency Corpus”
  • What exciting things can we do with it?
  • Hope for other organizations to imitate Yahoo!