Differentially Private Password Frequency Lists
Jeremiah Blocki MSR/Purdue Anupam Datta CMU Joseph Bonneau Stanford/EFF
Frequency Lists Jeremiah Blocki Anupam Datta Joseph Bonneau - - PowerPoint PPT Presentation
Differentially Private Password Frequency Lists Jeremiah Blocki Anupam Datta Joseph Bonneau MSR/Purdue CMU Stanford/EFF Or, How to release statistics from 70 million passwords (on purpose) Jeremiah Blocki Anupam Datta Joseph Bonneau
Jeremiah Blocki MSR/Purdue Anupam Datta CMU Joseph Bonneau Stanford/EFF
Jeremiah Blocki MSR/Purdue Anupam Datta CMU Joseph Bonneau Stanford/EFF
Password Dataset: (N users)
password 12345 abc123 abc123
1 2
Histogram
1
Password Dataset: (N users)
password 12345 abc123 abc123
1 2 2
Histogram Frequency List
1 1 1
Password Dataset: (N users)
1 2 2
Formal Notation: π = π1,β¦,ππ such that
π β₯ 0
π
ππ Histogram Frequency List
1 1 1
password 12345 abc123 abc123
ππΎ =
π=1 πΎ
π
π
Estimate #accounts compromised by attacker with πΎ guesses per user
Halting Condition (Rational Offline Adversary):
Password Frequency Lists allow us to estimate
Can estimate when the offline adversary will give up.
** frequency list perturbed slightly to preserve differential privacy.
https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage
Would it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.
I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They wonβt let me release it.
I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They wonβt let me release it.
** frequency list perturbed slightly to preserve differential privacy.
https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage
adversary has background knowledge
adversary has background knowledge
adversary has background knowledge
understandably reluctant to release these password frequency lists.
12345 abc123 abc123
12345 abc123 abc123
abc123 12345
3 1 2 2 2 1 1 2
π β πβ² 1 β
π
π
π β π πβ²
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Small Constant (e.g., π = 0.5)
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Small Constant (e.g., π = 0.5) Negligibly Small Value (e.g., π = 2β100)
23
2 2 2 2 1 2
minus
f fβ
Subset S of all potentially harmful outcomes to Alice
24
2 2 2 2 1 2
minus
f fβ
Subset S of all potentially harmful outcomes to Alice
25
Theorem: There is a computationally efficient algorithm π β π΅ π such that A preserves π, π -differential privacy and, except with probability π, outputs π s.t. π β π 1 π β€ π 1 π π + ln 1 π ππ .
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Assigns very small probability to inaccurate outcomes.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem [MT07]: The exponential mechanism preserves π, 0 - differential privacy.
Assigns very small probability to inaccurate outcomes.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem [HR18]: There are ππ
π partitions of the integer N.
Assigns very small probability to inaccurate outcomes.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem [HR18]: There are ππ
π partitions of the integer N.
Assigns very small probability to inaccurate outcomes.
Union Bound ο π β π 1 β€ π
π π
with high probability.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem:
πβ π 1 π
β€ π
1 π π with high probability.
Assigns very small probability to inaccurate outcomes.
Theorem [MT07]: The exponential mechanism preserves π, 0 - differential privacy.
Strong Evidence: Sampling from the exponential mechanism is computationally intractable in general (e.g., [U13]). NaΓ―ve Implementation: Exponential time (distribution assigns weights to infinitely many integer partitions)
Theorem: There is an efficient algorithm A to sample from a distribution that is πβclose to the exponential mechanism β over integer partitions. The algorithm uses time and space π π π + N ln 1 π π
Theorem: There is an efficient algorithm A to sample from a distribution that is πβclose to the exponential mechanism β over integer partitions. The algorithm uses time and space π π π + N ln 1 π π
Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,k such that
πβ1 =
t=0
ππβ1 Wi,t
.
Theorem: There is an efficient algorithm A to sample from a distribution that is πβclose to the exponential mechanism β over integer partitions. The algorithm uses time and space π π π + N ln 1 π π
Key Idea 2: Allow A to ignore a partition π if π β π 1 very large. Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,t
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Any individual participates in at most 23 groups (including All)
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
ππππ
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦