Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords
Jeremiah Blocki
Purdue University
DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness
Releasing a Differentially Private Password Frequency Corpus from 70 - - PowerPoint PPT Presentation
Releasing a Differentially Private Password Frequency Corpus from 70 Million Yahoo! Passwords Jeremiah Blocki Purdue University DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness What
Purdue University
DIMACS/Northeast Big Data Hub Workshop on Overcoming Barriers to Data Sharing including Privacy and Fairness
Password Dataset: (N users)
password 12345 abc123 abc123
1 2
Histogram
1
Password Dataset: (N users)
password 12345 abc123 abc123
1 2 2
Histogram Frequency List
1 1 1
Password Dataset: (N users)
1 2 2
Formally: π β β(πΆ)
Password Frequency List is just an integer partition.
Histogram Frequency List
1 1 1
password 12345 abc123 abc123
ππΎ =
π=1 πΎ
π
π
Estimate #accounts compromised by attacker with πΎ guesses per user
Password Frequency Lists allow us to estimate
** frequency list perturbed slightly to preserve differential privacy.
https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage
Would it be possible to access the Yahoo! data? I am working on a cool new research project and the password frequency data would be very useful.
I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They wonβt let me release it.
I would love to make the data public, but Yahoo! Legal has concerns about security and privacy. They wonβt let me release it.
** frequency list perturbed slightly to preserve differential privacy.
https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at: * entire frequency list available due to improper password storage
adversary has background knowledge
understandably reluctant to release these password frequency lists.
12345 abc123 abc123
12345 abc123 abc123
abc123 12345
3 1 2 2 2 1 1 2
π β πβ² 1 β
π
π
π β π πβ²
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Small Constant (e.g., π = 0.5)
Definition: An (randomized) algorithm A preserves π, π -differential privacy if for any subset Sβ πππππ(π΅) of possible outcomes and any we have Pr π΅(π) β S β€ ππPr π΅(πβ²) β S + π for any pair of adjacent password frequency lists f and fβ, π β πβ² 1 = 1.
Small Constant (e.g., π = 0.5) Negligibly Small Value (e.g., π = 2β100)
20
2 2 2 2 1 2
minus
f fβ
Subset S of all potentially harmful outcomes to Alice
21
2 2 2 2 1 2
minus
f fβ
Subset S of all potentially harmful outcomes to Alice
22
Theorem: There is a computationally efficient algorithm π΅: β Γ β β β such that A preserves π, π -differential privacy and, except with probability π, A(f) outputs π s.t. π β π 1 π β€ π 1 π π + ln 1 π ππ . ππ£π§π π = π π π + π ln 1 Ξ΄ π = ππͺπππ(π΅)
β =
π=π β
β π
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Assigns very small probability to inaccurate outcomes.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem [MT07]: The exponential mechanism preserves π, 0 - differential privacy.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem [HR18]: There are ππ
π partitions of the integer N.
Assigns very small probability to inaccurate outcomes.
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem [HR18]: There are ππ
π partitions of the integer N.
Assigns very small probability to inaccurate outcomes. Union Bound ο π β π 1 β€ π
π π
with high probability when
1 π = π
π .
Input: f Output: Pr βπ π = π β πβ
πβ π 1 2π
Theorem:
πβ π 1 π
β€ π
1 π π with high probability.
Assigns very small probability to inaccurate outcomes.
Theorem [MT07]: The exponential mechanism preserves π, 0 - differential privacy.
Theorem: There is an efficient algorithm A to sample from a distribution that is πβclose to the exponential mechanism β over integer partitions. The algorithm uses time and space π π π + N ln 1 π π
Key Intuition:
ππ = πβπ» πβ€π ππβ ππ Γ πβπ» π>π ππβ ππ
Suggests Potential Recurrence Relationships
Theorem: There is an efficient algorithm A to sample from a distribution that is πβclose to the exponential mechanism β over integer partitions. The algorithm uses time and space π π π + N ln 1 π π
Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,k such that
πβ1 =
t=0
ππβ1 Wi,t
.
Theorem: There is an efficient algorithm A to sample from a distribution that is πβclose to the exponential mechanism β over integer partitions. The algorithm uses time and space π π π + N ln 1 π π
Key Idea 2: Allow A to ignore a partition π if π β π 1 very large. Key Idea 1: Novel dynamic programming algorithm to compute weights Wi,t
π π + N ln 1 π π (8 bytes) β 200 ππΆ
table for sampling.
come from?
Does Yahoo! have any preference about the privacy parameter π?
Are there standardized guidelines to select π?
No, I was thinking π =
1 2 would be
reasonableβ¦.
Yahoo! is fine with π =
1 2
Risk: Industry deployments become de facto standard for selecting π? Suggested Dinner Discussion Topic: What role should academia play in influencing these standards?
Original Data Sanitized Data N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:
Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
https://figshare.com/articles/Yahoo_Password_Frequency_Corpus/2057937 Yahoo! Frequency data is now available online at:
Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Any individual participates in at most 23 groups (including All)
Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
ππππ
Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Original Data [B12] Sanitized Data [BDB16] N π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π πΆ π¦π©π‘π πΆ ππ π¦π©π‘π πΆ ππππ π¦π©π‘π π―π.π All 69,301,337 6.5 11.4 21.6 69,299,074 6.5 11.4 21.6 gender (self-reported) Female 30,545,765 6.9 11.5 21.1 30,545,765 6.9 11.5 21.1 Male 38,624,554 6.3 11.3 21.8 38,624,554 6.3 11.3 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ language preference Chinese 1,564,364 6.5 11.1 22.0 1,571,348 6.5 11.1 21.8 β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦
Application to Social Networks: Degree Distribution with Node Privacy
3 π
π π
1 π2
1 π = Ξ©
20000 40000 60000 80000
(100 Samples)
100 200 300 400 500 600
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Average L1 Error (200 Samples) 1/βΞ΅
Bitcoin Trust Network
π β 1
3 π
5000 10000 15000 20000 25000 2000 4000 6000 8000 10000 12000
Average L1 Error (200 Samples) 1/Ξ΅2
Bitcoin Trust Network
π β 1
3 π
π β 1 π
100000000 200000000 300000000 400000000 500000000 600000000 700000000 800000000 900000000 1E+09 2000 4000 6000 8000 10000 12000
Mean Squared Error (200 Samples)
Bitcoin Trust Network
Laplace (Post Process) Exponential Mechanism
Anupam Datta CMU Joseph Bonneau NYU