Matthias Grossglauser, EPFL CTW 2013 1
4417749 care packages 2006-03 03-02 09:19:32 4417749 movies for dogs 2006-03 03-02 09:24:14 4417749 blue book 2006-03 03-03 11:48:52 4417749 best dog for older owner 2006-03 03-06 11:48:24 4417749 best dog for older owner 2006-03 03-06 11:48:24 4417749 rescue of older dogs 2006-03 03-06 11:55:25 4417749 school supplies for the iraq children 2006-03 03-06 13:36:33 4417749 school supplies for the iraq children 2006-03 03-06 13:36:33 4417749 pine straw lilburn delivery 2006-03 03-06 18:35:02 4417749 pine straw delivery in in gwinnett county 2006-03 03-06 18:36:35 4417749 landscapers in lilburn ga ga. 2006-03 03-06 18:37:26 4417749 pne straw in lilburn ga ga. 2006-03 03-06 18:38:19 4417749 pine straw in in lilburn ga ga. 2006-03 03-06 18:38:27 4417749 gwinnett county yellow pages 2006-03 03-06 18:42:08 ... anonymized user ID 2
Search ches es: “ landscapers in Lilburn, Ga ” “homes sold in shadow lake subdivision gwinnett county georgia ” “ jarrett t. arnold ”, “ jack t. arnold ” 441 417749=T 7749=Thel elma Arnold ld 62 years old widow and dog owner home: Lilburn, GA AOL press rele lease: e: “There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.” Heads had to roll… AOL CTO Maureen Govern (+2 others) fired 3
Personall lly identifiable le Name Name Home Home Work Work information (PII): Adam Adam A A EPFL EPFL “information that can be Barbara Barbara B B EPFL EPFL used to uniquely identify, Carlos Carlos A A UNIL UNIL contact, or locate a single person or can be used with other sources to uniquely identify a single individual” A (wikipedia) B UNIL EPFL 4
Adversary has: Anonymized network = unlabeled graph Side information: subgraph; statistics on certain nodes; noisy version of whole network; … anonymized social network side information Adam Barbara Carlos 5
Other er appli lica cations: Find overlap in networks: Social networks from different domains & time slots Identify viruses by function-call patterns Computer vision: matching segment graphs for different viewing angles … matching nodes peter.muster@epfl.ch 021-693-1233 by structure only 6
Fundamental feasibility w/o side information, but with ∞ time and memory 7
8
9
Is it fundamen entall lly hard or easy to match ch simila lar graphs by structu cture? e? Fundamen ental = Information-theoretic: ignore computational & memory cost Hard: in addition to second graph, no other side information Demanding: want to match every vertex
First publi lished ed 1959 59 by Erdös & Rényi Focus on existence results Large 𝒐 asymptotics cs and phase transitions Connectivity Existence of subgraphs Giant component Chromatic number G ( n , p ( n )) Automorphism group … Threshold for asymmetry: 𝑞 = log 𝑜 /𝑜 11
Asymmetric Symmetric AuG = 1 AuG = 12 AuG = size of automorphism group 12
Generator 𝐻 = 𝐻(𝑜,𝑞) sampled ( 𝑡 ) not sampled ( 1 − 𝑡 ) “real” social ties phone calls emails 𝑡 measures similarity 13
𝑜! possible mappings! Δ 𝜌 0 = 0 Δ 𝜌 = 2 14
Assumption: Attacker has infinite computational power Can try all possible mappings π and compute edge mismatch function Δ ( π ) Ques estion: Are there conditions on p, s such that unique min of ( ) 1 P 0 If yes: adversary would be able to match vertex sets only through the structure of the two networks! Note: e: 𝐻(𝑜,𝑞; 𝑡) model: statistically uniform, low clustering, degree distribution not skewed -> conjecture: harder than real networks 15
Theorem em: 𝑜𝑞𝑡 : E[degree] of G 1,2 threshold for aug(G)=1 For the G(n,p;s) matching problem, if 2 s 8 log ( 1 ) nps n 2 s then the identity permutation minimizes Δ (.) a.a.s. Penalty for difference G 1 -G 2 “growing slowly” Inter erpreta etation: two piece ces of bad/go good news Surprisingly weak condition: degree growing faster than ~ log 𝑜 enough to break anonymity Decrease with 𝑡 only quadratic 16
Fix a particu cula lar map π V π : set of mismatched nodes under π G 1 π є Π 11 G 2 Transposition invariant edge 17
E π = V x V π : all the edges 𝑜 − 𝑙 nodes modified under π V π : 𝑙 nodes n 1 5 2 4 3 Δ 0 :each edge contributes Δ π :each pair of edges contributes Bernoulli( 2𝑞𝑡(1 − 𝑡) ): Bernoulli( 2𝑞𝑡(1 − 𝑞𝑡) ): sampling errors matching errors 12 13 14 15 23 24 25 34 35 45 1n 2n … 12 13 14 15 23 24 25 34 35 45 1n 2n 18
𝐻(𝑜, 𝑞; 𝑡, 𝑢) matching problem 19 19
Result: lt: Dependence on 𝑜 still the same: 𝑜𝑞𝑡 = 𝑑(𝑡, 𝑢) log 𝑜 + 𝜕(1) Dependence on 𝑡 and 𝑢 less intuitive Inter erpreta etation: Node mismatch does not help/hurt too much either 20
Phase transition, and an efficient & tractable matching algorithm… 21
INPUT: Seed map of known pairs Propagate the map to “similar” neighbors on left and right [A. Narayanan, V. Shmatikov, "De-anonymizing social networks“, IEEE Symp. On Security and Privacy, 2009] 22
Similarity metric: A B sim ( A , B ) A B 23
Find max sim(u,v) Continue until done… …or blocked 24
How many seeds are need eded ed? Is there a phase transition? How efficien ently ly can we match ch? Tuning parameter eters? [A. Narayanan, V. Shmatikov, "De-anonymizing social networks“, IEEE Symp. on Security and Privacy, 2009]
𝐻 1 𝐻 2 26
If ≥ 𝑠 matched neighbors match 𝐻 1 matching error 𝐻 2 27
𝐻(𝑜, 𝑞) 28
P( )=1 𝑜𝑞 < 1 : consumption > production Extinction prob. of branching process (failure rate) 𝑜𝑞 > 1 : production > consumption P( )=0 29
Activation from 𝑠 neighbors [S. Janson, T. Luczak, T. Turova, T. Vallier, Bootstrap Percolation on the Random Graph 𝐻(𝑜, 𝑞) , Annals Applied Prob., 22(5), 2012] 30
consumption > production production > consumption P( )=1 𝑢 𝑑 𝑜𝑞 = 𝜕(1) P( )=0 𝑏 𝑑 31
Theorem em: phase transition in # seeds 𝑜 −1 ≪ 𝑞𝑡 ≪ 𝑡𝑜 − 1 2 − 3 For 2𝑠 : 𝑏 𝑏 𝑑 → 𝛽 < 1 , If final map is 𝑝(𝑜) w.h.p. 𝑏 𝑏 𝑑 > 𝛽 > 1 , If final map is 𝑜 − 𝑝 𝑜 w.h.p. Seed set size thres eshold ld: 𝑏 𝑑 = 1 − 𝑠 −1 𝑢 𝑑 1/(𝑠−1) 𝑠−1 ! 𝑢 𝑑 = 𝑜 𝑞𝑡 2 𝑠 32
Bootstrap perco cola lation in 𝑯(𝒐, 𝒒) : # credits of node 𝑗 at time 𝑢 : i.i.d. Binomials Perco cola lation graph match ching in 𝑯(𝒐, 𝒒; 𝒕) # credits of pair 𝑗,𝑘 at time 𝑢 : dependent, different Binomials As long as no matching error so far, increments at 𝑢 𝑗, 𝑗 ~𝐶𝑓𝑠 𝑞𝑡 2 , 𝑗, 𝑘 ~𝐶𝑓𝑠((𝑞𝑡) 2 ) Different: Dependent: for 𝑗, 𝑗 ′ ,𝑘 all different: 𝑗, 𝑘 + + = 𝑞𝑡 2 𝑄 𝑗, 𝑘 + + 𝑗 ′ , 𝑘 + + = 𝑞𝑡 𝑄 𝐻 𝐻 1 𝐻 2 33
Approach ch: Focus on regime where 𝑌 = no bad pair (𝑗,𝑘) get enough credits (𝑠) to be potentially matched 𝑞𝑡 ≪ 𝑜 − 1 2 − 3 2𝑠 True for Need to choose 𝑠 large enough (sparse graphs: 𝑠 ≥ 4 , otherwise higher) Conditional on 𝑌 , only need to focus on good pairs (𝑗, 𝑗) Equivalence with bootstrap problem does it percolate? 𝑜 −1 ≪ 𝑞𝑡 Need to have 𝑏 > 𝑏 𝑑 large Need to have seed set size enough 34
35
36
37
38
How to get started in practice 39
Ques estion: Can similar idea inform algorithm design? Wishli list: Cold-st start: how to match without seeds? Sparse se graphs: s: how to avoid blocking? Error propagation: how to correct mismatches? 40
Fingerprint: Fingerprint: (deg=3, (deg=4, dist(seed1)=3, dist(seed1)=1, seed1 dist(seed2)=1) dist(seed2)=3) Fingerprint: Fingerprint: (deg=3, (deg=1, dist(seed1)=1, dist(seed1)=4, seed2 dist(seed2)=3) dist(seed2)=2)
Recommend
More recommend