SLIDE 1 Matthias Grossglauser, EPFL CTW 2013
1
SLIDE 2 4417749 care packages 2006-03 03-02 09:19:32 4417749 movies for dogs 2006-03 03-02 09:24:14 4417749 blue book 2006-03 03-03 11:48:52 4417749 best dog for older owner 2006-03 03-06 11:48:24 4417749 best dog for older owner 2006-03 03-06 11:48:24 4417749 rescue of older dogs 2006-03 03-06 11:55:25 4417749 school supplies for the iraq children 2006-03 03-06 13:36:33 4417749 school supplies for the iraq children 2006-03 03-06 13:36:33 4417749 pine straw lilburn delivery 2006-03 03-06 18:35:02 4417749 pine straw delivery in in gwinnett county 2006-03 03-06 18:36:35 4417749 landscapers in lilburn ga ga. 2006-03 03-06 18:37:26 4417749 pne straw in lilburn ga ga. 2006-03 03-06 18:38:19 4417749 pine straw in in lilburn ga ga. 2006-03 03-06 18:38:27 4417749 gwinnett county yellow pages 2006-03 03-06 18:42:08 ...
2
anonymized user ID
SLIDE 3
ches es:
- “landscapers in Lilburn, Ga”
- “homes sold in shadow lake subdivision
gwinnett county georgia”
- “jarrett t. arnold”, “jack t. arnold”
- 441
417749=T 7749=Thel elma Arnold ld
years
widow and dog
Lilburn, GA
press rele lease: e:
was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.”
had to roll…
CTO Maureen Govern (+2
fired
3
SLIDE 4
lly identifiable le information (PII):
that can be used to uniquely identify, contact,
locate a single person
can be used with
sources to uniquely identify a single individual” (wikipedia)
4
Name Home Work Adam A EPFL Barbara B EPFL Carlos A UNIL
A B EPFL UNIL
Name Home Work Adam A EPFL Barbara B EPFL Carlos A UNIL
SLIDE 5
has:
network = unlabeled graph
information: subgraph; statistics
certain nodes; noisy version
whole network; …
5
anonymized social network side information Adam Barbara Carlos
SLIDE 6
er appli lica cations:
in networks:
networks from different domains & time slots
viruses by function-call patterns
vision: matching segment graphs for different viewing angles
6
021-693-1233 peter.muster@epfl.ch matching nodes by structure only
SLIDE 7 Fundamental feasibility w/o side information, but with ∞ time and memory
7
SLIDE 10
it fundamen entall lly hard
easy to match ch simila lar graphs by structu cture? e?
ental =
ignore computational & memory cost
in addition to second graph, no
side information
want to match every vertex
SLIDE 11
publi lished ed 1959 59 by Erdös & Rényi
existence results
𝒐 asymptotics cs and phase transitions
subgraphs
component
number
group
11
)) ( , ( n p n G
Threshold for asymmetry: 𝑞 = log 𝑜 /𝑜
SLIDE 12 12
Symmetric Asymmetric AuG = 12 AuG = 1 AuG = size
automorphism group
SLIDE 13 13
sampled (𝑡) not sampled (1 − 𝑡) Generator 𝐻 = 𝐻(𝑜,𝑞) 𝑡 measures similarity “real” social ties phone calls emails
SLIDE 14 14
Δ 𝜌0 = 0 Δ 𝜌 = 2
𝑜! possible mappings!
SLIDE 15
has infinite computational power
try all possible mappings π and compute edge mismatch function Δ(π)
estion:
there conditions
p, s such that
yes: adversary would be able to match vertex sets
through the structure
the two networks!
e:
statistically uniform, low clustering, degree distribution not skewed
harder than real networks
15
1 ) (
min unique P
SLIDE 16
em:
the G(n,p;s) matching problem, if then the identity permutation minimizes Δ(.) a.a.s.
erpreta etation: two piece ces
bad/go good news
weak condition: degree growing faster than ~log 𝑜 enough to break anonymity
with 𝑡 only quadratic
16
) 1 ( log 8 2
2
n s s nps
Penalty for difference G1-G2 “growing slowly” threshold for aug(G)=1 𝑜𝑞𝑡: E[degree] of G1,2
SLIDE 17
a particu cula lar map π
17
G1 G2
π є Π11
Vπ: set of mismatched nodes under π
Transposition invariant edge
SLIDE 18 18
Vπ: 𝑙 nodes 𝑜 − 𝑙 nodes 5 2 4 n 3 1
12 12 13 13 14 15 14 15 23 23 24 25 24 25 34 34 35 45 35 45 1n 1n 2n 2n
… Δ0 :each edge contributes Bernoulli(2𝑞𝑡(1 − 𝑡)): sampling errors Δπ :each pair
edges contributes Bernoulli(2𝑞𝑡(1 − 𝑞𝑡)): matching errors Eπ= V x Vπ: all the edges modified under π
SLIDE 19 19 19
𝐻(𝑜, 𝑞; 𝑡, 𝑢) matching problem
SLIDE 20
lt:
𝑜 still the same:
- Dependence on 𝑡 and 𝑢 less intuitive
- Inter
erpreta etation:
mismatch does not help/hurt too much either
20
𝑜𝑞𝑡 = 𝑑(𝑡, 𝑢) log 𝑜 + 𝜕(1)
SLIDE 21 Phase transition, and an efficient & tractable matching algorithm…
21
SLIDE 22 22
INPUT: Seed map
known pairs Propagate the map to “similar” neighbors
left and right
[A. Narayanan,
"De-anonymizing social networks“, IEEE
and Privacy, 2009]
SLIDE 23 23
Similarity metric:
B A B A B A sim ) , (
SLIDE 24 24
Find max sim(u,v) Continue until done… …or blocked
SLIDE 25
many seeds are need eded ed?
there a phase transition?
efficien ently ly can we match ch?
parameter eters?
[A. Narayanan,
"De-anonymizing social networks“, IEEE
and Privacy, 2009]
SLIDE 27 27
𝐻1 𝐻2 If ≥ 𝑠 matched neighbors match matching error
SLIDE 29 29
P()=1 P()=0
𝑜𝑞 < 1: consumption > production 𝑜𝑞 > 1: production > consumption Extinction
branching process (failure rate)
SLIDE 30 30
Activation from 𝑠 neighbors
[S. Janson,
- T. Luczak,
- T. Turova,
- T. Vallier,
Bootstrap Percolation on the Random Graph 𝐻(𝑜, 𝑞), Annals Applied Prob., 22(5), 2012]
SLIDE 31 31
consumption > production production > consumption 𝑏𝑑 𝑢𝑑
P()=1 P()=0
𝑜𝑞 = 𝜕(1)
SLIDE 32
em: phase transition in # seeds
𝑜−1 ≪ 𝑞𝑡 ≪ 𝑡𝑜 −1
2− 3 2𝑠:
𝑏
𝑏𝑑 → 𝛽 < 1,
final map is 𝑝(𝑜) w.h.p.
𝑏
𝑏𝑑 > 𝛽 > 1,
final map is 𝑜 − 𝑝 𝑜 w.h.p.
set size thres eshold ld:
𝑠−1 ! 𝑜 𝑞𝑡2 𝑠 1/(𝑠−1)
32
SLIDE 33
perco cola lation in 𝑯(𝒐, 𝒒):
credits
node 𝑗 at time 𝑢: i.i.d. Binomials
cola lation graph match ching in 𝑯(𝒐, 𝒒; 𝒕)
credits
pair 𝑗,𝑘 at time 𝑢: dependent, different Binomials
long as no matching error so far, increments at 𝑢
𝑗, 𝑗 ~𝐶𝑓𝑠 𝑞𝑡2 , 𝑗, 𝑘 ~𝐶𝑓𝑠((𝑞𝑡)2)
for 𝑗, 𝑗′,𝑘 all different:
𝑗, 𝑘 + + = 𝑞𝑡 2
𝑗, 𝑘 + + 𝑗′, 𝑘 + + = 𝑞𝑡
33
𝐻1 𝐻2 𝐻
SLIDE 34
ch:
regime where 𝑌 =no bad pair (𝑗,𝑘) get enough credits (𝑠) to be potentially matched
for 𝑞𝑡 ≪ 𝑜−1
2− 3 2𝑠
to choose 𝑠 large enough (sparse graphs: 𝑠 ≥ 4,
higher)
𝑌,
need to focus
good pairs (𝑗, 𝑗)
with bootstrap problem does it percolate?
to have 𝑜−1 ≪ 𝑞𝑡
to have seed set size 𝑏 > 𝑏𝑑 large enough
34
SLIDE 39 How to get started in practice
39
SLIDE 40
estion:
similar idea inform algorithm design?
list:
start: how to match without seeds?
se graphs: s: how to avoid blocking?
propagation: how to correct mismatches?
40
SLIDE 41
Fingerprint: (deg=4, dist(seed1)=1, dist(seed2)=3) Fingerprint: (deg=1, dist(seed1)=4, dist(seed2)=2) Fingerprint: (deg=3, dist(seed1)=3, dist(seed2)=1)
seed1 seed2
Fingerprint: (deg=3, dist(seed1)=1, dist(seed2)=3)
SLIDE 42
?
Fingerprint: (deg=4, dist(seed1)=1, dist(seed2)=3) Fingerprint: (deg=1, dist(seed1)=4, dist(seed2)=2) Fingerprint: (deg=3, dist(seed1)=3, dist(seed2)=1) Fingerprint: (deg=3, dist(seed1)=1, dist(seed2)=3)
Network sampling model: P(fp1, fp2 | matched correctly), P(fp1, fp2 | matched wrong) Jointly MAP matching: Best bipartite matching 𝜌 s.t. max P(all matched correctly | all fingerprints) Single-pair posterior: P(matched correctly | fp1, fp2)
SLIDE 43 43
Phase 1: 2 candidates Phase 2: 4 candidates Phase 3: 8 candidates 1 distance anchor 2 distance anchors Problem: Mapping error
distance error in next phase Solution: Prior (phase 𝑗 + 1) = posterior (phase 𝑗)
SLIDE 45
Match ching:
as noisy graph isomorphism problem
much information in network structure?
eoreti etic:
is quite easy, benign growth
mean degree
no a-priori structure
cola lation Graph Match ching from seeds
transition in size
seed set hard to control, tune, predict
works very well in practice; parsimonious (𝑠)
seeds
framework & heuristics
idea: exploit known “couples” as references for new candidate pairs
45
SLIDE 46 CTW 2013 Collaborators: Daniel
Pedram Pedarsani, Lyudmila Yartseva
46