Matthias Grossglauser, EPFL CTW 2013 1 4417749 care packages - - PowerPoint PPT Presentation

matthias grossglauser epfl ctw 2013
SMART_READER_LITE
LIVE PREVIEW

Matthias Grossglauser, EPFL CTW 2013 1 4417749 care packages - - PowerPoint PPT Presentation

Matthias Grossglauser, EPFL CTW 2013 1 4417749 care packages 2006-03 03-02 09:19:32 4417749 movies for dogs 2006-03 03-02 09:24:14 4417749 blue book 2006-03 03-03 11:48:52 4417749 best dog for older owner 2006-03 03-06 11:48:24


slide-1
SLIDE 1

Matthias Grossglauser, EPFL CTW 2013

1

slide-2
SLIDE 2

4417749 care packages 2006-03 03-02 09:19:32 4417749 movies for dogs 2006-03 03-02 09:24:14 4417749 blue book 2006-03 03-03 11:48:52 4417749 best dog for older owner 2006-03 03-06 11:48:24 4417749 best dog for older owner 2006-03 03-06 11:48:24 4417749 rescue of older dogs 2006-03 03-06 11:55:25 4417749 school supplies for the iraq children 2006-03 03-06 13:36:33 4417749 school supplies for the iraq children 2006-03 03-06 13:36:33 4417749 pine straw lilburn delivery 2006-03 03-06 18:35:02 4417749 pine straw delivery in in gwinnett county 2006-03 03-06 18:36:35 4417749 landscapers in lilburn ga ga. 2006-03 03-06 18:37:26 4417749 pne straw in lilburn ga ga. 2006-03 03-06 18:38:19 4417749 pine straw in in lilburn ga ga. 2006-03 03-06 18:38:27 4417749 gwinnett county yellow pages 2006-03 03-06 18:42:08 ...

2

anonymized user ID

slide-3
SLIDE 3
  • Search

ches es:

  • “landscapers in Lilburn, Ga”
  • “homes sold in shadow lake subdivision

gwinnett county georgia”

  • “jarrett t. arnold”, “jack t. arnold”
  • 441

417749=T 7749=Thel elma Arnold ld

  • 62

years

  • ld

widow and dog

  • wner
  • home:

Lilburn, GA

  • AOL

press rele lease: e:

  • “There

was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.”

  • Heads

had to roll…

  • AOL

CTO Maureen Govern (+2

  • thers)

fired

3

slide-4
SLIDE 4
  • Personall

lly identifiable le information (PII):

  • “information

that can be used to uniquely identify, contact,

  • r

locate a single person

  • r

can be used with

  • ther

sources to uniquely identify a single individual” (wikipedia)

4

Name Home Work Adam A EPFL Barbara B EPFL Carlos A UNIL

A B EPFL UNIL

Name Home Work Adam A EPFL Barbara B EPFL Carlos A UNIL

slide-5
SLIDE 5
  • Adversary

has:

  • Anonymized

network = unlabeled graph

  • Side

information: subgraph; statistics

  • n

certain nodes; noisy version

  • f

whole network; …

5

anonymized social network side information Adam Barbara Carlos

slide-6
SLIDE 6
  • Other

er appli lica cations:

  • Find
  • verlap

in networks:

  • Social

networks from different domains & time slots

  • Identify

viruses by function-call patterns

  • Computer

vision: matching segment graphs for different viewing angles

6

021-693-1233 peter.muster@epfl.ch matching nodes by structure only

slide-7
SLIDE 7

Fundamental feasibility w/o side information, but with ∞ time and memory

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10
  • Is

it fundamen entall lly hard

  • r

easy to match ch simila lar graphs by structu cture? e?

  • Fundamen

ental =

  • Information-theoretic:

ignore computational & memory cost

  • Hard:

in addition to second graph, no

  • ther

side information

  • Demanding:

want to match every vertex

slide-11
SLIDE 11
  • First

publi lished ed 1959 59 by Erdös & Rényi

  • Focus
  • n

existence results

  • Large

𝒐 asymptotics cs and phase transitions

  • Connectivity
  • Existence
  • f

subgraphs

  • Giant

component

  • Chromatic

number

  • Automorphism

group

11

)) ( , ( n p n G

Threshold for asymmetry: 𝑞 = log 𝑜 /𝑜

slide-12
SLIDE 12

12

Symmetric Asymmetric AuG = 12 AuG = 1 AuG = size

  • f

automorphism group

slide-13
SLIDE 13

13

sampled (𝑡) not sampled (1 − 𝑡) Generator 𝐻 = 𝐻(𝑜,𝑞) 𝑡 measures similarity “real” social ties phone calls emails

slide-14
SLIDE 14

14

Δ 𝜌0 = 0 Δ 𝜌 = 2

𝑜! possible mappings!

slide-15
SLIDE 15
  • Assumption:
  • Attacker

has infinite computational power

  • Can

try all possible mappings π and compute edge mismatch function Δ(π)

  • Ques

estion:

  • Are

there conditions

  • n

p, s such that

  • If

yes: adversary would be able to match vertex sets

  • nly

through the structure

  • f

the two networks!

  • Note:

e:

  • 𝐻(𝑜,𝑞; 𝑡) model:

statistically uniform, low clustering, degree distribution not skewed

  • > conjecture:

harder than real networks

15

 

1 ) (

  • f

min unique     P

slide-16
SLIDE 16
  • Theorem

em:

  • For

the G(n,p;s) matching problem, if then the identity permutation minimizes Δ(.) a.a.s.

  • Inter

erpreta etation: two piece ces

  • f

bad/go good news

  • Surprisingly

weak condition: degree growing faster than ~log 𝑜 enough to break anonymity

  • Decrease

with 𝑡 only quadratic

16

) 1 ( log 8 2

2

    n s s nps

Penalty for difference G1-G2 “growing slowly” threshold for aug(G)=1 𝑜𝑞𝑡: E[degree] of G1,2

slide-17
SLIDE 17
  • Fix

a particu cula lar map π

17

G1 G2

π є Π11

Vπ: set of mismatched nodes under π

Transposition  invariant edge

slide-18
SLIDE 18

18

Vπ: 𝑙 nodes 𝑜 − 𝑙 nodes 5 2 4 n 3 1

12 12 13 13 14 15 14 15 23 23 24 25 24 25 34 34 35 45 35 45 1n 1n 2n 2n

… Δ0 :each edge contributes Bernoulli(2𝑞𝑡(1 − 𝑡)): sampling errors Δπ :each pair

  • f

edges contributes Bernoulli(2𝑞𝑡(1 − 𝑞𝑡)): matching errors Eπ= V x Vπ: all the edges modified under π

slide-19
SLIDE 19

19 19

𝐻(𝑜, 𝑞; 𝑡, 𝑢) matching problem

slide-20
SLIDE 20
  • Result:

lt:

  • Dependence
  • n

𝑜 still the same:

  • Dependence on 𝑡 and 𝑢 less intuitive
  • Inter

erpreta etation:

  • Node

mismatch does not help/hurt too much either

20

𝑜𝑞𝑡 = 𝑑(𝑡, 𝑢) log 𝑜 + 𝜕(1)

slide-21
SLIDE 21

Phase transition, and an efficient & tractable matching algorithm…

21

slide-22
SLIDE 22

22

INPUT: Seed map

  • f

known pairs Propagate the map to “similar” neighbors

  • n

left and right

[A. Narayanan,

  • V. Shmatikov,

"De-anonymizing social networks“, IEEE

  • Symp. On Security

and Privacy, 2009]

slide-23
SLIDE 23

23

Similarity metric:

B A B A B A sim   ) , (

slide-24
SLIDE 24

24

Find max sim(u,v) Continue until done… …or blocked

slide-25
SLIDE 25
  • How

many seeds are need eded ed?

  • Is

there a phase transition?

  • How

efficien ently ly can we match ch?

  • Tuning

parameter eters?

[A. Narayanan,

  • V. Shmatikov,

"De-anonymizing social networks“, IEEE

  • Symp. on Security

and Privacy, 2009]

slide-26
SLIDE 26

26

𝐻1 𝐻2

slide-27
SLIDE 27

27

𝐻1 𝐻2 If ≥ 𝑠 matched neighbors  match matching error

slide-28
SLIDE 28

28

𝐻(𝑜, 𝑞)

slide-29
SLIDE 29

29

P()=1 P()=0

𝑜𝑞 < 1: consumption > production 𝑜𝑞 > 1: production > consumption Extinction

  • prob. of

branching process (failure rate)

slide-30
SLIDE 30

30

Activation from 𝑠 neighbors

[S. Janson,

  • T. Luczak,
  • T. Turova,
  • T. Vallier,

Bootstrap Percolation on the Random Graph 𝐻(𝑜, 𝑞), Annals Applied Prob., 22(5), 2012]

slide-31
SLIDE 31

31

consumption > production production > consumption 𝑏𝑑 𝑢𝑑

P()=1 P()=0

𝑜𝑞 = 𝜕(1)

slide-32
SLIDE 32
  • Theorem

em: phase transition in # seeds

  • For

𝑜−1 ≪ 𝑞𝑡 ≪ 𝑡𝑜 −1

2− 3 2𝑠:

  • If

𝑏

𝑏𝑑 → 𝛽 < 1,

final map is 𝑝(𝑜) w.h.p.

  • If

𝑏

𝑏𝑑 > 𝛽 > 1,

final map is 𝑜 − 𝑝 𝑜 w.h.p.

  • Seed

set size thres eshold ld:

  • 𝑏𝑑 = 1 − 𝑠−1 𝑢𝑑
  • 𝑢𝑑 =

𝑠−1 ! 𝑜 𝑞𝑡2 𝑠 1/(𝑠−1)

32

slide-33
SLIDE 33
  • Bootstrap

perco cola lation in 𝑯(𝒐, 𝒒):

  • #

credits

  • f

node 𝑗 at time 𝑢: i.i.d. Binomials

  • Perco

cola lation graph match ching in 𝑯(𝒐, 𝒒; 𝒕)

  • #

credits

  • f

pair 𝑗,𝑘 at time 𝑢: dependent, different Binomials

  • As

long as no matching error so far, increments at 𝑢

  • Different:

𝑗, 𝑗 ~𝐶𝑓𝑠 𝑞𝑡2 , 𝑗, 𝑘 ~𝐶𝑓𝑠((𝑞𝑡)2)

  • Dependent:

for 𝑗, 𝑗′,𝑘 all different:

  • 𝑄

𝑗, 𝑘 + + = 𝑞𝑡 2

  • 𝑄

𝑗, 𝑘 + + 𝑗′, 𝑘 + + = 𝑞𝑡

33

𝐻1 𝐻2 𝐻

slide-34
SLIDE 34
  • Approach

ch:

  • Focus
  • n

regime where 𝑌 =no bad pair (𝑗,𝑘) get enough credits (𝑠) to be potentially matched

  • True

for 𝑞𝑡 ≪ 𝑜−1

2− 3 2𝑠

  • Need

to choose 𝑠 large enough (sparse graphs: 𝑠 ≥ 4,

  • therwise

higher)

  • Conditional
  • n

𝑌,

  • nly

need to focus

  • n

good pairs (𝑗, 𝑗)

  • Equivalence

with bootstrap problem  does it percolate?

  • Need

to have 𝑜−1 ≪ 𝑞𝑡

  • Need

to have seed set size 𝑏 > 𝑏𝑑 large enough

34

slide-35
SLIDE 35

35

slide-36
SLIDE 36

36

slide-37
SLIDE 37

37

slide-38
SLIDE 38

38

slide-39
SLIDE 39

How to get started in practice

39

slide-40
SLIDE 40
  • Ques

estion:

  • Can

similar idea inform algorithm design?

  • Wishli

list:

  • Cold-st

start: how to match without seeds?

  • Sparse

se graphs: s: how to avoid blocking?

  • Error

propagation: how to correct mismatches?

40

slide-41
SLIDE 41

Fingerprint: (deg=4, dist(seed1)=1, dist(seed2)=3) Fingerprint: (deg=1, dist(seed1)=4, dist(seed2)=2) Fingerprint: (deg=3, dist(seed1)=3, dist(seed2)=1)

seed1 seed2

Fingerprint: (deg=3, dist(seed1)=1, dist(seed2)=3)

slide-42
SLIDE 42

?

Fingerprint: (deg=4, dist(seed1)=1, dist(seed2)=3) Fingerprint: (deg=1, dist(seed1)=4, dist(seed2)=2) Fingerprint: (deg=3, dist(seed1)=3, dist(seed2)=1) Fingerprint: (deg=3, dist(seed1)=1, dist(seed2)=3)

Network sampling model: P(fp1, fp2 | matched correctly), P(fp1, fp2 | matched wrong) Jointly MAP matching: Best bipartite matching 𝜌 s.t. max P(all matched correctly | all fingerprints) Single-pair posterior: P(matched correctly | fp1, fp2)

slide-43
SLIDE 43

43

Phase 1: 2 candidates Phase 2: 4 candidates Phase 3: 8 candidates 1 distance anchor 2 distance anchors Problem: Mapping error

  • >

distance error in next phase Solution: Prior (phase 𝑗 + 1) = posterior (phase 𝑗)

slide-44
SLIDE 44

44

slide-45
SLIDE 45
  • Graph

Match ching:

  • Model

as noisy graph isomorphism problem

  • How

much information in network structure?

  • Information-theo

eoreti etic:

  • Matching

is quite easy, benign growth

  • f

mean degree

  • 𝐻(𝑜,𝑞; 𝑡) model:

no a-priori structure

  • Perco

cola lation Graph Match ching from seeds

  • Phase

transition in size

  • f

seed set  hard to control, tune, predict

  • Actually

works very well in practice; parsimonious (𝑠)

  • Finding

seeds

  • Bayesian

framework & heuristics

  • Key

idea: exploit known “couples” as references for new candidate pairs

45

slide-46
SLIDE 46

CTW 2013 Collaborators: Daniel

  • R. Figueiredo,

Pedram Pedarsani, Lyudmila Yartseva

46