An Efficient reconciliation algorithm for social networks Silvio - - PowerPoint PPT Presentation

an efficient reconciliation algorithm for social networks
SMART_READER_LITE
LIVE PREVIEW

An Efficient reconciliation algorithm for social networks Silvio - - PowerPoint PPT Presentation

An Efficient reconciliation algorithm for social networks Silvio Lattanzi (Google Research NY) Joint work with: Nitish Korula (Google Research NY) ICERM Stochastic Graph Models Outline Graph reconciliation Model and theoretical results.


slide-1
SLIDE 1

An Efficient reconciliation algorithm for social networks

Silvio Lattanzi (Google Research NY) Joint work with: Nitish Korula (Google Research NY) ICERM Stochastic Graph Models

slide-2
SLIDE 2

Outline

Stochastic Graph Models, ICERM

Graph reconciliation Model and theoretical results. Experimental results From theory to practice. Open problems and future directions

slide-3
SLIDE 3

Graph reconciliation

Stochastic Graph Models, ICERM

slide-4
SLIDE 4

Real world motivations

Stochastic Graph Models, ICERM

slide-5
SLIDE 5

Real world motivations

Stochastic Graph Models, ICERM

Intra-language network

slide-6
SLIDE 6

Real world motivations

Stochastic Graph Models, ICERM

Intra-language network Inter-language network

slide-7
SLIDE 7

Real world motivations

Stochastic Graph Models, ICERM

Can we use intra-language information to improve inter- language graph?

slide-8
SLIDE 8

Real world motivations

Stochastic Graph Models, ICERM

Can we use intra-language information to improve inter- language graph?

slide-9
SLIDE 9

Real world motivations

Stochastic Graph Models, ICERM

Can we use intra-language information to improve inter- language graph?

?

slide-10
SLIDE 10

Real world motivations

Stochastic Graph Models, ICERM

slide-11
SLIDE 11

Real world motivations

Stochastic Graph Models, ICERM

slide-12
SLIDE 12

Real world motivations

Stochastic Graph Models, ICERM

slide-13
SLIDE 13

Real world motivations

Stochastic Graph Models, ICERM

slide-14
SLIDE 14

Given two networks, identify as many users as possible across them. Applications:

social networks

  • ntology reconciliation

Graph reconciliation problem

Stochastic Graph Models, ICERM

slide-15
SLIDE 15

Problem of reconciliation introduced by Novak et al.

Previous work

Stochastic Graph Models, ICERM

slide-16
SLIDE 16

Problem of reconciliation introduced by Novak et al. Two main approaches:

  • ML on user profile features

(name, location, image)

Previous work

Stochastic Graph Models, ICERM

slide-17
SLIDE 17

Problem of reconciliation introduced by Novak et al. Two main approaches:

  • ML on user profile features

(name, location, image)

  • ML on neighborhood topology

Previous work

Stochastic Graph Models, ICERM

slide-18
SLIDE 18

Problem of reconciliation introduced by Novak et al. Two main approaches:

  • ML on user profile features

(name, location, image)

  • ML on neighborhood topology

Limitations:

Previous work

Stochastic Graph Models, ICERM

slide-19
SLIDE 19

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack

Previous work

Stochastic Graph Models, ICERM

slide-20
SLIDE 20

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack

Previous work

Stochastic Graph Models, ICERM

slide-21
SLIDE 21

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack

Previous work

Stochastic Graph Models, ICERM

slide-22
SLIDE 22

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack

Previous work

Stochastic Graph Models, ICERM

slide-23
SLIDE 23

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack

Previous work

Stochastic Graph Models, ICERM

slide-24
SLIDE 24

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack

Previous work

Stochastic Graph Models, ICERM

slide-25
SLIDE 25

Very rich literature in de-anonymization Two relevant works:

  • Backstrom et al. propose an active and passive attack
  • Narayanan and Shmatikov successful

de-anonymization attack

Previous work

Stochastic Graph Models, ICERM

slide-26
SLIDE 26

Ground truth 24000 matching across the two social networks

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

slide-27
SLIDE 27

Ground truth 24000 matching across the two social networks

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

80 me-links

slide-28
SLIDE 28

Ground truth 24000 matching across the two social networks They could re-identify 30.8% of the mappings.

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

80 me-links

slide-29
SLIDE 29

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

slide-30
SLIDE 30

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

?

slide-31
SLIDE 31

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

2

slide-32
SLIDE 32

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

2 1

slide-33
SLIDE 33

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

2 1

slide-34
SLIDE 34

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

slide-35
SLIDE 35

Algorithm:

Narayanan and Shmatikov experiment

Stochastic Graph Models, ICERM

Why? Is it necessary to have high degree me-links?

slide-36
SLIDE 36

Input: two graphs and a set of trusted matching We want to maximize the number of final matches.

Abstraction

Stochastic Graph Models, ICERM

slide-37
SLIDE 37

Is the problem tractable?

Stochastic Graph Models, ICERM

Problem is similar to graph isomorphism

slide-38
SLIDE 38

Problem is similar to graph isomorphism Problem seems even harder because we want to detect similar structure

Is the problem tractable?

Stochastic Graph Models, ICERM

slide-39
SLIDE 39

Problem is similar to graph isomorphism Problem seems even harder because we want to detect similar structure

Is the problem tractable?

Stochastic Graph Models, ICERM

slide-40
SLIDE 40

Abstraction

Formalization of the problem:

Underlying social network

Stochastic Graph Models, ICERM

slide-41
SLIDE 41

Abstraction

Formalization of the problem:

Underlying social network p1 p2 Delete the edges independently

Stochastic Graph Models, ICERM

slide-42
SLIDE 42

Abstraction

Formalization of the problem:

Underlying social network p1 p2 Delete the edges independently Initial matchings

Stochastic Graph Models, ICERM

slide-43
SLIDE 43

Questions

Stochastic Graph Models, ICERM

Having a constant fraction of me-links, can we reconcile the entire network? If we have k me-links which fraction of networks can we reconcile?

slide-44
SLIDE 44

Without additional assumption on the underling network problem seems still very hard

Underlying social network

Stochastic Graph Models, ICERM

slide-45
SLIDE 45

Without additional assumption on the underling network problem seems still very hard We study two different models for social networks:

  • G(n,p)
  • Preferential attachment

Underlying social network

Stochastic Graph Models, ICERM

slide-46
SLIDE 46

Our algorithm

Stochastic Graph Models, ICERM

Algorithm: Narayanan Shmatikov + degree bucketing + acceptance threshold

slide-47
SLIDE 47

G(n,p)

Does the technique works if the underlying graph is random?

Stochastic Graph Models, ICERM

p p1 p2

slide-48
SLIDE 48

E[NG1(∗) ∩ NG2(∗)] = (n − 2)p2p1p2

G(n,p)

Does the technique works if the underlying graph is random?

Stochastic Graph Models, ICERM

p p1 p2

E[NG1(∗) ∩ NG2(∗)] = (n − 1)pp1p2

slide-49
SLIDE 49

Concentration

We assume Two cases:

  • , Chernoff bound is enough
  • , we never make error

Stochastic Graph Models, ICERM

c log n n ≤ p ≤ 1 6, l, p1, p2 ∈ O(1)

x = (n − 2)p2p1p2

P = " n X

i=1

Bi ≤ 2 # = (1 − x)n + nx(1 − x)n−1 + ✓n 2 ◆ x2(1 − x)n−2 = 1 − n3x3 − o(n3x3)

npp1p2l ≥ 24 log n

npp1p2l ≤ 24 log n

slide-50
SLIDE 50

More realistic model

Preferential attachment:

  • is a single node with

self-loops

  • adding a node to and

edges with probability proportional to the current degrees

Stochastic Graph Models, ICERM

Gm

1

m

Gm

n

Gm

n−1

m

slide-51
SLIDE 51

Preferential attachment

A bit harder

  • Several nodes of constant degree, we need to have a cascade
  • Objective is reconcile a constant fraction of the network

Stochastic Graph Models, ICERM

slide-52
SLIDE 52

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results.

slide-53
SLIDE 53

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results. Different nodes of intermediate degree do not share many

neighbors.

slide-54
SLIDE 54

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results. Different nodes of intermediate degree do not share many

neighbors.

High degree nodes help to detect intermediate degree nodes that

in turn help to detect small degree nodes.

slide-55
SLIDE 55

PA structural lemmas

Stochastic Graph Models, ICERM

High degree nodes are early birds.

Nodes inserted after time , for constant , have degree in

φn φ

  • (log2 n)
slide-56
SLIDE 56

PA structural lemmas

Stochastic Graph Models, ICERM

High degree nodes are early birds.

Nodes inserted after time , for constant , have degree in

The rich get richer.

For nodes of degree greater than a constant fraction of their neighbors has been inserted after time , for constant

φn φ

  • (log2 n)

log2 n

✏n

slide-57
SLIDE 57

PA structural lemmas

Stochastic Graph Models, ICERM

High degree nodes are early birds.

Nodes inserted after time , for constant , have degree in

The rich get richer.

For nodes of degree greater than a constant fraction of their neighbors has been inserted after time , for constant

First-mover advantage.

All nodes inserted before time , have degree at least

φn φ

  • (log2 n)

log2 n

✏n

n0.3

log3 n

slide-58
SLIDE 58

High degree nodes are early birds

Stochastic Graph Models, ICERM

Gm

1

Gm

n

slide-59
SLIDE 59

High degree nodes are early birds

Stochastic Graph Models, ICERM

Gm

1

Gm

n

φn

slide-60
SLIDE 60

High degree nodes are early birds

Stochastic Graph Models, ICERM

Gm

1

Gm

n

φn

λn

slide-61
SLIDE 61

High degree nodes are early birds

Stochastic Graph Models, ICERM

Let be the degree at the beginning of a phase.

The probability that a node increase its degree is dominated by the probability of an head in a coin toss for a biased coin that gives head with probability

Gm

1

Gm

n

φn

λn

di

3di φn

slide-62
SLIDE 62

The rich get richer

Stochastic Graph Models, ICERM

If at time , the node has degree less than we are done

✏n

1 2d

Gm

1

Gm

n

✏n

slide-63
SLIDE 63

The rich get richer

Stochastic Graph Models, ICERM

If at time , the node has degree less than we are done

The probability that the node increases its degree is dominated by the probability of an head in a coin toss for a biased coin that gives head with probability ✏n

1 2d

Gm

1

Gm

n

✏n

d 2nm

slide-64
SLIDE 64

First-mover advantage

Stochastic Graph Models, ICERM

From Cooper and Frieze result on the cover time of PA graphs, Playing a bit with algebra we can get the final result.

Dk = dnm(v1) + dnm(v2) + · · · + dnm(vk) Pr ⇣ |Dk − 2 √ 2kn| ≥ 3 p mn log mn ⌘ ≤ (mn)−2

Pr(dn(vk+1) = d + 1|Dk − 2k = s) ≤ s + d 2N − 2k − s − d

slide-65
SLIDE 65

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results. Different nodes of intermediate degree do not share many

neighbors.

High degree nodes help to detect intermediate degree nodes that

in turn help to detect small degree nodes.

slide-66
SLIDE 66

Matching high degree nodes

Stochastic Graph Models, ICERM

By Chernoff w.h.p.

E[NG1(∗) ∩ NG2(∗)] = d(v)p1p2l

NG1(∗) ∩ NG2(∗) ≥ 7 8d(v)p1p2l

slide-67
SLIDE 67

NG1(∗) ∩ NG2(∗) ≤ ✓2 3 + ✏ ◆ d(v)p1p2l + o(d(v))

Matching high degree nodes

Stochastic Graph Models, ICERM

By Chernoff w.h.p.

E[NG1(∗) ∩ NG2(∗)] = d(v)p1p2l

NG1(∗) ∩ NG2(∗) ≥ 7 8d(v)p1p2l

Gm

1

Gm

n

✏n

slide-68
SLIDE 68

Matching high degree nodes

Stochastic Graph Models, ICERM

By Chernoff w.h.p. has degree at most and so the probability of connecting to it is

E[NG1(∗) ∩ NG2(∗)] = d(v)p1p2l

NG1(∗) ∩ NG2(∗) ≥ 7 8d(v)p1p2l

Gm

1

Gm

n

✏n

˜ O(√n)

  • (1)

NG1(∗) ∩ NG2(∗) ≤ ✓2 3 + ✏ ◆ d(v)p1p2l + o(d(v))

slide-69
SLIDE 69

Matching high degree nodes

Stochastic Graph Models, ICERM

By Chernoff w.h.p. has degree at most and so the probability of connecting to it is

E[NG1(∗) ∩ NG2(∗)] = d(v)p1p2l

NG1(∗) ∩ NG2(∗) ≥ 7 8d(v)p1p2l

Gm

1

Gm

n

✏n

˜ O(√n)

  • (1)

NG1(∗) ∩ NG2(∗) ≤ ✓2 3 + ✏ ◆ d(v)p1p2l + o(d(v))

slide-70
SLIDE 70

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results. Different nodes of intermediate degree do not share many

neighbors.

High degree nodes help to detect intermediate degree nodes that

in turn help to detect small degree nodes.

slide-71
SLIDE 71

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

Bound the mismatch score

slide-72
SLIDE 72

Bound the mismatch score

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

n

4 3 0.3

n( 4

3) 20.3

n( 4

3) 30.3

na = n0.3, nb = n

4 3 0.3

n( 3

2 −✏)0.3

n( 3

2 −✏) 20.3

n( 3

2 −✏) 30.3

slide-73
SLIDE 73

The probability that 3 nodes coming between and point to and

Bound the mismatch score

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

n

4 3 0.3

n( 4

3) 20.3

n( 4

3) 30.3

na = n0.3, nb = n

4 3 0.3

na

nb

  • nb2

nb

X

i=na nb

X

j=na nb

X

k=na

✓ log3 n (i − 1) ◆2 ✓ log3 n (j − 1) ◆2 ✓ log3 n (k − 1) ◆2 ≈ n2b−3a ∈ o(1)

slide-74
SLIDE 74

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results. Different nodes of intermediate degree do not share many

neighbors.

High degree nodes help to detect intermediate degree nodes

that in turn help to detect small degree nodes.

slide-75
SLIDE 75

Cascade

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

slide-76
SLIDE 76

Cascade

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

n0.25

After one phase

Gm

1

Gm

n

slide-77
SLIDE 77

Cascade

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

n0.25

After one phase in each phase we do not identify a small fraction, in total we loose a small constant

Gm

1

Gm

n

Gm

1

Gm

n

slide-78
SLIDE 78

Cascade

Stochastic Graph Models, ICERM

Gm

1

Gm

n

n0.3

n0.25

After one phase in each phase we do not identify a small fraction, in total we loose a small constant

Gm

1

Gm

n

Gm

1

Gm

n

slide-79
SLIDE 79

Sketch of the proof

Stochastic Graph Models, ICERM

For high degree node we can use concentration results. Different nodes of intermediate degree do not share many

neighbors.

High degree nodes help to detect intermediate degree nodes that

in turn help to detect small degree nodes.

slide-80
SLIDE 80

Results

Stochastic Graph Models, ICERM

Theorem 1

If the underlying network is a G(n,p) graph it is possible to reconcile it completely

Theorem 2

If the underlying network is a PA graph it is possible to reconcile it a large fraction

  • f it.
slide-81
SLIDE 81

Experimental results

Stochastic Graph Models, ICERM

slide-82
SLIDE 82

Experiments

Stochastic Graph Models, ICERM

Experiments on different graphs:

slide-83
SLIDE 83

PA experiment

Stochastic Graph Models, ICERM

Are our theoretical results robust?

slide-84
SLIDE 84

Scalability

Stochastic Graph Models, ICERM

How does the algorithm scale with the size of the graph?

slide-85
SLIDE 85

Facebook experiment

Stochastic Graph Models, ICERM

How does the algorithm perform if the underlying graph is a social network?

slide-86
SLIDE 86

Facebook experiment

Stochastic Graph Models, ICERM

How does the algorithm perform if the underlying graph is a social network? 80% recall!! Can we explain it in theory?

slide-87
SLIDE 87

Facebook cascade experiment

Stochastic Graph Models, ICERM

What does happen if we generate the underlying network using a cascade process? Recover almost all the graph in the intersection. Can we explain it in theory?

slide-88
SLIDE 88

Affiliation network model

Stochastic Graph Models, ICERM

What does happen if we delete all the edges inside a subset of the communities? More than 80% recall. Can we explain it in theory?

slide-89
SLIDE 89

Reconcile different graphs

Stochastic Graph Models, ICERM

DBLP: we generate two co-authorship graphs. One considering only publications in even years and the other publication only in

  • dd years.
slide-90
SLIDE 90

Reconcile different graphs

Stochastic Graph Models, ICERM

DBLP: we generate two co-authorship graphs. One considering only publications in even years and the other publication only in

  • dd years.

Gowalla: we generate two co-checkin graphs. One considering only checkins in even years and the other checkins only in

  • dd years.
slide-91
SLIDE 91

Reconcile different graphs

Stochastic Graph Models, ICERM

DBLP: we generate two co-authorship graphs. One considering only publications in even years and the other publication only in

  • dd years.

Gowalla: we generate two co-checkin graphs. One considering only checkins in even years and the other checkins only in

  • dd years.

German/French Wikipedia: we crawl the inter-languange links, we use few of them as seed and we check how many links we could recover.

slide-92
SLIDE 92

Reconcile different graphs

Stochastic Graph Models, ICERM

Recall for Wikipedia ~30%

slide-93
SLIDE 93

Reconcile different graphs

Stochastic Graph Models, ICERM

We have really good performance for high degree nodes

slide-94
SLIDE 94

Open problems and future directions

Stochastic Graph Models, ICERM

slide-95
SLIDE 95

Extensions

Stochastic Graph Models, ICERM

Other model of underlying graphs Other model of generation of networks Adversarial underlying network, error in seed links

slide-96
SLIDE 96

Limitation of the current model

Stochastic Graph Models, ICERM

Users’ degree depend varies in different social networks

How can we model this more general setting?

slide-97
SLIDE 97

Better algorithm

Stochastic Graph Models, ICERM

Currently exploring only direct neighborhood Can we design better algorithms?

slide-98
SLIDE 98

Thanks!

Stochastic Graph Models, ICERM