Eight Friends are Enough: Social Graph Approximation via Public - - PowerPoint PPT Presentation

eight friends are enough social graph approximation via
SMART_READER_LITE
LIVE PREVIEW

Eight Friends are Enough: Social Graph Approximation via Public - - PowerPoint PPT Presentation

Eight Friends are Enough: Social Graph Approximation via Public Listings Joseph Bonneau, Jonathan Anderson, Ross Anderson, Frank Stajano University of Cambridge Computer Laboratory Facebook Features & Privacy Backlashes News Feed (Sep


slide-1
SLIDE 1

Eight Friends are Enough: Social Graph Approximation via Public Listings

Joseph Bonneau, Jonathan Anderson, Ross Anderson, Frank Stajano University of Cambridge Computer Laboratory

slide-2
SLIDE 2

Facebook Features & Privacy Backlashes

  • News Feed (Sep 2006)
  • Beacon (Nov 2007)
  • “New Facebook” (Sep 2008)
  • Terms of Use (Feb 2009)
  • New Product Pages (Mar 2009)
slide-3
SLIDE 3

A Quietly Introduced Feature...

Public Search Listings, Sep 2007

slide-4
SLIDE 4

Public Search Listings

  • Unprotected against crawling
  • Indexed by search engines
  • Opt out—but most users don't know it exists!
slide-5
SLIDE 5

Utility

Entity Resolution

slide-6
SLIDE 6

Utility

Promotion via Network Effects

slide-7
SLIDE 7

Legal Status

“Your name, network names, and profile picture thumbnail will be available in search results across the Facebook network and those limited pieces of information may be made available to third party search engines. This is primarily so your friends can find you and send a friend request.”

  • Facebook Privacy Policy
slide-8
SLIDE 8

Legal Status

Much More Info Now Included...

slide-9
SLIDE 9

Legal Status

Public Group Pages Recently Added

slide-10
SLIDE 10

Obvious Attack

  • Initially returned new friend set on refresh
  • Can find all n friends in O(n·log n) queries
  • The Coupon Collector's Problem
  • For 100 Friends, need 65 page refreshes
  • As of Jan 2009, friends fixed per IP address
slide-11
SLIDE 11

Fun with Tor

UK Germany USA Australia

slide-12
SLIDE 12

Attack Scenario

  • Spider all public listings
  • Our experiments crawled 250 k users daily
  • Implies ~800 CPU-days to recover all users
  • Compute functions on sampled graph
slide-13
SLIDE 13

Abstraction

  • Take a graph G = <V,E>
  • Randomly select k out-edges from each node
  • Result is a sampled graph Gk = <V,Ek>
  • Try to approximate f(G) ≈ fapprox(Gk)
slide-14
SLIDE 14
  • Node Degree
  • Dominating Set
  • Betweenness Centrality
  • Path Length
  • Community Structure

Approximable Functions

slide-15
SLIDE 15

Experimental Data

  • Crawled networks for Stanford, Harvard universities
  • Representative sub-networks

# Users Stanford 15043 125 90 Harvard 18273 116 76 Mean d Median d

slide-16
SLIDE 16

Stanford Histogram

slide-17
SLIDE 17

Harvard Histogram

slide-18
SLIDE 18

Comparison

Networks have very similar structure Stanford Harvard

slide-19
SLIDE 19

Stanford Log-Log plot

slide-20
SLIDE 20

Harvard Log-Log plot

slide-21
SLIDE 21

Back To Our Abstraction

  • Take a graph G = <V,E>
  • Randomly select k out-edges from each node
  • Result is a sampled graph Gk = <V,Ek>
  • Try to approximate f(G) ≈ fapprox(Gk)
slide-22
SLIDE 22

Estimating Degrees

  • Convert sampled graph into a directed graph
  • Edges originate at the node where they were seen
  • Learn exact degree for nodes with degree < k
  • Less than k out-edges
  • Get random sample for nodes with degree ≥ k
  • Many have more than k in-edges
slide-23
SLIDE 23

Estimating Degrees

3 3 3 4 4 2 1 2 6

Average Degree: 3.5

slide-24
SLIDE 24

Estimating Degrees

3 3 3 4 4 2 1 2 6

Sampled with k=2

slide-25
SLIDE 25

Estimating Degrees

? ? ? ? ? ? 1 ? ?

Degree known exactly for one node

slide-26
SLIDE 26

Estimating Degrees

3.5 3.5 1.75 3.5 5.25 1.75 1 1.75 7

Naïve approach: Multiply in-degree by average degree / k

slide-27
SLIDE 27

Estimating Degrees

3.5 3.5 2 3.5 5.25 2 1 2 7

Raise estimates which are less than k

slide-28
SLIDE 28

Estimating Degrees

3.5 3.5 2 3.5 5.25 2 1 2 7

Nodes with high-degree neighbors underestimated

slide-29
SLIDE 29

Estimating Degrees

3.5 3.5 3.5 3.5 5.25 2 1 2 7

Iteratively scale by current estimate / k in each step

slide-30
SLIDE 30

Estimating Degrees

2.75 2.75 3.5 3.63 5.5 2 1 2 5.5

After 1 iteration

slide-31
SLIDE 31

Estimating Degrees

2.68 2.68 3.41 3.53 5.35 2 1 2 5.35

Normalise to estimated total degree

slide-32
SLIDE 32

Estimating Degrees

2.48 2.83 3.04 3.64 5.09 2 1 2 5.91

Convergence after n > 10 iterations

slide-33
SLIDE 33

Estimating Degrees

  • Converges fast, typically after 10 iterations
  • Absolute error is high—38% average
  • Reduced to 23% for nodes with d ≥ 50
  • Still accurately can pick high degree nodes
slide-34
SLIDE 34

Aggregate of x highest-degree nodes

slide-35
SLIDE 35

Comparison of sampling parameters

slide-36
SLIDE 36

Dominating Sets

  • Set of Nodes D⊆V such that

D Neighbours( ∪ D)=V

  • Set allows viewing the entire network
  • Also useful for marketing, trend-setting
slide-37
SLIDE 37

Dominating Sets

3 3 4 4 4 5 3 2 3 1

Trivial Algorithm: Select High-Degree Nodes in Order

slide-38
SLIDE 38

Dominating Sets

3 3 4 4 4 5 3 2 3 1

In fact, finding minimal dominating set is NP-complete

slide-39
SLIDE 39

Dominating Sets

4 4 5 5 5 6 4 3 4 2

Greedy Algorithm: select for maximal coverage

slide-40
SLIDE 40

Dominating Sets

1 1 2 4 3 2

Greedy Algorithm: select for maximal coverage

slide-41
SLIDE 41

Dominating Sets

Shown to perform adequately in practice

slide-42
SLIDE 42

Works Well on Sampled Graph

slide-43
SLIDE 43

Insensitive to Sampling Parameter!

Surprising: Even k = 1 performs quite well

slide-44
SLIDE 44

Shortest Paths

  • Social networks shown to be “small world”
  • Short paths should exist, even for large graphs
  • Short paths can be used for social engineering
slide-45
SLIDE 45

Floyd-Warshall Algorithm

  • Finds shortest distance between all pairs of nodes
  • Dynamic programming – O(V3) over V2 nodes
  • Think Dijkstra, but for all vertices
slide-46
SLIDE 46

Floyd-Warshall Algorithm

1 4 5 6 7 3 8 10 2 9 1 2 3 4 5 6 7 8 9 10 1 1 1 1 ∞ ∞ ∞ ∞ ∞ ∞ 2 1 1 ∞ 1 ∞ ∞ ∞ ∞ ∞ 3 1 1 1 1 1 ∞ ∞ ∞ ∞ 4 1 ∞ 1 ∞ 1 ∞ ∞ ∞ ∞ 5 ∞ 1 1 ∞ 1 1 ∞ ∞ ∞ 6 ∞ ∞ 1 1 1 1 ∞ ∞ ∞ 7 ∞ ∞ ∞ ∞ 1 1 1 ∞ 1 8 ∞ ∞ ∞ ∞ ∞ ∞ 1 1 1 9 ∞ ∞ ∞ ∞ ∞ ∞ ∞ 1 ∞ 10 ∞ ∞ ∞ ∞ ∞ ∞ 1 1 ∞

slide-47
SLIDE 47

Floyd-Warshall Algorithm

1 4 5 6 7 3 8 10 2 9 1 2 3 4 5 6 7 8 9 10 1 1 1 1 2 2 ∞ ∞ ∞ ∞ 2 1 1 2 1 2 2 ∞ ∞ ∞ 3 1 1 1 1 1 2 ∞ ∞ ∞ 4 1 2 1 2 1 2 ∞ ∞ ∞ 5 2 1 1 2 1 1 2 ∞ 2 6 2 2 1 1 1 1 2 ∞ 2 7 ∞ 2 2 2 1 1 1 2 1 8 ∞ ∞ ∞ ∞ 2 2 1 1 1 9 ∞ ∞ ∞ ∞ ∞ ∞ 2 1 2 10 ∞ ∞ ∞ ∞ 2 2 1 1 2

slide-48
SLIDE 48

Floyd-Warshall Algorithm

1 4 5 6 7 3 8 10 2 9 1 2 3 4 5 6 7 8 9 10 1 1 1 1 2 2 3 4 5 4 2 1 1 2 1 2 2 3 4 3 3 1 1 1 1 1 2 3 4 3 4 1 2 1 2 1 2 3 4 3 5 2 1 1 2 1 1 2 3 2 6 2 2 1 1 1 1 2 3 2 7 3 2 2 2 1 1 1 2 1 8 4 3 3 3 2 2 1 1 1 9 5 4 4 4 3 3 2 1 2 10 4 3 3 3 2 2 1 1 2

slide-49
SLIDE 49

Short Paths Still Exist in Sampled Graph

slide-50
SLIDE 50

Centrality

  • A measure of a node's importance
  • Betweenness centrality:

CBv= ∑

s≠v≠t∈V

 stv st

  • Measures the shortest paths in the

graph that a particular vertex is part of

slide-51
SLIDE 51

Centrality

1 4 5 6 7 3 8 10 2 9

CBv7=?

slide-52
SLIDE 52

1 4 5 6 7 3 8 10 2 9

CBv7= 0 1

Centrality

slide-53
SLIDE 53

Centrality

1 4 5 6 7 3 8 10 2 9

CBv7= 0 10 2

slide-54
SLIDE 54

1 4 5 6 7 3 8 10 2 9

CBv7= 0 10 24 4 

Centrality

slide-55
SLIDE 55

Message Interception Scenario

  • Messages sent via shortest (least-cost) paths
  • Adversary can compromise x nodes
  • How much traffic can s/he intercept?

pinterceptvs,vd=CBv ∣V∣

2

slide-56
SLIDE 56

Message Interception

slide-57
SLIDE 57

Community Detection

  • Goal: Find highly-connected sub-groups
  • Measure success by high modularity:
  • Ratio of intra-community edges to random
  • Normalised to be between -1 and 1
slide-58
SLIDE 58

Community Detection

2 2 3 4 4 2 2 1

0.01 0.04 0.035 0.03 0.03 0.035 0.02 0.03 0.03 0.01 0.04

  • Clausen et. al 2004 – find maximal modularity in O(nlg2n)
  • Track marginal modularity, update neighbours on each merge
slide-59
SLIDE 59

Community Detection

Q=0.04

2 2 3 4 4 2 2 1

0.04 0.035 0.025 0.03 0.03 0.035 0.0125 0.04 0.03

slide-60
SLIDE 60

Community Detection

Q=0.08

2 2 3 4 4 2 2 1

0.04 0.035 0.025 0.03 0.06 0.035 0.0125 0.06 0.04

slide-61
SLIDE 61

Community Detection

Q=0.14

2 2 3 4 4 2 2 1

  • 0.11

0.10 0.035 0.025 0.01 0.035 0.0125 0.04

slide-62
SLIDE 62

Community Detection

Q=0.175

2 2 3 4 4 2 2 1

  • 0.11

0.10 0.035 0.0375 0.01 0.025 0.0375 0.04

slide-63
SLIDE 63

Community Detection

Q=0.2125

2 2 3 4 4 2 2 1

  • 0.15

0.10 0.1125 0.01

slide-64
SLIDE 64

Community Detection

Q=0.2225

2 2 3 4 4 2 2 1

  • 0.15

0.11 0.1125

  • 0.15
slide-65
SLIDE 65

Community Detection

slide-66
SLIDE 66

Conclusions

  • Social graph is fragile to partial disclosure
  • Consistent with Danezis/Wittneben, Nagaraja results
  • Public Listings Leak Too Much
  • Dominating sets, centrality, communities in particular
  • SNS operators need a dedicated privacy review team
  • Comparable to security audit & penetration testing
slide-67
SLIDE 67

Questions?

jcb82@cl.cam.ac.uk jra40@cl.cam.ac.uk