Social Processes, Information Flow, and Anonymized Network Data Jon - - PowerPoint PPT Presentation

social processes information flow and anonymized network
SMART_READER_LITE
LIVE PREVIEW

Social Processes, Information Flow, and Anonymized Network Data Jon - - PowerPoint PPT Presentation

Social Processes, Information Flow, and Anonymized Network Data Jon Kleinberg Cornell University Including joint work with Lars Backstrom, Cynthia Dwork, and David Liben-Nowell Jon Kleinberg Social Processes and Anonymized Network Data


slide-1
SLIDE 1

Social Processes, Information Flow, and Anonymized Network Data

Jon Kleinberg

Cornell University Including joint work with Lars Backstrom, Cynthia Dwork, and David Liben-Nowell

Jon Kleinberg Social Processes and Anonymized Network Data

slide-2
SLIDE 2

Social Network Analysis

High-school dating (Bearman-Moody-Stovel 2004) Karate club (Zachary 1977)

Social network data Active research area in sociology, social psychology, anthropology for the past half-century. Today: Convergence of social and technological networks Computing and info. systems with intrinsic social structure. What can the different fields learn from each other?

Jon Kleinberg Social Processes and Anonymized Network Data

slide-3
SLIDE 3

Mining Social Network Data

Mining social networks also has long history in social sciences. E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social ties and rivalries in a university karate club.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-4
SLIDE 4

Mining Social Network Data

Mining social networks also has long history in social sciences. E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social ties and rivalries in a university karate club. During his observation, conflicts intensified and group split.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-5
SLIDE 5

Mining Social Network Data

Mining social networks also has long history in social sciences. E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social ties and rivalries in a university karate club. During his observation, conflicts intensified and group split. Split could be explained by minimum cut in social network.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-6
SLIDE 6

A Matter of Scale

Social network data spans many orders of magnitude 436-node network of e-mail exchange over 3 months at a corporate research lab (Adamic-Adar 2003) 43,553-node network of e-mail exchange over 2 years at a large university (Kossinets-Watts 2006) 4.4-million-node network of declared friendships on blogging community LiveJournal (Liben-Nowell et al. 2005, Backstrom et al. 2006) 240-million-node network of all IM communication over one month on Microsoft Instant Messenger (Leskovec-Horvitz’07)

Jon Kleinberg Social Processes and Anonymized Network Data

slide-7
SLIDE 7

Not Just a Matter of Scale

How does massive network data compare to small-scale studies? Currently, massive network datasets give you both more and less: More: can observe global phenomena that are genuine, but literally invisible at smaller scales. Less: Don’t really know what any one node or link means. Easy to measure things; hard to pose nuanced questions. Goal: Find the point where the lines of research converge.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-8
SLIDE 8

Outline

Several core computing ideas come into play: Working with network data that is much messier than just nodes and edges. Algorithmic models as a basic vocabulary for expressing complex social-science questions on complex network data. Understanding social networks as datasets: privacy implications and other concerns. Plan for the talk: Algorithmic models for cascading behavior in social networks: Formulating some fundamental unresolved questions. Evaluating anonymization as a standard approach for protecting privacy in social network data.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-9
SLIDE 9

Diffusion in Social Networks

Book recommendations (Leskovec et al 2006) Contagion of TB (Andre et al. 2006)

Behaviors that cascade from node to node like an epidemic. News, opinions, beliefs, rumors, fads, ... Diffusion of innovations [Coleman-Katz-Menzel, Rogers] Viral marketing [Domingos-Richardson 2001] Localized collective action: riots, walkouts Modeling via

biological epidemics [Berger-Borgs-Chayes-Saberi 2005] coordination games [Blume1993, Ellison1993, Jackson-Yariv2005]

Jon Kleinberg Social Processes and Anonymized Network Data

slide-10
SLIDE 10

Chain-Letter Petitions

Chain-letter petitions as “tracers” through global social network

[Liben-Nowell & Kleinberg 2008]

Dear All, The US Congress has authorised the President of the US to go to war against Iraq. Please consider this an urgent request. UN Petition for Peace: [...] Please COPY (rather than Forward) this e-mail in a new message, sign at the end of the list, and send it to all the people whom you know. If you receive this list with more than 500 names signed, please send a copy of the message to: usa@un.int president@whitehouse.gov

Jon Kleinberg Social Processes and Anonymized Network Data

slide-11
SLIDE 11

Networks of Documents, Networks of People

Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them ... There is a new profession of trail blazers, those who find delight in the task of establishing useful trails through the enormous mass of the common record. (Bush, 1945) The chain-letter is a dual process: A person blazing trails through a network of documents, vs. A document blazing trails through a network of people.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-12
SLIDE 12

How Information Spreads (Traditional Picture)

Adam

Jon Kleinberg Social Processes and Anonymized Network Data

slide-13
SLIDE 13

How Information Spreads (Traditional Picture)

Adam Bob Cathy Dan

Jon Kleinberg Social Processes and Anonymized Network Data

slide-14
SLIDE 14

How Information Spreads (Traditional Picture)

Adam Bob Cathy Dan

Eva Fred Geri Hal Iris Justine Ken Larry Mia

Jon Kleinberg Social Processes and Anonymized Network Data

slide-15
SLIDE 15

How Information Spreads (Traditional Picture)

Adam Bob Cathy Dan

Eva Fred Geri Hal Iris Justine Ken Larry Mia

Jon Kleinberg Social Processes and Anonymized Network Data

slide-16
SLIDE 16

Assembling a Chain-Letter Tree

The full tree is unobservable.

Adam Bob Cathy Dan

Eva Fred Geri Hal Iris Justine Ken Larry Mia

But hundreds of copies with distinct recipient lists have been posted to mailing lists. We can obtain these by Web searches and then assemble a partial tree.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-17
SLIDE 17

Assembling a Chain-Letter Tree

A B C D E F G H

Jon Kleinberg Social Processes and Anonymized Network Data

slide-18
SLIDE 18

Assembling a Chain-Letter Tree

A B C D E F G H

A B C D E F G H

Jon Kleinberg Social Processes and Anonymized Network Data

slide-19
SLIDE 19

Assembling a Chain-Letter Tree

A B C D E F G H

A B C D E F G H A B C D E F I J

Jon Kleinberg Social Processes and Anonymized Network Data

slide-20
SLIDE 20

Assembling a Chain-Letter Tree

A B C D E F G H I J

A B C D E F G H A B C D E F I J

Jon Kleinberg Social Processes and Anonymized Network Data

slide-21
SLIDE 21

Assembling a Chain-Letter Tree

A B C D E F G H I J

A B C D E F G H A B C D E F I J A B C D K L M

Jon Kleinberg Social Processes and Anonymized Network Data

slide-22
SLIDE 22

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K

A B C D E F G H A B C D E F I J A B C D K L M

Jon Kleinberg Social Processes and Anonymized Network Data

slide-23
SLIDE 23

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K

A B C D E F G H A B C D E F I J A B C D K L M A X C D E F G H

Jon Kleinberg Social Processes and Anonymized Network Data

slide-24
SLIDE 24

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K X

A B C D E F G H A B C D E F I J A B C D K L M A X C D E F G H

Jon Kleinberg Social Processes and Anonymized Network Data

slide-25
SLIDE 25

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K X

A B C D E F G H A B C D E F I J A B C D K L M A X C D E F G H A E F G H

Jon Kleinberg Social Processes and Anonymized Network Data

slide-26
SLIDE 26

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K X

A B C D E F G H A B C D E F I J A B C D K L M A X C D E F G H A E F G H

Jon Kleinberg Social Processes and Anonymized Network Data

slide-27
SLIDE 27

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K X

A B C D E F G H A B C D E F I J A B C D K L M A X C D E F G H A E F G H

1 1 1 1 1 1 1 1 2 2 3 3 4 3 3

Jon Kleinberg Social Processes and Anonymized Network Data

slide-28
SLIDE 28

Assembling a Chain-Letter Tree

A B C D E F G H I J M L K

Jon Kleinberg Social Processes and Anonymized Network Data

slide-29
SLIDE 29

Jon Kleinberg Social Processes and Anonymized Network Data

slide-30
SLIDE 30

Jon Kleinberg Social Processes and Anonymized Network Data

slide-31
SLIDE 31

Jon Kleinberg Social Processes and Anonymized Network Data

slide-32
SLIDE 32

Jon Kleinberg Social Processes and Anonymized Network Data

slide-33
SLIDE 33

Modeling the Structure of the Tree

We’re all a few steps apart in social network (“six degrees”), but the tree is very deep and narrow. Trees for other chain letters have very similar structure. Modeling non-participation and missing data doesn’t account for this. Some plausible models that can produce trees of this shape: (1) Based on temporal ideas: people act on messages at very different speeds. (2) Based on spatial ideas: social networks are geographically clustered.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-34
SLIDE 34

Why is the Tree So Deep and Narrow?

It looks like a depth-first search tree. But why? Possible model based

  • n timing.

Assume nodes act on messages according to a delay distribution.

A C B D F E

In simulations on 4.4-million-node LiveJournal friendship network, a generalization produces trees with height, depth, and width close to Iraq-war chain letter.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-35
SLIDE 35

Why is the Tree So Deep and Narrow?

It looks like a depth-first search tree. But why? Possible model based

  • n timing.

Assume nodes act on messages according to a delay distribution.

A C B D F E

1pm 2pm 3pm 4pm 5pm 6pm

In simulations on 4.4-million-node LiveJournal friendship network, a generalization produces trees with height, depth, and width close to Iraq-war chain letter.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-36
SLIDE 36

Why is the Tree So Deep and Narrow?

It looks like a depth-first search tree. But why? Possible model based

  • n timing.

Assume nodes act on messages according to a delay distribution.

A C B D F E

In simulations on 4.4-million-node LiveJournal friendship network, a generalization produces trees with height, depth, and width close to Iraq-war chain letter.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-37
SLIDE 37

Why is the Tree So Deep and Narrow?

It looks like a depth-first search tree. But why? Possible model based

  • n timing.

Assume nodes act on messages according to a delay distribution.

A C B D F E

In simulations on 4.4-million-node LiveJournal friendship network, a generalization produces trees with height, depth, and width close to Iraq-war chain letter.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-38
SLIDE 38

Timing-Based Models for Tree Structure

Adam Bob Cathy Dan

Ken Larry Mia Participate with prob. p. If so, wait time t drawn from f(t) = t-c

When a node v in the network first gets a copy of the message, v participates in the chain-letter with prob. p. If so, waits time t before forwarding (t from f (t) = t−c.) Produces “elongated” trees when simulated in real networks. To get depth of real tree, need to let nodes “group-reply.” Open: Prove this yields asymptotically deeper trees in natural random-graph model.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-39
SLIDE 39

Timing-Based Models for Tree Structure

Adam Bob Cathy Dan

Ken Larry Mia With some prob., group- reply to sender's list, rather than sending to your own friends.

When a node v in the network first gets a copy of the message, v participates in the chain-letter with prob. p. If so, waits time t before forwarding (t from f (t) = t−c.) Produces “elongated” trees when simulated in real networks. To get depth of real tree, need to let nodes “group-reply.” Open: Prove this yields asymptotically deeper trees in natural random-graph model.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-40
SLIDE 40

Spatial Clustering and Thresholds

A second class of theories based on spatial ideas.

long-range link

Even in on-line social networks, most friends are geographically (and demographically) similar to you

[McPherson et al. 2001, Liben-Nowell et al. 2005]

Decision rules for acting may involve thresholds: e.g., you may need to see multiple friends advocating a cause before signing on [Granovetter 1978, Schelling 1978]

Jon Kleinberg Social Processes and Anonymized Network Data

slide-41
SLIDE 41

Spatial Clustering and Thresholds

Interaction of local structure and thresholds

[Centola-Macy 2007]

Suppose people needed two stimuli to be willing to participate.

you are here long-range link

Non-trivial thresholds make it hard to use long-range links.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-42
SLIDE 42

Spatial Clustering and Thresholds

Interaction of local structure and thresholds

[Centola-Macy 2007]

Suppose people needed two stimuli to be willing to participate.

you are here long-range link

Non-trivial thresholds make it hard to use long-range links.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-43
SLIDE 43

Spatial Clustering and Thresholds

Interaction of local structure and thresholds

[Centola-Macy 2007]

Suppose people needed two stimuli to be willing to participate.

you are here long-range link

Non-trivial thresholds make it hard to use long-range links.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-44
SLIDE 44

Spatial Clustering and Thresholds

Interaction of local structure and thresholds

[Centola-Macy 2007]

Suppose people needed two stimuli to be willing to participate.

you are here long-range link

Non-trivial thresholds make it hard to use long-range links.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-45
SLIDE 45

Spatial Clustering and Thresholds

Interaction of local structure and thresholds

[Centola-Macy 2007]

Suppose people needed two stimuli to be willing to participate.

you are here long-range link

Non-trivial thresholds make it hard to use long-range links.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-46
SLIDE 46

Protecting Privacy in Social Network Data

Many large datasets based on communication (e-mail, IM, voice) where users have strong privacy expectations. Current safeguards based on anonymization: replace node names with random IDs. With more detailed data, anonymization has run into trouble: Identifying on-line pseudonyms by textual analysis [Novak-Raghavan-Tomkins 2004] De-anonymizing Netflix ratings via time series [Narayanan-Shmatikov 2006] Search engine query logs: identifying users from their queries. Our setting is much starker; does this make things safer? E.g. no text, time-stamps,

  • r node attributes

Just a graph with nodes numbered 1, 2, 3, . . . , n.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-47
SLIDE 47

An Example of What Can Go Wrong

Jon Kleinberg Social Processes and Anonymized Network Data

slide-48
SLIDE 48

An Example of What Can Go Wrong

Node 32 can find himself: only node of degree 6 connected to both leaders.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-49
SLIDE 49

An Example of What Can Go Wrong

Node 32 can find himself: only node of degree 6 connected to both leaders. Node 4 can find herself: only node of degree 6 connected to defecting leader but not original leader.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-50
SLIDE 50

Attacking an Anonymized Network

What we learn from this: Attacker may have extra power if they are part of the system. In large e-mail/IM network, can easily add yourself to system. But “finding yourself” when there are 100 million nodes is going to be more subtle than when there are 34 nodes. Template for an active attack on an anonymized network [Backstrom-Dwork-Kleinberg 2007] Attacker can create (before the data is released)

nodes (e.g. by registering an e-mail account) edges incident to these nodes (by sending mail)

Privacy breach: learning whether there is an edge between two existing nodes in the network. Note: attacker’s actions are completely “innocuous.” Main result: active attacks can easily compromise privacy. Idea is to exploit the incredible richness of link structures.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-51
SLIDE 51

An Attack

100M nodes

Scenario: Suppose an organization were going to release an anonymized communication graph on 100 million users.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-52
SLIDE 52

An Attack

100M nodes

An attacker chooses a small set of user accounts to “target”: Goal is to learn edge relations among them.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-53
SLIDE 53

An Attack

100M nodes

Before dataset is released: Create a small set of new accounts, with links among them, forming a subgraph H. Attach this new subgraph H to targeted accounts.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-54
SLIDE 54

An Attack

100M+12 nodes

When anonymized dataset is released, need to find H. Why couldn’t there be many copies of H in the dataset? (We don’t even know what the network will look like ... ) Why wouldn’t it be computationally hard to find H?

Jon Kleinberg Social Processes and Anonymized Network Data

slide-55
SLIDE 55

An Attack

100M+12 nodes

In fact, Theorem: small random graphs H will likely be unique and efficiently findable. Random graph: each edge present with prob. 1/2.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-56
SLIDE 56

An Attack

100M+12 nodes

Once H is found: Can easily find the targeted nodes by following edges from H.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-57
SLIDE 57

Specifics of the Attack

First version of the attack: Create random H on (2 + ε) log n nodes. In experiments on 4.4 million-node LiveJournal graph, 7-node graph H can compromise 70 targeted nodes (and hence ∼ 2400 edge relations). Second version of the attack: Logarithmic size is not optimal. Can begin breaching privacy with H of size ∼ √log n Passive attacks: In LiveJournal graph: with reasonable probability, you and 6

  • f your friends chosen at random can carry out the first

attack, compromising about 10 users.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-58
SLIDE 58

Why is H Unique? A Ramsey-Theoretic Calculation

Basic calculation at the foundation of Theorem (Erd¨

  • s, 1947): There exists an n-node

graph with no clique and no independent set of size > 2 log n. Quantitative bound for Ramsey’s Theorem;

  • ne of the earliest uses of random graphs.

clique independent set

The calculation: Build random n-node graph, include each edge with prob. 1

2.

There are < nk sets of k nodes; each is a clique or independent set with probability 2−(k

2) ≈ 2−k2/2.

Product nk · 2−k2/2 upper-bounds probability of any clique or

  • indep. set; it drops below 1 once k exceeds ≈ 2 log n.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-59
SLIDE 59

Why is H Unique? A Ramsey-Theoretic Calculation

Erd¨

  • s: Graph is random, subgraph is non-random.

Our case: Subgraph (H) is random, graph is non-random. But main calculation starts from same premise.

1 3 4 2 6 5 2

H

targeted nodes

Analysis is greatly complicated because H is plugged into full graph. New copies of H could partly overlap

  • riginal copy of H.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-60
SLIDE 60

Finding the subgraph H

To find H: Can assume there is a path through nodes 1, 2, . . . , k. Start search at all possible nodes in G. Prune search path at depth j if edges back from node j don’t match, or if degree of j doesn’t match. Probability of a spurious path surviving to depth j is ≈ 2−j2/2 (modulo overlap worries). Overall size of search tree slightly more than linear in n.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-61
SLIDE 61

Stronger Theoretical Bound

Variant on construction breaches privacy with H of size ∼ √log n. Construct H as before on k nodes, but connect to b = k

3 targeted

nodes. With high prob., min. internal cut in H exceeds b = cut to rest of graph.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-62
SLIDE 62

Stronger Theoretical Bound

Recovery: Break graph up along cuts of size ≤ b. Uses Gomory-Hu tree computation (e.g. Flake et al. 2004) Can prove that H will be one of the components after this decomposition. Uniqueness of H: After breaking apart the graph, there are ≤ n

k size-k

components other than H. Each is isomorphic to H with probability ≈ 2−k2/2. Now 2−k2/2 only has to cancel n

k , not nk,

so k ≈ √log n is enough.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-63
SLIDE 63

Passive Attacks and the Richness of Local Subgraphs

In 4.4-million-node LiveJournal network, once you have 10 neighbors, the subgraph on these neighbors is likely to be unique. Friendship structures act like unique signatures. Passive attacks feasible with even smaller sets, using numbers

  • f external neighbors in addition to internal network structure.

Most of us have laid the groundwork for a privacy-breaching attack without realizing it.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-64
SLIDE 64

The Perils of Anonymized Data

General release of an anonymized social network? Many potential dangers.

Note: earlier datasets additionally protected by legal/contractual/IRB/employment safeguards.

Fundamental question: privacy-preserving mechanisms for making social network data accessible.

Interesting connections to issues in Sofya’s talk: May be difficult to obfuscate network effectively; Interactive mechanisms for network data may be possible. (See also [Dinur-Nissim 2003, Dwork-McSherry-Talwar 2007])

Recent proposals specifically aimed at

framework for safe public release [Blum-Ligget-Roth ’08] social networks [Hay et al ’07, Zheleva et al ’07, Korolova et al ’08]

Further issues

Even without overt attacks, increasingly refined pictures of individuals begin to emerge.

Jon Kleinberg Social Processes and Anonymized Network Data

slide-65
SLIDE 65

Final Reflections: Glimpses into Massive Networks

Simultaneous opportunities and challenges. How do we build deeper models of the processes at work inside large-scale social networks? A stronger vocabulary for analyzing operational models consistent with observed data. Understanding how much can be predicted. Social computing applications produce a new set of design constraints. How do we make data available without compromising privacy? A need for guarantees: the dangers can be unexpected. Algorithmic and mathematical models will be crucial to understanding all these developments.

Jon Kleinberg Social Processes and Anonymized Network Data