Information Diffusion in Social Networks Research Promotion - - PowerPoint PPT Presentation

information diffusion in social networks
SMART_READER_LITE
LIVE PREVIEW

Information Diffusion in Social Networks Research Promotion - - PowerPoint PPT Presentation

Information Diffusion in Social Networks Research Promotion Workshop 15 th March 2013 BESU, Shibpur Amitabha Bagchi Computer Science and Engineering IIT Delhi Online Social Networks OSNs like Facebook and Twitter are ubiquitous. In


slide-1
SLIDE 1

Information Diffusion in Social Networks

Research Promotion Workshop BESU, Shibpur 15th March 2013

Amitabha Bagchi Computer Science and Engineering IIT Delhi

slide-2
SLIDE 2

Online Social Networks

  • OSNs like Facebook and Twitter are

ubiquitous.

  • In fact some of you are probably updating

your Facebook status even as I speak.

  • "Stuck in boring talk about research, think I'll

take a nap....LOL" Researchers from various disciplines are waking up to the possibilities.

slide-3
SLIDE 3

Research aspects of OSNs

  • Sociologists have studied human social

networks from the dawn of their discipline.

  • Physicists are interested in social networks

as a complex system of interacting agents

  • Mathematicians see stochastic processes.
  • Economists apply game theory

Computer Scientists built these systems. And we are building the systems that can analyze the data these systems generate.

slide-4
SLIDE 4

Information diffusion on OSNs

Question: How do particular topics or pieces of content become popular on OSNs? The answer to this question is tremendously important to a variety of stakeholders: commerce, political scientists, sociologists etc

slide-5
SLIDE 5

Two aspects: Macro and Micro

Micro: What are individual users doing? Macro: What are the large-scale phenomena that are observed in this system? Synthesis: Can we deduce the nature of the large-scale phenomena from a knowledge of what individual users are doing?

slide-6
SLIDE 6

Example: The SIR model

Given a graph G and a special vertex v that has a certain message (rumor).

  • Each node is in one of three states:

Susceptible, Infected, Removed. Initial v is Infected and everyone else is Susceptible.

  • At each time step an edge (u,v) is chosen at

random and if u is infected it sends the message to v.

  • If v is S, it becomes I. If it is I it becomes R.
  • If v is R then u becomes R.
slide-7
SLIDE 7

SIR: The Macro question

Clearly, as long as there are infected nodes the process continues. Question: Will all the nodes have been infected for at least some time before the process ends? Ans: (Probably) depends on the topology. For a complete graph the answer is no (Sudbury,

  • J. Appl. Prob., 1985).
slide-8
SLIDE 8

The way of Physics

Observe the macro and theorize about the micro to better understand the universe.

slide-9
SLIDE 9

The way of Engineering

Use the observation of the micro and the theory

  • f the micro to build better systems and make

more money...

...thereby helping pay for Physics research

slide-10
SLIDE 10

Outline

  • Refine the micro question.
  • Define a stochastic model of the micro.
  • Simulate and observe the behaviour of the

macro.

  • Compare with data.
slide-11
SLIDE 11

Refining the question

The Attribution problem: Why do users do what they do?

  • Did you share that photo because you like

what's in it or because you are a big fan of the person who posted it?

  • You just heard on TV that Sehwag has been

cut from the Indian team. Do you want to share your opinion on Twitter?

  • Everyone is talking about Kolaveri. Do you

want to check it out?

slide-12
SLIDE 12

Building the model

The model comes from (possible) answers to the questions.

  • People are influenced by what their friends

are talking about. (Endogenous).

  • People monitor broadcast media also and
  • ften respond to it on OSNs. (Exogenous).
  • People respond to themes that are getting

popular on OSNs. (Somewhere in between).

slide-13
SLIDE 13

The Model I

  • Users form a network that is an undirected

small-world.

  • Each user “tweets” from time to time. A

“tweet” is an event in time that has a “topic” associated with it.

  • The users options of topics at time t are from

a set of topics that have been seen until time t.

  • The user differentiates between “global”

topics and “local” topics.

slide-14
SLIDE 14

The Model II

  • There is a “global list” in which “global tweets”

arrive with frequency λ1 (distributed as a Poisson point process). Each of these brings a new topic.

  • Each user has a “local list” into which tweets

are written with frequency λ2 (distributed as a Poisson point process). The topic of a user’s tweet is chosen randomly

  • ut of the topics in the global list and the local

lists of its neighbours in the network.

slide-15
SLIDE 15

The Model III

  • Each global tweet has a weight A on arrival in

the global list.

  • This weight decreases exponentially with

time with parameter α i.e. Ae-αt at time t if the topic arrived at time 0.

  • When a user tweets then that tweet is placed

in its local list with weight B.

  • This weight decreases exponentially with

time with parameter β i.e. Be-βt at time t if the tweet arrived at time 0.

slide-16
SLIDE 16

The Model IV

A new tweet has two kinds of candidates it can copy its topic from:

  • Global tweets.
  • Local tweets from one of its neighbours’ lists.

A new tweet has the same topic as a candidate tweet with probability proportional to the candidate’s weight.

slide-17
SLIDE 17

A reality check

  • The total weight seen by any node is finite

with probability 1.

  • Additionally since this is an ergodic Markov

process there is a stationary distribution, hence the total weight converges to a constant C(v) for node v. E[C] = λ1A/α + kλ2B/β, Where k is the number of neighbors of v.

slide-18
SLIDE 18

Three parameter regimes

Varying the parameters gives us three kinds of behaviours.

  • Sub-viral regime: No topic dominates. Well-

described by mean-field approximation.

  • Super-viral regime: Each new topic goes viral

and dies quickly

  • Viral regime
slide-19
SLIDE 19

Evolution in the viral regime

The simulation resembles real-world topic evolution.

1000 2000 3000 4000 5000 6000 7000 8000 100 200 300 400 500 600 700 800 900 1000 Number of Nodes time

slide-20
SLIDE 20

Viral regime characteristics

Power law-like distributions are seen for macro properties like peak volume, spread and lifetime.

100 1000 10000 1 10 100 1000 Maximum Spread Rank 100 1000 10000 1 10 100 Maximum Peak height Rank

slide-21
SLIDE 21

Live longer, go further

Longer lived topics spread further. (Or is it the

  • ther way around?)

1000 2000 3000 4000 5000 6000 7000 8000 100 200 300 400 500 600 700 800 Maximum Spread Lifetime

slide-22
SLIDE 22

Studying topology empirically

We define topic based graphs for each topic

  • Lifetime graph: The subgraph induced by all

users who have ever tweeted on the topic.

  • Evolving graphs: The sequence of graphs

induced by the users who tweet on the topic

  • n a given day.
  • Cumulative evolving graph: There is an

edge from u to v if u follows v and u tweets

  • n the topic a day after tweets on day t and
slide-23
SLIDE 23

Topological study: Viral topics

For a viral topic clusters merge into one as it rises in popularity. (Evolving graph)

1000 2000 3000 4000 5000 6000 7000 8000 400 450 500 550 600 650 700 750 800 850 900 950 50 100 150 200 250 300 350 Max Cluster Sizes/ Evolution Number of Clusters Time Max Cluster Size 2ndmax Cluster Size 3rdmax Cluster Size Evolution

  • No. of Clusters
slide-24
SLIDE 24

Topological study: Non-viral topics

Non-viral topics see many small clusters. (Evolving graph)

50 100 150 200 250 300 350 400 450 500 80 85 90 95 100 105 110 115 120 125 50 100 150 200 250 300 Max Cluster Size / Evolution Number of Clusters Time Max Cluster Size 2ndmax Cluster Size 3rdmax Cluster Size Evolution

  • No. of Clusters
slide-25
SLIDE 25

Empirical cross-verification: Setup

  • We used a data set containing approx 200

million tweets from 9 million users crawled from Twitter in 2009.

  • We augmented the data set by crawling

follower-following relationships and geolocating the users where possible.

  • Further we used NLP tools to tag tweets with

topics (since hashtags were very sparse).

slide-26
SLIDE 26

Large cluster formation: Empirical

For non-viral topics, the largest component of the cumulative evolving graph contains a small fraction of all nodes

101 102 103 10 20 30 40 50 60 70 80 0.05 0.1 0.15 0.2 0.25 0.3 Users count Fraction of node in Giant component Day Popularity Giant component 101 102 103 10 20 30 40 50 60 70 80 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Users count Fraction of node in Giant component Day Popularity Giant component

slide-27
SLIDE 27

Large clusters in viral topics

In viral topics the largest component takes up a significant fraction of the graph, growing in size as the topic rises in popularity.

100 101 102 103 104 105 10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Users count Fraction of node in Giant component Day Popularity Giant component 101 102 103 104 105 10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 0.6 Users count Fraction of node in Giant component Day Popularity Giant component

slide-28
SLIDE 28

Cluster merging in the model

The ratio of the largest to the second largest component in the evolving graph tells a story.

1 1.5 2 2.5 3 80 85 90 95 100 105 110 115 120 125 50 100 150 200 250 300 350 400 450 500 Max/2ndMax Cluster Size Evolution Time Max/2ndMax Evolution 1000 2000 3000 4000 5000 6000 7000 400 450 500 550 600 650 700 750 800 850 900 950 1000 2000 3000 4000 5000 6000 7000 8000 Max/2ndMax Cluster Size Evolution Time Max/2ndMax Evolution

slide-29
SLIDE 29

The real data also has geography

Viral topics cross regional/national boundaries in the cumulative evolving graph.

101 102 103 104 10 20 30 40 50 60 70 80 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 Topic users count Fraction of edges across countries Day Temporal Geographical 100 101 102 10 20 30 40 50 60 70 80 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Topic users count Fraction of edges across countries Day Temporal Geographical

slide-30
SLIDE 30

That was the trailer…

  • Ruhela et. al. Towards the use of Online Social

Networks for Efficient Internet Content Distribution, in Proc ANTS 2011.

  • Ardon et. al. Spatio-Temporal Analysis of Topic

Popularity in Twitter, arXiv:1111.2904v1 [cs.SI].

  • Rajyalakshmi et. al. Topic Diffusion and

Emergence of Virality in Social Networks, arxiv: 1202.2215v1 [cs.SI]. www.cse.iitd.ernet.in/~bagchi

slide-31
SLIDE 31

The emerging science of big data

  • Huge amounts of data being generated from

all kinds of sources.

  • “Smart cities”, Genome sequencing,

telescopes, networked systme.

  • A growing awareness that the science of data

is the new frontier of technology Think of it as IT’s steam engine moment. It’s turn to shine as a force in human affairs.

slide-32
SLIDE 32

Challenges

  • Modelling
  • Domain knowledge required
  • But understanding of what data can reveal also

required.

  • Data analytics
  • Algorithms
  • Data structures
  • Databases
  • System Architecture
slide-33
SLIDE 33

The research horizon..

  • …is unlimited.
  • Only CS fundamentals will matter.
  • Everything else will become obsolete before

the exam papers are returned.

slide-34
SLIDE 34

Thanks for listening