http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) - - PowerPoint PPT Presentation

http cs224w stanford edu 1 new problem outbreak detection
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) - - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) Develop an approximation algorithm It is a submodular opt. problem! (3) Speed-up


slide-1
SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

http://cs224w.stanford.edu

slide-2
SLIDE 2

 (1) New problem: Outbreak detection  (2) Develop an approximation algorithm

  • It is a submodular opt. problem!

 (3) Speed-up greedy hill-climbing

  • Valid for optimizing general submodular functions

(i.e., also works for influence maximization)

 (4) Prove a new “data dependent” bound

  • n the solution quality
  • Valid for optimizing general submodular functions

(i.e., also works for influence maximization)

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

slide-3
SLIDE 3

 Given a real city water

distribution network

 And data on how

contaminants spread in the network

 Detect the

contaminant as quickly as possible

 Problem posed by the

US Environmental Protection Agency

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

S S

[Leskovec et al., KDD ’07]

slide-4
SLIDE 4

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-4

Blogs Posts Time

  • rdered

hyperlinks Information cascade

Which blogs should one read to detect cascades as effectively as possible?

[Leskovec et al., KDD ’07]

10/24/2012

slide-5
SLIDE 5

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

Detect all stories but late.

Want to read things before others do.

Detect blue & yellow soon but miss red.

[Leskovec et al., KDD ’07]

slide-6
SLIDE 6

 Both of these two are an instance of the

same underlying problem!

 Given a dynamic process spreading over

a network

 We want to select a set of nodes to detect

the process effectively

 Many other applications:

  • Epidemics
  • Influence propagation
  • Network security

6 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-7
SLIDE 7

 Utility of placing sensors:

  • Water flow dynamics, demands of households, …

 For each subset S ⊆ V compute utility f(S)

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

S2 S3 S4 S1 S2 S3 S4 S1

High sensing quality f(S) = 0.9 Low sensing quality f(S)=0.01 High impact

  • utbreak

Medium impact

  • utbreak

Low impact

  • utbreak

Sensor reduces impact through early detection!

S1

Contamination Set V of all network junctions

slide-8
SLIDE 8

Given:

 Graph 𝐻(𝑊, 𝐹)  Data on how outbreaks spread over the 𝑯:

  • For each outbreak 𝑗 we know the time 𝑈(𝑗, 𝑣)

when outbreak 𝑗 contaminates node 𝑣

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

[Leskovec et al., KDD ’07]

Water distribution network (physical pipes and junctions) Simulator of water consumption&flow

(built by Mech Eng. people) We simulate the contamination spread for every possible location.

slide-9
SLIDE 9

Given:

 Graph 𝐻(𝑊, 𝐹)  Data on how outbreaks spread over the 𝑯:

  • For each outbreak 𝑗 we know the time 𝑈(𝑗, 𝑣)

when outbreak 𝑗 contaminates node 𝑣

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

[Leskovec et al., KDD ’07]

The network of the blogosphere Traces of the information flow

Collect lots of blogs posts and trace hyperlinks to obtain data about information flow from a given blog.

a b c a b c

slide-10
SLIDE 10

Given:

 Graph 𝐻(𝑊, 𝐹)  Data on how outbreaks spread over the 𝑯:

  • For each outbreak 𝑗 we know the time 𝑈(𝑗, 𝑣)

when outbreak 𝑗 contaminates node 𝑣

 Goal: Select a subset of nodes S that

maximize the expected reward: subject to: cost(S) < B

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

Expected reward for detecting outbreak i

[Leskovec et al., KDD ’07]

max

𝑇⊆𝑊 𝑔 𝑇 = 𝑄 𝑗 𝑔 𝑗 𝑇 𝑗

𝒈𝒋 𝑻 is penalty reduction: 𝑔

𝑗 𝑇 = 𝜌𝑗 ∅ − 𝜌𝑗(𝑇)

slide-11
SLIDE 11

 Reward

  • (1) Minimize time to detection
  • (2) Maximize number of detected propagations
  • (3) Minimize number of infected people

 Cost (node/location dependent):

  • Reading big blogs is more time consuming
  • Placing a sensor in a remote location is expensive

11

  • utbreak i

Monitoring blue node saves more people than monitoring the green node

f(S)

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-12
SLIDE 12

 Objective functions:

  • 1) Time to detection (DT)
  • How long does it take to detect a contamination?
  • Penalty: 𝜌𝑗(𝑢) = min

{𝑢, 𝑈

𝑛𝑛𝑛}

  • 2) Detection likelihood (DL)
  • How many contaminations do we detect?
  • We incur penalty if we don’t detect: 𝜌𝑗(𝑢) = 0, 𝜌𝑗(∞) = 1
  • 3) Population affected (PA)
  • How many people drank contaminated water?
  • 𝜌𝑗(𝑢) = {# of blogs in cascade 𝑗 at time 𝑢}.

 Observation:

In all cases detecting sooner does not hurt!

12 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-13
SLIDE 13

 Observation: Diminishing returns

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

S1 S2

Placement S={s1, s2}

S’

New sensor: Adding s’ helps a lot

S2 S4 S1 S3

Placement S’={s1, s2, s3, s4}

s’

Adding s’ helps very little

[Leskovec et al., KDD ’07]

slide-14
SLIDE 14

 Claim: For all 𝐵 ⊆ 𝐶 ⊆ 𝑊 and sensors 𝑡 ∈ 𝑊\B

𝑔 𝐵 ∪ 𝑡 − 𝑔 𝐵 ≥ 𝑔 𝐶 ∪ 𝑡 − 𝑔 𝐶

 Proof:

  • Fix cascade 𝑗
  • Show 𝑔

𝑗 𝐵 = 𝜌𝑗 ∞ − 𝜌𝑗(𝑈(𝐵, 𝑗)) is submodular

  • Consider 𝐵 ⊆ 𝐶 ⊆ 𝑊 and sensor 𝑡 ∈ 𝑊\B
  • When does node 𝒕 detect cascade 𝒋? 3 Cases:
  • (1) 𝑈 𝑡, 𝑗 ≥ 𝑈(𝐵, 𝑗) then

𝑔

𝑗 𝐵 ∪ 𝑡

= 𝑔

𝑗 𝐵 , 𝑔 𝑗 𝐶 ∪ 𝑡

= 𝑔

𝑗 𝐶 and so

𝑔

𝑗 𝐵 ∪ 𝑡

− 𝑔

𝑗 𝐵 = 0 = 𝑔 𝑗 𝐶 ∪ 𝑡

− 𝑔

𝑗 𝐶

  • Since 𝑡 detects too late, nobody benefits

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

slide-15
SLIDE 15

 Proof (contd.):

  • 3 Cases:
  • (2) 𝑈 𝐶, 𝑗 ≤ 𝑈 𝑡, 𝑗 < 𝑈(𝐵, 𝑗) then

𝑔

𝑗 𝐵 ∪ 𝑐

− 𝑔

𝑗 𝐵 ≥ 0 = 𝑔 𝑗 𝐶 ∪ 𝑡

− 𝑔

𝑗 𝐶

  • 𝑡 detects sooner than any node in 𝐵 but after all in 𝐶.

So 𝑣 only helps improve the solution 𝐵.

  • (3) 𝑈 𝑡, 𝑗 < 𝑈(𝐶, 𝑗) then

𝑔

𝑗 𝐵 ∪ 𝑡

− 𝑔

𝑗 𝐵 = 𝜌𝑗 ∞ − 𝜌𝑗 𝑈 𝑡, 𝑗

− 𝑔

𝑗(𝐵) ≥

𝜌𝑗 ∞ − 𝜌𝑗 𝑈 𝑡, 𝑗 − 𝑔

𝑗(𝐶) = 𝑔 𝑗 𝐶 ∪ 𝑡

− 𝑔

𝑗 𝐶

  • Ineqaulity is due to non-decreasingness of 𝑔

𝑗(⋅), i.e., 𝑔 𝑗 𝐵 ≤ 𝑔 𝑗(𝐶)

  • So, 𝒈𝒋(⋅) is submodular!

 So, 𝒈(⋅) is also submodular

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

𝑔 𝑇 = 𝑄 𝑗 𝑔

𝑗 𝑇 𝑗

slide-16
SLIDE 16

 What do we know about

  • ptimizing submodular

functions?

  • A hill-climbing (i.e., greedy) is near
  • ptimal (1 −

1 𝑓 ⋅ 𝑃𝑄𝑈)

 But:

  • (1) This only works for unit cost case!

(each sensor costs the same)

  • For use each sensor 𝑡 has cost 𝑑(𝑡)
  • (2) Hill-climbing algorithm is slow
  • At each iteration we need to re-evaluate

marginal gains of all nodes

  • Runtime 𝑃(|𝑊| · 𝐿) for placing 𝐿 sensors

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-16

a b c a b c d d reward e e

Hill-climbing

Add sensor with highest marginal gain

10/24/2012

slide-17
SLIDE 17

 Consider:

Hill-climbing that ignores cost

  • Ignore sensor cost
  • Repeatedly select sensor with highest marginal gain
  • Do this until the budget is exhausted

 How well does this work?

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

[Leskovec et al., KDD ’07]

slide-18
SLIDE 18

 Bad example:

  • 𝑜 sensors, budget 𝐶
  • 𝑡1: reward 𝑠, cost 𝐶
  • 𝑡2 … 𝑡𝑜: reward 𝑠 − 𝜁, cost 1
  • Hill-climbing always prefers more expensive sensor

𝑡1with reward 𝑠 (and exhausts the budget) It never selects cheaper sensors with reward 𝑠 − 𝜁 → For variable cost it can fail arbitrarily badly!

 Idea: What if we optimize benefit-cost ratio?

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

[Leskovec et al., KDD ’07]

𝑡𝑗 = arg max

𝑡∈𝑊

𝑔 𝐵𝑗−1 ∪ {𝑡} − 𝑔(𝐵𝑗−1) 𝑑 𝑡 Greedily pick sensor

𝑡𝑗 that maximizes benefit to cost ratio.

slide-19
SLIDE 19

 Benefit-cost ratio can also fail arbitrarily badly!  Consider: budget 𝐶:

  • 2 sensors 𝒕𝟐 and 𝒕𝟑:
  • Costs: 𝑑(𝑡1) = 𝜁, 𝑑(𝑡2) = 𝐶
  • Only 1 cascade: 𝑔(𝑡1) = 2𝜁, 𝑔(𝑡2) = 𝐶
  • Then benefit-cost ratio is
  • 𝑐⋅𝑑(𝑡1) = 2 and 𝑐⋅𝑑(𝑡2) = 1
  • So, we first select 𝑡1 and then can not afford 𝑡2

→We get reward 2𝜁 instead of 𝐶. Now send 𝜁 → 0 and we get arbitrarily bad solution!

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

[Leskovec et al., KDD ’07]

slide-20
SLIDE 20

 CELF (cost-effective lazy forward-selection)

A two pass greedy algorithm:

  • Set (solution) 𝑇𝑇: Use benefit-cost greedy
  • Set (solution) 𝑇𝑇𝑇: Use unit-cost greedy
  • Final solution: 𝑇 = arg max

(𝑔(𝑇𝑇), 𝑔(𝑇𝑇𝑇))

 How far is CELF from (unknown) optimal

solution?

 Theorem: CELF is near optimal

  • CELF achieves ½(1-1/e) factor approximation!

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

[Leskovec et al., KDD ’07]

slide-21
SLIDE 21

 What do we know about

  • ptimizing submodular

functions?

  • A hill-climbing (i.e., greedy) is near
  • ptimal (1 −

1 𝑓 ⋅ 𝑃𝑄𝑈)

 But:

  • (2) Hill-climbing algorithm is slow!
  • At each iteration we need to re-

evaluate marginal gains of all nodes

  • Runtime 𝑃(|𝑊| · 𝐿) for placing 𝐿

sensors

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-21

a b c a b c d d reward e e

Hill-climbing

Add sensor with highest marginal gain

10/24/2012

slide-22
SLIDE 22

 In round 𝒋 + 𝟐: So far we picked 𝑇𝑗 = {𝑡1, … , 𝑡𝑗}

  • Now pick s𝑗 = arg max𝑣 𝑔(𝑇𝑗 ∪ {𝑣}) − 𝑔(𝑇𝑗)
  • This our old friend – greedy hill-climbing algorithm.

It maximizes the “marginal benefit” 𝜀𝑣(𝑇𝑗) = 𝑔(𝑇𝑗 ∪ {𝑣}) − 𝑔(𝑇𝑗)

 By submodularity property:

𝑔 𝑇𝑗 ∪ 𝑣 − 𝑔 𝑇𝑗 ≥ 𝑔 𝑇

𝑘 ∪ 𝑣

− 𝑔 𝑇

𝑘 for 𝑗 < 𝑘

 Observation: By submodularity:

For every 𝑣 δ𝑣(𝑇𝑗) ≥ δ𝑣(𝑇𝑘) for 𝑗 ≤ 𝑘 since 𝑇𝑗 ⊆ 𝑇𝑘 Marginal benefits δu only shrink!

(as S grows)

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

u δu(Si) ≥ δu(Sj)

[Leskovec et al., KDD ’07] Activating node u in step i helps more than activating it at step j (j>i)

slide-23
SLIDE 23

 Idea:

  • Use δi as upper-bound on δj (j>i)

 Lazy hill-climbing:

  • Keep an ordered list of marginal

benefits δi from previous iteration

  • Re-evaluate δi only for top node
  • Re-sort and prune

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

a b c d Marginal gain e

[Leskovec et al., KDD ’07]

f(S ∪ {u}) – f(S) ≥ f(T ∪ {u}) – f(T)

S ⊆ T S1={a}

slide-24
SLIDE 24

 Idea:

  • Use δi as upper-bound on δj (j>i)

 Lazy hill-climbing:

  • Keep an ordered list of marginal

benefits δi from previous iteration

  • Re-evaluate δi only for top node
  • Re-sort and prune

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

a d b c e Marginal gain

[Leskovec et al., KDD ’07]

f(S ∪ {u}) – f(S) ≥ f(T ∪ {u}) – f(T)

S ⊆ T S1={a}

slide-25
SLIDE 25

 Idea:

  • Use δi as upper-bound on δj (j>i)

 Lazy hill-climbing:

  • Keep an ordered list of marginal

benefits δi from previous iteration

  • Re-evaluate δi only for top node
  • Re-sort and prune

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

a c d b e Marginal gain

[Leskovec et al., KDD ’07]

f(S ∪ {u}) – f(S) ≥ f(T ∪ {u}) – f(T)

S ⊆ T S1={a} S2={a,b}

slide-26
SLIDE 26

 Back to the solution quality!  The (1-1/e) bound for submodular functions

is the worst case bound (worst over all possible inputs)

 Data dependent bound:

  • Value of the bound depends on the input data
  • On “easy” data, hill climbing may do better than 63%
  • Can we say something about the solution

quality when we know the input data?

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

slide-27
SLIDE 27

 Suppose 𝑇 is some solution to 𝑔(𝑇) s.t. |𝑇| ≤ 𝑙

  • 𝑔(𝑇) is monotone & submodular

 Let 𝑈 = {𝑢1, … , 𝑢𝑙} be the OPT solution  For each 𝑣 ∉ 𝑇 let δ𝑣 = 𝑔(𝑇 ∪ {𝑣}) − 𝑔(𝑇)

Order δ𝑣 so that δ1 ≥ δ2 ≥ … ≥ δ𝑜−|𝑇|

 Then: 𝑔 𝑈 ≤ 𝑔 𝑇 + ∑

𝜀𝑗

𝑙 𝑗=1

  • Note:
  • This is a data dependent bound (δu depend on input data)
  • Bound holds for any algorithm
  • Makes no assumption about how 𝑇 is computed
  • For some inputs it can be very “loose” (worse than 63%)

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

slide-28
SLIDE 28

 Claim:

  • For each 𝑣 ∉ 𝑇 let δ𝑣 = 𝑔 𝑇 ∪ 𝑣

− 𝑔 𝑇

  • Order δ𝑣 so that δ1 ≥ δ2 ≥ … ≥ δ𝑜− 𝑇
  • Then: 𝑔 𝑈 ≤ 𝑔 𝑇 + ∑

𝜀𝑗

𝑙 𝑗=1

 Proof:

  • 𝑔 𝑈 ≤ 𝑔 𝑈 ∪ 𝑇 =

𝑔 𝑇 + ∑ 𝑔 𝑇 ∪ 𝑢1 … 𝑢𝑗 − 𝑔 𝑇 ∪ 𝑢1 … 𝑢𝑗−1

𝑙 𝑗=1

  • ≤ 𝑔 𝑇 + ∑

𝑔 𝑇 ∪ 𝑢𝑗 − 𝑔 𝑇

𝑙 𝑗=1

  • = 𝑔 𝑇 + ∑

𝜀𝑢𝑗

𝑙 𝑗=1

  • ≤ 𝑔 𝑇 + ∑

𝜀𝑗

𝑙 𝑗=1

⇒ 𝒈 𝑼 ≤ 𝒈 𝑻 + ∑ 𝜺𝒋

𝒍 𝒋=𝟐

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

Instead of taking ti∈T (of benefit 𝜀𝑢𝑗), we take the best possible element (𝜀𝑗) (we proved this last time)

slide-29
SLIDE 29

 Real metropolitan area

water network

  • V = 21,000 nodes
  • E = 25,000 pipes

 Use a cluster of 50 machines for a month  Simulate 3.6 million epidemic scenarios

(152 GB of epidemic data)

 By exploiting sparsity we fit it into main

memory (16GB)

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

[Leskovec et al., KDD ’07]

slide-30
SLIDE 30

Data-dependent bound is much tighter

(gives more accurate estimate of alg. performance)

30

Solution quality F(A) Higher is better

5 10 15 20 0.2 0.4 0.6 0.8 1 1.2 1.4

“Offline”

the (1-1/e) bound

Data-dependent bound Hill Climbing

Number of sensors placed

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-31
SLIDE 31

 Placement heuristics perform

much worse

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

Author Score

CELF

26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph 7 U Michigan 4 Michigan Tech U 3 Malcolm 2 Proteo 2 Technion (2) 1

Battle of Water Sensor Networks competition

[w/ Ostfeld et al., J. of Water Resource Planning]

slide-32
SLIDE 32

 Different objective functions give different

sensor placements

32

Population affected Detection likelihood

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-33
SLIDE 33

 CELF is 10 times faster than greedy

hill-climbing!

33 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-34
SLIDE 34

= I have 10 minutes. Which blogs should I read to be most up to date? = Who are the most influential bloggers?

34

?

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-35
SLIDE 35

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35

Detect all stories but late.

Want to read things before others do.

Detect blue & yellow soon but miss red.

slide-36
SLIDE 36

 Online bound is much tighter!

  • 13% instead of 37%

Old bound Our bound CELF

10/24/2012 36 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[Leskovec et al., KDD ’07]

slide-37
SLIDE 37

 Heuristics perform much worse!  One really needs to perform the optimization

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37

[Leskovec et al., KDD ’07]

slide-38
SLIDE 38

 CELF has 2 sub-algorithms. Which wins?  Unit cost:

  • CELF picks large

popular blogs:

instapundit.com, michellemalkin.com

 Cost-benefit:

  • Cost proportional

to the number of posts

 We can do much

better when considering costs

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38

[Leskovec et al., KDD ’07]

slide-39
SLIDE 39

 Problem: Then CEF

picks lots of small blogs that participate in few cascades

 We pick best solution

that interpolates between the costs

 We can get good

solutions with few blogs and few posts

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39

Each curve represents a set of solutions S with the same final reward f(S)

[Leskovec et al., KDD ’07]

Score f(S)=0.4

f(S)=0.3 f(S)=0.2

slide-40
SLIDE 40

 We want to generalize well to future (unknown)

cascades

 Limiting selection to bigger blogs improves

generalization!

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-40 10/24/2012

slide-41
SLIDE 41

 CELF runs 700

times faster than simple hill- climbing algorithm

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41

[Leskovec et al., KDD ’07]

slide-42
SLIDE 42

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42

Observations

Small diameter, Edge clustering Patterns of signed edge creation Viral Marketing, Blogosphere, Memetracking Scale-Free Densification power law, Shrinking diameters Strength of weak ties, Core-periphery

Models

Erdös-Renyi model, Small-world model Structural balance, Theory of status Independent cascade model, Game theoretic model Preferential attachment, Copying model Microscopic model of evolving networks Kronecker Graphs

Algorithms

Decentralized search Models for predicting edge signs Influence maximization, Outbreak detection, LIM PageRank, Hubs and authorities Link prediction, Supervised random walks Community detection: Girvan-Newman, Modularity