[PPT] - http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) PowerPoint Presentation

SLIDE 1

CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University

http://cs224w.stanford.edu

SLIDE 2

¡ (1) New problem: Outbreak detection ¡ (2) Develop an approximation algorithm

§ It is a submodular opt. problem!

¡ (3) Speed-up greedy hill-climbing

§ Valid for optimizing general submodular functions (i.e., also works for influence maximization)

¡ (4) Prove a new “data dependent” bound

n the solution quality

§ Valid for optimizing any submodular function (i.e., also works for influence maximization)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2

SLIDE 3

¡ Given a real city water

distribution network

¡ And data on how

contaminants spread in the network

¡ Detect the

contaminant as quickly as possible

¡ Problem posed by the

US Environmental Protection Agency

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3

S S

SLIDE 4

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4

Users/blogs Posts Time

rdered

hyperlinks Information cascade

Which users/news sites should

ne follow to detect cascades

as effectively as possible?

11/12/19

SLIDE 5

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5

Detect all stories but late.

Want to read things before others do.

Detect blue & yellow stories soon but miss the red story.

SLIDE 6

¡ Both of these two are instances of the same

underlying problem!

¡ Given a dynamic process spreading over

a network we want to select a set of nodes to detect the process effectively

¡ Many other applications:

§ Epidemics § Influence propagation § Network security

6 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 7

¡ Utility of placing sensors:

§ Water flow dynamics, demands of households, …

¡ For each subset S Í V compute utility f(S)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7

S2 S3 S4 S1 S2 S3 S4 S1

High sensing “quality” (e.g., f(S) = 0.9) Low sensing “quality” (e.g. f(S)=0.01) High impact

utbreak

Medium impact

utbreak

Low impact

utbreak

Sensor reduces impact through early detection!

S1

Contamination Set V of all network junctions

SLIDE 8

Given:

¡ Graph 𝐻(𝑊, 𝐹) ¡ Data about how outbreaks spread over the 𝑯:

§ For each outbreak 𝑗 we know the time 𝑈(𝑣, 𝑗) when outbreak 𝑗 contaminates node 𝑣

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8

Water distribution network (physical pipes and junctions) Simulator of water consumption & flow

(built by Mech. Eng. people) We simulate the contamination spread for every possible location.

SLIDE 9

Given:

¡ Graph 𝐻(𝑊, 𝐹) ¡ Data about how outbreaks spread over the 𝑯:

§ For each outbreak 𝑗 we know the time 𝑈(𝑣, 𝑗) when outbreak 𝑗 contaminates node 𝑣

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9

The network of news media Traces of the information flow and identify influence sets

Collect lots of articles and trace them to

btain data about information flow from a

given news site.

a b c a b c

SLIDE 10

Given:

¡ Graph 𝐻(𝑊, 𝐹) ¡ Data on how outbreaks spread over the 𝑯:

§ For each outbreak 𝑗 we know the time 𝑈(𝑣, 𝑗) when outbreak 𝑗 contaminates node 𝑣

¡ Goal: Select a subset of nodes S that

maximizes the expected reward: subject to: cost(S) < B

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10

Expected reward for detecting outbreak i

max

.⊆0 𝑔 𝑇 = 4 5

𝑄 𝑗 𝑔

5 𝑇

P(i)… probability of outbreak i occurring. f(i)… reward for detecting outbreak i using sensors S.

SLIDE 11

¡ Reward (one of the following three):

§ (1) Minimize time to detection § (2) Maximize number of detected propagations § (3) Minimize number of infected people

¡ Cost (context dependent):

§ Reading big blogs is more time consuming § Placing a sensor in a remote location is expensive

11

utbreak i

Monitoring blue node saves more people than monitoring the green node

1 2 3 6 7 5 9 11 10 8

f(S)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 12

¡ Penalty 𝝆𝒋(𝒖) for detecting outbreak 𝒋 at time 𝒖

§ 1) Time to detection (DT)

§ How long does it take to detect a contamination? § Penalty for detecting at time 𝒖: 𝜌5(𝑢) = 𝑢

§ 2) Detection likelihood (DL)

§ How many contaminations do we detect? § Penalty for detecting at time 𝒖: 𝜌5(𝑢) = 0, 𝜌5(∞) = 1

§ Note, this is binary outcome: we either detect or not

§ 3) Population affected (PA)

§ How many people drank contaminated water? § Penalty for detecting at time 𝒖: 𝜌5(𝑢) = {# of infected nodes in outbreak 𝑗 by time 𝑢}.

¡ Observation:

In all cases detecting sooner does not hurt!

12 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 13

¡ Observation: Diminishing returns

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13

x1 x2

Placement S={x1, x2}

S’

New sensor: Adding x’helps a lot

x2 x4 x1 x3

Placement S’={x1, x2, x3, x4}

x’

Adding x’helps very little

We define 𝒈𝒋 𝑻 as penalty reduction: 𝑔

5 𝑇 = 𝜌5 ∞ − 𝜌5(𝑈(𝑇, 𝑗))

SLIDE 14

¡ Claim: For all 𝑩 ⊆ 𝑪 ⊆ 𝑾 and sensor 𝒚 ∈ 𝑾\𝑪

𝒈 𝑩 ∪ 𝒚 − 𝒈 𝑩 ≥ 𝒈 𝑪 ∪ 𝒚 − 𝒈 𝑪

¡ Proof: All our objectives are submodular

§ Fix outbreak 𝒋 § Show 𝒈𝒋 𝑩 = 𝝆𝒋 ∞ − 𝝆𝒋(𝑼(𝑩, 𝒋)) is submodular § Consider 𝑩 ⊆ 𝑪 ⊆ 𝑾 and sensor 𝒚 ∈ 𝑾\𝑪 § When does sensor 𝒚 detect outbreak 𝒋?

§ We analyze 3 cases based on when 𝒚 detects outbreak i § (1) 𝑼 𝑪, 𝒋 ≤ 𝑼 𝑩, 𝒋 < 𝑼(𝒚, 𝒋): 𝒚 detects late, nobody benefits: 𝑔

5 𝐵 ∪ 𝑦

= 𝑔

5 𝐵 , also 𝑔 5 𝐶 ∪ 𝑦

= 𝑔

5 𝐶 and so

𝑔

5 𝐵 ∪ 𝑦

− 𝑔

5 𝐵 = 0 = 𝑔 5 𝐶 ∪ 𝑦

− 𝑔

5 𝐶

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14

SLIDE 15

¡ Proof (contd.):

§ (2) 𝑼 𝑪, 𝒋 ≤ 𝑼 𝒚, 𝒋 ≤ 𝑼 𝑩, 𝒋 : 𝒚 detects after B but before A 𝒚 detects sooner than any node in 𝑩 but after all in 𝑪. So 𝒚 only helps improve the solution 𝑩 (but not 𝑪) 𝑔

5 𝐵 ∪ 𝑦

− 𝑔

5 𝐵 ≥ 0 = 𝑔 5 𝐶 ∪ 𝑦

− 𝑔

5 𝐶

§ (3) 𝑼 𝒚, 𝒋 < 𝑼 𝑪, 𝒋 ≤ 𝑼(𝑩, 𝒋): 𝒚 detects early 𝑔

5 𝐵 ∪ 𝑦

− 𝑔

5 𝐵 = 𝜌5 ∞ − 𝜌5 𝑈 𝑦, 𝑗

− 𝑔

5(𝐵) ≥

𝜌5 ∞ − 𝜌5 𝑈 𝑦, 𝑗 − 𝑔

5(𝐶) = 𝑔 5 𝐶 ∪ 𝑦

− 𝑔

5 𝐶

§ Inequality is due to non-decreasingness of 𝑔

5(⋅), i.e., 𝑔 5 𝐵 ≤ 𝑔 5(𝐶)

§ So, 𝒈𝒋(⋅) is submodular!

¡ So, 𝒈(⋅) is also submodular

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15

𝑔 𝑇 = 4

5

𝑄 𝑗 𝑔

5 𝑇

Remember 𝑩 ⊆ 𝑪

SLIDE 16

¡ What do we know about

ptimizing submodular

functions?

§ Hill-climbing (i.e., greedy) is near

ptimal: (𝟐 −

𝟐 𝒇) ⋅ 𝑷𝑸𝑼

¡ But:

§ (1) This only works for unit cost case! (each sensor costs the same)

§ For us each sensor 𝒕 has cost 𝒅(𝒕)

§ (2) Hill-climbing algorithm is slow

§ At each iteration we need to re-evaluate marginal gains of all nodes § Runtime 𝑷(|𝑾| · 𝑳) for placing 𝑳 sensors

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu Part 2-16

a b c a b c d d reward e e

Hill-climbing

Add sensor with highest marginal gain

11/12/19

SLIDE 17

SLIDE 18

¡ Consider the following algorithm to solve

the outbreak detection problem: Hill-climbing that ignores cost

§ Ignore sensor cost 𝒅(𝒕) § Repeatedly select sensor with highest marginal gain § Do this until the budget is exhausted

¡ Q: How well does this work? ¡ A: It can fail arbitrarily badly! L

§ There exists a problem setting where the hill-climbing solution is arbitrarily far from OPT

§ Next we come up with an example

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18

SLIDE 19

¡ Bad example when we ignore cost:

§ 𝒐 sensors, budget 𝑪 § 𝒕𝟐: reward 𝒔, cost 𝑪, , § 𝒕𝟑 … 𝒕𝒐: reward 𝒔 − 𝜻, c = 𝟐 § Hill-climbing always prefers more expensive sensor 𝒕𝟐 with reward 𝒔 (and exhausts the budget). It never selects cheaper sensors with reward 𝒔 − 𝜻 → For variable cost it can fail arbitrarily badly!

¡ Idea: What if we optimize benefit-cost ratio?

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19

𝑡5 = arg max

d∈(0\e)

𝑔 𝐵5fg ∪ {𝑡} − 𝑔(𝐵5fg) 𝒅 𝒕

Greedily pick sensor 𝒕𝒋 that maximizes benefit to cost ratio.

SLIDE 20

¡ Benefit-cost ratio can also fail arbitrarily badly! ¡ Consider: budget 𝑪:

§ 2 sensors 𝒕𝟐 and 𝒕𝟑:

§ Costs: 𝒅(𝒕𝟐) = 𝜻, 𝒅(𝒕𝟑) = 𝑪 § Benefit (only 1 cascade): 𝒈(𝒕𝟐) = 𝟑𝜻, 𝒈(𝒕𝟑) = 𝑪

§ Then benefit-cost ratio is:

§ 𝒈 𝒕𝟐 /𝒅(𝒕𝟐) = 𝟑 and 𝒈(𝒕𝟑)/𝒅(𝒕𝟑) = 𝟐

§ So, we first select 𝒕𝟐and then can not afford 𝒕𝟑 →We get reward 𝟑𝜻 instead of 𝑪! Now send 𝜻 → 𝟏 and we get an arbitrarily bad solution!

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20

This algorithm incentivizes choosing nodes with very low cost, even when slightly more expensive ones can lead to much better global results.

SLIDE 21

¡ CELF (Cost-Effective Lazy Forward-selection)

A two pass greedy algorithm:

§ Set (solution) 𝑻′: Use benefit-cost greedy § Set (solution) 𝑻′′: Use unit-cost greedy

§ Final solution: 𝑻 = 𝒃𝒔𝒉 𝒏𝒃𝒚(𝒈(𝑻′), 𝒈(𝑻′′))

¡ How far is CELF from (unknown) optimal

solution?

¡ Theorem: CELF is near optimal [Krause&Guestrin, ‘05]

§ CELF achieves ½(1-1/e) factor approximation!

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21

This is surprising: We have two clearly suboptimal solutions, but taking best of the two is guaranteed to give a near-optimal solution.

SLIDE 22

SLIDE 23

¡ What do we know about

ptimizing submodular

functions?

§ Hill-climbing (i.e., greedy) is near

ptimal (that is, (1 − g

r) ⋅ 𝑃𝑄𝑈)

¡ But:

§ (2) Hill-climbing algorithm is slow!

§ At each iteration we need to re- evaluate marginal gains of all nodes § Runtime 𝑃(|𝑊| · 𝐿) for placing 𝐿

sensors

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23

a b c a b c d d reward e e

Hill-climbing

Add sensor with highest marginal gain

11/12/19

SLIDE 24

¡ In round 𝒋 + 𝟐: So far we picked 𝑇𝑗 = {𝑡1, … , 𝑡𝑗}

§ Now pick 𝐭𝒋w𝟐 = 𝐛𝐬𝐡 𝐧𝐛𝐲

𝒗

𝒈(𝑻𝒋 ∪ {𝒗}) − 𝒈(𝑻𝒋)

§ This our old friend – greedy hill-climbing algorithm. It maximizes the “marginal gain” 𝜺𝒋 𝒗 = 𝒈(𝑻𝒋 ∪ {𝒗}) − 𝒈(𝑻𝒋)

¡ By submodularity property:

𝑔 𝑇5 ∪ 𝑣 − 𝑔 𝑇5 ≥ 𝑔 𝑇

∪ 𝑣

− 𝑔 𝑇

for 𝑗 < 𝑘

¡ Observation: By submodularity:

For every 𝒗 𝜀5(𝑣) ≥ 𝜀

(𝑣) for 𝑗 < 𝑘 since 𝑇𝑗 ⊂ 𝑇𝑘

Marginal benefits di(u) only shrink!

(as i grows)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24

u di(u) ³ dj(u)

Activating node u in step i helps more than activating it at step j (j>i)

SLIDE 25

¡ Idea:

§ Use di as upper-bound on dj (j > i)

¡ Lazy hill-climbing:

§ Keep an ordered list of marginal benefits di from previous iteration § Re-evaluate di only for top node § Re-order and prune

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25

a b c d Marginal gain e

f(S È {u}) – f(S) ≥ f(T È {u}) – f(T)

S Í T S1={a}

SLIDE 26

¡ Idea:

§ Use di as upper-bound on dj (j > i)

¡ Lazy hill-climbing:

§ Keep an ordered list of marginal benefits di from previous iteration § Re-evaluate di only for top node § Re-order and prune

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26

a d b c e Marginal gain

f(S È {u}) – f(S) ≥ f(T È {u}) – f(T)

S Í T S1={a}

SLIDE 27

¡ Idea:

§ Use di as upper-bound on dj (j > i)

¡ Lazy hill-climbing:

§ Keep an ordered list of marginal benefits di from previous iteration § Re-evaluate di only for top node § Re-order and prune

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27

a c d b e Marginal gain

f(S È {u}) – f(S) ≥ f(T È {u}) – f(T)

S Í T S1={a} S2={a,b}

SLIDE 28

¡ CELF (using Lazy

evaluation) runs 700 times faster than greedy hill- climbing algorithm

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28

CELF… raw CELF CELF+bounds … CELF together with computing the data-dependent solution quality bound

SLIDE 29

SLIDE 30

¡ Back to the solution quality! ¡ The (1-1/e) bound for submodular functions

is the worst case bound (worst over all possible inputs)

¡ Data dependent bound:

§ Value of the bound depends on the input data

§ On “easy” data, hill climbing may do better than 63%

§ Can we say something about the solution quality when we know the input data?

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30

SLIDE 31

¡ Suppose 𝑻 is some solution to 𝒈(𝑻) s.t. 𝑻 ≤ 𝒍

§ 𝒈(𝑻) is monotone & submodular

¡ Let 𝑷𝑸𝑼 = {𝒖𝟐, … , 𝒖𝒍} be the OPT solution ¡ For each 𝒗 let 𝜺 𝒗 = 𝒈 𝑻 ∪ 𝒗

− 𝒈 𝑻

¡ Order 𝜺 𝒗 so that 𝜺 𝟐 ≥ 𝜺 𝟑 ≥ ⋯

¡ Then: 𝒈 𝑷𝑸𝑼 ≤ 𝒈 𝑻 + ∑𝒋†𝟐

𝒍

𝜺 𝒋

§ Note:

§ This is a data dependent bound (𝜀 𝑗 depends on input data) § Bound holds for any algorithm

§ Makes no assumption about how 𝑻 was computed

§ For some inputs it can be very “loose” (worse than 63%)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31

SLIDE 32

¡ Claim:

§ For each 𝑣 let 𝜺(𝒗) = 𝒈(𝑻 ∪ {𝒗}) − 𝒈(𝑻) § Order 𝜺 𝒗 so that 𝜺 𝟐 ≥ 𝜺 𝟑 ≥ ⋯ § Then: 𝒈 𝑷𝑸𝑼 ≤ 𝒈 𝑻 + ∑𝒋†𝟐

𝒍

𝜺(𝒋)

¡ Proof:

§ 𝑔 𝑃𝑄𝑈 ≤ 𝑔 𝑃𝑄𝑈 ∪ 𝑇 § = 𝑔 𝑇 + 𝑔 𝑃𝑄𝑈 ∪ 𝑇 − 𝑔(𝑇) § ≤ 𝑔 𝑇 + ∑5†g

‡

𝑔 𝑇 ∪ 𝑢5 − 𝑔 𝑇 § = 𝑔 𝑇 + ∑5†g

‡

𝜀(𝑢5) § ≤ 𝑔 𝑇 + ∑5†g

‡

𝜀(𝑗) ⇒ 𝒈 𝑷𝑸𝑼 ≤ 𝒈 𝑻 + ∑𝒋†𝟐

𝒍

𝜺(𝒋)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32

Instead of taking tiÎOPT (of benefit 𝜀(𝑢5)), we take the best possible element (𝜀(𝑗)) (we proved this last time)

SLIDE 33

SLIDE 34

¡ Real metropolitan area

water network

§ V = 21,000 nodes § E = 25,000 pipes

¡ Use a cluster of 50 machines for a month ¡ Simulate 3.6 million epidemic scenarios

(random locations, random days, random time of the day)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34

SLIDE 35

Data-dependent bound is much tighter

(gives more accurate estimate of alg. performance)

35

Solution quality F(A) Higher is better 5 10 15 20 0.2 0.4 0.6 0.8 1 1.2 1.4

“Offline”

the (1-1/e) bound

Data-dependent bound Hill Climbing

Number of sensors placed

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 36

¡ Placement heuristics perform

much worse

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36

Author Score

CELF

26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph 7 U Michigan 4 Michigan Tech U 3 Malcolm 2 Proteo 2 Technion (2) 1

Battle of Water Sensor Networks competition

[w/ Ostfeld et al., J. of Water Resource Planning]

SLIDE 37

¡ Different objective functions give different

sensor placements

37

Population affected Detection likelihood

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 38

Here CELF is much faster than greedy hill-climbing!

§ (But there might be datasets/inputs where the CELF will have the same running time as greedy hill-climbing)

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38

SLIDE 39

= I have 10 minutes. Which news sites should I read to be most up to date? = Who are the most influential news sites?

39

?

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 40

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40

Detect all stories but late.

Want to read things before others do.

Detect blue & yellow soon but miss red.

SLIDE 41

¡ Crawled 45,000 blogs for 1 year ¡ Obtained 10 million news posts ¡ And identified 350,000 cascades ¡ Cost of a blog is the number of posts it has

41 11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

SLIDE 42

¡ Online bound turns out to be much tighter!

§ Based on the plot below: 87% instead of 32.5%

Old bound Our bound CELF

11/12/19 42 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu

vs.

SLIDE 43

¡ Heuristics perform much worse! ¡ One really needs to perform the optimization

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43

SLIDE 44

¡ CELF has 2 sub-algorithms. Which wins? ¡ Unit cost:

§ CELF picks large popular blogs

¡ Cost-benefit:

§ Cost proportional to the number of posts

¡ We can do much

better when considering costs

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44

SLIDE 45

¡ Problem: Then CELF

picks lots of small blogs that participate in few cascades

¡ We pick best solution

that interpolates between the costs

¡ We can get good

solutions with few blogs and few posts

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45

Each curve represents a set of solutions S with the same final reward f(S) Score f(S)=0.4

f(S)=0.3 f(S)=0.2

SLIDE 46

¡ We want to generalize well to future (unknown)

cascades

¡ Limiting selection to bigger blogs improves

generalization!

Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46 11/12/19

Small blogs Big blogs

SLIDE 47

¡ CELF runs 700

times faster than simple hill- climbing algorithm

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47

[Leskovec et al., KDD ’07]

SLIDE 48

¡ Outbreak detection problem in networks ¡ Different ways to formalize objective functions

§ All are submodular

¡ Lazy-Greedy algorithm for optimizing

submodular functions

¡ CELF algorithm that combines 2 versions of

Lazy-Greedy

¡ Data-dependent bound on the solution quality

11/12/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48