http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) - - PowerPoint PPT Presentation
http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) - - PowerPoint PPT Presentation
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) Develop an approximation algorithm It is a submodular opt. problem! (3) Speed-up
(1) New problem: Outbreak detection (2) Develop an approximation algorithm
- It is a submodular opt. problem!
(3) Speed-up greedy hill-climbing
- Valid for optimizing general submodular functions
(i.e., also works for influence maximization)
(4) Prove a new “data dependent” bound
- n the solution quality
- Valid for optimizing general submodular functions
(i.e., also works for influence maximization)
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
Given a real city water
distribution network
And data on how
contaminants spread in the network
Detect the
contaminant as quickly as possible
Problem posed by the
US Environmental Protection Agency
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
S S
[Leskovec et al., KDD ’07]
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-4
Blogs Posts Time
- rdered
hyperlinks Information cascade
Which blogs should one read to detect cascades as effectively as possible?
[Leskovec et al., KDD ’07]
10/24/2012
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Detect all stories but late.
Want to read things before others do.
Detect blue & yellow soon but miss red.
[Leskovec et al., KDD ’07]
Both of these two are an instance of the
same underlying problem!
Given a dynamic process spreading over
a network
We want to select a set of nodes to detect
the process effectively
Many other applications:
- Epidemics
- Influence propagation
- Network security
6 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Utility of placing sensors:
- Water flow dynamics, demands of households, …
For each subset S ⊆ V compute utility f(S)
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
S2 S3 S4 S1 S2 S3 S4 S1
High sensing quality f(S) = 0.9 Low sensing quality f(S)=0.01 High impact
- utbreak
Medium impact
- utbreak
Low impact
- utbreak
Sensor reduces impact through early detection!
S1
Contamination Set V of all network junctions
Given:
Graph 𝐻(𝑊, 𝐹) Data on how outbreaks spread over the 𝑯:
- For each outbreak 𝑗 we know the time 𝑈(𝑗, 𝑣)
when outbreak 𝑗 contaminates node 𝑣
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
[Leskovec et al., KDD ’07]
Water distribution network (physical pipes and junctions) Simulator of water consumption&flow
(built by Mech Eng. people) We simulate the contamination spread for every possible location.
Given:
Graph 𝐻(𝑊, 𝐹) Data on how outbreaks spread over the 𝑯:
- For each outbreak 𝑗 we know the time 𝑈(𝑗, 𝑣)
when outbreak 𝑗 contaminates node 𝑣
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
[Leskovec et al., KDD ’07]
The network of the blogosphere Traces of the information flow
Collect lots of blogs posts and trace hyperlinks to obtain data about information flow from a given blog.
a b c a b c
Given:
Graph 𝐻(𝑊, 𝐹) Data on how outbreaks spread over the 𝑯:
- For each outbreak 𝑗 we know the time 𝑈(𝑗, 𝑣)
when outbreak 𝑗 contaminates node 𝑣
Goal: Select a subset of nodes S that
maximize the expected reward: subject to: cost(S) < B
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Expected reward for detecting outbreak i
[Leskovec et al., KDD ’07]
max
𝑇⊆𝑊 𝑔 𝑇 = 𝑄 𝑗 𝑔 𝑗 𝑇 𝑗
𝒈𝒋 𝑻 is penalty reduction: 𝑔
𝑗 𝑇 = 𝜌𝑗 ∅ − 𝜌𝑗(𝑇)
Reward
- (1) Minimize time to detection
- (2) Maximize number of detected propagations
- (3) Minimize number of infected people
Cost (node/location dependent):
- Reading big blogs is more time consuming
- Placing a sensor in a remote location is expensive
11
- utbreak i
Monitoring blue node saves more people than monitoring the green node
f(S)
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Objective functions:
- 1) Time to detection (DT)
- How long does it take to detect a contamination?
- Penalty: 𝜌𝑗(𝑢) = min
{𝑢, 𝑈
𝑛𝑛𝑛}
- 2) Detection likelihood (DL)
- How many contaminations do we detect?
- We incur penalty if we don’t detect: 𝜌𝑗(𝑢) = 0, 𝜌𝑗(∞) = 1
- 3) Population affected (PA)
- How many people drank contaminated water?
- 𝜌𝑗(𝑢) = {# of blogs in cascade 𝑗 at time 𝑢}.
Observation:
In all cases detecting sooner does not hurt!
12 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Observation: Diminishing returns
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
S1 S2
Placement S={s1, s2}
S’
New sensor: Adding s’ helps a lot
S2 S4 S1 S3
Placement S’={s1, s2, s3, s4}
s’
Adding s’ helps very little
[Leskovec et al., KDD ’07]
Claim: For all 𝐵 ⊆ 𝐶 ⊆ 𝑊 and sensors 𝑡 ∈ 𝑊\B
𝑔 𝐵 ∪ 𝑡 − 𝑔 𝐵 ≥ 𝑔 𝐶 ∪ 𝑡 − 𝑔 𝐶
Proof:
- Fix cascade 𝑗
- Show 𝑔
𝑗 𝐵 = 𝜌𝑗 ∞ − 𝜌𝑗(𝑈(𝐵, 𝑗)) is submodular
- Consider 𝐵 ⊆ 𝐶 ⊆ 𝑊 and sensor 𝑡 ∈ 𝑊\B
- When does node 𝒕 detect cascade 𝒋? 3 Cases:
- (1) 𝑈 𝑡, 𝑗 ≥ 𝑈(𝐵, 𝑗) then
𝑔
𝑗 𝐵 ∪ 𝑡
= 𝑔
𝑗 𝐵 , 𝑔 𝑗 𝐶 ∪ 𝑡
= 𝑔
𝑗 𝐶 and so
𝑔
𝑗 𝐵 ∪ 𝑡
− 𝑔
𝑗 𝐵 = 0 = 𝑔 𝑗 𝐶 ∪ 𝑡
− 𝑔
𝑗 𝐶
- Since 𝑡 detects too late, nobody benefits
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Proof (contd.):
- 3 Cases:
- (2) 𝑈 𝐶, 𝑗 ≤ 𝑈 𝑡, 𝑗 < 𝑈(𝐵, 𝑗) then
𝑔
𝑗 𝐵 ∪ 𝑐
− 𝑔
𝑗 𝐵 ≥ 0 = 𝑔 𝑗 𝐶 ∪ 𝑡
− 𝑔
𝑗 𝐶
- 𝑡 detects sooner than any node in 𝐵 but after all in 𝐶.
So 𝑣 only helps improve the solution 𝐵.
- (3) 𝑈 𝑡, 𝑗 < 𝑈(𝐶, 𝑗) then
𝑔
𝑗 𝐵 ∪ 𝑡
− 𝑔
𝑗 𝐵 = 𝜌𝑗 ∞ − 𝜌𝑗 𝑈 𝑡, 𝑗
− 𝑔
𝑗(𝐵) ≥
𝜌𝑗 ∞ − 𝜌𝑗 𝑈 𝑡, 𝑗 − 𝑔
𝑗(𝐶) = 𝑔 𝑗 𝐶 ∪ 𝑡
− 𝑔
𝑗 𝐶
- Ineqaulity is due to non-decreasingness of 𝑔
𝑗(⋅), i.e., 𝑔 𝑗 𝐵 ≤ 𝑔 𝑗(𝐶)
- So, 𝒈𝒋(⋅) is submodular!
So, 𝒈(⋅) is also submodular
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
𝑔 𝑇 = 𝑄 𝑗 𝑔
𝑗 𝑇 𝑗
What do we know about
- ptimizing submodular
functions?
- A hill-climbing (i.e., greedy) is near
- ptimal (1 −
1 𝑓 ⋅ 𝑃𝑄𝑈)
But:
- (1) This only works for unit cost case!
(each sensor costs the same)
- For use each sensor 𝑡 has cost 𝑑(𝑡)
- (2) Hill-climbing algorithm is slow
- At each iteration we need to re-evaluate
marginal gains of all nodes
- Runtime 𝑃(|𝑊| · 𝐿) for placing 𝐿 sensors
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-16
a b c a b c d d reward e e
Hill-climbing
Add sensor with highest marginal gain
10/24/2012
Consider:
Hill-climbing that ignores cost
- Ignore sensor cost
- Repeatedly select sensor with highest marginal gain
- Do this until the budget is exhausted
How well does this work?
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
[Leskovec et al., KDD ’07]
Bad example:
- 𝑜 sensors, budget 𝐶
- 𝑡1: reward 𝑠, cost 𝐶
- 𝑡2 … 𝑡𝑜: reward 𝑠 − 𝜁, cost 1
- Hill-climbing always prefers more expensive sensor
𝑡1with reward 𝑠 (and exhausts the budget) It never selects cheaper sensors with reward 𝑠 − 𝜁 → For variable cost it can fail arbitrarily badly!
Idea: What if we optimize benefit-cost ratio?
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
[Leskovec et al., KDD ’07]
𝑡𝑗 = arg max
𝑡∈𝑊
𝑔 𝐵𝑗−1 ∪ {𝑡} − 𝑔(𝐵𝑗−1) 𝑑 𝑡 Greedily pick sensor
𝑡𝑗 that maximizes benefit to cost ratio.
Benefit-cost ratio can also fail arbitrarily badly! Consider: budget 𝐶:
- 2 sensors 𝒕𝟐 and 𝒕𝟑:
- Costs: 𝑑(𝑡1) = 𝜁, 𝑑(𝑡2) = 𝐶
- Only 1 cascade: 𝑔(𝑡1) = 2𝜁, 𝑔(𝑡2) = 𝐶
- Then benefit-cost ratio is
- 𝑐⋅𝑑(𝑡1) = 2 and 𝑐⋅𝑑(𝑡2) = 1
- So, we first select 𝑡1 and then can not afford 𝑡2
→We get reward 2𝜁 instead of 𝐶. Now send 𝜁 → 0 and we get arbitrarily bad solution!
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
[Leskovec et al., KDD ’07]
CELF (cost-effective lazy forward-selection)
A two pass greedy algorithm:
- Set (solution) 𝑇𝑇: Use benefit-cost greedy
- Set (solution) 𝑇𝑇𝑇: Use unit-cost greedy
- Final solution: 𝑇 = arg max
(𝑔(𝑇𝑇), 𝑔(𝑇𝑇𝑇))
How far is CELF from (unknown) optimal
solution?
Theorem: CELF is near optimal
- CELF achieves ½(1-1/e) factor approximation!
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
[Leskovec et al., KDD ’07]
What do we know about
- ptimizing submodular
functions?
- A hill-climbing (i.e., greedy) is near
- ptimal (1 −
1 𝑓 ⋅ 𝑃𝑄𝑈)
But:
- (2) Hill-climbing algorithm is slow!
- At each iteration we need to re-
evaluate marginal gains of all nodes
- Runtime 𝑃(|𝑊| · 𝐿) for placing 𝐿
sensors
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-21
a b c a b c d d reward e e
Hill-climbing
Add sensor with highest marginal gain
10/24/2012
In round 𝒋 + 𝟐: So far we picked 𝑇𝑗 = {𝑡1, … , 𝑡𝑗}
- Now pick s𝑗 = arg max𝑣 𝑔(𝑇𝑗 ∪ {𝑣}) − 𝑔(𝑇𝑗)
- This our old friend – greedy hill-climbing algorithm.
It maximizes the “marginal benefit” 𝜀𝑣(𝑇𝑗) = 𝑔(𝑇𝑗 ∪ {𝑣}) − 𝑔(𝑇𝑗)
By submodularity property:
𝑔 𝑇𝑗 ∪ 𝑣 − 𝑔 𝑇𝑗 ≥ 𝑔 𝑇
𝑘 ∪ 𝑣
− 𝑔 𝑇
𝑘 for 𝑗 < 𝑘
Observation: By submodularity:
For every 𝑣 δ𝑣(𝑇𝑗) ≥ δ𝑣(𝑇𝑘) for 𝑗 ≤ 𝑘 since 𝑇𝑗 ⊆ 𝑇𝑘 Marginal benefits δu only shrink!
(as S grows)
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
u δu(Si) ≥ δu(Sj)
[Leskovec et al., KDD ’07] Activating node u in step i helps more than activating it at step j (j>i)
Idea:
- Use δi as upper-bound on δj (j>i)
Lazy hill-climbing:
- Keep an ordered list of marginal
benefits δi from previous iteration
- Re-evaluate δi only for top node
- Re-sort and prune
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
a b c d Marginal gain e
[Leskovec et al., KDD ’07]
f(S ∪ {u}) – f(S) ≥ f(T ∪ {u}) – f(T)
S ⊆ T S1={a}
Idea:
- Use δi as upper-bound on δj (j>i)
Lazy hill-climbing:
- Keep an ordered list of marginal
benefits δi from previous iteration
- Re-evaluate δi only for top node
- Re-sort and prune
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
a d b c e Marginal gain
[Leskovec et al., KDD ’07]
f(S ∪ {u}) – f(S) ≥ f(T ∪ {u}) – f(T)
S ⊆ T S1={a}
Idea:
- Use δi as upper-bound on δj (j>i)
Lazy hill-climbing:
- Keep an ordered list of marginal
benefits δi from previous iteration
- Re-evaluate δi only for top node
- Re-sort and prune
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
a c d b e Marginal gain
[Leskovec et al., KDD ’07]
f(S ∪ {u}) – f(S) ≥ f(T ∪ {u}) – f(T)
S ⊆ T S1={a} S2={a,b}
Back to the solution quality! The (1-1/e) bound for submodular functions
is the worst case bound (worst over all possible inputs)
Data dependent bound:
- Value of the bound depends on the input data
- On “easy” data, hill climbing may do better than 63%
- Can we say something about the solution
quality when we know the input data?
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Suppose 𝑇 is some solution to 𝑔(𝑇) s.t. |𝑇| ≤ 𝑙
- 𝑔(𝑇) is monotone & submodular
Let 𝑈 = {𝑢1, … , 𝑢𝑙} be the OPT solution For each 𝑣 ∉ 𝑇 let δ𝑣 = 𝑔(𝑇 ∪ {𝑣}) − 𝑔(𝑇)
Order δ𝑣 so that δ1 ≥ δ2 ≥ … ≥ δ𝑜−|𝑇|
Then: 𝑔 𝑈 ≤ 𝑔 𝑇 + ∑
𝜀𝑗
𝑙 𝑗=1
- Note:
- This is a data dependent bound (δu depend on input data)
- Bound holds for any algorithm
- Makes no assumption about how 𝑇 is computed
- For some inputs it can be very “loose” (worse than 63%)
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
Claim:
- For each 𝑣 ∉ 𝑇 let δ𝑣 = 𝑔 𝑇 ∪ 𝑣
− 𝑔 𝑇
- Order δ𝑣 so that δ1 ≥ δ2 ≥ … ≥ δ𝑜− 𝑇
- Then: 𝑔 𝑈 ≤ 𝑔 𝑇 + ∑
𝜀𝑗
𝑙 𝑗=1
Proof:
- 𝑔 𝑈 ≤ 𝑔 𝑈 ∪ 𝑇 =
𝑔 𝑇 + ∑ 𝑔 𝑇 ∪ 𝑢1 … 𝑢𝑗 − 𝑔 𝑇 ∪ 𝑢1 … 𝑢𝑗−1
𝑙 𝑗=1
- ≤ 𝑔 𝑇 + ∑
𝑔 𝑇 ∪ 𝑢𝑗 − 𝑔 𝑇
𝑙 𝑗=1
- = 𝑔 𝑇 + ∑
𝜀𝑢𝑗
𝑙 𝑗=1
- ≤ 𝑔 𝑇 + ∑
𝜀𝑗
𝑙 𝑗=1
⇒ 𝒈 𝑼 ≤ 𝒈 𝑻 + ∑ 𝜺𝒋
𝒍 𝒋=𝟐
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
Instead of taking ti∈T (of benefit 𝜀𝑢𝑗), we take the best possible element (𝜀𝑗) (we proved this last time)
Real metropolitan area
water network
- V = 21,000 nodes
- E = 25,000 pipes
Use a cluster of 50 machines for a month Simulate 3.6 million epidemic scenarios
(152 GB of epidemic data)
By exploiting sparsity we fit it into main
memory (16GB)
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
[Leskovec et al., KDD ’07]
Data-dependent bound is much tighter
(gives more accurate estimate of alg. performance)
30
Solution quality F(A) Higher is better
5 10 15 20 0.2 0.4 0.6 0.8 1 1.2 1.4
“Offline”
the (1-1/e) bound
Data-dependent bound Hill Climbing
Number of sensors placed
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Placement heuristics perform
much worse
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
Author Score
CELF
26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph 7 U Michigan 4 Michigan Tech U 3 Malcolm 2 Proteo 2 Technion (2) 1
Battle of Water Sensor Networks competition
[w/ Ostfeld et al., J. of Water Resource Planning]
Different objective functions give different
sensor placements
32
Population affected Detection likelihood
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
CELF is 10 times faster than greedy
hill-climbing!
33 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
= I have 10 minutes. Which blogs should I read to be most up to date? = Who are the most influential bloggers?
34
?
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Detect all stories but late.
Want to read things before others do.
Detect blue & yellow soon but miss red.
Online bound is much tighter!
- 13% instead of 37%
Old bound Our bound CELF
10/24/2012 36 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[Leskovec et al., KDD ’07]
Heuristics perform much worse! One really needs to perform the optimization
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
[Leskovec et al., KDD ’07]
CELF has 2 sub-algorithms. Which wins? Unit cost:
- CELF picks large
popular blogs:
instapundit.com, michellemalkin.com
Cost-benefit:
- Cost proportional
to the number of posts
We can do much
better when considering costs
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
[Leskovec et al., KDD ’07]
Problem: Then CEF
picks lots of small blogs that participate in few cascades
We pick best solution
that interpolates between the costs
We can get good
solutions with few blogs and few posts
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
Each curve represents a set of solutions S with the same final reward f(S)
[Leskovec et al., KDD ’07]
Score f(S)=0.4
f(S)=0.3 f(S)=0.2
We want to generalize well to future (unknown)
cascades
Limiting selection to bigger blogs improves
generalization!
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 2-40 10/24/2012
CELF runs 700
times faster than simple hill- climbing algorithm
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
[Leskovec et al., KDD ’07]
10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Observations
Small diameter, Edge clustering Patterns of signed edge creation Viral Marketing, Blogosphere, Memetracking Scale-Free Densification power law, Shrinking diameters Strength of weak ties, Core-periphery
Models
Erdös-Renyi model, Small-world model Structural balance, Theory of status Independent cascade model, Game theoretic model Preferential attachment, Copying model Microscopic model of evolving networks Kronecker Graphs
Algorithms
Decentralized search Models for predicting edge signs Influence maximization, Outbreak detection, LIM PageRank, Hubs and authorities Link prediction, Supervised random walks Community detection: Girvan-Newman, Modularity