Rumours in Graphs
Jilles Vreeken
24 July 2015
Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement - - PowerPoint PPT Presentation
Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement #1 The Exam. 20 minutes per person 1) Questions can be on any topic covered in 2) the lectures 1) the required reading 2) the assignments (1 topic per assignment, your
24 July 2015
The Exam.
1)
20 minutes per person
2)
Questions can be on any topic covered in
1)
the lectures
2)
the required reading
3)
the assignments (1 topic per assignment, your choice)
3)
Grade will be based on your performance in the exam, minus any Bonus points you may have acquired.
4)
Timeslots will be mailed today.
Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-me-anything> (Subjective) Interestingness
Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-us-anything> (Subjective) Interestingness
* preferably related to TADA, data mining, machine learning, science, the world, etc.
ice cream
(Prakash, Vreeken & Faloutsos, ICDM 2012)
(Prakash, Vreeken & Faloutsos, ICDM 2012)
Susceptible-Infected (SI) Model
[AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts
Shah and Zaman, IEEE TIT, 2011
One seed. Provably finds MLE seed for d-regular trees SI process
Lappas et. al., KDD, 2010.
k seeds (takes in Input k) Infected graph assumed to be in steady-state IC model
1) use MDL for number of seeds 2) for a given number:
exoneration = centrality + penalty
Running time linear! (in edges and nodes)
Induction by Compression
Related to Bayesian approaches MDL = Model + Data
Cost of a Model:
scoring the seed-set
Number of possible |𝑇|-sized sets Encoding integer |𝑇|
Encoding the Data: Propagation Ripples
Original Graph Infected Snapshot Ripple R2 Ripple R1
How the ‘frontier’ advances How long is the ripple
Ripple R
Given k quickly identify high-quality set S Given set S, optimize the ripple R
exoneration
smallest eigenvector of
Laplacian sub-matrix
analyze a Const
strai rained ned SI epidemic
Exonerate neighbors Repeat
Optimizing R
Get the MLE ripple!
Finally use MDL score to tell us the best set NETSLEUTH: Linear running time in nodes and edges Ripple R
MDL based Overlap based
(JD = Jaccard distance)
One Seed Two Seeds Three Seeds
One Seed Two Seeds Three Seeds
True NETSLEUTH One Seed Two Seeds Three Seeds
Giv iven: Graph and Infections Fin ind: Best ‘Culprits’ Two wo-step ep solution
use MDL
MDL for number of seeds
for a given number:
exo xonerat neration
NetSle
Sleuth th:
Linear running time
in nodes and edges
Epidemiology
Public-health surveillance
Each level has a certain probability to miss some truly infected people CNN headlines
Not sure Not sure
Surveillance Pyramid
[Nishiura+, PLoS ONE 2011]
CDC Lab Hospital
Twitter: due to the uniform samples [Morstatter+ 2013],
the relevant ‘infected’ tweets may be missed
Missing Missing
Tweets Sampled Tweets Sampling
(Sundareisan et al. SDM’15)
GIVEN:
Graph 𝐻(𝑊, 𝐹) from historical data Infected set 𝐸 ⊂ 𝑊, sampled (𝑞%) and incomplete Infectivity 𝛾 of the virus (assumed to follow the SI model)
FIND:
Seed set i.e. patient zeros/culprits Set 𝐷− (the missing infected nodes) Ripple 𝑆 (the order of infections)
Costenbader & Valente 2003; Kossinets 2006, Borgatti et al. 2006
study the effect of sampling on macro level network statistics
Adiga et. al. 2013
sensitivity of total infections to noise in network structure
Sadikov et al., WSDM, 2011
correct for sampling for macro level cascade statistics
Motivation---Introduction Problem Definition Our Appr
proach ch
MDL Decoupling Finding 𝒯given 𝐷 Finding 𝐷 given 𝒯
Experiments Conclusion
Sender Receiver Graph 𝐻(𝑊, 𝐹) Infectivity (𝛾) Sampling (𝑞) Seeds (𝒯) Infected set (𝐸) Ripple (𝑆) Missing nodes (𝐷−) Graph 𝐻(𝑊, 𝐹) Infectivity (𝛾) Sampling (𝑞)
The Model Data given model
How to score a seed set (𝒯) How to score the ripple? Number of possible |𝒯|-sized sets Encoding integer |𝒯|
Scoring a ripple (𝑆) Original Graph Infected Snapshot Ripple 𝑆2 Ripple 𝑆1
Ripple cost How the ‘frontier’ advances How long is the ripple
Ripple R
Now you know too much – for you to know what was 𝐸 we need to transmit which are the missed nodes 𝐷− (green nodes) Detail: 𝛿 = 1 – 𝑞 i.e. the probability
Finally, we have
Our problem is now to find those 𝒯, 𝑆, 𝐷− that minimize it
Motivation---Introduction Problem Definition Our Appr
proach ch
MDL Decoup
upli ling
Finding S given C Finding C given S
Experiments Conclusion
The two problems are
1) finding the seeds and ripple (𝒯, 𝑆) 2) finding the missing nodes (𝐷−)
Can we decouple these problems?
Finding seeds ds depends nds on missing sing nodes. NETSLEUTH: no missing nodes as input, no missing nodes as output NETFILL: correctly fills in the nodes missing from input
Legend Missing nodes Seed Infected node
Finding missing sing nodes es also
nds on seeds.
A B S
Not Infected Infected Seed Most probably A was missed
Motivation---Introduction Problem Definition Our Appr
proach ch
MDL Decoupling Finding
ng 𝒯 give ven 𝑫
Finding 𝐷 given 𝒯
Experiments Conclusion
1) Suppose an oracle gives us the missing nodes (𝐷−) 2) We have complete infected set (𝐸 ∪ 𝐷−) 3) Apply NETSLEUTH directly NO SAMPLING INVOLVED And will give us the seed set!
Applying NetSleuth* on Oracle’s Answer Legend Missing nodes Seed Infected node
Motivation---Introduction Problem Definition Our Appr
proach ch
MDL Decoupling Finding 𝒯 given 𝐷 Finding
ng 𝑫 give ven 𝒯
Experiments Conclusion
Oracle gives us 𝒯, find 𝐷− The Naïve Approach:
Find all possible 𝐷− Pick the best one according to MDL
Sadly, this is infeasible in practice, as we would have to score 𝑊 𝑊 ∖ 𝐸 sets
Sub-problem 1
|Seeds| = 1 |Missing nodes| = 1
Sub-problem 2
Finding the right number of missing nodes.
Sub-problem 3
|Seeds| > 1
The best node is the one that makes the seed 𝑇 most likely
we use empirical risk as the measure
Sanity ty Chec eck: ideally risk should be 0
So, the best hidden hazard is
Using some results in Prakash et al. 2012, we can rewrite it as u1 is the eigenv nvect ctor
the smal allest est eigenv nvalue alue of the Lapl placi acian an subma matrix ix of 𝐸
Laplacian = 𝐸𝑓(𝐻) – 𝐵(𝐻) LD = take only rows for nodes in 𝐸 (Laplacian submatrix!) u1 (smallest eigenvalue in eigenvector)
Degree Adjacency
Laplacian
D
Laplacian Laplacian Submatrix
ƛ
Eigenvector
But, how to solve this quickly?
Choose 𝑜∗ such arg max
𝑜 𝑗∈𝑜𝑐(𝑜)
𝑣1(𝑗) this measures
how connected a node 𝑜 is to
centrally located infected nodes w.r.t. 𝑇 in 𝐸
depends on seed 𝑇 as well as the structure of the graph (!)
MDL! Add nodes based on Z-scores till MDL increases.
but, but, MDL is not convex! yes, yes, but it has convex like behavior!
Using z-scores: Missing nodes near single seed Ideal: Missing nodes near both seeds
Exonerate previous seeds
consider previous seeds uninfected and re-calculate 𝑣1 the blame
me is transferred to the locality of the older seed
complete Z − score =
max
𝑝𝑤𝑓𝑠 𝑏𝑚𝑚 𝑡𝑓𝑓𝑒𝑡 𝑎 − 𝑡𝑑𝑝𝑠𝑓(𝑜)
maximum as we need high quality missing nodes
take nodes with top-𝑙 complete Z-scores
Exonerate previous seeds
consider previous seeds uninfected and re-calculate 𝑣1
Using z-scores: Missing nodes near single seed Ideal: Missing nodes near both seeds
Running time: sub-quadratic in practice
Motivation---Introduction Problem Definition Our Approach Ex
Experim riments ents
Conclusion
Real and Synthetic graphs Real and Simulated cascades Graphs
GRID
AS-OREGON
FLIXSTER
a friendship network with movie ratings
cascade: the same movie rating from friends
MEME-TRACKER
hl-mt and hl-hl Webpage Meme Time Citation (Gomez-Rodriguez et al. KDD 2010)
NETSLEUTH Simulation
Simulate the SI process till we reach 𝐸 Seeds = Input. Missing nodes = 𝐽 ∖ 𝐸
Frontier
Nodes “next in line” to be infected.
those at the boundary (frontier) of infected set
Seeds = Find seeds given missing nodes (NETSLEUTH on 𝐸 + Frontier)
SIMULATION Seeds Missing nodes
NetSleuth Seeds Missing nodes Legend: Correct FP FN Seeds Infected FRONTIER Seeds Missing nodes NETFILL Seeds Missing nodes
For the accuracy of 𝐷− (missing nodes)
Jaccard, precision, recall, f-measure do not consider TN. MCC-Matthew’s correlation coefficient
𝑁𝐷𝐷 = 𝑈𝑄 × 𝑈𝑂 − 𝐺𝑄 × 𝐺𝑂 𝑈𝑄 + 𝐺𝑄 𝑈𝑄 + 𝐺𝑂 𝑈𝑂 + 𝐺𝑄 𝑈𝑂 + 𝐺𝑂
Closer to 1 the better
Prediction Truly missing Truly Not missing missing TP FP found FN TN
Confusion matrix
For seeds (𝒯) and ripple (𝑆)
Q-score
𝑅 = 𝑀 𝐸, 𝒯𝑏𝑚𝑝𝑠𝑗𝑢ℎ𝑛, 𝑆𝑏𝑚𝑝𝑠𝑗𝑢ℎ𝑛 𝑀 𝐸, 𝒯𝑢𝑠𝑣𝑓, 𝑆𝑢𝑠𝑣𝑓
from ‘literature’ (aka our previous paper), the close ser r to to 1 t the better
See paper for more experiments e.g. scalability, robustness etc.
96,000 node graph for the meme “State of the economy” What did we find? Truly missing websites! Examples include “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts.
Motivation---Introduction Problem Definition Our Approach Experiments Concl
clus usion
Give ven: Graph and sampled infections Find: missing infections and culprits
Formulat lated d the problem
Using MDL
Two-st stage age alternating
Find best seeds given missing nodes
Find best missing nodes given seeds
NetFil ill
Subquadratic (near-linear in many cases)
Outperforms baselines in real and synthetic data
NetFill on a grid
solutions are typically very ad hoc, very heuristic
offers a clean and principled way to define solutions
we can identify multiple sources and missing nodes
further extensions currently underway
solutions are typically very ad hoc, very heuristic
offers a clean and principled way to define solutions
we can identify multiple sources and missing nodes
further extensions currently underway