 
              Rumours in Graphs Jilles Vreeken 24 July 2015
Service Announcement #1 The Exam. 20 minutes per person 1) Questions can be on any topic covered in 2) the lectures 1) the required reading 2) the assignments (1 topic per assignment, your choice) 3) Grade will be based on your performance in the exam, 3) minus any Bonus points you may have acquired. Timeslots will be mailed today. 4)
Service Announcement #2 Introduction Patterns Correlation and Causation (Subjective) Interestingness Graphs Wrap-up + < ask-me-anything>
Service Announcement #2 <ask-me-anything>? Introduction Yes! Prepare questions on anything* Patterns you’ve always wanted to ask me. Correlation and Causation Mail them to me in advance, (Subjective) Interestingness or have me answer on the spot Graphs * preferably related to Wrap-up + < ask-us-anything> TADA, data mining, machine learning, science, the world, etc.
Service Announcement #3 Next week there is a high chance of choco colat late or, if the weather permits, ice cream
Who Who ar are th e the e Cu Culp lprits rits? B. Aditya Prakash Jilles les Vreeken eeken Christos Faloutsos
First que uest stio ion n of the he da day How can we find the number and location of starting points for epidemics in graphs? (Prakash, Vreeken & Faloutsos, ICDM 2012)
First que uest stio ion n of the he da day Who are the culprits? (Prakash, Vreeken & Faloutsos, ICDM 2012)
Virus Propagation Susceptible-Infected (SI) Model [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and Diseases over contact networks their 1039 contacts
Culprits: Problem definition 2d grid Question: Who started it?
Related Work – Culprits (Partial)  Shah and Zaman, IEEE TIT, 2011  One seed.  Provably finds MLE seed for d-regular trees  SI process  Lappas et. al., KDD, 2010.  k seeds (takes in Input k)  Infected graph assumed to be in steady-state  IC model
Culprits: Problem definition 2d grid Question: Who started it?
Culprits: Exoneration
Culprits: Exoneration
Who are the culprits Two-step solution 1) use MDL for number of seeds 2) for a given number: exoneration = centrality + penalty Running time linear! (in edges and nodes) N ET S LEUTH
Modeling using MDL Minimum Description Length principle Induction by Compression Related to Bayesian approaches MDL = Model + Data Cost of a Model: scoring the seed-set Number of possible Encoding integer |𝑇| |𝑇| -sized sets
Modeling using MDL Encoding the Data: Propagation Ripples Infected Original Snapshot Graph Ripple R1 Ripple R2
Modeling using MDL Ripple cost Ripple R How the ‘frontier’ How long is the ripple advances Total MDL cost
How to optimize the score? Two-step process  Given k quickly identify high-quality set S  Given set S , optimize the ripple R
Optimizing the score High-quality k- seed-set  exoneration Best single seed:  smallest eigenvector of Laplacian sub-matrix  analyze a Const strai rained ned SI epidemic Exonerate neighbors Repeat
Optimizing the score Optimizing R  Get the MLE ripple! Ripple R Finally use MDL score to tell us the best set N ET S LEUTH : Linear running time in nodes and edges
Experiments How far are they? Evaluation functions:  MDL based  Overlap based Closer to 1 the better ( JD = Jaccard distance)
Experiments: # of Seeds One Seed Two Seeds Three Seeds
Experiments: Quality (MDL and JD) One Seed Two Seeds Ideal = 1 Three Seeds
Experiments: Quality (Jaccard Scores) One Seed Two Seeds N ET S LEUTH Closer to True diagonal, Three Seeds the better
Experiments: Scalability
Intermediate Conclusion Giv iven: Graph and Infections Fin ind : Best ‘Culprits’ Two wo-step ep solution  use MDL MDL for number of seeds  for a given number: exo xonerat neration on = centrality + penalty  NetSle Sleuth th:  Linear running time in nodes and edges
Hidden Hazards Sashidar Sundareirsan Jilles Vreeken B. Aditya Prakash
But: Real data is noisy! We don’t know who exactly are infected  Epidemiology CDC  Public-health surveillance Lab Hospital Not sure ? CNN ? Surveillance Pyramid headlines Not sure [Nishiura+, PLoS ONE 2011] Each level has a certain probability to miss some truly infected people
Real data is noisy! Correcting missing data is by itself very important Social Media  Twitter: due to the uniform samples [Morstatter+ 2013] , the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling
Third que uest stio ion n of the he da day Given a sample ple of the infectees, how can we find out the number and location of starting points of the epidemic, as well as the mis issing sing nodes des? (Sundareisan et al. SDM’15)
The Problem  GIVEN:  Graph 𝐻(𝑊, 𝐹) from historical data  Infected set 𝐸 ⊂ 𝑊 , sampled ( 𝑞% ) and incomplete  Infectivity 𝛾 of the virus (assumed to follow the SI model)  FIND:  Seed set i.e. patient zeros/culprits  Set 𝐷 − (the missing infected nodes)  Ripple 𝑆 (the order of infections)
Related Work – Missing Nodes (Partial) Costenbader & Valente 2003; Kossinets 2006, Borgatti et al. 2006  study the effect of sampling on macro level network statistics Adiga et. al. 2013  sensitivity of total infections to noise in network structure Sadikov et al., WSDM, 2011  correct for sampling for macro level cascade statistics
Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoupling  Finding 𝒯 given 𝐷  Finding 𝐷 given 𝒯  Experiments  Conclusion
MDL Encoding For Our Problem The Model Seeds ( 𝒯 ), Ripple ( 𝑆 ) Missing nodes ( 𝐷 − ) Sender Receiver Graph 𝐻(𝑊, 𝐹) Graph 𝐻(𝑊, 𝐹) Infectivity ( 𝛾 ) Infectivity ( 𝛾 ) Data given Sampling ( 𝑞 ) Sampling ( 𝑞 ) model Seeds ( 𝒯 ) Infected set ( 𝐸 ) Ripple ( 𝑆 ) Missing nodes ( 𝐷 − )
Model ( 𝑇, 𝑆 ) Cost How to score a seed set ( 𝒯 ) Number of possible Encoding integer | 𝒯 | | 𝒯 |-sized sets How to score the ripple?
Model (𝑇, 𝑆) Cost Scoring a ripple ( 𝑆 ) Infected Original Snapshot Graph Ripple Ripple 𝑆 1 𝑆 2
Model (𝑇, 𝑆) Cost Ripple cost Ripple R How the ‘frontier’ How long is the ripple advances
Cost of the data (C-) Now you know too much – for you to know what was 𝐸 we need to transmit which are the missed nodes 𝐷 − (green nodes) Detail: 𝛿 = 1 – 𝑞 i.e. the probability of a node to be truly missing
T otal MDL Cost Finally, we have 𝑀 𝐸, 𝒯, 𝑆 = 𝑀 𝒯 + 𝑀 𝑆 𝒯 + 𝑀(𝐸 ∣ 𝒯, 𝑆) Our problem is now to find those 𝒯, 𝑆, 𝐷 − that minimize it
Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoup upli ling  Finding S given C  Finding C given S  Experiments  Conclusion
Our Approach: Decoupling The two problems are 1) finding the seeds and ripple (𝒯, 𝑆) 2) finding the missing nodes ( 𝐷 − ) Can we decouple these problems?
Decoupling the problems (contd.) Finding seeds ds depends nds on missing sing nodes. Legend Missing nodes Seed Infected node N ET F ILL : N ET S LEUTH : correctly fills in the no missing nodes as input, nodes missing from input no missing nodes as output
Decoupling the problems (cont.) Finding missing sing nodes es also o depends nds on seeds. Not Infected Infected Most probably A was missed B Seed S A
Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoupling  Finding ng 𝒯 give ven 𝑫  Finding 𝐷 given 𝒯  Experiments  Conclusion
Finding missing nodes ( C − ) and culprits ( 𝒯 ) 1) Suppose an oracle gives us the missing nodes ( 𝐷 − ) 2) We have complete infected set ( 𝐸 ∪ 𝐷 − ) 3) Apply N ET S LEUTH directly NO SAMPLING INVOLVED And will give us the seed set! Legend Missing nodes Applying NetSleuth* on Seed Oracle’s Answer Infected node
Outline  Motivation---Introduction  Problem Definition  Our Appr proach ch  MDL  Decoupling  Finding 𝒯 given 𝐷  Finding ng 𝑫 give ven 𝒯  Experiments  Conclusion
Missing Nodes (C-) given (S) Oracle gives us 𝒯 , find 𝐷 − The Naïve Approach:  Find all possible 𝐷 −  Pick the best one according to MDL Sadly, this is infeasible in practice, 𝑊 as we would have to score sets 𝑊 ∖ 𝐸
Our Approach Sub-problem 1  |Seeds| = 1  |Missing nodes| = 1 Sub-problem 2  Finding the right number of missing nodes. Sub-problem 3  |Seeds| > 1
Sub Problem 1: Best hidden hazard given one seed The best node is the one that makes the seed 𝑇 most likely  we use empirical risk as the measure Sanity ty Chec eck: ideally risk should be 0 So, the best hidden hazard is
Recommend
More recommend