[PPT] - Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement PowerPoint Presentation

SLIDE 1

Rumours in Graphs

Jilles Vreeken

24 July 2015

SLIDE 2

Service Announcement #1

The Exam.

1)

20 minutes per person

2)

Questions can be on any topic covered in

1)

the lectures

2)

the required reading

3)

the assignments (1 topic per assignment, your choice)

3)

Grade will be based on your performance in the exam, minus any Bonus points you may have acquired.

4)

Timeslots will be mailed today.

SLIDE 3

Service Announcement #2

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-me-anything> (Subjective) Interestingness

SLIDE 4

Service Announcement #2

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-us-anything> (Subjective) Interestingness

<ask-me-anything>? Yes! Prepare questions on anything* you’ve always wanted to ask me. Mail them to me in advance,

r have me answer on the spot

* preferably related to TADA, data mining, machine learning, science, the world, etc.

SLIDE 5

Service Announcement #3

Next week there is a high chance of choco colat late

r, if the weather permits,

ice cream

SLIDE 6

Who Who ar are th e the e Cu Culp lprits rits?

B. Aditya Prakash

Jilles les Vreeken eeken Christos Faloutsos

SLIDE 7

First que uest stio ion n of the he da day

How can we find the number and location of starting points for epidemics in graphs?

(Prakash, Vreeken & Faloutsos, ICDM 2012)

SLIDE 8

First que uest stio ion n of the he da day

Who are the culprits?

(Prakash, Vreeken & Faloutsos, ICDM 2012)

SLIDE 9

Virus Propagation

Susceptible-Infected (SI) Model

[AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts

Diseases over contact networks

SLIDE 10

Culprits: Problem definition

2d grid Question: Who started it?

SLIDE 11

Related Work – Culprits (Partial)

 Shah and Zaman, IEEE TIT, 2011

 One seed.  Provably finds MLE seed for d-regular trees  SI process

 Lappas et. al., KDD, 2010.

 k seeds (takes in Input k)  Infected graph assumed to be in steady-state  IC model

SLIDE 12

Culprits: Problem definition

2d grid Question: Who started it?

SLIDE 13

Culprits: Exoneration

SLIDE 14

Culprits: Exoneration

SLIDE 15

Who are the culprits

Two-step solution

1) use MDL for number of seeds 2) for a given number:

exoneration = centrality + penalty

Running time linear! (in edges and nodes)

NETSLEUTH

SLIDE 16

Modeling using MDL

Minimum Description Length principle

Induction by Compression

Related to Bayesian approaches MDL = Model + Data

Cost of a Model:

scoring the seed-set

Number of possible |𝑇|-sized sets Encoding integer |𝑇|

SLIDE 17

Modeling using MDL

Encoding the Data: Propagation Ripples

Original Graph Infected Snapshot Ripple R2 Ripple R1

SLIDE 18

Modeling using MDL

Ripple cost Total MDL cost

How the ‘frontier’ advances How long is the ripple

Ripple R

SLIDE 19

How to optimize the score?

Two-step process

 Given k quickly identify high-quality set S  Given set S, optimize the ripple R

SLIDE 20

Optimizing the score

High-quality k-seed-set

 exoneration

Best single seed:

 smallest eigenvector of

Laplacian sub-matrix

 analyze a Const

strai rained ned SI epidemic

Exonerate neighbors Repeat

SLIDE 21

Optimizing the score

Optimizing R

 Get the MLE ripple!

Finally use MDL score to tell us the best set NETSLEUTH: Linear running time in nodes and edges Ripple R

SLIDE 22

Experiments

Evaluation functions:

 MDL based  Overlap based

(JD = Jaccard distance)

Closer to 1 the better

How far are they?

SLIDE 23

Experiments: # of Seeds

One Seed Two Seeds Three Seeds

SLIDE 24

Experiments: Quality (MDL and JD)

Ideal = 1

One Seed Two Seeds Three Seeds

SLIDE 25

Experiments: Quality (Jaccard Scores)

Closer to diagonal, the better

True NETSLEUTH One Seed Two Seeds Three Seeds

SLIDE 26

Experiments: Scalability

SLIDE 27

Intermediate Conclusion

Giv iven: Graph and Infections Fin ind: Best ‘Culprits’ Two wo-step ep solution

 use MDL

MDL for number of seeds

 for a given number:

exo xonerat neration

n = centrality + penalty

 NetSle

Sleuth th:

 Linear running time

in nodes and edges

SLIDE 28

Hidden Hazards

Sashidar Sundareirsan Jilles Vreeken

B. Aditya Prakash

SLIDE 29

But: Real data is noisy!

 Epidemiology

 Public-health surveillance

We don’t know who exactly are infected

? ?

Each level has a certain probability to miss some truly infected people CNN headlines

Not sure Not sure

Surveillance Pyramid

[Nishiura+, PLoS ONE 2011]

CDC Lab Hospital

SLIDE 30

Real data is noisy!

Social Media

 Twitter: due to the uniform samples [Morstatter+ 2013],

the relevant ‘infected’ tweets may be missed

? ?

Missing Missing

Correcting missing data is by itself very important

Tweets Sampled Tweets Sampling

SLIDE 31

Third que uest stio ion n of the he da day

Given a sample ple of the infectees, how can we find out the number and location

f starting points of the epidemic,

as well as the mis issing sing nodes des?

(Sundareisan et al. SDM’15)

SLIDE 32

The Problem

 GIVEN:

 Graph 𝐻(𝑊, 𝐹) from historical data  Infected set 𝐸 ⊂ 𝑊, sampled (𝑞%) and incomplete  Infectivity 𝛾 of the virus (assumed to follow the SI model)

 FIND:

 Seed set i.e. patient zeros/culprits  Set 𝐷− (the missing infected nodes)  Ripple 𝑆 (the order of infections)

SLIDE 33

Related Work – Missing Nodes (Partial)

Costenbader & Valente 2003; Kossinets 2006, Borgatti et al. 2006

 study the effect of sampling on macro level network statistics

Adiga et. al. 2013

 sensitivity of total infections to noise in network structure

Sadikov et al., WSDM, 2011

 correct for sampling for macro level cascade statistics

SLIDE 34

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoupling  Finding 𝒯given 𝐷  Finding 𝐷 given 𝒯

 Experiments  Conclusion

SLIDE 35

MDL Encoding For Our Problem

Sender Receiver Graph 𝐻(𝑊, 𝐹) Infectivity (𝛾) Sampling (𝑞) Seeds (𝒯) Infected set (𝐸) Ripple (𝑆) Missing nodes (𝐷−) Graph 𝐻(𝑊, 𝐹) Infectivity (𝛾) Sampling (𝑞)

Seeds (𝒯), Ripple (𝑆) Missing nodes (𝐷−)

The Model Data given model

SLIDE 36

Model (𝑇, 𝑆) Cost

How to score a seed set (𝒯) How to score the ripple? Number of possible |𝒯|-sized sets Encoding integer |𝒯|

SLIDE 37

Model (𝑇, 𝑆) Cost

Scoring a ripple (𝑆) Original Graph Infected Snapshot Ripple 𝑆2 Ripple 𝑆1

SLIDE 38

Model (𝑇, 𝑆) Cost

Ripple cost How the ‘frontier’ advances How long is the ripple

Ripple R

SLIDE 39

Cost of the data (C-)

Now you know too much – for you to know what was 𝐸 we need to transmit which are the missed nodes 𝐷− (green nodes) Detail: 𝛿 = 1 – 𝑞 i.e. the probability

f a node to be truly missing

SLIDE 40

T

tal MDL Cost

Finally, we have

𝑀 𝐸, 𝒯, 𝑆 = 𝑀 𝒯 + 𝑀 𝑆 𝒯 + 𝑀(𝐸 ∣ 𝒯, 𝑆)

Our problem is now to find those 𝒯, 𝑆, 𝐷− that minimize it

SLIDE 41

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoup

upli ling

 Finding S given C  Finding C given S

 Experiments  Conclusion

SLIDE 42

Our Approach: Decoupling

The two problems are

1) finding the seeds and ripple (𝒯, 𝑆) 2) finding the missing nodes (𝐷−)

Can we decouple these problems?

SLIDE 43

Decoupling the problems (contd.)

Finding seeds ds depends nds on missing sing nodes. NETSLEUTH: no missing nodes as input, no missing nodes as output NETFILL: correctly fills in the nodes missing from input

Legend Missing nodes Seed Infected node

SLIDE 44

Decoupling the problems (cont.)

Finding missing sing nodes es also

depends

nds on seeds.

A B S

Not Infected Infected Seed Most probably A was missed

SLIDE 45

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoupling  Finding

ng 𝒯 give ven 𝑫

 Finding 𝐷 given 𝒯

 Experiments  Conclusion

SLIDE 46

Finding missing nodes (C−) and culprits (𝒯)

1) Suppose an oracle gives us the missing nodes (𝐷−) 2) We have complete infected set (𝐸 ∪ 𝐷−) 3) Apply NETSLEUTH directly NO SAMPLING INVOLVED And will give us the seed set!

Applying NetSleuth* on Oracle’s Answer Legend Missing nodes Seed Infected node

SLIDE 47

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoupling  Finding 𝒯 given 𝐷  Finding

ng 𝑫 give ven 𝒯

 Experiments  Conclusion

SLIDE 48

Missing Nodes (C-) given (S)

Oracle gives us 𝒯, find 𝐷− The Naïve Approach:

 Find all possible 𝐷−  Pick the best one according to MDL

Sadly, this is infeasible in practice, as we would have to score 𝑊 𝑊 ∖ 𝐸 sets

SLIDE 49

Our Approach

Sub-problem 1

 |Seeds| = 1  |Missing nodes| = 1

Sub-problem 2

 Finding the right number of missing nodes.

Sub-problem 3

 |Seeds| > 1

SLIDE 50

Sub Problem 1: Best hidden hazard given one seed

The best node is the one that makes the seed 𝑇 most likely

 we use empirical risk as the measure

Sanity ty Chec eck: ideally risk should be 0

So, the best hidden hazard is

SLIDE 51

Sub-Problem 1: Best Hidden Hazard

Using some results in Prakash et al. 2012, we can rewrite it as u1 is the eigenv nvect ctor

r corresponding to

the smal allest est eigenv nvalue alue of the Lapl placi acian an subma matrix ix of 𝐸

SLIDE 52

Detour: Laplacian Submatrix

Laplacian = 𝐸𝑓𝑕(𝐻) – 𝐵(𝐻) LD = take only rows for nodes in 𝐸 (Laplacian submatrix!) u1 (smallest eigenvalue in eigenvector)

Degree Adjacency

Laplacian

D

Laplacian Laplacian Submatrix

ƛ

Eigenvector

SLIDE 53

Okay

But, how to solve this quickly?

SLIDE 54

Best hidden hazard

Choose 𝑜∗ such arg max

𝑜 𝑗∈𝑜𝑐(𝑜)

𝑣1(𝑗) this measures

 how connected a node 𝑜 is to

centrally located infected nodes w.r.t. 𝑇 in 𝐸

 depends on seed 𝑇 as well as the structure of the graph (!)

SLIDE 55

Sub-Problem 2: How many missing nodes?

MDL! Add nodes based on Z-scores till MDL increases.

 but, but, MDL is not convex!  yes, yes, but it has convex like behavior!

SLIDE 56

Sub-Problem 3: What if |Seeds| > 1

Using z-scores: Missing nodes near single seed Ideal: Missing nodes near both seeds

SLIDE 57

Sub problem 3: What if |Seeds| > 1

Exonerate previous seeds

 consider previous seeds uninfected and re-calculate 𝑣1  the blame

me is transferred to the locality of the older seed

 complete Z − score =

max

𝑝𝑤𝑓𝑠 𝑏𝑚𝑚 𝑡𝑓𝑓𝑒𝑡 𝑎 − 𝑡𝑑𝑝𝑠𝑓(𝑜)

 maximum as we need high quality missing nodes

 take nodes with top-𝑙 complete Z-scores

SLIDE 58

Sub problem 3: What if |Seeds| > 1

Exonerate previous seeds

 consider previous seeds uninfected and re-calculate 𝑣1

Using z-scores: Missing nodes near single seed Ideal: Missing nodes near both seeds

SLIDE 59

Finding missing nodes given seeds

Phew!

SLIDE 60

The complete algorithm – NETFILL (Outline)

Running time: sub-quadratic in practice

SLIDE 61

Outline

 Motivation---Introduction  Problem Definition  Our Approach  Ex

Experim riments ents

 Conclusion

SLIDE 62

Datasets

Real and Synthetic graphs Real and Simulated cascades Graphs



GRID



AS-OREGON



FLIXSTER



a friendship network with movie ratings



cascade: the same movie rating from friends



MEME-TRACKER



hl-mt and hl-hl Webpage Meme Time Citation (Gomez-Rodriguez et al. KDD 2010)

SLIDE 63

Baselines

 NETSLEUTH  Simulation

 Simulate the SI process till we reach 𝐸  Seeds = Input.  Missing nodes = 𝐽 ∖ 𝐸

 Frontier

 Nodes “next in line” to be infected.



those at the boundary (frontier) of infected set



Seeds = Find seeds given missing nodes (NETSLEUTH on 𝐸 + Frontier)

SLIDE 64

SIMULATION Seeds Missing nodes

Visualizing Performance (Grid connected)

NetSleuth Seeds Missing nodes Legend: Correct FP FN Seeds Infected FRONTIER Seeds Missing nodes NETFILL Seeds Missing nodes

SLIDE 65

Automatically finding the correct number of missing nodes

SLIDE 66

Evaluation Metrics (Subtleties)

For the accuracy of 𝐷− (missing nodes)

 Jaccard, precision, recall, f-measure do not consider TN.  MCC-Matthew’s correlation coefficient

𝑁𝐷𝐷 = 𝑈𝑄 × 𝑈𝑂 − 𝐺𝑄 × 𝐺𝑂 𝑈𝑄 + 𝐺𝑄 𝑈𝑄 + 𝐺𝑂 𝑈𝑂 + 𝐺𝑄 𝑈𝑂 + 𝐺𝑂

1 <= MCC <= 1

Closer to 1 the better

Prediction Truly missing Truly Not missing missing TP FP found FN TN

Confusion matrix

SLIDE 67

Evaluation Metrics (contd.)

For seeds (𝒯) and ripple (𝑆)

 Q-score

𝑅 = 𝑀 𝐸, 𝒯𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛, 𝑆𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛 𝑀 𝐸, 𝒯𝑢𝑠𝑣𝑓, 𝑆𝑢𝑠𝑣𝑓

from ‘literature’ (aka our previous paper), the close ser r to to 1 t the better

SLIDE 68

Grid-connected (Synthetic Graph, Synthetic Cascades)

Closer to 1 the better

SLIDE 69

AS-Oregon (Real Graph, Synthetic Cascades)

Closer to 1 the better

SLIDE 70

Meme-Tracker HL-MT (Real Graph, Real Cascades)

See paper for more experiments e.g. scalability, robustness etc.

Closer to 1 the better

SLIDE 71

Meme-Tracker– case study

96,000 node graph for the meme “State of the economy” What did we find? Truly missing websites! Examples include “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts.

SLIDE 72

Outline

 Motivation---Introduction  Problem Definition  Our Approach  Experiments  Concl

clus usion

n

SLIDE 73

Conclusions



Give ven: Graph and sampled infections Find: missing infections and culprits



Formulat lated d the problem



Using MDL



Two-st stage age alternating

ptimization



Find best seeds given missing nodes



Find best missing nodes given seeds



NetFil ill



Subquadratic (near-linear in many cases)



Outperforms baselines in real and synthetic data

NetFill on a grid

SLIDE 74

Conclusions

Graphs problems are often difficult

 solutions are typically very ad hoc, very heuristic

Information theory

 offers a clean and principled way to define solutions

Identifying Infection Sources

 we can identify multiple sources and missing nodes

further extensions currently underway

SLIDE 75

Graphs problems are often difficult

 solutions are typically very ad hoc, very heuristic

Information theory

 offers a clean and principled way to define solutions

Identifying Infection Sources

 we can identify multiple sources and missing nodes

further extensions currently underway