Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement - - PowerPoint PPT Presentation

rumours in graphs
SMART_READER_LITE
LIVE PREVIEW

Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement - - PowerPoint PPT Presentation

Rumours in Graphs Jilles Vreeken 24 July 2015 Service Announcement #1 The Exam. 20 minutes per person 1) Questions can be on any topic covered in 2) the lectures 1) the required reading 2) the assignments (1 topic per assignment, your


slide-1
SLIDE 1

Rumours in Graphs

Jilles Vreeken

24 July 2015

slide-2
SLIDE 2

Service Announcement #1

The Exam.

1)

20 minutes per person

2)

Questions can be on any topic covered in

1)

the lectures

2)

the required reading

3)

the assignments (1 topic per assignment, your choice)

3)

Grade will be based on your performance in the exam, minus any Bonus points you may have acquired.

4)

Timeslots will be mailed today.

slide-3
SLIDE 3

Service Announcement #2

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-me-anything> (Subjective) Interestingness

slide-4
SLIDE 4

Service Announcement #2

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-us-anything> (Subjective) Interestingness

<ask-me-anything>? Yes! Prepare questions on anything* you’ve always wanted to ask me. Mail them to me in advance,

  • r have me answer on the spot

* preferably related to TADA, data mining, machine learning, science, the world, etc.

slide-5
SLIDE 5

Service Announcement #3

Next week there is a high chance of choco colat late

  • r, if the weather permits,

ice cream

slide-6
SLIDE 6

Who Who ar are th e the e Cu Culp lprits rits?

  • B. Aditya Prakash

Jilles les Vreeken eeken Christos Faloutsos

slide-7
SLIDE 7

First que uest stio ion n of the he da day

How can we find the number and location of starting points for epidemics in graphs?

(Prakash, Vreeken & Faloutsos, ICDM 2012)

slide-8
SLIDE 8

First que uest stio ion n of the he da day

Who are the culprits?

(Prakash, Vreeken & Faloutsos, ICDM 2012)

slide-9
SLIDE 9

Virus Propagation

Susceptible-Infected (SI) Model

[AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts

Diseases over contact networks

slide-10
SLIDE 10

Culprits: Problem definition

2d grid Question: Who started it?

slide-11
SLIDE 11

Related Work – Culprits (Partial)

 Shah and Zaman, IEEE TIT, 2011

 One seed.  Provably finds MLE seed for d-regular trees  SI process

 Lappas et. al., KDD, 2010.

 k seeds (takes in Input k)  Infected graph assumed to be in steady-state  IC model

slide-12
SLIDE 12

Culprits: Problem definition

2d grid Question: Who started it?

slide-13
SLIDE 13

Culprits: Exoneration

slide-14
SLIDE 14

Culprits: Exoneration

slide-15
SLIDE 15

Who are the culprits

Two-step solution

1) use MDL for number of seeds 2) for a given number:

exoneration = centrality + penalty

Running time linear! (in edges and nodes)

NETSLEUTH

slide-16
SLIDE 16

Modeling using MDL

Minimum Description Length principle

Induction by Compression

Related to Bayesian approaches MDL = Model + Data

Cost of a Model:

scoring the seed-set

Number of possible |𝑇|-sized sets Encoding integer |𝑇|

slide-17
SLIDE 17

Modeling using MDL

Encoding the Data: Propagation Ripples

Original Graph Infected Snapshot Ripple R2 Ripple R1

slide-18
SLIDE 18

Modeling using MDL

Ripple cost Total MDL cost

How the ‘frontier’ advances How long is the ripple

Ripple R

slide-19
SLIDE 19

How to optimize the score?

Two-step process

 Given k quickly identify high-quality set S  Given set S, optimize the ripple R

slide-20
SLIDE 20

Optimizing the score

High-quality k-seed-set

 exoneration

Best single seed:

 smallest eigenvector of

Laplacian sub-matrix

 analyze a Const

strai rained ned SI epidemic

Exonerate neighbors Repeat

slide-21
SLIDE 21

Optimizing the score

Optimizing R

 Get the MLE ripple!

Finally use MDL score to tell us the best set NETSLEUTH: Linear running time in nodes and edges Ripple R

slide-22
SLIDE 22

Experiments

Evaluation functions:

 MDL based  Overlap based

(JD = Jaccard distance)

Closer to 1 the better

How far are they?

slide-23
SLIDE 23

Experiments: # of Seeds

One Seed Two Seeds Three Seeds

slide-24
SLIDE 24

Experiments: Quality (MDL and JD)

Ideal = 1

One Seed Two Seeds Three Seeds

slide-25
SLIDE 25

Experiments: Quality (Jaccard Scores)

Closer to diagonal, the better

True NETSLEUTH One Seed Two Seeds Three Seeds

slide-26
SLIDE 26

Experiments: Scalability

slide-27
SLIDE 27

Intermediate Conclusion

Giv iven: Graph and Infections Fin ind: Best ‘Culprits’ Two wo-step ep solution

 use MDL

MDL for number of seeds

 for a given number:

exo xonerat neration

  • n = centrality + penalty

 NetSle

Sleuth th:

 Linear running time

in nodes and edges

slide-28
SLIDE 28

Hidden Hazards

Sashidar Sundareirsan Jilles Vreeken

  • B. Aditya Prakash
slide-29
SLIDE 29

But: Real data is noisy!

 Epidemiology

 Public-health surveillance

We don’t know who exactly are infected

? ?

Each level has a certain probability to miss some truly infected people CNN headlines

Not sure Not sure

Surveillance Pyramid

[Nishiura+, PLoS ONE 2011]

CDC Lab Hospital

slide-30
SLIDE 30

Real data is noisy!

Social Media

 Twitter: due to the uniform samples [Morstatter+ 2013],

the relevant ‘infected’ tweets may be missed

? ?

Missing Missing

Correcting missing data is by itself very important

Tweets Sampled Tweets Sampling

slide-31
SLIDE 31

Third que uest stio ion n of the he da day

Given a sample ple of the infectees, how can we find out the number and location

  • f starting points of the epidemic,

as well as the mis issing sing nodes des?

(Sundareisan et al. SDM’15)

slide-32
SLIDE 32

The Problem

 GIVEN:

 Graph 𝐻(𝑊, 𝐹) from historical data  Infected set 𝐸 ⊂ 𝑊, sampled (𝑞%) and incomplete  Infectivity 𝛾 of the virus (assumed to follow the SI model)

 FIND:

 Seed set i.e. patient zeros/culprits  Set 𝐷− (the missing infected nodes)  Ripple 𝑆 (the order of infections)

slide-33
SLIDE 33

Related Work – Missing Nodes (Partial)

Costenbader & Valente 2003; Kossinets 2006, Borgatti et al. 2006

 study the effect of sampling on macro level network statistics

Adiga et. al. 2013

 sensitivity of total infections to noise in network structure

Sadikov et al., WSDM, 2011

 correct for sampling for macro level cascade statistics

slide-34
SLIDE 34

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoupling  Finding 𝒯given 𝐷  Finding 𝐷 given 𝒯

 Experiments  Conclusion

slide-35
SLIDE 35

MDL Encoding For Our Problem

Sender Receiver Graph 𝐻(𝑊, 𝐹) Infectivity (𝛾) Sampling (𝑞) Seeds (𝒯) Infected set (𝐸) Ripple (𝑆) Missing nodes (𝐷−) Graph 𝐻(𝑊, 𝐹) Infectivity (𝛾) Sampling (𝑞)

Seeds (𝒯), Ripple (𝑆) Missing nodes (𝐷−)

The Model Data given model

slide-36
SLIDE 36

Model (𝑇, 𝑆) Cost

How to score a seed set (𝒯) How to score the ripple? Number of possible |𝒯|-sized sets Encoding integer |𝒯|

slide-37
SLIDE 37

Model (𝑇, 𝑆) Cost

Scoring a ripple (𝑆) Original Graph Infected Snapshot Ripple 𝑆2 Ripple 𝑆1

slide-38
SLIDE 38

Model (𝑇, 𝑆) Cost

Ripple cost How the ‘frontier’ advances How long is the ripple

Ripple R

slide-39
SLIDE 39

Cost of the data (C-)

Now you know too much – for you to know what was 𝐸 we need to transmit which are the missed nodes 𝐷− (green nodes) Detail: 𝛿 = 1 – 𝑞 i.e. the probability

  • f a node to be truly missing
slide-40
SLIDE 40

T

  • tal MDL Cost

Finally, we have

𝑀 𝐸, 𝒯, 𝑆 = 𝑀 𝒯 + 𝑀 𝑆 𝒯 + 𝑀(𝐸 ∣ 𝒯, 𝑆)

Our problem is now to find those 𝒯, 𝑆, 𝐷− that minimize it

slide-41
SLIDE 41

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoup

upli ling

 Finding S given C  Finding C given S

 Experiments  Conclusion

slide-42
SLIDE 42

Our Approach: Decoupling

The two problems are

1) finding the seeds and ripple (𝒯, 𝑆) 2) finding the missing nodes (𝐷−)

Can we decouple these problems?

slide-43
SLIDE 43

Decoupling the problems (contd.)

Finding seeds ds depends nds on missing sing nodes. NETSLEUTH: no missing nodes as input, no missing nodes as output NETFILL: correctly fills in the nodes missing from input

Legend Missing nodes Seed Infected node

slide-44
SLIDE 44

Decoupling the problems (cont.)

Finding missing sing nodes es also

  • depends

nds on seeds.

A B S

Not Infected Infected Seed Most probably A was missed

slide-45
SLIDE 45

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoupling  Finding

ng 𝒯 give ven 𝑫

 Finding 𝐷 given 𝒯

 Experiments  Conclusion

slide-46
SLIDE 46

Finding missing nodes (C−) and culprits (𝒯)

1) Suppose an oracle gives us the missing nodes (𝐷−) 2) We have complete infected set (𝐸 ∪ 𝐷−) 3) Apply NETSLEUTH directly NO SAMPLING INVOLVED And will give us the seed set!

Applying NetSleuth* on Oracle’s Answer Legend Missing nodes Seed Infected node

slide-47
SLIDE 47

Outline

 Motivation---Introduction  Problem Definition  Our Appr

proach ch

 MDL  Decoupling  Finding 𝒯 given 𝐷  Finding

ng 𝑫 give ven 𝒯

 Experiments  Conclusion

slide-48
SLIDE 48

Missing Nodes (C-) given (S)

Oracle gives us 𝒯, find 𝐷− The Naïve Approach:

 Find all possible 𝐷−  Pick the best one according to MDL

Sadly, this is infeasible in practice, as we would have to score 𝑊 𝑊 ∖ 𝐸 sets

slide-49
SLIDE 49

Our Approach

Sub-problem 1

 |Seeds| = 1  |Missing nodes| = 1

Sub-problem 2

 Finding the right number of missing nodes.

Sub-problem 3

 |Seeds| > 1

slide-50
SLIDE 50

Sub Problem 1: Best hidden hazard given one seed

The best node is the one that makes the seed 𝑇 most likely

 we use empirical risk as the measure

Sanity ty Chec eck: ideally risk should be 0

So, the best hidden hazard is

slide-51
SLIDE 51

Sub-Problem 1: Best Hidden Hazard

Using some results in Prakash et al. 2012, we can rewrite it as u1 is the eigenv nvect ctor

  • r corresponding to

the smal allest est eigenv nvalue alue of the Lapl placi acian an subma matrix ix of 𝐸

slide-52
SLIDE 52

Detour: Laplacian Submatrix

Laplacian = 𝐸𝑓𝑕(𝐻) – 𝐵(𝐻) LD = take only rows for nodes in 𝐸 (Laplacian submatrix!) u1 (smallest eigenvalue in eigenvector)

Degree Adjacency

Laplacian

D

Laplacian Laplacian Submatrix

ƛ

Eigenvector

slide-53
SLIDE 53

Okay

But, how to solve this quickly?

slide-54
SLIDE 54

Best hidden hazard

Choose 𝑜∗ such arg max

𝑜 𝑗∈𝑜𝑐(𝑜)

𝑣1(𝑗) this measures

 how connected a node 𝑜 is to

centrally located infected nodes w.r.t. 𝑇 in 𝐸

 depends on seed 𝑇 as well as the structure of the graph (!)

slide-55
SLIDE 55

Sub-Problem 2: How many missing nodes?

MDL! Add nodes based on Z-scores till MDL increases.

 but, but, MDL is not convex!  yes, yes, but it has convex like behavior!

slide-56
SLIDE 56

Sub-Problem 3: What if |Seeds| > 1

Using z-scores: Missing nodes near single seed Ideal: Missing nodes near both seeds

slide-57
SLIDE 57

Sub problem 3: What if |Seeds| > 1

Exonerate previous seeds

 consider previous seeds uninfected and re-calculate 𝑣1  the blame

me is transferred to the locality of the older seed

 complete Z − score =

max

𝑝𝑤𝑓𝑠 𝑏𝑚𝑚 𝑡𝑓𝑓𝑒𝑡 𝑎 − 𝑡𝑑𝑝𝑠𝑓(𝑜)

 maximum as we need high quality missing nodes

 take nodes with top-𝑙 complete Z-scores

slide-58
SLIDE 58

Sub problem 3: What if |Seeds| > 1

Exonerate previous seeds

 consider previous seeds uninfected and re-calculate 𝑣1

Using z-scores: Missing nodes near single seed Ideal: Missing nodes near both seeds

slide-59
SLIDE 59

Finding missing nodes given seeds

Phew!

slide-60
SLIDE 60

The complete algorithm – NETFILL (Outline)

Running time: sub-quadratic in practice

slide-61
SLIDE 61

Outline

 Motivation---Introduction  Problem Definition  Our Approach  Ex

Experim riments ents

 Conclusion

slide-62
SLIDE 62

Datasets

Real and Synthetic graphs Real and Simulated cascades Graphs

GRID

AS-OREGON

FLIXSTER

a friendship network with movie ratings

cascade: the same movie rating from friends

MEME-TRACKER

hl-mt and hl-hl Webpage Meme Time Citation (Gomez-Rodriguez et al. KDD 2010)

slide-63
SLIDE 63

Baselines

 NETSLEUTH  Simulation

 Simulate the SI process till we reach 𝐸  Seeds = Input.  Missing nodes = 𝐽 ∖ 𝐸

 Frontier

 Nodes “next in line” to be infected.

those at the boundary (frontier) of infected set

Seeds = Find seeds given missing nodes (NETSLEUTH on 𝐸 + Frontier)

slide-64
SLIDE 64

SIMULATION Seeds Missing nodes

Visualizing Performance (Grid connected)

NetSleuth Seeds Missing nodes Legend: Correct FP FN Seeds Infected FRONTIER Seeds Missing nodes NETFILL Seeds Missing nodes

slide-65
SLIDE 65

Automatically finding the correct number of missing nodes

slide-66
SLIDE 66

Evaluation Metrics (Subtleties)

For the accuracy of 𝐷− (missing nodes)

 Jaccard, precision, recall, f-measure do not consider TN.  MCC-Matthew’s correlation coefficient

𝑁𝐷𝐷 = 𝑈𝑄 × 𝑈𝑂 − 𝐺𝑄 × 𝐺𝑂 𝑈𝑄 + 𝐺𝑄 𝑈𝑄 + 𝐺𝑂 𝑈𝑂 + 𝐺𝑄 𝑈𝑂 + 𝐺𝑂

  • 1 <= MCC <= 1

Closer to 1 the better

Prediction Truly missing Truly Not missing missing TP FP found FN TN

Confusion matrix

slide-67
SLIDE 67

Evaluation Metrics (contd.)

For seeds (𝒯) and ripple (𝑆)

 Q-score

𝑅 = 𝑀 𝐸, 𝒯𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛, 𝑆𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛 𝑀 𝐸, 𝒯𝑢𝑠𝑣𝑓, 𝑆𝑢𝑠𝑣𝑓

from ‘literature’ (aka our previous paper), the close ser r to to 1 t the better

slide-68
SLIDE 68

Grid-connected (Synthetic Graph, Synthetic Cascades)

Closer to 1 the better

slide-69
SLIDE 69

AS-Oregon (Real Graph, Synthetic Cascades)

Closer to 1 the better

slide-70
SLIDE 70

Meme-Tracker HL-MT (Real Graph, Real Cascades)

See paper for more experiments e.g. scalability, robustness etc.

Closer to 1 the better

slide-71
SLIDE 71

Meme-Tracker– case study

96,000 node graph for the meme “State of the economy” What did we find? Truly missing websites! Examples include “www.nbcbayarea.com”, “chicagotribune.com” and some blog posts.

slide-72
SLIDE 72

Outline

 Motivation---Introduction  Problem Definition  Our Approach  Experiments  Concl

clus usion

  • n
slide-73
SLIDE 73

Conclusions

Give ven: Graph and sampled infections Find: missing infections and culprits

Formulat lated d the problem

Using MDL

Two-st stage age alternating

  • ptimization

Find best seeds given missing nodes

Find best missing nodes given seeds

NetFil ill

Subquadratic (near-linear in many cases)

Outperforms baselines in real and synthetic data

NetFill on a grid

slide-74
SLIDE 74

Conclusions

Graphs problems are often difficult

 solutions are typically very ad hoc, very heuristic

Information theory

 offers a clean and principled way to define solutions

Identifying Infection Sources

 we can identify multiple sources and missing nodes

further extensions currently underway

slide-75
SLIDE 75

Graphs problems are often difficult

 solutions are typically very ad hoc, very heuristic

Information theory

 offers a clean and principled way to define solutions

Identifying Infection Sources

 we can identify multiple sources and missing nodes

further extensions currently underway

Thank you!