Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken - - PowerPoint PPT Presentation

cu culprits ts an and isl island nds
SMART_READER_LITE
LIVE PREVIEW

Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken - - PowerPoint PPT Presentation

Cu Culprits ts an and Isl Island nds Jill illes V s Vreeken 4 4 Ju July 2014 2014 (TA TADA) Ser ervic ice Ann e Announ uncemen ent #1 Tensors Introduction - Introduction to tensors - Is DM science? - Tensors in DM - DM


slide-1
SLIDE 1

Cu Culprits ts an and Isl Island nds

Jill illes V s Vreeken

4 4 Ju July 2014 2014 (TA TADA)

slide-2
SLIDE 2

Ser ervic ice Ann e Announ uncemen ent #1

Introduction

  • Is DM science?
  • DM in action

Tensors

  • Introduction to tensors
  • Tensors in DM
  • Special topics in tensors

Information Theory

  • MDL + patterns
  • Entropy + correlation
  • MaxEnt + iterative DM

Mixed Grill

  • Influence Propagation
  • Redescription Mining
  • <special request>
slide-3
SLIDE 3

Ser ervic ice Ann e Announ uncemen ent #1

Introduction

  • Is DM science?
  • DM in action

Tensors

  • Introduction to tensors
  • Tensors in DM
  • Special topics in tensors

Information Theory

  • MDL + patterns
  • Entropy + correlation
  • MaxEnt + iterative DM

Mixed Grill

  • Influence Propagation
  • Redescription Mining
  • <special request>

<special request>? Let us know (asap, mail) what topic you would like to see discussed

slide-4
SLIDE 4

Who Who are the the Cu Culpri rits ts?

  • B. Aditya Prakash

Jill illes V s Vreeken Christos Faloutsos

4 4 Ju July 2014 2014 (TA TADA)

slide-5
SLIDE 5

Fir irst st q quest estio ion of the e da day

How can we find the number and location of starting points for epidemics in graphs? – or – Who are the culprits?

slide-6
SLIDE 6

Virus P s Propaga gatio ion

Susceptible-Infected (SI) Model [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts

Diseases over contact networks

slide-7
SLIDE 7

Culp lprit its: Pr Problem blem d def efin init itio ion

2d grid Question: Who started it?

slide-8
SLIDE 8

Culp lprit its: Pr Problem blem d def efin init itio ion

Prior work: [Lappas et al. 2010, Shah et al. 2011]

2d grid Question: Who started it?

slide-9
SLIDE 9

Culp lprit its: E Exo xoner eratio ion

slide-10
SLIDE 10

Culp lprit its: E Exo xoner eratio ion

slide-11
SLIDE 11

Who ho a are t e the c he culp lprit its

Two-step solution

1) use MDL for number of seeds 2) for a given number:

exoneration = centrality + penalty

Running time linear! (in edges and nodes)

NetSleuth

slide-12
SLIDE 12

Mo Modeling using deling using MDL MDL

Minimum Description Length principle

Induction by Compression

Related to Bayesian approaches MDL = Model + Data

Cost of a Model:

scoring the seed-set

Number of possible |𝑇|-sized sets Encoding integer |𝑇|

slide-13
SLIDE 13

Mo Modeling using deling using MDL MDL

Encoding the Data: Propagation Ripples

Original Graph Infected Snapshot Ripple R2 Ripple R1

slide-14
SLIDE 14

Mo Modeling using deling using MDL MDL

Ripple cost Total MDL cost

How the ‘frontier’ advances How long is the ripple

Ripple R

Prakash, Vreeken, Faloutsos 2012

slide-15
SLIDE 15

Ho How w to o

  • pt

ptim imiz ize e the sc e score? e?

Two-step process

 Given k quickly identify high-quality set S  Given set S, optimize the ripple R

slide-16
SLIDE 16

Op Optim imiz izin ing t the he sc score

High-quality k-seed-set

 exoneration

Best single seed:

 smallest eigenvector of

Laplacian sub-matrix

 analyze a Constrained SI epidemic

Exonerate neighbors Repeat

slide-17
SLIDE 17

Op Optim imiz izin ing t the he sc score

Optimizing R

 Get the MLE ripple!

Finally use MDL score to tell us the best set NETSLEUTH: Linear running time in nodes and edges Ripple R

slide-18
SLIDE 18

Experi riments

Evaluation functions:

 MDL based  Overlap based

(JD = Jaccard distance)

Closer to 1 the better

How far are they?

slide-19
SLIDE 19

Experi riments: # # of f Seeds

One Seed Two Seeds Three Seeds

slide-20
SLIDE 20

Exper xperim iments: s: Q Quali lity ( (MDL MDL and JD) D)

Prakash, Vreeken, Faloutsos 2012

Ideal = 1

One Seed Two Seeds Three Seeds

slide-21
SLIDE 21

Exper xperim iments: s: Q Quali lity ( (Jaccar ard Sc Scor

  • res)

Closer to diagonal, the better

True NETSLEUTH One Seed Two Seeds Three Seeds

slide-22
SLIDE 22

Exper xperim iments: s: S Scala labili ility

slide-23
SLIDE 23

Conc nclu lusio ion

Given: Graph and Infections Find: Best ‘Culprits’ Two-step solution

 use MDL for number of seeds  for a given number:

exoneration = centrality + penalty

 NetSleuth:

 Linear running time

in nodes and edges

slide-24
SLIDE 24

Le Lema man Ako koglu Jille Jilles Vree eeken en Hangh ghan ang Tong

  • ng

Pol

  • lo
  • Ch

Chau au Nik ikola laj T j Tatti Ch Christ stos s Falout

  • utsos
  • s

(Akoglu et al. SDM’13)

Con Connection Pat athwa hways

slide-25
SLIDE 25

Quest uestio ion a at h hand nd

How can we use a graph to explai ain a few sel selected nodes?

slide-26
SLIDE 26

Giv Given en a a ‘list ‘list’ o ’ of a authors…

What can we say?

 let’s use relational information

Brad A. Myers Bonnie E. John James A. Landay Hector Garcia Molina David J. DeWitt

  • H. V. Jagadish

Christos Faloutsos Scott E. Hudson Shumin Zhai Abigail Sellen Steve Benford Ravin Balakrishnan Surajit Chaudhuri William Buxton Hiroshi Ishii Raghu Ramakrishnan Rakesh Agrawal Jeffrey F. Naughton Gerhard Weikum Michael J. Carey

slide-27
SLIDE 27

Giv Given en a a ‘list ‘list’ o ’ of a authors…

What can we say?

 let’s use relational information

slide-28
SLIDE 28

Usin sing t g the c e co-aut uthorsh ship g graph… h…

Any structure?

 too cluttered

slide-29
SLIDE 29

Th The P e Problem blem

Given

 a large graph G  a handful of nodes S

marked by an external process

What can we say about S?

 are they close by?  are they segregated?  do they form groups?

Can we connect them?

 with simple paths?  maybe using a few connectors?

slide-30
SLIDE 30

Our Our a app pproach

Use the network structure to explain S Partition S into groups of nodes, such that

 “simple” paths in G connect the nodes in each group,  nodes in different groups are “not easily reachable”

Use MDL to decide ‘simple’ and ‘best‘ partitioning

slide-31
SLIDE 31

Example

Simple connection pathways

 good connectors  better sensemaking

VLDB CHI

slide-32
SLIDE 32

App Applic licatio ions

  • 1. Graph anomaly description/summarization

 Summarize top-k node anomalies by groups  Find connections/connectors among groups

Top-k anomalies

e.g. Gene interaction network

slide-33
SLIDE 33

App Applic licatio ions

  • 2. Query summarization

 Summarize top-k query pages by groups  Find connections/connectors among groups

Top-ranked pages

e.g. Web network

slide-34
SLIDE 34

App Applic licatio ions

  • 3. Understanding dynamic events in graphs

 Event spread within groups explained by the network  Event spread between groups due to external influence

Affected people

e.g. Social network

slide-35
SLIDE 35

App Applic licatio ions

  • 4. Understanding semantic coherence

 Summarize words by semantically coherent groups  Find connectors (other relevant words) per group

Set of words

e.g. Ontology network

slide-36
SLIDE 36

App Applic licatio ions

  • 5. Understanding segregation (social science)

 Summarize students by their social “circles”  Study groups (and groups within groups)

Students with attributes

  • f interest

e.g. school-children friendship network

slide-37
SLIDE 37

Problem: F For

  • rma

mally

Problem Definition Given a graph G=(V,E) and a set of marked nodes M subseteq V Problem 1. Optimal partitioning Find a coherent partitioning P of M. Find the optimal number of partitions |P|. Problem 2. Optimal connection subgraphs Efficiently find the minimum cost set of subgraphs connecting the nodes in each part

slide-38
SLIDE 38

Ob Objec jectiv ive: e: Inf Informally ly

Our key idea is to use information theory Imagine a sender and a receiver.

 both sender and receiver know graph structure G,  only the sender knows the set of marked nodes M  goal: transmit M using as few bits as possible.

Why would this work?

 naïve: encode ID of each marked node with bits  better: exploit “close-by” nodes, restart for farther nodes

vs.

… … u

slide-39
SLIDE 39

Ob Objec jectiv ive: e: Int Intuit itio ion

We think of encoding as

 hopping from node to node to encode close-by nodes  and flying to a new node to encode farther nodes  until all marked nodes are identified

Simplicity of connection tree T is determined by:

 the amount of flights we make across the graph;  ease of identifying the edges to follow next;  ease of identifying the marked nodes in our tour;

slide-40
SLIDE 40

Ob Objec jectiv ive: e: F Formall lly

 encode #partitions  encode each part  encoding of tree per part

root node number of marked nodes in pi identities of marked nodes spanning tree t of pi #branches of node t identities of branch nodes recursively encode all tree nodes

minimize P, Ti

slide-41
SLIDE 41

Solut lutio ion: In Intuit uitio ion

The problem is hard

 Related to the directed Steiner tree problem

Hence, we resort to heuristics… The general idea:

 transform G into a directed weighted graph G’  chop G’ into sub-graphs  find low-cost minimal spanning trees per sub-graph

(we give 4 efficient algorithms)

It’s NP-hard. The problem is NP NP-hard rd

 Reduces to directed Steiner tree problem

slide-42
SLIDE 42

Solut lutio ion: P Prelim elimin inaries ies

Graph transformation

 given undirected unweighted  we transform it into directed weighted

where and

Given G’, the problem becomes: find the set of trees with minimum total cost on the marked nodes. Finding bounded-length paths

 (multiple) short paths of length up to

between marked nodes in G’

 employ BFS-like expansion

slide-43
SLIDE 43

1) Connected components (CC)

 find induced subgraph(s) on marked nodes in G’  find minimum cost directed tree(s)

2) Minimum arborescence (ARB)

 construct transitive closure graph CG (with bounded paths)  add universal node u with out-edges  find minimum cost directed tree(s), remove u, re-expand paths

Algo Algorit ithms

u

slide-44
SLIDE 44

Algo Algorit ithms

3) Level-1 trees (L1)

 find minimum cost depth-1 trees in CG  expand paths

4) Level-k trees (Lk)

 refine level-(k-1) trees by finding intermediate node v’s  minimizing total cost, i.e. sum of cost to each v and subtrees

v v

slide-45
SLIDE 45

Experi riments

Synthetic examples

slide-46
SLIDE 46

Experi riments

 Case studies on DBLP

DBLP: RECOMB vs. KDD

slide-47
SLIDE 47

Experi riments

DBLP: NIPS vs. PODS

slide-48
SLIDE 48

Experi riments

GScholar: ‘large graphs’ vs. ‘visual’

slide-49
SLIDE 49

Intermediate Conclusi sions

Dot2Dot

 principled approach to describe sets of marked nodes using structure

  • f the graph

 automatically finds good connectors  automatically determines number of groups

New problem, but many applications in the wild

slide-50
SLIDE 50

Conclusi sions

Graphs problems are often difficult

 solutions are typically very ad hoc, very heuristic

Information theory

 offers a clean and principled way to define solutions

Identifying Infection Sources

 first to identify multiple sources – extensions currently underway

Explaining Node Sets

 first to define the problem – many applications in the wild

slide-51
SLIDE 51

Graphs problems are often difficult

 solutions are typically very ad hoc, very heuristic

Information theory

 offers a clean and principled way to define solutions

Identifying Infection Sources

 first to identify multiple sources – extensions currently underway

Explaining Node Sets

 first to define the problem – many applications in the wild

Thank you!