Models and algorithms for network immunization Aris Gionis Basic - - PowerPoint PPT Presentation
Models and algorithms for network immunization Aris Gionis Basic - - PowerPoint PPT Presentation
Models and algorithms for network immunization Aris Gionis Basic Research Unit, HIIT University of Helsinki a brief introduction... ...originally from Greece BS, University of Athens, Greece MS and PhD, Stanford University, USA
a brief introduction...
- ...originally from Greece
- BS, University of Athens, Greece
- MS and PhD, Stanford University, USA
- PhD adviser: Rajeev Motwani
- Thesis title: “Algorithms for similarity search and clustering in
large data sets”, July, 2003
- in Basic Research Unit, HIIT, Finland, since August 2003
Estonia CS theory days, 29 Oct, 2005 2
Basic Research Unit, HIIT
- research
– Heikki Mannila – Panayiotis Tsaparas – Niina Haiminen, Evimaria Terzi – external collaborators: Foto Afrati, Christos Faloutsos, Spiros Papadimitriou, Alex Hinnenburg, . . .
- co-supervising students
– Niina Haiminen, Evimaria Terzi
- teaching courses
– data mining, approximation algorithms, computational complexity, spectral methods for data mining
Estonia CS theory days, 29 Oct, 2005 3
Research paradigm in BRU
- develop novel data analysis techniques for use in other sciences
- combine basic research in computer science with applications
– look at data analysis problems arising in practice – abstract new computational concepts from them – analyze, develop new computational methods – take the results into practice ⇒ theoretical work in algorithms and foundations of data analysis can have fast impact in the application areas ⇐ the applications feed interesting novel questions to theoretical research
Estonia CS theory days, 29 Oct, 2005 4
Recent projects
- sequence analysis
– biology, genetics, physics, telecommunications
- analysis of spatial data
– biology, ecology
- ordering problems
– paleontology
- clustering
- analysis of 0–1 matrices
Estonia CS theory days, 29 Oct, 2005 5
...rest of the talk...
Models and algorithms for network immunization
joint work with George Giakkoupis, Evimaria Terzi, and Panayiotis Tsaparas
Genome segmentations
joint work with Niina Haiminen, Evimaria Terzi, Heikki Mannila
Estonia CS theory days, 29 Oct, 2005 6
Motivation
- many natural or man-made systems are organized as networks
– internet, web, social networks, protein networks, etc.
- operation is threaten by the propagation of a harmful entity
through the network – diseases in social networks – gossip or panic in social networks – failures in power grids – computer viruses on the internet
- can we restrict the spread of the virus in the network?
Estonia CS theory days, 29 Oct, 2005 7
Virus spread
Estonia CS theory days, 29 Oct, 2005 8
Virus spread
Estonia CS theory days, 29 Oct, 2005 9
Virus spread
Estonia CS theory days, 29 Oct, 2005 10
Virus spread
Estonia CS theory days, 29 Oct, 2005 11
Virus spread
Estonia CS theory days, 29 Oct, 2005 12
Virus spread
Estonia CS theory days, 29 Oct, 2005 13
Virus spread
Estonia CS theory days, 29 Oct, 2005 14
Virus spread
Estonia CS theory days, 29 Oct, 2005 15
Restrain the spread
Estonia CS theory days, 29 Oct, 2005 16
Restrain the spread
Estonia CS theory days, 29 Oct, 2005 17
Restrain the spread
Estonia CS theory days, 29 Oct, 2005 18
Restrain the spread
Estonia CS theory days, 29 Oct, 2005 19
Naive virus injection
Estonia CS theory days, 29 Oct, 2005 20
General framework
- network G = (V, E) over which the virus propagates
- virus-propagation model (can be probabilistic)
- adversary who injects copies of the virus in the network
– blind – adaptive ⇒ immunization algorithm: given a network, budget k, and a virus-propagation model find k nodes to immunize so that the spread is minimized
Estonia CS theory days, 29 Oct, 2005 21
What is the spread?
- network G = (V, E)
- adversary plants r viruses (blindly or adaptively)
- Nr ⊆ V : set of nodes selected by adversary
- expected number of infected nodes: S(Nr, G)
- spread: Sr(G) = maxNr S(Nr, G)
- expected spread:
Sr(G) = ENr[S(Nr, G)]
Estonia CS theory days, 29 Oct, 2005 22
Example of immunization algorithms
- immunize a random node
- immunize the node with the largest degree
Estonia CS theory days, 29 Oct, 2005 23
Virus-propagation models
- problem as stated above is too general
– e.g., no formal specification language for all possible virus-propagation models
- concentrate on two specific virus-propagation models:
– independent cascade, and – dynamic propagation, ...but similar ideas can be applied to other models, too
Estonia CS theory days, 29 Oct, 2005 24
Some background models on epidemics
- Susceptible-Infected-Removed (SIR)
– susceptible (healthy) nodes do not have the virus but they can catch it if exposed to somebody who does – infected nodes have the virus and they can pass it – removed (or recovered) have immunity, cannot catch the virus again and cannot pass it on
- Susceptible-Infected-Susceptible (SIS)
– susceptible nodes – infected nodes can be healed and become susceptible again
Estonia CS theory days, 29 Oct, 2005 25
Epidemics background
- traditional studies do not take into account the network
structure – nodes become infected or recovered with uniform probabilities
- modern studies do take into account network topology
- epidemic threshold
– β: infection rate, δ: healing rate, λ = β/δ: effective spreading rate – ∃λc s.t. if – λ ≥ λc a non-zero fraction of nodes becomes infected (SIR) – λ ≥ λc virus spreads and becomes persistent (SIS) – λ < λc virus dies out exponentially fast (SIS)
Estonia CS theory days, 29 Oct, 2005 26
Epidemics background
- many studies of special cases
- power-law networks do not have (non-zero) epidemic thresholds
- studies of immunizing the highest degree nodes
- immunization in the case of unknown network topology
– immunizing the adjacent node of a random node works well for skewed-degree networks
- . . .
Estonia CS theory days, 29 Oct, 2005 27
Our approach
- algorithmic approach to the immunization problem
- extensive experimentation
- virus-propagation models considered:
– independent cascade, and – dynamic propagation
Estonia CS theory days, 29 Oct, 2005 28
Independent cascade
- initially the adversary plants r viruses in the network
- assume node u becomes infected for first time at time t:
– u attempts to infect all currently uninfected neighbors v – it succeeds with probability p – if u succeeds then v becomes infected – otherwise u never attempts to infect v again
Estonia CS theory days, 29 Oct, 2005 29
Independent cascade — example
Time 1 q w u v
Estonia CS theory days, 29 Oct, 2005 30
Independent cascade — example
Time 2 Time 1 q q w w u v u v
Estonia CS theory days, 29 Oct, 2005 31
Independent cascade — example
Time 2 Time 3 Time 1 q q q w w w u v u v u v
Estonia CS theory days, 29 Oct, 2005 32
Independent cascade — example
Time 2 Time 3 Time 1 q q q w w w u v u v u v
Estonia CS theory days, 29 Oct, 2005 33
Independent cascade
Estonia CS theory days, 29 Oct, 2005 34
Independent cascade
Estonia CS theory days, 29 Oct, 2005 35
Independent cascade
Estonia CS theory days, 29 Oct, 2005 36
Independent cascade
- given a sampling on network links with probability p
– S1(G) is size of largest connected component (adaptive) – S1(G) is the average connected components size (blind)
- immunization problem:
– remove k nodes from the network in order to minimize – size of r largest connected components, or – average size of connected component, respectively
- both Sr(G) and
Sr(G) are NP-hard
Estonia CS theory days, 29 Oct, 2005 37
Algorithm for the independent-cascade model
- greedy, i.e., immunize nodes one by one
- for the adaptive-adversary case:
– at each iteration find the node that minimizes the expected size of the largest connected component in the resulting network
- for the blind-adversary case:
– at each iteration find the node that minimizes the expected size of the average connected component in the resulting network
Estonia CS theory days, 29 Oct, 2005 38
Computing the expectations
- sample many graphs over all the 2|E| possible graphs
– in each sample graph (u, v) exists with probability p ⇒ in each sampled graph for each node u find the size of the largest/average connected component in the graph resulting from removing (immunizing) u select the node that minimizes the expectation (largest/average)
Estonia CS theory days, 29 Oct, 2005 39
Dynamic-propagation
- a dynamic birth-death process that evolves over time
- virus propagates from node u to neighbor node v with
probability β
- at each point in time, a node u that is infected heals with
probability δ
Estonia CS theory days, 29 Oct, 2005 40
Epidemic-threshold property
- Theorem. Consider network G with adjacency matrix M,
propagation probability β, and healing probability δ. If β/δ < 1/λ1(M) the expected time until the virus dies out is logarithmic in the number of nodes in the network, against an adaptive adversary
Estonia CS theory days, 29 Oct, 2005 41
Epidemic threshold (cont.)
- what if β/δ large?
- notice that the virus eventually will die out
- dynamical model hard to analyze because of non linearities
- recent work by Ganesh et al. 2005 shows that
if β/δ > 1/η(G) (isoparametric constant of the graph) then the expected time until the virus dies out is exponential with the size of the network
Estonia CS theory days, 29 Oct, 2005 42
Multiple-copies model
- each node can have multiple copies of a virus
- infection probability refers to receiving one more copy
- healing probability refers to removing one copy
- more pessimistic than the single-copy model
- easier to analyze
Estonia CS theory days, 29 Oct, 2005 43
Multiple-copies model
- at time t, node i has vt
i copies
- vt = [vt
1, . . . , vt n] vector of nodes’ copies
vt expected value of vt
- then
- vt+1 = ∆
vt, where ∆ = βM + diag(1 − δ, . . . , 1 − δ)
- Theorem. In the multiple-copies model the expected time until
the virus dies out is logarithmic if β/δ < 1/λ1(M) and it is unbounded if β/δ > 1/λ1(M)
Estonia CS theory days, 29 Oct, 2005 44
Immunization problem for the dynamic model
- given network G and effective infection rate β/δ, immunize the
minimum number of nodes in G, such that β/δ < 1/λ1(M ′), where M ′ is the adjacency matrix of the immunized network
- we would like to use a greedy approach
- the problem becomes finding the node to immunize so that the
eigenvalue of the adjacency matrix drops as much as possible
Estonia CS theory days, 29 Oct, 2005 45
EIG algorithm for dynamic propagation
- B ← M
- while β/δ > 1/λ1(B)
– compute w1, the eigenvector of B that corresponds to λ1(B) – find node u with the maximum value in w1 – Remove u from the graph and form new matrix B′ – B ← B′
Estonia CS theory days, 29 Oct, 2005 46
Intuition behind the EIG algorithm
- suppose that “susceptibility” of node i is captured by wi
- probability of virus propagation between i and j: pij = wiwj
- healing probability of node i is 1 − w2
i
- system matrix ∆ = wwT
- then λ1(∆) = ||w||2 and corresponding eigenvector w (norm.)
- consider ∆′ after immunizing node i
(zero-ing the i-th row and column of ∆)
- now λ1(∆′) = ||w||2 − w2
i
Estonia CS theory days, 29 Oct, 2005 47
Intuition behind the EIG algorithm
- the principal eigenvalue gives an indication of the connectivity
- f the network
- large eigenvalue corresponds to a densely connected network
- the nodes with the maximum value in the first eigenvector are
the ones that are most tightly interconnected
- removing these nodes reduces the graph connectivity
- in general EIG selects nodes with high degree, but not always
(more global view)
Estonia CS theory days, 29 Oct, 2005 48
Experimental setup – algorithms
- compare the performance of the algorithms against other
strategies – MaxDegree – MaxDegreeIt – Random
Estonia CS theory days, 29 Oct, 2005 49
Experimental setup – datasets
- synthetic datasets:
– random graphs (Erd˝
- s-R´
enyi) – scale-free graphs (Barab´ asi and Albert) – small-world graph (Watts, Watts and Strogatz)
- real datasets:
– co-authorship graphs (representing social networks) – autonomous systems (internet topology) – power-grid (networks of electricity transfer)
Estonia CS theory days, 29 Oct, 2005 50
Scale-free graphs (Barab´ asi and Albert)
- preferential attachment
- nodes join the network sequentially
- each new node comes with m edges
- it connects its m edges to existing nodes, which are selected
with probability proportional to their degrees
- simulates the rich gets richer effect
- results in power-law graphs with exponent 3
Estonia CS theory days, 29 Oct, 2005 51
Small-world graphs
- Networks with
high clustering coefficient and small average path length
Estonia CS theory days, 29 Oct, 2005 52
Small-world graphs – Watts model
- generated using a parameter α
- intuitively α controls the probability that two nodes will be
connected given the number of their common neighbors
1 5 8 10 15 0.2 0.4 0.6 0.8 Parameter a Clustering coefficient 1 5 8 10 15 50 100 Parameter a Characteristic path length
Estonia CS theory days, 29 Oct, 2005 53
Small-world graphs – Watts-Strogatz model
- the generation process is governed by parameters q
- initially all nodes are on a ring lattice.
- each node has degree k
- each node is rewired to another random node with probability q
Estonia CS theory days, 29 Oct, 2005 54
Independent cascade
synthetic dataset – scale-free graphs
50 100 150 200 250 300 350 400 2400 2600 2800 3000 3200 3400 3600 3800 4000
Scale−Free Graphs (p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random
Estonia CS theory days, 29 Oct, 2005 55
Independent cascade
synthetic dataset – small-world graphs
50 100 150 200 250 300 350 400 2800 3000 3200 3400 3600 3800 4000
Watts−Strogatz Model (q=0.01, p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random
50 100 150 200 250 300 350 400 500 1000 1500 2000 2500 3000 3500 4000
Watts Model (alpha=6,p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random
Estonia CS theory days, 29 Oct, 2005 56
Independent cascade
synthetic datasets – small-world graphs
1 10 20 30 40 50 100 200 300 400 500 600 700
Watts graph; a=3; p=0.6 # of immunized nodes Expected number of infected nodes
Greedy Sort MaxDegree MaxDegreeIt Random
1 10 20 30 40 50 3880 3900 3920 3940 3960 3980 4000
Watts graph; a=14; p=0.6 # of immunized nodes Expected number of infected nodes
Greedy Sort MaxDegree MaxDegreeIt Random
Estonia CS theory days, 29 Oct, 2005 57
Independent cascade
real datasets
50 100 150 200 250 300 200 400 600 800 1000 1200 1400 1600 1800
Co−authors Graph (p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random
100 200 300 400 500 500 1000 1500 2000 2500 3000 3500
Power−grid Network (p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random
Estonia CS theory days, 29 Oct, 2005 58
Dynamic propagation
synthetic datasets
20 50 100 200 500 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
Scale free graph with 4000 nodes Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt
20 50 100 200 500 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Watts−Strogatz with 4000 nodes; p=0.01 Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt
20 50 100 200 500 0.7 0.75 0.8 0.85 0.9 0.95 1
Watts graph with 4000 nodes; a=6 Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt
Estonia CS theory days, 29 Oct, 2005 59
Dynamic propagation
real datasets
20 50 100 200 500 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Co−authors graph Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt
20 50 100 200 500 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Power−grid graph Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt
Estonia CS theory days, 29 Oct, 2005 60
Conclusions
- network immunization problem under different virus propagation
models
- greedy algorithms work well in practice
- applications in epidemiology and security of computer networks
- many open problems
– can we do better than the greedy? – which node to remove in order to obtain the largest drop in the eigenvalue?
Estonia CS theory days, 29 Oct, 2005 61
...complete change of topic...
Genome segmentations
joint work with Niina Haiminen, Evimaria Terzi, Heikki Mannila
Estonia CS theory days, 29 Oct, 2005 62
(k,h)-segmentation
- [Gionis and Mannila 03]
- given sequence S = a1, a2, . . . , an
- we want to find k segments
- but only h < k different segment types are allowed
- each of the k segments should be assigned to one of the h types
- find the best segmentation into k segments, the h types, and
the assignment of each segment to one type
Estonia CS theory days, 29 Oct, 2005 63
(k,h)-segmentation: problem definition
- assume piecewise constant representation, and L2
2
- given sequence S = a1, a2, . . . , an
- we want to find
– partition of S into k segments S1, . . . , Sk, – h levels l1, . . . , lh – assignment of segment j to level lj ∈ {l1, . . . , lh} in order to minimize the total error R[n, k, h] =
k
- j=1
ej
- i=bj
(ai − lj)2
Estonia CS theory days, 29 Oct, 2005 64
Example
Estonia CS theory days, 29 Oct, 2005 65
Example: k = 3 and h = 3
Estonia CS theory days, 29 Oct, 2005 66
Example: k = 3 and h = 2
Estonia CS theory days, 29 Oct, 2005 67
Some facts about the (k, h)-segmentation problem
- NP-Complete problem for multidimensional data (d > 1), w.r.t.
L1 and L2 (contrast with k-segmentation, which is polynomial)
- generalizes k-segmentation and clustering
– k-segmentation: h = k – clustering: k = n
- simple approximation algorithms that combine the above two
subproblems – d = 1: 3-approximation for L1, 5-approximation for L2
2
– d > 1: (3 + ǫ)-approx. for L1, (1 + 4α2)-approx. for L2
2,
where α is the best approximation factor for k-means
Estonia CS theory days, 29 Oct, 2005 68
ClusterSegments algorithm
- solve k-segmentation problem and obtain segments S1, . . . , Sk
- consider the representative cj for each segment Sj
(mean, median, etc.)
- map segment Sj to a weighted point with value cj and weight
wj = |Sj|
- cluster those k weighted points to h centers L = {l1, . . . , lh}
- assign each segment to its closer center in L
- running time is O(n2k) (from dynamic programming)
Estonia CS theory days, 29 Oct, 2005 69
ClusterSegments example, k = 3, h = 2
Estonia CS theory days, 29 Oct, 2005 70
ClusterSegments example, k = 3, h = 2
Estonia CS theory days, 29 Oct, 2005 71
ClusterSegments example, k = 3, h = 2
Estonia CS theory days, 29 Oct, 2005 72
ClusterSegments example, k = 3, h = 2
Estonia CS theory days, 29 Oct, 2005 73
ClusterSegments example, k = 3, h = 2
Estonia CS theory days, 29 Oct, 2005 74
Iterative algorithm
- if we know the k best segments, we can find the h best levels
- if we know the h best levels, we can find the k best segments
- start from an initial solution,
e.g., the one produced by the previous algorithm
- iterate:
– keep segment boundaries fixed, find levels – keep levels fixed, find boundaries
- EM-style, fast convergence, good results
Estonia CS theory days, 29 Oct, 2005 75
DNA segmentation
- segmentation: a powerful concept for examining the large-scale
- rganization of DNA
- many examples of segments in DNA
– (telomere, main-sequence, centromere) – (gene-rich, junk DNA)* – (regulatory region, gene, regulatory region, junk DNA)* – (microbial insert | viral insert | ancient mammalian)*
- goal is to understand the complexity of the genome organization
based on segments and recurrent sources
Estonia CS theory days, 29 Oct, 2005 76
DNA segmentation
- existing approaches with top-down segmentation and greedy
identification of similar segments [Bernaola-Galv´ an et al. 96, Bernaola-Galv´ an et al. 00, Li 01, Azad et al. 02]
- here we describe some of our own experiments with
(k, h)-segmentation [Haiminen et al. 05]
Estonia CS theory days, 29 Oct, 2005 77
Distinguishing genomes of different species
- create many “semi-synthetic” datasets HiSj by concatenating
– Hi: the i-th chromosome of human with – Sj: the j-th chromosome of another species S
- apply (k, h)-segmentation and compare with the ground truth
segmentation
- let L = {l1, . . . , lh} be the discovered sources in the
concatenated sequence, and LH and LS be the distribution of lengths of sources of L in chromosomes H and S, resp.
- compare the variational distance between the two distributions
– 0: identical distributions, 1: completely distinct distributions
Estonia CS theory days, 29 Oct, 2005 78
Genomes of different species — sample segmentations
Ground Truth
1
(k,h)−segmentation
1 6 4 17 11 7 20 9 4 9 12 17 13 15 18 17 18 17 2 13 20 9 14 0 3 5 10 19 8 16
human 8 vs. chicken 8
1
Ground Truth
4 2 3 13 5 3 9 2 8 11 3113 7 13 2 10 5 13 14 1 8 13 10 2 8 1112 6
(k,h)−segmentation
human 8 vs. zebra fish 8
Estonia CS theory days, 29 Oct, 2005 79
Genomes of different species — sample segmentations
Ground Truth
1
(k,h)−segmentation
3 15 4 17 9 13 9 8 08 14 15 17 16 6 15 9 8 1 19 0 2 10 11 12 2 5 18 7
human 8 vs. dog 8
Ground Truth
1
(k,h)−segmentation
110 3 10 5 7 8 7 8 11 12 20 12 1714216 19 18 4 15 16 9 6 13 2 16 2 0 18
human 8 vs. mouse 8
Estonia CS theory days, 29 Oct, 2005 80
Genomes of different species — sample segmentations
Ground Truth
1
(k,h)−segmentation
1 5 16 5 6 17 4 7 8 11 12 15 12 13 14 19 1116 18 16 11 0 2 0 18 10 5 17 9 3
human 8 vs. chimp 8
Ground Truth
1
(k,h)−segmentation
1 3 2 3 4 5 6 7 8 9 10 11 12 130 1 3 2 3 4 5 6 7 8 9 10 11 12 130
human 8 vs. human 8
Estonia CS theory days, 29 Oct, 2005 81
Genomes of different species — variational distances
100 200 300 400 500 600 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Variational Distance HxZy
Human−Zbfish Mixes
data random 100 200 300 400 500 600 0.2 0.4 0.6 0.8 1
Variational Distance HxCy
Human−Chicken Mixes
data random
human vs. zebrafish human vs. chicken
Estonia CS theory days, 29 Oct, 2005 82
Genomes of different species — variational distances
50 100 150 200 250 300 350 400 0.2 0.4 0.6 0.8 1
Variational Distance HxMy
Human−Mouse Mixes
data random 50 100 150 200 250 300 350 400 450 500 0.2 0.4 0.6 0.8 1
Variational Distance HxCMPy
Human−Chimp Mixes
data random
human vs. mouse human vs. chimp
Estonia CS theory days, 29 Oct, 2005 83
Distinguishing coding from non-coding regions
- Rickettsia bacterium region that includes 13 genes and
non-coding in-between region
- 10 bp non-overlapping windows
- in each window features that capture the existence of codons
Estonia CS theory days, 29 Oct, 2005 84
Distinguishing coding from non-coding regions
(k,h) = (20,2) 1 1 1 1 1 1 (k,h) = (27,3) 1 0 2 0 1 1 2 0 1 1 0 2 0 1 1
5 10 15 20
GT: (25,3) 1 2 0 1 0 1 1 0 1 0 1 1 0 1 1 0 2 0 1 1 kbp
Estonia CS theory days, 29 Oct, 2005 85
DNA segmentations — conclusions
- segmentation is promising tool for analyzing genomic sequences
- fascinating problem of understanding the structure of DNA
Estonia CS theory days, 29 Oct, 2005 86
Thank you!
- for your attention
- Helger Lipmaa and Tarmo Uustalu for the invitation
- hope to learn more about CS research and theory in Estonia...
- ...hope to enjoy the weekend, too!
Estonia CS theory days, 29 Oct, 2005 87