Models and algorithms for network immunization Aris Gionis Basic - - PowerPoint PPT Presentation

models and algorithms for network immunization
SMART_READER_LITE
LIVE PREVIEW

Models and algorithms for network immunization Aris Gionis Basic - - PowerPoint PPT Presentation

Models and algorithms for network immunization Aris Gionis Basic Research Unit, HIIT University of Helsinki a brief introduction... ...originally from Greece BS, University of Athens, Greece MS and PhD, Stanford University, USA


slide-1
SLIDE 1

Models and algorithms for network immunization

Aris Gionis Basic Research Unit, HIIT University of Helsinki

slide-2
SLIDE 2

a brief introduction...

  • ...originally from Greece
  • BS, University of Athens, Greece
  • MS and PhD, Stanford University, USA
  • PhD adviser: Rajeev Motwani
  • Thesis title: “Algorithms for similarity search and clustering in

large data sets”, July, 2003

  • in Basic Research Unit, HIIT, Finland, since August 2003

Estonia CS theory days, 29 Oct, 2005 2

slide-3
SLIDE 3

Basic Research Unit, HIIT

  • research

– Heikki Mannila – Panayiotis Tsaparas – Niina Haiminen, Evimaria Terzi – external collaborators: Foto Afrati, Christos Faloutsos, Spiros Papadimitriou, Alex Hinnenburg, . . .

  • co-supervising students

– Niina Haiminen, Evimaria Terzi

  • teaching courses

– data mining, approximation algorithms, computational complexity, spectral methods for data mining

Estonia CS theory days, 29 Oct, 2005 3

slide-4
SLIDE 4

Research paradigm in BRU

  • develop novel data analysis techniques for use in other sciences
  • combine basic research in computer science with applications

– look at data analysis problems arising in practice – abstract new computational concepts from them – analyze, develop new computational methods – take the results into practice ⇒ theoretical work in algorithms and foundations of data analysis can have fast impact in the application areas ⇐ the applications feed interesting novel questions to theoretical research

Estonia CS theory days, 29 Oct, 2005 4

slide-5
SLIDE 5

Recent projects

  • sequence analysis

– biology, genetics, physics, telecommunications

  • analysis of spatial data

– biology, ecology

  • ordering problems

– paleontology

  • clustering
  • analysis of 0–1 matrices

Estonia CS theory days, 29 Oct, 2005 5

slide-6
SLIDE 6

...rest of the talk...

Models and algorithms for network immunization

joint work with George Giakkoupis, Evimaria Terzi, and Panayiotis Tsaparas

Genome segmentations

joint work with Niina Haiminen, Evimaria Terzi, Heikki Mannila

Estonia CS theory days, 29 Oct, 2005 6

slide-7
SLIDE 7

Motivation

  • many natural or man-made systems are organized as networks

– internet, web, social networks, protein networks, etc.

  • operation is threaten by the propagation of a harmful entity

through the network – diseases in social networks – gossip or panic in social networks – failures in power grids – computer viruses on the internet

  • can we restrict the spread of the virus in the network?

Estonia CS theory days, 29 Oct, 2005 7

slide-8
SLIDE 8

Virus spread

Estonia CS theory days, 29 Oct, 2005 8

slide-9
SLIDE 9

Virus spread

Estonia CS theory days, 29 Oct, 2005 9

slide-10
SLIDE 10

Virus spread

Estonia CS theory days, 29 Oct, 2005 10

slide-11
SLIDE 11

Virus spread

Estonia CS theory days, 29 Oct, 2005 11

slide-12
SLIDE 12

Virus spread

Estonia CS theory days, 29 Oct, 2005 12

slide-13
SLIDE 13

Virus spread

Estonia CS theory days, 29 Oct, 2005 13

slide-14
SLIDE 14

Virus spread

Estonia CS theory days, 29 Oct, 2005 14

slide-15
SLIDE 15

Virus spread

Estonia CS theory days, 29 Oct, 2005 15

slide-16
SLIDE 16

Restrain the spread

Estonia CS theory days, 29 Oct, 2005 16

slide-17
SLIDE 17

Restrain the spread

Estonia CS theory days, 29 Oct, 2005 17

slide-18
SLIDE 18

Restrain the spread

Estonia CS theory days, 29 Oct, 2005 18

slide-19
SLIDE 19

Restrain the spread

Estonia CS theory days, 29 Oct, 2005 19

slide-20
SLIDE 20

Naive virus injection

Estonia CS theory days, 29 Oct, 2005 20

slide-21
SLIDE 21

General framework

  • network G = (V, E) over which the virus propagates
  • virus-propagation model (can be probabilistic)
  • adversary who injects copies of the virus in the network

– blind – adaptive ⇒ immunization algorithm: given a network, budget k, and a virus-propagation model find k nodes to immunize so that the spread is minimized

Estonia CS theory days, 29 Oct, 2005 21

slide-22
SLIDE 22

What is the spread?

  • network G = (V, E)
  • adversary plants r viruses (blindly or adaptively)
  • Nr ⊆ V : set of nodes selected by adversary
  • expected number of infected nodes: S(Nr, G)
  • spread: Sr(G) = maxNr S(Nr, G)
  • expected spread:

Sr(G) = ENr[S(Nr, G)]

Estonia CS theory days, 29 Oct, 2005 22

slide-23
SLIDE 23

Example of immunization algorithms

  • immunize a random node
  • immunize the node with the largest degree

Estonia CS theory days, 29 Oct, 2005 23

slide-24
SLIDE 24

Virus-propagation models

  • problem as stated above is too general

– e.g., no formal specification language for all possible virus-propagation models

  • concentrate on two specific virus-propagation models:

– independent cascade, and – dynamic propagation, ...but similar ideas can be applied to other models, too

Estonia CS theory days, 29 Oct, 2005 24

slide-25
SLIDE 25

Some background models on epidemics

  • Susceptible-Infected-Removed (SIR)

– susceptible (healthy) nodes do not have the virus but they can catch it if exposed to somebody who does – infected nodes have the virus and they can pass it – removed (or recovered) have immunity, cannot catch the virus again and cannot pass it on

  • Susceptible-Infected-Susceptible (SIS)

– susceptible nodes – infected nodes can be healed and become susceptible again

Estonia CS theory days, 29 Oct, 2005 25

slide-26
SLIDE 26

Epidemics background

  • traditional studies do not take into account the network

structure – nodes become infected or recovered with uniform probabilities

  • modern studies do take into account network topology
  • epidemic threshold

– β: infection rate, δ: healing rate, λ = β/δ: effective spreading rate – ∃λc s.t. if – λ ≥ λc a non-zero fraction of nodes becomes infected (SIR) – λ ≥ λc virus spreads and becomes persistent (SIS) – λ < λc virus dies out exponentially fast (SIS)

Estonia CS theory days, 29 Oct, 2005 26

slide-27
SLIDE 27

Epidemics background

  • many studies of special cases
  • power-law networks do not have (non-zero) epidemic thresholds
  • studies of immunizing the highest degree nodes
  • immunization in the case of unknown network topology

– immunizing the adjacent node of a random node works well for skewed-degree networks

  • . . .

Estonia CS theory days, 29 Oct, 2005 27

slide-28
SLIDE 28

Our approach

  • algorithmic approach to the immunization problem
  • extensive experimentation
  • virus-propagation models considered:

– independent cascade, and – dynamic propagation

Estonia CS theory days, 29 Oct, 2005 28

slide-29
SLIDE 29

Independent cascade

  • initially the adversary plants r viruses in the network
  • assume node u becomes infected for first time at time t:

– u attempts to infect all currently uninfected neighbors v – it succeeds with probability p – if u succeeds then v becomes infected – otherwise u never attempts to infect v again

Estonia CS theory days, 29 Oct, 2005 29

slide-30
SLIDE 30

Independent cascade — example

Time 1 q w u v

Estonia CS theory days, 29 Oct, 2005 30

slide-31
SLIDE 31

Independent cascade — example

Time 2 Time 1 q q w w u v u v

Estonia CS theory days, 29 Oct, 2005 31

slide-32
SLIDE 32

Independent cascade — example

Time 2 Time 3 Time 1 q q q w w w u v u v u v

Estonia CS theory days, 29 Oct, 2005 32

slide-33
SLIDE 33

Independent cascade — example

Time 2 Time 3 Time 1 q q q w w w u v u v u v

Estonia CS theory days, 29 Oct, 2005 33

slide-34
SLIDE 34

Independent cascade

Estonia CS theory days, 29 Oct, 2005 34

slide-35
SLIDE 35

Independent cascade

Estonia CS theory days, 29 Oct, 2005 35

slide-36
SLIDE 36

Independent cascade

Estonia CS theory days, 29 Oct, 2005 36

slide-37
SLIDE 37

Independent cascade

  • given a sampling on network links with probability p

– S1(G) is size of largest connected component (adaptive) – S1(G) is the average connected components size (blind)

  • immunization problem:

– remove k nodes from the network in order to minimize – size of r largest connected components, or – average size of connected component, respectively

  • both Sr(G) and

Sr(G) are NP-hard

Estonia CS theory days, 29 Oct, 2005 37

slide-38
SLIDE 38

Algorithm for the independent-cascade model

  • greedy, i.e., immunize nodes one by one
  • for the adaptive-adversary case:

– at each iteration find the node that minimizes the expected size of the largest connected component in the resulting network

  • for the blind-adversary case:

– at each iteration find the node that minimizes the expected size of the average connected component in the resulting network

Estonia CS theory days, 29 Oct, 2005 38

slide-39
SLIDE 39

Computing the expectations

  • sample many graphs over all the 2|E| possible graphs

– in each sample graph (u, v) exists with probability p ⇒ in each sampled graph for each node u find the size of the largest/average connected component in the graph resulting from removing (immunizing) u select the node that minimizes the expectation (largest/average)

Estonia CS theory days, 29 Oct, 2005 39

slide-40
SLIDE 40

Dynamic-propagation

  • a dynamic birth-death process that evolves over time
  • virus propagates from node u to neighbor node v with

probability β

  • at each point in time, a node u that is infected heals with

probability δ

Estonia CS theory days, 29 Oct, 2005 40

slide-41
SLIDE 41

Epidemic-threshold property

  • Theorem. Consider network G with adjacency matrix M,

propagation probability β, and healing probability δ. If β/δ < 1/λ1(M) the expected time until the virus dies out is logarithmic in the number of nodes in the network, against an adaptive adversary

Estonia CS theory days, 29 Oct, 2005 41

slide-42
SLIDE 42

Epidemic threshold (cont.)

  • what if β/δ large?
  • notice that the virus eventually will die out
  • dynamical model hard to analyze because of non linearities
  • recent work by Ganesh et al. 2005 shows that

if β/δ > 1/η(G) (isoparametric constant of the graph) then the expected time until the virus dies out is exponential with the size of the network

Estonia CS theory days, 29 Oct, 2005 42

slide-43
SLIDE 43

Multiple-copies model

  • each node can have multiple copies of a virus
  • infection probability refers to receiving one more copy
  • healing probability refers to removing one copy
  • more pessimistic than the single-copy model
  • easier to analyze

Estonia CS theory days, 29 Oct, 2005 43

slide-44
SLIDE 44

Multiple-copies model

  • at time t, node i has vt

i copies

  • vt = [vt

1, . . . , vt n] vector of nodes’ copies

vt expected value of vt

  • then
  • vt+1 = ∆

vt, where ∆ = βM + diag(1 − δ, . . . , 1 − δ)

  • Theorem. In the multiple-copies model the expected time until

the virus dies out is logarithmic if β/δ < 1/λ1(M) and it is unbounded if β/δ > 1/λ1(M)

Estonia CS theory days, 29 Oct, 2005 44

slide-45
SLIDE 45

Immunization problem for the dynamic model

  • given network G and effective infection rate β/δ, immunize the

minimum number of nodes in G, such that β/δ < 1/λ1(M ′), where M ′ is the adjacency matrix of the immunized network

  • we would like to use a greedy approach
  • the problem becomes finding the node to immunize so that the

eigenvalue of the adjacency matrix drops as much as possible

Estonia CS theory days, 29 Oct, 2005 45

slide-46
SLIDE 46

EIG algorithm for dynamic propagation

  • B ← M
  • while β/δ > 1/λ1(B)

– compute w1, the eigenvector of B that corresponds to λ1(B) – find node u with the maximum value in w1 – Remove u from the graph and form new matrix B′ – B ← B′

Estonia CS theory days, 29 Oct, 2005 46

slide-47
SLIDE 47

Intuition behind the EIG algorithm

  • suppose that “susceptibility” of node i is captured by wi
  • probability of virus propagation between i and j: pij = wiwj
  • healing probability of node i is 1 − w2

i

  • system matrix ∆ = wwT
  • then λ1(∆) = ||w||2 and corresponding eigenvector w (norm.)
  • consider ∆′ after immunizing node i

(zero-ing the i-th row and column of ∆)

  • now λ1(∆′) = ||w||2 − w2

i

Estonia CS theory days, 29 Oct, 2005 47

slide-48
SLIDE 48

Intuition behind the EIG algorithm

  • the principal eigenvalue gives an indication of the connectivity
  • f the network
  • large eigenvalue corresponds to a densely connected network
  • the nodes with the maximum value in the first eigenvector are

the ones that are most tightly interconnected

  • removing these nodes reduces the graph connectivity
  • in general EIG selects nodes with high degree, but not always

(more global view)

Estonia CS theory days, 29 Oct, 2005 48

slide-49
SLIDE 49

Experimental setup – algorithms

  • compare the performance of the algorithms against other

strategies – MaxDegree – MaxDegreeIt – Random

Estonia CS theory days, 29 Oct, 2005 49

slide-50
SLIDE 50

Experimental setup – datasets

  • synthetic datasets:

– random graphs (Erd˝

  • s-R´

enyi) – scale-free graphs (Barab´ asi and Albert) – small-world graph (Watts, Watts and Strogatz)

  • real datasets:

– co-authorship graphs (representing social networks) – autonomous systems (internet topology) – power-grid (networks of electricity transfer)

Estonia CS theory days, 29 Oct, 2005 50

slide-51
SLIDE 51

Scale-free graphs (Barab´ asi and Albert)

  • preferential attachment
  • nodes join the network sequentially
  • each new node comes with m edges
  • it connects its m edges to existing nodes, which are selected

with probability proportional to their degrees

  • simulates the rich gets richer effect
  • results in power-law graphs with exponent 3

Estonia CS theory days, 29 Oct, 2005 51

slide-52
SLIDE 52

Small-world graphs

  • Networks with

high clustering coefficient and small average path length

Estonia CS theory days, 29 Oct, 2005 52

slide-53
SLIDE 53

Small-world graphs – Watts model

  • generated using a parameter α
  • intuitively α controls the probability that two nodes will be

connected given the number of their common neighbors

1 5 8 10 15 0.2 0.4 0.6 0.8 Parameter a Clustering coefficient 1 5 8 10 15 50 100 Parameter a Characteristic path length

Estonia CS theory days, 29 Oct, 2005 53

slide-54
SLIDE 54

Small-world graphs – Watts-Strogatz model

  • the generation process is governed by parameters q
  • initially all nodes are on a ring lattice.
  • each node has degree k
  • each node is rewired to another random node with probability q

Estonia CS theory days, 29 Oct, 2005 54

slide-55
SLIDE 55

Independent cascade

synthetic dataset – scale-free graphs

50 100 150 200 250 300 350 400 2400 2600 2800 3000 3200 3400 3600 3800 4000

Scale−Free Graphs (p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random

Estonia CS theory days, 29 Oct, 2005 55

slide-56
SLIDE 56

Independent cascade

synthetic dataset – small-world graphs

50 100 150 200 250 300 350 400 2800 3000 3200 3400 3600 3800 4000

Watts−Strogatz Model (q=0.01, p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random

50 100 150 200 250 300 350 400 500 1000 1500 2000 2500 3000 3500 4000

Watts Model (alpha=6,p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random

Estonia CS theory days, 29 Oct, 2005 56

slide-57
SLIDE 57

Independent cascade

synthetic datasets – small-world graphs

1 10 20 30 40 50 100 200 300 400 500 600 700

Watts graph; a=3; p=0.6 # of immunized nodes Expected number of infected nodes

Greedy Sort MaxDegree MaxDegreeIt Random

1 10 20 30 40 50 3880 3900 3920 3940 3960 3980 4000

Watts graph; a=14; p=0.6 # of immunized nodes Expected number of infected nodes

Greedy Sort MaxDegree MaxDegreeIt Random

Estonia CS theory days, 29 Oct, 2005 57

slide-58
SLIDE 58

Independent cascade

real datasets

50 100 150 200 250 300 200 400 600 800 1000 1200 1400 1600 1800

Co−authors Graph (p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random

100 200 300 400 500 500 1000 1500 2000 2500 3000 3500

Power−grid Network (p=0.8) # of immunized nodes Expected epidemic spread Greedy Sort MaxDegree MaxDegreeIt Random

Estonia CS theory days, 29 Oct, 2005 58

slide-59
SLIDE 59

Dynamic propagation

synthetic datasets

20 50 100 200 500 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65

Scale free graph with 4000 nodes Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt

20 50 100 200 500 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Watts−Strogatz with 4000 nodes; p=0.01 Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt

20 50 100 200 500 0.7 0.75 0.8 0.85 0.9 0.95 1

Watts graph with 4000 nodes; a=6 Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt

Estonia CS theory days, 29 Oct, 2005 59

slide-60
SLIDE 60

Dynamic propagation

real datasets

20 50 100 200 500 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Co−authors graph Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt

20 50 100 200 500 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Power−grid graph Number of removed nodes lambda’/lambda Eig Batch MaxDegree MaxDegreeIt

Estonia CS theory days, 29 Oct, 2005 60

slide-61
SLIDE 61

Conclusions

  • network immunization problem under different virus propagation

models

  • greedy algorithms work well in practice
  • applications in epidemiology and security of computer networks
  • many open problems

– can we do better than the greedy? – which node to remove in order to obtain the largest drop in the eigenvalue?

Estonia CS theory days, 29 Oct, 2005 61

slide-62
SLIDE 62

...complete change of topic...

Genome segmentations

joint work with Niina Haiminen, Evimaria Terzi, Heikki Mannila

Estonia CS theory days, 29 Oct, 2005 62

slide-63
SLIDE 63

(k,h)-segmentation

  • [Gionis and Mannila 03]
  • given sequence S = a1, a2, . . . , an
  • we want to find k segments
  • but only h < k different segment types are allowed
  • each of the k segments should be assigned to one of the h types
  • find the best segmentation into k segments, the h types, and

the assignment of each segment to one type

Estonia CS theory days, 29 Oct, 2005 63

slide-64
SLIDE 64

(k,h)-segmentation: problem definition

  • assume piecewise constant representation, and L2

2

  • given sequence S = a1, a2, . . . , an
  • we want to find

– partition of S into k segments S1, . . . , Sk, – h levels l1, . . . , lh – assignment of segment j to level lj ∈ {l1, . . . , lh} in order to minimize the total error R[n, k, h] =

k

  • j=1

ej

  • i=bj

(ai − lj)2

Estonia CS theory days, 29 Oct, 2005 64

slide-65
SLIDE 65

Example

Estonia CS theory days, 29 Oct, 2005 65

slide-66
SLIDE 66

Example: k = 3 and h = 3

Estonia CS theory days, 29 Oct, 2005 66

slide-67
SLIDE 67

Example: k = 3 and h = 2

Estonia CS theory days, 29 Oct, 2005 67

slide-68
SLIDE 68

Some facts about the (k, h)-segmentation problem

  • NP-Complete problem for multidimensional data (d > 1), w.r.t.

L1 and L2 (contrast with k-segmentation, which is polynomial)

  • generalizes k-segmentation and clustering

– k-segmentation: h = k – clustering: k = n

  • simple approximation algorithms that combine the above two

subproblems – d = 1: 3-approximation for L1, 5-approximation for L2

2

– d > 1: (3 + ǫ)-approx. for L1, (1 + 4α2)-approx. for L2

2,

where α is the best approximation factor for k-means

Estonia CS theory days, 29 Oct, 2005 68

slide-69
SLIDE 69

ClusterSegments algorithm

  • solve k-segmentation problem and obtain segments S1, . . . , Sk
  • consider the representative cj for each segment Sj

(mean, median, etc.)

  • map segment Sj to a weighted point with value cj and weight

wj = |Sj|

  • cluster those k weighted points to h centers L = {l1, . . . , lh}
  • assign each segment to its closer center in L
  • running time is O(n2k) (from dynamic programming)

Estonia CS theory days, 29 Oct, 2005 69

slide-70
SLIDE 70

ClusterSegments example, k = 3, h = 2

Estonia CS theory days, 29 Oct, 2005 70

slide-71
SLIDE 71

ClusterSegments example, k = 3, h = 2

Estonia CS theory days, 29 Oct, 2005 71

slide-72
SLIDE 72

ClusterSegments example, k = 3, h = 2

Estonia CS theory days, 29 Oct, 2005 72

slide-73
SLIDE 73

ClusterSegments example, k = 3, h = 2

Estonia CS theory days, 29 Oct, 2005 73

slide-74
SLIDE 74

ClusterSegments example, k = 3, h = 2

Estonia CS theory days, 29 Oct, 2005 74

slide-75
SLIDE 75

Iterative algorithm

  • if we know the k best segments, we can find the h best levels
  • if we know the h best levels, we can find the k best segments
  • start from an initial solution,

e.g., the one produced by the previous algorithm

  • iterate:

– keep segment boundaries fixed, find levels – keep levels fixed, find boundaries

  • EM-style, fast convergence, good results

Estonia CS theory days, 29 Oct, 2005 75

slide-76
SLIDE 76

DNA segmentation

  • segmentation: a powerful concept for examining the large-scale
  • rganization of DNA
  • many examples of segments in DNA

– (telomere, main-sequence, centromere) – (gene-rich, junk DNA)* – (regulatory region, gene, regulatory region, junk DNA)* – (microbial insert | viral insert | ancient mammalian)*

  • goal is to understand the complexity of the genome organization

based on segments and recurrent sources

Estonia CS theory days, 29 Oct, 2005 76

slide-77
SLIDE 77

DNA segmentation

  • existing approaches with top-down segmentation and greedy

identification of similar segments [Bernaola-Galv´ an et al. 96, Bernaola-Galv´ an et al. 00, Li 01, Azad et al. 02]

  • here we describe some of our own experiments with

(k, h)-segmentation [Haiminen et al. 05]

Estonia CS theory days, 29 Oct, 2005 77

slide-78
SLIDE 78

Distinguishing genomes of different species

  • create many “semi-synthetic” datasets HiSj by concatenating

– Hi: the i-th chromosome of human with – Sj: the j-th chromosome of another species S

  • apply (k, h)-segmentation and compare with the ground truth

segmentation

  • let L = {l1, . . . , lh} be the discovered sources in the

concatenated sequence, and LH and LS be the distribution of lengths of sources of L in chromosomes H and S, resp.

  • compare the variational distance between the two distributions

– 0: identical distributions, 1: completely distinct distributions

Estonia CS theory days, 29 Oct, 2005 78

slide-79
SLIDE 79

Genomes of different species — sample segmentations

Ground Truth

1

(k,h)−segmentation

1 6 4 17 11 7 20 9 4 9 12 17 13 15 18 17 18 17 2 13 20 9 14 0 3 5 10 19 8 16

human 8 vs. chicken 8

1

Ground Truth

4 2 3 13 5 3 9 2 8 11 3113 7 13 2 10 5 13 14 1 8 13 10 2 8 1112 6

(k,h)−segmentation

human 8 vs. zebra fish 8

Estonia CS theory days, 29 Oct, 2005 79

slide-80
SLIDE 80

Genomes of different species — sample segmentations

Ground Truth

1

(k,h)−segmentation

3 15 4 17 9 13 9 8 08 14 15 17 16 6 15 9 8 1 19 0 2 10 11 12 2 5 18 7

human 8 vs. dog 8

Ground Truth

1

(k,h)−segmentation

110 3 10 5 7 8 7 8 11 12 20 12 1714216 19 18 4 15 16 9 6 13 2 16 2 0 18

human 8 vs. mouse 8

Estonia CS theory days, 29 Oct, 2005 80

slide-81
SLIDE 81

Genomes of different species — sample segmentations

Ground Truth

1

(k,h)−segmentation

1 5 16 5 6 17 4 7 8 11 12 15 12 13 14 19 1116 18 16 11 0 2 0 18 10 5 17 9 3

human 8 vs. chimp 8

Ground Truth

1

(k,h)−segmentation

1 3 2 3 4 5 6 7 8 9 10 11 12 130 1 3 2 3 4 5 6 7 8 9 10 11 12 130

human 8 vs. human 8

Estonia CS theory days, 29 Oct, 2005 81

slide-82
SLIDE 82

Genomes of different species — variational distances

100 200 300 400 500 600 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Variational Distance HxZy

Human−Zbfish Mixes

data random 100 200 300 400 500 600 0.2 0.4 0.6 0.8 1

Variational Distance HxCy

Human−Chicken Mixes

data random

human vs. zebrafish human vs. chicken

Estonia CS theory days, 29 Oct, 2005 82

slide-83
SLIDE 83

Genomes of different species — variational distances

50 100 150 200 250 300 350 400 0.2 0.4 0.6 0.8 1

Variational Distance HxMy

Human−Mouse Mixes

data random 50 100 150 200 250 300 350 400 450 500 0.2 0.4 0.6 0.8 1

Variational Distance HxCMPy

Human−Chimp Mixes

data random

human vs. mouse human vs. chimp

Estonia CS theory days, 29 Oct, 2005 83

slide-84
SLIDE 84

Distinguishing coding from non-coding regions

  • Rickettsia bacterium region that includes 13 genes and

non-coding in-between region

  • 10 bp non-overlapping windows
  • in each window features that capture the existence of codons

Estonia CS theory days, 29 Oct, 2005 84

slide-85
SLIDE 85

Distinguishing coding from non-coding regions

(k,h) = (20,2) 1 1 1 1 1 1 (k,h) = (27,3) 1 0 2 0 1 1 2 0 1 1 0 2 0 1 1

5 10 15 20

GT: (25,3) 1 2 0 1 0 1 1 0 1 0 1 1 0 1 1 0 2 0 1 1 kbp

Estonia CS theory days, 29 Oct, 2005 85

slide-86
SLIDE 86

DNA segmentations — conclusions

  • segmentation is promising tool for analyzing genomic sequences
  • fascinating problem of understanding the structure of DNA

Estonia CS theory days, 29 Oct, 2005 86

slide-87
SLIDE 87

Thank you!

  • for your attention
  • Helger Lipmaa and Tarmo Uustalu for the invitation
  • hope to learn more about CS research and theory in Estonia...
  • ...hope to enjoy the weekend, too!

Estonia CS theory days, 29 Oct, 2005 87