Searching for Connected/Functional Motifs in Biological Networks St - - PowerPoint PPT Presentation
Searching for Connected/Functional Motifs in Biological Networks St - - PowerPoint PPT Presentation
Searching for Connected/Functional Motifs in Biological Networks St ephane Vialette LIGM Universit e Paris-Est Marne-la-Vall ee, France ENS - 07 Septembre 2010 Networks in Biology Our environement is a combination of tightly
Networks in Biology
Our environement is a combination of tightly interlinked complex system at various levels of magnitude
◮ Gene expression in cells: Gene regulation networks. ◮ Large-scale approach: Protein interaction networks. ◮ Metabolites and enzymes: Metabolic networks. ◮ Evolutionary relationships between orginisms:
Phylogenetic networks.
◮ Collecting high-throughput data: Correlation networks. ◮ . . .
Protein-Protein Interaction (PPI) PPI networks
◮ Proteins are vertices. ◮ Interactions are (weighted) edges.
Protein-Protein Interaction (PPI) PPI networks
◮ Proteins are vertices. ◮ Interactions are (weighted) edges.
Gene or PPI databases
◮
BioGRID - A Database of Genetic and Physical Interactions
◮
DIP - Database of Interacting Proteins
◮
MINT - A Molecular Interactions Database
◮
IntAct - EMBL-EBI Protein Interaction
◮
MIPS - Comprehensive Yeast Protein-Protein interactions
◮
Yeast Protein Interactions - Yeast two-hybrid results from Fields’ group
◮
PathCalling - A yeast protein interaction database by Curagen
◮
SPiD - Bacillus subtilis Protein Interaction Database
◮
AllFuse - Functional Associations of Proteins in Complete Genomes
◮
BRITE - Biomolecular Relations in Information Transmission and Expression
◮
ProMesh - A Protein-Protein Interaction Database
◮
The PIM Database - by Hybrigenics
◮
Mouse Protein-Protein interactions
◮
Human herpesvirus 1 Protein-Protein interactions
◮
Human Protein Reference Database
◮
BOND - The Biomolecular Object Network Databank. Former BIND
◮
MDSP - Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry
◮
Protcom - Database of protein-protein complexes enriched with the domain-domain structures
◮
Proteins that interact with GroEL and factors that affect their release
◮
YPDTM - Yeast Proteome Database by Incyte
◮
. . .
Network querying
Definition Given a small network (corresponding to a known pathway or a complex of interest), the network querying problem is to identify in a large target network similar instances. Remarks
◮ Similarity is usually measured in terms of sequence and
interaction patterns.
◮ Approximate occurrences: insertions and deletions. ◮ Topology-based approach.
Topology-based approach
PathBlast (http://www.pathblast.org) A server for querying linear pathways within PPI networks (UC San Diego, UC Berkeley, Tel Aviv University, Whitehead Insti- tute).
Topology-based approach
NetMatch (http://baderlab.org/Software/NetMatch) A Cytoscape plugin to query networks for patterns [FERRO et al.,
08].
Topology-based approach
Netgrep (http://genomics.princeton.edu/netgrep) Fast network schema searches in interactomes [BANKS, NABIEVA,
PETERSON, AND SINGH, 08].
From topology-based to topology-free motifs
Views Roughly speaking, there are now two views of graph (or network) motifs:
◮ The older is the topological view where one basically ends
up with certain subgraph isomorphism problems.
◮ The recent view on graph motifs takes a more functional
- approach. Here topology is of lesser importance but the
functionalities of network vertices form the governing principle [LACROIX, FERNANDES, AND SAGOT, 05].
From topology-based to topology-free motifs
Views Roughly speaking, there are now two views of graph (or network) motifs:
◮ The older is the topological view where one basically ends
up with certain subgraph isomorphism problems.
◮ The recent view on graph motifs takes a more functional
- approach. Here topology is of lesser importance but the
functionalities of network vertices form the governing principle [LACROIX, FERNANDES, AND SAGOT, 05]. Remarks The functional approach
◮ does not require information on the interconnections, ◮ is applicable in broader scenarios: complexes or pathways
whose topologies are not completely known, querying from species for which no PPI information is available, . . .
GRAPH MOTIF
Definition (GRAPH MOTIF) Input: A set of colors C, a motif M over C (a multiset M with underlying set C), a graph G = (V, E), and a mapping λ : V → C. Task: Find an occurrence of M in G, i.e., a subset V ′ ⊆ V such that
◮ λ(V ′) = M, and ◮ G[V ′] is connected.
Remarks
◮ Introduced in [LACROIX, FERNANDES, AND SAGOT, 05]. ◮ The motif M is said to be colorful if it is a set. ◮ The multiplicity of a color c ∈ C in G is the number of
vertices u ∈ V such that λ(u) = c.
GRAPH MOTIF
Example
M
GRAPH MOTIF
Example
M
GRAPH MOTIF: Preliminary results
Theorem (LACROIX, FERNANDES, AND SAGOT, 06) GRAPH MOTIF is NP-complete even if G is a tree. Remarks
◮ The proof does not hold for colorful motif. ◮ Exponential exact algorithm for the general case.
GRAPH MOTIF: A sudden jump in complexity
Theorem (FELLOWS, FERTIN, HERMELIN, V., 07) GRAPH MOTIF is NP-complete even if
◮ G is a tree with maximum degree 4 and color multiplicity 3
and M is colorful, or
◮ G is a bipartite graph and M is built over 2 colors.
Theorem (FELLOWS, FERTIN, HERMELIN, AND V., 07) GRAPH MOTIF is polynomial-time solvable if G is a tree with color multiplicity 2.
GRAPH MOTIF: Coping with hardness
Some lines of thought
◮ One may reasonably that the motifs tends to be small in
practice (compared to the target graph).
◮ It would be nice to design an algorithm whose running time
is polynomial in the size of the target graph and exponential in the size of the motif.
◮ It would be even nicer to design an algorithm whose
running time is polynomial in the size of the target graph and exponential in the number of distinct colors that occur in the motif.
◮ Parameterized complexity is a branch of computational
complexity theory that focuses on classifying computational problems according to their inherent difficulty with respect to multiple parameters of the input.
Parameterized complexity
Definition (Parameterized problem) A parameterized problem is a language L ⊆ Σ∗ × Σ∗, where Σ is a finite alphabet. The second component is called the parameter of the problem. Definition (Fixed-parameter tractability) A parameterized problem L is fixed-parameter tractable if it can be determined in f(k) nO(1) time whether (x, k) ∈ L, where f is a computable function only depending on k. The corresponding complexity class if called FPT. Definition (Parameterized hierarchy) FPT ⊆ W[1] ⊆ W[2] ⊆ . . . ⊆ W[sat] ⊆ W[P] ⊆ XP.
Parameterized complexity
In a nutshell . . .
◮ Problems that enjoy a fixed-parameter tractable algorithm
can be solved efficiently for small values of the fixed parameter.
◮ W[1] is the class of decision problems of the form (x, k) (k
a parameter), that are fixed-parameter reducible to WEIGHTED 3SAT: Given a 3SAT formula, does it have a satisfying assignment of Hamming weight k?
◮ W[1] includes the first class of problems not believed to be
in FPT.
◮ If FPT = W[1] then NP is contained in DTIME(2o(n)).
GRAPH MOTIF: Small enough motifs
Theorem (LACROIX, FERNANDES, AND SAGOT, 06) GRAPH MOTIF for trees is fixed-parameter tractable w.r.t. |M|. Remarks
◮ Fixed-parameter tractability proof does not hold for
(general graphs).
◮ Pure cominatorial enumeration algorithm.
GRAPH MOTIF: Small enough motifs
Theorem (FELLOWS, FERTIN, HERMELIN, AND V., 07) GRAPH MOTIF is solvable in 2O(k) n2 log n) time, where k = |M| and n = |V|. Theorem (BETZLER, FELLOWS, KOMUSIEWICZ, AND NIEDERMEIER, 08) GRAPH MOTIF is solvable with error probability ε in O(4.32k k2 | log ε| m) time, where k = |M| and m = |E|.
GRAPH MOTIF: Small enough motifs
Theorem (BETZLER, FELLOWS, KOMUSIEWICZ, AND NIEDERMEIER, 08) GRAPH MOTIF is solvable with error probability ε in O(4.32k k2 | log ε| m) time, where k = |M| and m = |E|. Key elements
◮ GRAPH MOTIF for colorful motifs. ◮ Color coding and recoloring procedure. ◮ Fast subset convolution (BJ ¨
ORKLUND, HUSFELDT, AND KASKI, 07).
◮ Algorithm engineering for color-coding (H ¨
UFFNER, WERNICKE,
ZICHNER, 07).
GRAPH MOTIF: colorful motifs
Theorem GRAPH MOTIF for colorful motifs is solvable in O(3k m) time, where k = |M| and m = |E|. Key elements Dynamic programming approach: Du,M′ is the minimum score
- f a color set M′ ⊆ M for a vertex v ∈ V.
Du,M′ =
- if M′ = col(v)
1
- therwise
Du,M′ = min
u∈N(v);M′′⊆M′
- Du,M′\col(v),
Dv,M′′∪col(v) + Dv,(M′\M′′)∪COLOR(v)
GRAPH MOTIF: Color coding
Color-coding
◮ ALON, YUSTER, AND ZWICK, 95. ◮ Method to derive (randomized) fixed-parameter algorithms
for several subgraph isomorphism problems.
◮ Best explained by example . . .
LONGEST PATH Input: A graph G = (V, E) and a non-negative integer k. Task: Find a simple path in G that contains k vertices.
Color coding: k-path
Key idea
- 1. Randomly color the vertices of the graph with k colors.
- 2. Find a colorful path of k vertices in G (dynamic
programming step) .
s v u
Color coding: k-path
Theorem Let G = (V, E) be a graph and f : V → {1, 2, . . . , k} be a coloring of G. Then a colorful path of vertices can be found (if it exists in 2O(k) m time, where m = |E|. Theorem LONGEST PATH is solvable in 2O(k) m expected time, where m = |E|.
Color coding: toward deterministic algorithms
Definition (k-perfect family of hash functions) A k-perfect family of hash functions is a family H of functions from {1, 2, . . . , n} to {1, 2, . . . , k} such that for each S ⊆ {1, 2, . . . , n} with |S| = k there exists an h ∈ H such that h is one-to-one when restricted to S. Theorem One can construct a k-perfect family of hash functions from {1, 2, . . . , n} to {1, 2, . . . , k} which consist of 2O(k) log n
- functions. For such a hash function h the value h(i), 1 ≤ i ≤ n,
can be computed in linear-time. Theorem LONGEST PATH is solvable in 2O(k) m log n time, where m = |E| and n = |V|.
GRAPH MOTIF: Recoloring procedure
M (G, λ)
GRAPH MOTIF: Recoloring procedure
(G, λ) M′
1 2 1 2 1 2 3 1 2
GRAPH MOTIF: Recoloring procedure
M′
1 2 1 2 1 2 3 1 2
(G, λ′)
1 1 2 2 3 1 2 2 1 1 3 2 2 1 1 2 2 1 1 3 1 2 2 1 2 2 1 1 1 3 3 2 1
GRAPH MOTIF: Recoloring procedure
M′
1 2 1 2 1 2 3 1 2
(G, λ′)
1 1 2 2 3 1 2 2 1 1 3 2 2 1 1 2 2 1 1 3 1 2 2 1 2 2 1 1 1 3 3 2 1 1 2 1 2 3 2 1 2 1
GRAPH MOTIF: Recoloring procedure
M′
1 2 1 2 1 2 3 1 2 1 2 1 2 3 2 1 2 1
GRAPH MOTIF: Recoloring procedure
Let V ′ be an occurrence of M in G.
◮ ∀c ∈ C, let Pc be the probability that the m(c) vertices in
V ′ that have color c receive a colorful recoloring Pc = m(c)! m(c)m(c) > e−m(c) 2π m(c)
◮ ∀c, c ′ ∈ C, Pc and Pc ′ are independent, and hence
Pc∧c ′ > e−m((c)+m(c ′)) 2π (m(c) + m(c ′))
◮ Let PM be the probability that the occurrence V ′ is
(recoloring) colorful PM =
- c∈C
Pc > e−k
GRAPH MOTIF is not always in FPT
Theorem (FELLOWS, FERTIN, HERMELIN, AND V., 07) GRAPH MOTIF is in XP when parameterized by both the number of colors in the motif and the treewidth of the input graph. Theorem (FELLOWS, FERTIN, HERMELIN, AND V., 07) GRAPH MOTIF is W[1]-hard when parameterized by the number of colors in M, even when the target graph is a tree.
GRAPH MOTIF: Going further
◮ Allow multiple colors per vertex to model multiple
functionalities of one element.
◮ Asking for somewhat more robust motifs:
◮ Biconnex motifs. ◮ Bridge-connectivity.
◮ The GRAPH MOTIF problem is too stringent: measurement
errors might result in no occurrence of M in G whereas “good” solutions do exist, i.e., Turning GRAPH MOTIF into an optimization problem:
◮ Matching the whole motif at the price of loosing
connectivity: The occurrence is no longer required to be connected.
◮ Maintain connectivity at the price of loosing some elements:
The occurrence has to be connected but may misses some elements from the motif.
GRAPH MOTIF: Robust motifs
Definition A vertex u is called a cut vertex if there are two distinct vertices v and w, u = v and u‘neqw, such that every path from v to w contains u. Definition A graph is biconnected is it is connected and has no cut vertex. Definition A graph is 2-edge-connected (or bridge-connected) is it cannot be disconnected by deletion of 1 edge.
GRAPH MOTIF: Robust motifs
Theorem (BETZLER, FELLOWS, KOMUSIEWICZ, AND NIEDERMEIER, 08) BICONNECTED GRAPH MOTIF is W[1]-hard when parameterized by the number of elements in M. Remarks
◮ A stronger result actually holds: Finding a biconnected
subgraph of size k is W[1]-hard.
◮ Recall that finding a biconnected subgraph of size at least
k is solvable in linear-time [TARJAN, 72].
◮ The above theorem still holds if we replace biconnectivity
by bridge-connectivity.
GRAPH MOTIF as an optimization problem
Definition (MIN CC) Input: A set of colors C, a motif M over C (a multiset M with underlying set C), a graph G = (V, E), and a mapping λ : V → C. Solution: A subset V ′ ⊆ V such that λ(V ′) = M. Measure: The number of connected components in the induced subgraph G[V ′]. Remarks
◮ Contains GRAPH MOTIF, i.e., the occurrence results is one
connected component.
◮ Minimization problem.
GRAPH MOTIF as an optimization problem
Definition (MAX GRAPH MOTIF) Input: A set of colors C, a motif M over C (a multiset M with underlying set C), a graph G = (V, E), and a mapping λ : V → C. Solution: A subset V ′ ⊆ V such that
◮ λ(V ′) ⊆ M, and ◮ G[V ′] is connected.
Measure: The size of the occurrence, i.e., |V ′|. Remarks
◮ Contains GRAPH MOTIF, i.e., the occurrence uses all the
colors of the motif.
◮ Mixamization problem.
MIN CC: Some bad news
Theorem (DONDI, FERTIN, AND V., 07) MIN CC is APX-hard , even when the target graph is a path and M is colorful. Theorem (DONDI, FERTIN, V.) MIN CC is not approximable within ratio c log(n), even when the target graph is a tree and M is colorful.
From MIN CC to SET COVER
r S ′
1
S1 e1(S1) e2(S1) et1(S1) S ′
2
S2 e1(S2) e2(S2) et2(S2) S ′
m
Sm e1(Sm) e2(Sm) etm(Sm)
MIN-CC: Bad and good news
Theorem (DONDI, FERTIN, AND V., 07) MIN CC is W[2]-hard when parameterized by the size of the solution, even when the target graph is a tree and M is colorful. Theorem (BETZLER, FELLOWS, KOMUSIEWICZ, NIEDERMEIER, 08) MIN CC is W[1]-hard when parameterized by the size of the solution, even when the target graph is a path. Theorem (DONDI, FERTIN, AND V., 07) MIN CC is in FPT when parameterized by the size of M. More precisely, is solvable in 2O(k)n3 log n time. The complexity reduces to O(n2k2(q+2)) time if G is a tree.
MIN-CC: Going further . . .
◮ The fixed-parameter algorithms are still not practical. ◮ What about approximating MIN CC for paths? ◮ No efficient exponential-time algorithm is known so far. ◮ Designing algorithms that focus on the number of distinct
colors that occur in the motif.
MAX GRAPH MOTIF: Some bad news . . . again
Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is APX-hard , even when the target graph is a tree T of degree 3, M is colorful, and each color occurs at most twice in T. Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is not in APX , even when the target graph is a tree T and M is colorful.
MAX GRAPH MOTIF: Some bad news . . . again
Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is not in APX , even when the target graph is a tree T and M is colorful.
MAX GRAPH MOTIF: Some bad news . . . again
Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is not in APX , even when the target graph is a tree T and M is colorful. Proof (key elements) The proof is by the self-improvement technique
◮ Given two instances I1 = (T1, M1) and I2 = (T2, M2) of
MAX COLORS we need to define the product I1,2 = I1 × I2 = (T1,2, M1,2) Informally, T1,2 is obtained by replacing each vertex vi ∈ V1 by a copy of T2, connecting these copies through their
- roots. If vi ∈ Vi is colored ci and vj ∈ Vj is colored cj then
vertex vi(vj) ∈ T1,2 is colored ci(cj).
◮ Self-product Ik = Ik−1 × I
MAX GRAPH MOTIF: Some bad news . . . again
Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is not in APX , even when the target graph is a tree T and M is colorful. Proof (key elements)
◮ If Tk is a solution for MAX COLORS over instance Ik, then
there exists a solution T for MAX COLORS over instance I such that |T|k ≥ |Tk|
◮ If T is a solution for MAX COLORS over instance I, then
there exists a solution Tk for MAX COLORS over instance Ik such that |Tk| ≥ |T|k
◮ For any constant δ < 1, MAXIMUM LEVEL MOTIF cannot be
approximated within ratio 2logδ n in polynomial-time unless NP ⊆ DTIME[2poly log n]
MAX GRAPH MOTIF: Some (not so) good news
Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is in FPT when parameterized by the size
- f the solution.
Remarks
◮ The idea is to combine two perfect families of hash
functions with dynamic programming.
◮ The time complexity is, however, still not praticable!
4O(k)kn2 log2 n time for graphs and 2O(k)n3 log n time for trees.
MAX GRAPH MOTIF: Some (not so) good news
Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF is in FPT when parameterized by the size
- f the solution.
Remarks
◮ The idea is to combine two perfect families of hash
functions with dynamic programming.
◮ The time complexity is, however, still not praticable!
4O(k)kn2 log2 n time for graphs and 2O(k)n3 log n time for trees. Theorem (DONDI, FERTIN, V., 09) MAX GRAPH MOTIF for trees of size n can be solved in O∗(1.62n) time. In case the motif is colorful, the time complexity reduces to O∗(1.33n).
GRAPH MOTIF and variants: practical issues
Algorithmic solutions
◮ Torque [BRUCKNER, H ¨
UFFNER, KARP, SHAMIR, AND SHARAN., 2009]. ◮ Web server. ◮ Currently support queries up to 20–25 proteins.
◮ GraMoFoNe [BLIN, SIKORA, AND V., 10].
◮ cytoscape plugin. ◮ Currently support queries up to 20–25 proteins.
Torque
http://www.cs.tau.ac.il/˜bnet/torque.html
GraMoFoNe is a Cytoscape plugin
◮ Open-source Jave platform:
◮ import/export in numerous formats ◮ visualisation ◮ Network annotations
◮ Popular in bioinformatics ◮ Active community
GraMoFoNe
Main features
◮ Uses a Pseudo-Boolean programming engine. ◮ Databases (native support). ◮ Deals with both colorful and muliset motifs. ◮ Can report all solutions. ◮ Deals with approximate solutions:
◮ insertions, ◮ deletions, ◮ List coloring.