SLIDE 1 A Parallel Approximation Hitting Set Algorithm for Gene Expression Analysis
Universidade de S˜ ao Paulo
Universidade de S˜ ao Paulo
1
SLIDE 2 Gene Expression Analysis
- Given an experiment where expression levels
- f thousands of genes are measures.
- We consider the problem of determining
which genes affect the expression level of a given gene.
2
SLIDE 3 Our Problem
- Given an experiment with n genes of a set
E = {a0, a1, ..., an−1} whose expression levels are measured in a time series of m measures (typically n >> m). We have a total of nm values of 0’s or 1’s.
- Our algorithm (based on Ideker et al.
[ITK00]) receives an m × n matrix of such values and determine, for a given gene an−1, which other genes are responsible for the expression level of an−1.
1 1 1 1
1 1 1
1 1
1 M = x0 x1 x2 x3 p0 p1 p2 p3 p4
3
SLIDE 4 Example of Execution of the Algorithm Infer the truth table for a3 of the matrix E shown.
1 1 1 1
1 1 1
1 1
1 M = x0 x1 x2 x3 p0 p1 p2 p3 p4
(1) In step (1), the expression levels of a3 differ in the row pairs (0,1), (0,3), (1,2) and (2,3). We find:
- for (0,1), S01 = {a0, a2}, containing all the
- ther genes whose expression levels also
differ in the row pairs p0 and p1.
- the same is done for (0,3), S03 = {a2}.
- for (1,2), S12 = {a0, a1}.
- for (2,3), S23 = {a1}.
4
SLIDE 5
Result of Step 1 Result of Step 1: S01 = {a0, a2}, S03 = {a2}, S12 = {a0, a1}, S23 = {a1}. (2) In Step (2), find Smin = {a1, a2}, the smallest set such that each element in Smin is also present in each one of the sets Sij of the previous step.
5
SLIDE 6 The Hitting Set Problem
- Given a finite set E, a finite collection
S = {S1, ..., Sw} of subsets of E, find a subset A ⊆ E of the smallest size, such that A ∩ Si = ∅ for all i = 1, ..., w.
6
SLIDE 7
The Hitting Set Problem 1 2 5 3 8 7 6 4 9
E S
5 7 6 3 5 1 9 1 4 1 6 1
A
4
7
SLIDE 8 The Hitting Set Problem Primal-Dual Approximation Algorithm [FMCF01]
- Due to Bar-Yehuda and Even [BYE81] and
was originally conceived for the minimum set cover problem.
- It is an α-approximation algorithm, where
α = maxw
i=1|Si|.
i=1|Si| = O(n). 8
SLIDE 9 The Hitting Set Problem Greedy Approximation Algorithm [J74]
- Strategy of constructing the set A by
choosing the elements that occurs the most times in the subsets of S.
- The approximation ratio is ln |S| + 1.
- ln |S| + 1 = O(log m2)
9
SLIDE 10
The Hitting Set Problem Greedy Approximation Algorithm 1 2 5 3 8 7 6 4 9
E S
5 7 6 3 5 1 9 1 4 1 6
A
10
SLIDE 11
The Hitting Set Problem Greedy Approximation Algorithm 1 2 5 3 8 7 6 4 9
E S
5 7 6 3 5 1 9 1 4 1 6
A
1
11
SLIDE 12
The Hitting Set Problem Greedy Approximation Algorithm
S
a1 a3 a2 a0 a0 a2 a2 a0 a1 a1
E A
a2 a1
12
SLIDE 13 The Sequential Algorithm
2 1 3 2 1 2 3 1 1 1 false
gene vector set vector
list i1 j1 covered list
1 1 1 1
1 1 1
1 1
1 M = x0 x1 x2 x3 p0 p1 p2 p3 p4
13
SLIDE 14 The Sequential Algorithm
3 2 1 2 1 3 2 1 2 3 2 1 1 3 2 3 2 2 2 1 2 2 1
set vector gene vector
false false false false j1 i1 covered list
list
1 1 1 1
1 1 1
1 1
1 M = x0 x1 x2 x3 p0 p1 p2 p3 p4
14
SLIDE 15 The Sequential Algorithm
2
1 3 2 1 j1 list 2 3 2 1 1 3 2 3 2 2 2 1 2 1 1 1 list 2 3 2 1 gene vector set vector covered i1 HS:{0} false false true true false false
15
SLIDE 16 The Sequential Algorithm Time and Space Complexities
- To construct the data structures: O(m2n).
- Let k the size of the hitting set. We have to
find k times the element with the largest number of occurrences. Therefore we have the time complexity of O(kn).
- For each such element, we have to update the
data structures: O(m2n) time. Since we have k elements, the total time complexity to update data structures is O(km2n).
16
SLIDE 17 The Sequential Algorithm Time and Space Complexities
- The total time complexity is therefore
O(m2n)+O(kn)+O(km2n) = O(km2n) .
- The size k of the hitting set is O(m2).
Therefore, the time complexity of the algorithm can be expressed as O(m4n).
- The space complexity is O(m2n).
17
SLIDE 18 The Parallel Algorithm
- The input matrix M is partitioned vertically to be stored in
each processor.
- Example of the partitioning:
a0 a1 a2 a3 a4 a5 a6 a7 a8 . . . . . . . . . . . . . . . . . . . . . . . . . . . x0,0 x0,1 x0,2 x0,3 x0,4 x0,5 x0,6 x0,7 x0,8 x1,0 x2,0 x1,1 x2,1 x1,2 x2,2 x1,3 x2,3 x3,3 x1,4 x2,4 x3,4 x1,5 x2,5 x3,5 x1,6 x2,6 x3,6 x1,7 x2,7 x3,7 x1,8 x2,8 x3,8 x3,2 x3,0 x3,1 M = xm−1,0 xm−1,1 xm−1,2 xm−1,7 xm−1 xm−1,3 xm−1,4 xm−1,5 xm−1,6 processor 0 processor 1 processor 2
18
SLIDE 19 The Parallel Algorithm
- Each processor reads a piece of the input of
size m × n−1
p .
- All the processors store a vector v,
corresponding to the expression levels of the gene under study an−1.
- Each processor pi also stores a gene vector,
with information about genes it is responsible
- for. The gene vector stores information of the
genes for which processor pi is responsible. The gene vector in each processor has size O( m2n
p ).
- Each processor also has a set vector, such
that only elements of set Sij of its responsibility will only be in the list.
19
SLIDE 20 The Parallel Algorithm
1 1
1 1 1 1
1 1 1
x0 x1 1 + 1 x4 p0 p2 p3 p4 p1 x3
Processor 0 covered i1 j1 gene vector set vector false 1 1 2 3 2 2 1 3 2 3 false false false 2 1 3 2 2 i1 j1 gene vector set vector 2 3 1 2 3 1 1 3 2 3 2 2 2 1 2 1 Processor 1 1 list list covered false false false false list
list
20
SLIDE 21 The Parallel Algorithm Time and Space Complexities
p ).
- Requires O(k) communication rounds, where
k is the size of the hitting set. It can be expressed in terms of m, O(m2).
p ) space. 21
SLIDE 22
⋄ ⋄ ⋄ ⋄ ⋄ 20x4096 2 4 6 8 0.01 0.02 0.03 0.04 0.05
Seconds
22
SLIDE 23 Bibliographical References
[BYE81] R. Bar-Yehuda and S. Even. A linear time approximation algorithm for the weighted vertex cover
- problem. Journal of Algorithms, 2:198-203, 1981.
[FMCF01] C. G. Fernandes, F. K. Miyazawa, M. Cerioli,
- P. Feofiloff. Uma introdu¸
c˜ ao sucinta a algoritmos de aproxima¸ c˜
- ao. 23 Col´
- quio Brasileiro de Matem´
atica, 2001. [ITK00] T. E. Ideker, V. Thorsson, R. Karp. Discovery of regulatory interactions through perturbation: inference and experimental design. Pacific Symposium on Biocomputing, 5:302-313, 2000. [J74] D. S. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9:256-278, 1974.
23