engineering motif search for large motifs
play

Engineering Motif Search for Large Motifs Petteri Kaski 1 Juho Lauri - PowerPoint PPT Presentation

Engineering Motif Search for Large Motifs Petteri Kaski 1 Juho Lauri 2 Suhas Thejaswi 1 1 Department of Computer Science 2 Nokia Bell Labs Aalto University, Espoo, Finland Dublin, Ireland Symposium of Experimental Algorithms (SEA 2018)


  1. Engineering Motif Search for Large Motifs Petteri Kaski 1 Juho Lauri 2 Suhas Thejaswi 1 1 Department of Computer Science 2 Nokia Bell Labs Aalto University, Espoo, Finland Dublin, Ireland Symposium of Experimental Algorithms (SEA 2018) L’Aquila, Italy, Friday 29 June 2018

  2. Summit — Oak Ridge National Laboratory 200M 27 , 648 × NVIDIA GV100 GPUs, 141,557,760 cores , 15 MW

  3. NVIDIA DGX-1 — Aalto University 100K 8 × NVIDIA GV100 GPUs, 40960 cores , 3 KW

  4. • ∼ 2400 GHz GF ( 2 8 ) field multiplications • ∼ 6 terabytes/sec 2400 GHz = 2 . 4 trillion multiplications/sec 1 bit = 1 cm 2 , 6 terabytes = 4800 km 2 • Rome metropolitan city area is 5 , 352 km 2 Source: Wikipedia

  5. Outline • Background on motif search • Engineering a practical implementation of constrained multilinear sieving for massively vector-parallel microarchitectures (shared-memory multi-GPU systems) • Experiments What we want? • vector parallelization • saturate memory bandwidth • offload to multiple GPUs

  6. Motif search problem Data Vertex-colored graph H (the host graph) Query Multiset M of colors (the motif) Query matches a connected subgraph?

  7. Data, query, and one match

  8. Complexity • NP -complete if M has at least two colors • Fixed-parameter tractable (FPT) • Solvable in linear time in the size of H ( exponential in the size of M ) Shown to be FPT by Fellows, Fertin, Hermelin, Vialette, ICALP 2007

  9. FPT race Authors Time complexity Conference O ( ∼ 87 k poly ( n , m )) Fellows et al. ICALP 2007 O ( 4 . 32 k poly ( n , m )) Betzler et al. CPM 2008 O ( 4 k poly ( n , m )) Guillemot & Sikora MFCS 2010 O ( 2 . 54 k poly ( n , m )) Koutis IPL 2012 O ( 2 k k 2 m ) Björklund et al. STACS 2013 n – number of vertices m – number of edges k – motif size O ∗ (( 2 − ǫ ) k ) for motif search implies O ∗ (( 2 − δ ) n ) for set cover

  10. Algorithm

  11. Constrained multilinear sieving Converting a combinatorial problem to an algebraic problem (detecting a multilinear monomial in a multivariate polynomial) • Björklund, Kaski and Kowalik STACS-2013/Algorithmica-2016 Arithmetic over GF ( 2 b )

  12. Constrained multilinear sieving Converting a combinatorial problem to an algebraic problem (detecting a multilinear monomial in a multivariate polynomial) • Björklund, Kaski and Kowalik STACS-2013/Algorithmica-2016 • Randomized decision algorithm (YES/NO) Arithmetic over GF ( 2 b )

  13. Constrained multilinear sieving Converting a combinatorial problem to an algebraic problem (detecting a multilinear monomial in a multivariate polynomial) • Björklund, Kaski and Kowalik STACS-2013/Algorithmica-2016 • Randomized decision algorithm (YES/NO) • YES, always correct no false positives Arithmetic over GF ( 2 b )

  14. Constrained multilinear sieving Converting a combinatorial problem to an algebraic problem (detecting a multilinear monomial in a multivariate polynomial) • Björklund, Kaski and Kowalik STACS-2013/Algorithmica-2016 • Randomized decision algorithm (YES/NO) • YES, always correct no false positives • NO, false-negative probability k · 2 − b + 1 Arithmetic over GF ( 2 b )

  15. High-level algorithm (Björklund, Kaski, Kowalik) Output YES if and only if the sum of 2 k evaluations of a multivariate polynomial P ( x , y ) is non-zero • 2 k evaluations : 2 k points ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , , ( x ( 2 k ) , y ( 2 k ) ) · · · depend on motif M and random bits M ( b ) complexity to multiply in GF ( 2 b )

  16. High-level algorithm (Björklund, Kaski, Kowalik) Output YES if and only if the sum of 2 k evaluations of a multivariate polynomial P ( x , y ) is non-zero • 2 k evaluations : 2 k points ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , , ( x ( 2 k ) , y ( 2 k ) ) · · · depend on motif M and random bits • Multivariate polynomial : defined by host graph H , has n + 2 ( k − 1 ) m variables and degree 2 k − 1 M ( b ) complexity to multiply in GF ( 2 b )

  17. High-level algorithm (Björklund, Kaski, Kowalik) Output YES if and only if the sum of 2 k evaluations of a multivariate polynomial P ( x , y ) is non-zero • 2 k evaluations : 2 k points ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , , ( x ( 2 k ) , y ( 2 k ) ) · · · depend on motif M and random bits • Multivariate polynomial : defined by host graph H , has n + 2 ( k − 1 ) m variables and degree 2 k − 1 • One evaluation takes O ( k 2 mM ( b )) time M ( b ) complexity to multiply in GF ( 2 b )

  18. High-level algorithm (Björklund, Kaski, Kowalik) Output YES if and only if the sum of 2 k evaluations of a multivariate polynomial P ( x , y ) is non-zero • 2 k evaluations : 2 k points ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , , ( x ( 2 k ) , y ( 2 k ) ) · · · depend on motif M and random bits • Multivariate polynomial : defined by host graph H , has n + 2 ( k − 1 ) m variables and degree 2 k − 1 • One evaluation takes O ( k 2 mM ( b )) time • Overall runtime O ( 2 k k 2 mM ( b )) M ( b ) complexity to multiply in GF ( 2 b )

  19. CPU implementation (ALENEX 2015) Large graphs — how about large motifs? Open source — https://github.com/pkaski/motif-search Exponential complexity in motif size

  20. Design considerations — shared-memory multi-GPUs Positives • High arithmetic and memory bandwidth • Massive vector-parallelism — for example NVIDIA DGX-1 has ∼ 40,000 cores Negatives • High memory latency — bandwidth comes at the cost of latency • Lack of hardware support for finite-field arithmetic — PCLMULQDQ instruction set speeds up finite-field arithmetic in CPUs

  21. Design considerations — shared-memory multi-GPUs Using available bandwidth • Keeping pipeline busy — memory access and arithmetic operations simultaneously — hide latency with enough instructions “ in flight ” • Coalesced memory access — access memory in blocks • Bit-sliced finite-field arithmetic — to overcome the lack of hardware support

  22. Vertex-localized sieve Base case, for all i ∈ [ n ] and L ⊆ [ k ] P i , 1 ( ζ L , α ) = ζ L i For each s = 2 , 3 , . . . , k , i ∈ [ n ] , and L ⊆ [ k ] � � P i , s ( ζ L , α ) = P i , s 1 ( ζ L , α ) P j , s 2 ( ζ L , α ) α s , ( i , j ) s 1 + s 2 = s j ∈ Γ H ( i ) s 1 , s 2 ≥ 1 Finally, sum at each vertex � P i , k ( ζ L , α ) Q i , k ( µ, ν, α ) = L ⊆ [ k ] Parallelization over vertices i ∈ [ n ] ( n threads) and L ⊆ [ k ] (2 k threads) Parallelization over L vectorizes upto 2 k threads

  23. Workloads and uniformity Project (vertex) Project (vertex) Workers (CPU threads) Workers (GPU threads) D workers (threads) work on each project (vertex) D divides 2 k , execution in each thread of CPU is mostly independent All threads (typically 32) in a GPU warp execute same instructions

  24. Workloads and uniformity Vertices 1 2 . . . n . . . Threads D workers D workers . . . D workers Workloads of shape n × D (single GPU) Workload of shape M × n × D ( M GPUs) Each project (vertex) has different completion time

  25. Memory layout and coalescence Coalescence — each thread will be doing the same boring work � � P i , s ( ζ L , α ) = P i , s 1 ( ζ L , α ) P j , s 2 ( ζ L , α ) α s , ( i , j ) s 1 + s 2 = s j ∈ Γ H ( i ) s 1 , s 2 ≥ 1 Private L P i , s ( ζ L ′ , α ) = � � P i , s 1 ( ζ L ′ , α ) P j , s 2 ( ζ L ′ , α ) α s , ( i , j ) s 1 + s 2 = s j ∈ Γ H ( i ) s 1 , s 2 ≥ 1 Private L ′ L and L ′ work on same the vertex i Each thread execute same set of instructions on different data

  26. Memory layout and coalescence P i , s ( ζ L , α ) = � � P i , s 1 ( ζ L , α ) P j , s 2 ( ζ L , α ) α s , ( i , j ) s 1 + s 2 = s j ∈ Γ H ( i ) s 1 , s 2 ≥ 1 Platoon C P i ′ , s ( ζ L , α ) = � � P i ′ , s 1 ( ζ L , α ) P j , s 2 ( ζ L , α ) α s , ( i ′ , j ) s 1 + s 2 = s j ∈ Γ H ( i ′ ) s 1 , s 2 ≥ 1 Platoon C ′ C and C ′ work on vertices i and i ′ , respectively Each platoon has D privates

  27. Memory layout and coalescence Private Resources • Access A space each iteration • n × D workers X resources S resources A space U space (each iteration) (total) Memory layout shape U A × n × D × A Resources = scalars, space = memory (words) Each load/store access A words of data

  28. Inner loop in CUDA For each s = 2 , 3 , . . . , k , i ∈ [ n ] , and L ⊆ [ k ] P i , s ( ζ L , α ) = � � P i , s 1 ( ζ L , α ) P j , s 2 ( ζ L , α ) α s , ( i , j ) s 1 + s 2 = s j ∈ Γ H ( i ) s 1 , s 2 ≥ 1 for(index_t s1 = 1; s1 < s; s1++) { index_t s2 = s-s1; index_t s1i = LINE_IDX(n, gl, s1, i, a); line_t p_s1i; LINE_LOAD(p_s1i, d_s, seg, s1i); /* Load P_{i,s1} */ index_t s2j = LINE_IDX(n, gl, s2, j, a); line_t p_s2j; LINE_LOAD(p_s2j, d_s, seg, s2j); /* Load P_{j,s2} */ line_t p_s1i_s2j; LINE_MUL(p_s1i_s2j, p_s1i, p_s2j); /* Line multiplication */ LINE_ADD(p_sij, p_sij, p_s1i_s2j); /* Store result */ }

  29. Experiments

  30. Hardware configurations • CPU compute node 2 × 2 . 6-GHz Intel Xeon E5-2690v3 CPU Haswell microarchitecture, 12 cores/CPU (24 cores), 30 MiB L3 cache, 128 GiB main memory, (8 × 16 GiB DDR4-2133) • NVIDIA DGX-1 8 × 1312-GHz NVIDIA GV100 GPU Volta microarchitecture, 5120 cores/GPU (40960 cores), 128 GiB of on-device memory (8 × 16 GiB 4096-bit HBM2)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend