Multi-level parallelism for high performance combinatorics Florent - PowerPoint PPT Presentation

1 of 26 Multi-level parallelism for high performance combinatorics Florent Hivert LRI / Université Paris Sud 11 / CNRS SPLS / June 2018

2 of 26 Goal Present some experiments, experience return, and challenges around parallel (algebraic) combinatorics computations. What I learned: Following the these optimization steps Micro data-structures optimization Work stealing parallelization Careful memory management we can achieve surprisingly (at least for me) large speedups.

Background: Enumerative and Algebraic Combinatorics 3 of 26 Some classical algebraic/combinatorics objects Multivariate polynomials: x 3 1 x 4 x 6 + 5 x 3 2 x 4 5 x 2 8 − 12 x 8 4 Number of monomials v variables, degree d : � v + d − 1 � M ( v , d ) = v M ( 5 , 5 ) = 126 , M ( 5 , 10 ) = 252 M ( 10 , 20 ) = 2 · 10 7 M ( 10 , 10 ) = 92378 , M ( 16 , 32 ) = 1 . 5 · 10 12 M ( 16 , 16 ) = 300 540 195 ,

Background: Enumerative and Algebraic Combinatorics 4 of 26 Some classical algebraic/combinatorics objects (Fully) Symmetric polynomials: m ( 2 , 1 ) = x 2 0 x 1 + x 0 x 2 1 + x 2 0 x 2 + x 2 1 x 2 + x 0 x 2 2 + x 1 x 2 2 + x 2 0 x 3 + x 2 1 x 3 + x 2 2 x 3 + x 0 x 2 3 + x 1 x 2 3 + x 2 x 2 3 m ( 2 , 2 , 1 ) = x 2 0 x 2 1 x 2 + x 2 0 x 1 x 2 2 + x 0 x 2 1 x 2 2 + x 2 0 x 2 1 x 3 + x 2 0 x 2 2 x 3 + x 2 1 x 2 2 x 3 + x 2 0 x 1 x 2 3 + x 0 x 2 1 x 2 3 + x 2 0 x 2 x 2 3 + x 2 1 x 2 x 2 3 + x 0 x 2 2 x 2 3 + x 1 x 2 2 x 2 3 Index: integer partitions: ( 5 ) , ( 4 , 1 ) , ( 3 , 2 ) , ( 3 , 1 , 1 ) , ( 2 , 2 , 1 ) , ( 2 , 1 , 1 , 1 ) , ( 1 , 1 , 1 , 1 , 1 ) n 1 2 4 8 10 16 20 50 100 256 2 · 10 8 3 . 7 · 10 14 p ( n ) 1 2 5 22 42 231 627 204226

Background: Enumerative and Algebraic Combinatorics 5 of 26 Group algebra Linear combination of permutations: [ 1 , 2 , 3 , 4 , 5 ] + 2 [ 1 , 2 , 3 , 5 , 4 ] + 3 [ 1 , 2 , 4 , 3 , 5 ] + [ 5 , 1 , 2 , 3 , 4 ] Product: composition of permutations. The number of permutation grows very fast: 16 ! = 1 307 674 368 000 = 1 . 3 10 12

Background: Enumerative and Algebraic Combinatorics 6 of 26 Nested higher order directional derivative Directional derivative, first and higher order: ∇ 3 (Ξ 1 , Ξ 2 , Ξ 3 ) A = ∇ 3 ∇ Ξ 1 A Ξ 1 ⊗ Ξ 2 ⊗ Ξ 3 A Chain rule for directional derivative k Ξ 1 ⊗···⊗ Ξ k A = ∇ k + 1 � ∇ ξ ∇ k ∇ k ξ ⊗ Ξ 1 ⊗···⊗ Ξ k A + Ξ 1 ⊗···⊗∇ ξ Ξ j ⊗···⊗ Ξ k A j = 1   ∇ ξ 1 A  = A + A + A + A    3 6 1 3 6 3 6 3 6 3 6 2 2 1 2 1 2 2 1

Background: Enumerative and Algebraic Combinatorics 7 of 26 Algebraic combinatorics: Summary Note Dealing with (formal) linear combinations of objects whose set cardinality grows exponentially fast; Corollary sparse Linear algebra; small objects are usually sufficient !

Small combinatorial objects 8 of 26 Small combinatorial objects (i.e. monomials) Very often, small combinatorial objects can be encoded into small sequences of small integers ! Permutations: � � 1 2 3 4 5 6 7 8 9 = [ 1 , 6 , 9 , 4 , 8 , 2 , 7 , 3 , 6 ] 1 6 9 4 8 2 7 3 5 Integer partitions: 10 = 5 + 2 + 2 + 1 = 4 + 3 + 1 + 1 + 1 Set partitions: {{ 1 , 4 , 8 } , { 2 , 3 } , { 5 , 6 , 7 }} 5 Young tableaux: 2 6 9 1 3 4 7 8 Dyck (well bracketed) word: 1101101001100011010

Small combinatorial objects 9 of 26 Integer Vector Instruction Register: epi8,epu8 : 128 bits = 16 bytes Even more: AVX, AVX2, AVX512 Arithmetic/logic operations: and, or, add, sub, min, max, abs, cmp Bit finding, scanning: popcount , bfsd But more crucial for me: Array manipulation: blend, broadcast, shuffle String comparision: cmpistr (lex, find). Very efficient manipulations !

Small combinatorial objects 10 of 26 Example: Sorting network Knuth AoCP3 Fig. 51 p. 229:

Small combinatorial objects 11 of 26 // Sorting network Knuth AoCP3 Fig. 51 p 229. static const array<Perm16, 9> rounds = {{ { 1, 0, 3, 2, 5, 4, 7, 6, 9, 8,11,10,13,12,15,14}, { 2, 3, 0, 1, 6, 7, 4, 5,10,11, 8, 9,14,15,12,13}, ... }}; perm sort(perm a) { for (perm round : rounds) { perm minab, maxab, mask; perm b = _mm_shuffle_epi8(a, round); mask = _mm_cmplt_epi8(round, permid); minab = _mm_min_epi8(a, b); maxab = _mm_max_epi8(a, b); a = _mm_blendv_epi8(minab, maxab, mask); } return a; }

Small combinatorial objects 11 of 26 // Sorting network Knuth AoCP3 Fig. 51 p 229. static const array<Perm16, 9> rounds = {{ { 1, 0, 3, 2, 5, 4, 7, 6, 9, 8,11,10,13,12,15,14}, { 2, 3, 0, 1, 6, 7, 4, 5,10,11, 8, 9,14,15,12,13}, ... }}; perm sort(perm a) { for (perm round : rounds) { perm minab, maxab, mask; perm b = _mm_shuffle_epi8(a, round); mask = _mm_cmplt_epi8(round, permid); minab = _mm_min_epi8(a, b); maxab = _mm_max_epi8(a, b); a = _mm_blendv_epi8(minab, maxab, mask); } return a; } Compared to std::sort , speedup = 22.3

Small combinatorial objects 12 of 26 Disjoint-set (Union-Find) of data-structure SetPartition of { 1 , 2 . . . , 9 } : P = {{ 6 } , { 1 , 5 } , { 7 , 2 , 3 , 8 } , { 9 , 4 }} = {{ 1 , 5 } , { 2 , 3 , 7 , 8 } , { 4 , 9 } , { 6 }} Note Union-Find data structure: Choose a canonical representative for each classes (e.g. the smallest element). Find the canonical representative of some element Union combines two parts Union ( P , 5 , 3 ) = {{ 1 , 2 , 3 , 5 , 7 , 8 } , { 4 , 9 } , { 6 }}

Small combinatorial objects 13 of 26 Disjoint-set (Union-Find) of two set-partitions P = {{ 1 , 5 } , { 2 , 3 , 7 , 8 } , { 4 , 9 } , { 6 }} Q = {{ 1 } , { 3 } , { 2 , 4 } , { 5 , 6 } , { 7 , 8 } , { 9 }} Then P ∪ Q = {{ 1 , 5 , 6 } , { 2 , 3 , 4 , 7 , 8 , 9 }}

Small combinatorial objects 14 of 26 Disjoint-set (Union-Find) of two set-partitions Store a partition P as a function Can P : i 1 2 3 4 5 6 7 8 9 Can P 1 2 2 4 1 6 2 2 4 Lemma Can P ∪ Q = ( Can P ◦ Can Q ) ◦ n / 2 setpart16 union(setpart16 p, setpart16 p) { setpart16 res = _mm_shuffle_epi8(p, q); res = _mm_shuffle_epi8(res, res); res = _mm_shuffle_epi8(res, res); return = _mm_shuffle_epi8(res, res); }

Small combinatorial objects 15 of 26 Some more examples and speedup Operation Speedup Sorting a list of bytes 21 . 3 Number of cycles of a permutation 41 . 5 Cycle type of a permutation 8 . 94 Number of inversions of a permutation 9 . 39 Inverting a permutation 2 . 02 Problems: missing primitive (eg: inverting a permutation) AVX2 and AVX512 deals in parallel on 2 or 4 registers of size 128 bits. Shuffle instruction doesn’t cross 128 bits barriers. no support for the compiler need to rethink all the algorithms !

Large set enumeration: the challenging example of numerical monoids 16 of 26 Examples of recursively enumerated sets Binary words: generation tree [ ] [ 0 ] [ 1 ] [ 0 , 0 ] [ 0 , 1 ] [ 1 , 0 ] [ 1 , 1 ] [ 0 , 0 , 0 ] [ 0 , 0 , 1 ] [ 0 , 1 , 0 ] [ 0 , 1 , 1 ] [ 1 , 0 , 0 ] [ 1 , 0 , 1 ] [ 1 , 1 , 0 ] [ 1 , 1 , 1 ]

Large set enumeration: the challenging example of numerical monoids 17 of 26 Now that we know how to deals with each small objects, How to generate them ? Generation trees !

Large set enumeration: the challenging example of numerical monoids 18 of 26 Examples of recursively enumerated sets Binary words: generation tree [ ] [ 0 ] [ 1 ] [ 0 , 0 ] [ 0 , 1 ] [ 1 , 0 ] [ 1 , 1 ] [ 0 , 0 , 0 ] [ 0 , 0 , 1 ] [ 0 , 1 , 0 ] [ 0 , 1 , 1 ] [ 1 , 0 , 0 ] [ 1 , 0 , 1 ] [ 1 , 1 , 0 ] [ 1 , 1 , 1 ]

Multi-level parallelism for high performance combinatorics Florent - PowerPoint PPT Presentation

1 of 26 Multi-level parallelism for high performance combinatorics Florent Hivert LRI / Universit Paris Sud 11 / CNRS SPLS / June 2018 2 of 26 Goal Present some experiments, experience return, and challenges around parallel (algebraic)

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

1. Early combinatorics Robin Wilson 1. Early combinatorics 2. European combinatorics: Middle

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Munguia, Geoffrey M. Oxberry, Deepak

Combinatorics under Determinacy Jared Holshouser University of North Texas Ohio University 2016

Combinatorics on Words through the Word-Equations-lens Florin Manea Georg-August-Universitt

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

Formal Modeling in Cognitive Science Lecture 16 Introduction to Probability Theory; Combinatorial

Combinatorics of generalized exponents edric Lecouvey and Cristian Lenart C University of

5. Analytic Combinatorics http://aofa.cs.princeton.edu Analytic combinatorics is a calculus for

Instructor: Pedro Domingos Logistics Instructor: Pedro Domingos Email: pedrod@cs

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

1 6 3 16 5 4 5 36 15 3 4 25 46 35 14 23 26 4 12 2 1. What is a W -Graph? Let (

Combinatorial Algebra meets Algebraic Combinatorics January, 2224, 2016, Western University,

On Decomposition of Cartesian Products of Regular Graphs into Isomorphic Trees Kyle F. Jao

Sambuz

Useful Links

Newsletter

Mail Us