Array-of-Struct particles for iPic3D on MIC Alec Johnson and - PowerPoint PPT Presentation

Array-of-Struct particles for iPic3D on MIC Alec Johnson and Giovanni Lapenta Centre for mathematical Plasma Astrophysics Mathematics Department KU Leuven, Belgium EASC2014 Stockholm, Sweden April 3, 2014 Abstract: We are porting iPic3D to the MIC for particle processing. iPic3D advances both the electromagnetic field and the particles implicitly, requiring typically 100-200 iterations of the field advance and 3-5 iterations of the particle advance for each cycle. We use particle subcycling to limit particle motion to one cell per cycle, which improves accuracy and simplifies sorting. To accelerate sorting, we represent particles in AoS format in double precision so that particle data exactly fits the cache line width. To vectorize particle calculations, we process particles in blocks: a fast 8x8 matrix transpose implemented in intrinsics converts each 8-particle block between SoA and AoS representation. Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 1 / 16

porting iPic3D to MIC Goal: efficiently converged multiscale simulation of plasma Tool: iPic3D, an implicit particle-in-cell code Task: port to Xeon + Phi (MIC): improve MPI use OMP threads vectorize Key issue: data layout of particles Ordering: SoA for vectorization (push and sum) AoS for localization (sorting) Granularity of particles: grouped by cell: vectorization efficiency grouped by thread subdomain: cache efficiency Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 2 / 16

Agenda: justify choices The purpose of this presentation is to justify four algorithm choices: Two fundamental determinations: Two secondary determinations: 1 Subcycle particles : 1 Use double precision for particles. for each particle, break time Vlasov solver via resampling. step into substeps no mixed precision. move the particle at most particle exactly fits cache line. one cell per substep 2 Use AoS field to push particles. motivation: accurate motivation: better localization of simulation of fast-moving field data access particles justification: one transpose per benefit: simpler sorting cycle is justified by numerous 2 Use Array-of-Structs (AoS) particle iterations and amortized by for particles. many iterations of SoA field solver. motivation: fast sorting can still vectorize via fast transpose/intrinsics Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 3 / 16

Outline 1 iPic3D algorithm 2 Algorithm choices Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 4 / 16

Equations of iPic3D iPic3D simulates charged particles Current Evolution � s E + J s × B ′ � interacting with the electromagnetic field. q s σ ′ ∂ t J s + ∇ · P s = It solves the following equations: m s Average current responds linearly to electric field: Fields: J = � ∂ t B ( x ) + ∇ × E ( x ) = 0 J + A · E , ∂ t E ( x ) − c 2 ∇ × B ( x ) = − J ( x ) /ǫ 0 , where: J := �   � s � Particles: J s , J s := Π s · � � � p ) � � q p  J n s − ∆ t  E ′ ( x ′ p ) + v p × B ′ ( x ′ ∂ t v p = , 2 ∇ · P s , A := � m p   s β s σ ′  s Π s ,  ∂ t x p = v p ,   Π s := I − � B s × I + � B s �  B s  ,   Moments (10): 1 + | � B s | 2 (1) σ ( x ) := �   � B s := β s B ′ , p S ( x − x p ) q p β s := q s ∆ t (3) J ( x ) := � 2 m s . p S ( x − x p ) q p v p Implicit Particle Advance (6) P ( x ) := � � � p S ( x − x p ) q p v p v p · � p + β s E ′ ( x p ) � I − � B p × I + � B p � B p v n v p = , where The Implicit Moment Method uses these 10 1 + | � B p | 2 moments (with E and B ) to estimate J . � q p ∆ t 2 m p B ′ ( x p ) , and B p := Discretization: x p = x 0 ∂ t X := Xn + 1 − Xn p + ∆ t v p . . ∆ t 2 X n + 1 + 1 X = 1 2 X n . Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 5 / 16

iPic3D cycle iPic3D cycles through three tasks: 1 fields.advance(moments) 2 particles[s].move(fields) 3 moments[s].sum(particles[s]) Moving particles consists of pushing and sorting, e.g.: foreach subcycle c: foreach particle: particle.push(field(cell(particle))) particle.sort() particles.communicate() Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 6 / 16

Outline 1 iPic3D algorithm 2 Algorithm choices Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 7 / 16

Mapping data to architecture Balance two goals: Data of algorithm : 1 flexibility where 1 fields: 6 doubles (two vectors) per architectures/algorithms differ mesh cell: 2 best particulars where B x magnetic field 1 B y magnetic field architectures/algorithms agree 2 B z magnetic field 3 ψ ( B correction potential) 4 Architecture key attributes: E x electric field 5 1 Width of cache line : 8 doubles = E y electric field 6 E z electric field 512 bits (fairly universal) 7 φ ( E correction potential) 8 2 Width of vector unit : 2 100s of particles per mesh cell; 8 doubles = 512 bits for MIC 8 doubles (2 vectors + 2 scalars) 4 doubles = 256 bits for Xeon with AVX per particle: 2 doubles = 128 bits for SSE2 u velocity 1 v velocity 2 w velocity 3 q charge (or particle ID) 4 x position 5 y position 6 z position 7 t subcycle time 8 Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 8 / 16

(1) Why subcycle? Traditionally the implicit moment method moves all particles with the same time step. We are implementing subcycling: For each particle, the global time step is partitioned into substeps. Substeps stop particles at cell boundaries. Benefits of subcycling: 1 Simplifies sorting: SoA vectorization requires sorting particles by mesh cell. Subcycling guarantees that particles move only one mesh cell per subcycle. Without subcycling, particles can move arbitrarily far between sorts. Without subcycling, particles must be sorted with every iteration of the implicit mover. Without subcycling, sorted particle data must include average position data and no longer fits in a single cache line. 2 Subcycling is needed to resolve fast particles accurately. Maxwell’s equations need time-averaged current. Subcycling is needed to get correct time-averaged current of fast particles. Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 9 / 16

(2) AoS particle vectorization Usually SoA is preferred for vectorization. But AoS particles can still be vectorized in one of two ways: Fast matrix transpose Physical vectors (intrinsics-heavy): 8-component particles MIC : (subcycled case): process 2 particles at a time Represent as AoS concatenate velocity vectors: Process in 8-particle blocks [u1, v1, w2, q1, u2, v2, w2, q2] Convert blocks to/from SoA using concatenate position vectors: fast 8x8 matrix blocked transpose [x1, y1, z1, t1, x2, y2, z2, t2] (28-36 8-wide vector instructions) Use physical vector operations 12-component particles (use swizzle for cross-product) (non-subcycled case): Xeon Consider padding extra components to 8 (faster sort); otherwise: process 1 particle at a time (or 2 at a first 8 components handled like time for single precision) 8-component particles last 4 components handled like 4-component particles using fast 4x8 ↔ 8x4 transpose (16 8-wide vector instructions). Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 10 / 16

Pusher times on Xeon Phi [feasibility studies] Pusher times in iPic3D: time pusher ========= ======== 0.102 sec SoA (but also need to sort each iteration) 0.202 sec AoS_intr (no sort required, but helps cache) 0.259 sec AoS (no sort required, but helps cache) Pusher times for a single iteration: time pusher =========== ======== .07 Mcycles SoA .13 Mcycles AoS_tran (8-pcl blocks via fast 8x8 transpose) .21 Mcycles AoS_intr (2-pcl blocks with intrinsics mover) Pusher times for 4 iterations stopping at cell boundary: time pusher =========== ======== .36 Mcycles SoA .40 Mcycles AoS_tran (8-pcl blocks via fast 8x8 transpose) [unimplemented] AoS_intr (no need to sort with each subcycle) Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 11 / 16

Sorting efficiently Cache-line-sized particles facilitate To hide communication latency, overlap sorting: process-level communication with thread-level sorting. can transfer particles directly to memory destination with General sort: no-read writes send exiting particles no cache contention sort particles in process wait on incoming particles vector unit divides cache line sort incoming particles size, so fully utilized Subcycle sort (moving ≤ 1 cell per Sort particles by: subcycle): 1 process subdomain (for MPI), move particles in boundary cells send particles in ghost cells 2 thread subdomain, and move particles in interior cells 3 mesh cell (for vectorization) move incoming particles Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 12 / 16

Using AoS fields and moments in particle solver In the field solver we represent fields and moments in SoA format. This allows better vectorization of the implicit solver. In the particle solver, we represent fields and moments with AoS format: AoS gives better localization of random access. SoA fields and moments offer no benefit to vectorization of particle processing. The transpose is done only once per cycle. Johnson and Lapenta (KU Leuven) AoS for particles on MIC April 3, 2014 13 / 16

Array-of-Struct particles for iPic3D on MIC Alec Johnson and - PowerPoint PPT Presentation

Array-of-Struct particles for iPic3D on MIC Alec Johnson and Giovanni Lapenta Centre for mathematical Plasma Astrophysics Mathematics Department KU Leuven, Belgium EASC2014 Stockholm, Sweden April 3, 2014 Abstract: We are porting iPic3D to

types.h defs.h Page 1/1 Page 1/3 typedef unsigned int uint; struct buf; typedef unsigned short

Python and GraphQL Alec MacQueen Software Engineer @ Administrate Alec MacQueen - @macqueenism -

WAN HACKING with AutoHack - auditing security behind the firewall Alec Muffett Network Security

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Personal SE C Struct & Typedef Make C Structs A struct is a way of grouping named,

Data Structures in Racket Part 2 Last time (car (cdr (cons 3 (cons 2 ()) ) ) ) 3 2

Alec Muffett Programming Holes PROGRAMMING GOOFS THAT WILL HOSE YOUR SYSTEM SECURITY (a purely

3D 3D- -Str Str Struct ucture re Pr Predict ction on of of th the 3D 3D Struct

Objects II A class is a struct plus some associated functions that act upon variables of that

A #lang for data structures students 2 Welcome to DSSL2 #lang dssl2 struct nil: pass struct

WITH C++ Prof. Amr Goneid AUC Part 11. The Struct Data Type Prof. amr Goneid, AUC 1 The

CS 241 Data Organization Quiz 5 March 8, 2018 Question 1: Structures and Functions struct

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Sharing Clinical Trial Data at Johnson & Johnson Dr. Joanne Waldstreicher Chief Medical

Fluid models of plasma Alec Johnson Centre for mathematical Plasma Astrophysics Mathematics

Mieloma Recidivo/Refra5ario: Strategie terapeu;che e An;corpi Monoclonali Pattern of remission

A General Class of Score-Driven Smoothers Giuseppe Buccheri Scuola Normale Superiore Joint work

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Curing the infrared problem in nonrelativistic QED Daniela Cadamuro Leipzig joint work with

Semi-geostrophic equations A large-scale model in meteorology Stefania Lisai Supervised by B.

Sets and rings with Nikodm types properties M. Lpez-Pellicer (DMA, IUMPA) Be dlewo,

CS 377 Database Systems Relational Data Model Department of

UNIVERSAL TRAJECTORY SYNCHRONIZATION FOR HIGHLY PREDICTABLE ARRIVALS ENABLED BY FULL AUTOMATION

Array-of-Struct particles for iPic3D on MIC Alec Johnson and - PowerPoint PPT Presentation

Array-of-Struct particles for iPic3D on MIC Alec Johnson and Giovanni Lapenta Centre for mathematical Plasma Astrophysics Mathematics Department KU Leuven, Belgium EASC2014 Stockholm, Sweden April 3, 2014 Abstract: We are porting iPic3D to

types.h defs.h Page 1/1 Page 1/3 typedef unsigned int uint; struct buf; typedef unsigned short

Python and GraphQL Alec MacQueen Software Engineer @ Administrate Alec MacQueen - @macqueenism -

WAN HACKING with AutoHack - auditing security behind the firewall Alec Muffett Network Security

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Personal SE C Struct &amp; Typedef Make C Structs A struct is a way of grouping named,

Data Structures in Racket Part 2 Last time (car (cdr (cons 3 (cons 2 ()) ) ) ) 3 2

Alec Muffett Programming Holes PROGRAMMING GOOFS THAT WILL HOSE YOUR SYSTEM SECURITY (a purely

3D 3D- -Str Str Struct ucture re Pr Predict ction on of of th the 3D 3D Struct

Objects II A class is a struct plus some associated functions that act upon variables of that

A #lang for data structures students 2 Welcome to DSSL2 #lang dssl2 struct nil: pass struct

WITH C++ Prof. Amr Goneid AUC Part 11. The Struct Data Type Prof. amr Goneid, AUC 1 The

CS 241 Data Organization Quiz 5 March 8, 2018 Question 1: Structures and Functions struct

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Sharing Clinical Trial Data at Johnson &amp; Johnson Dr. Joanne Waldstreicher Chief Medical

Fluid models of plasma Alec Johnson Centre for mathematical Plasma Astrophysics Mathematics

Mieloma Recidivo/Refra5ario: Strategie terapeu;che e An;corpi Monoclonali Pattern of remission

A General Class of Score-Driven Smoothers Giuseppe Buccheri Scuola Normale Superiore Joint work

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Curing the infrared problem in nonrelativistic QED Daniela Cadamuro Leipzig joint work with

Semi-geostrophic equations A large-scale model in meteorology Stefania Lisai Supervised by B.

Sets and rings with Nikodm types properties M. Lpez-Pellicer (DMA, IUMPA) Be dlewo,

CS 377 Database Systems Relational Data Model Department of

UNIVERSAL TRAJECTORY SYNCHRONIZATION FOR HIGHLY PREDICTABLE ARRIVALS ENABLED BY FULL AUTOMATION

Personal SE C Struct & Typedef Make C Structs A struct is a way of grouping named,

Sharing Clinical Trial Data at Johnson & Johnson Dr. Joanne Waldstreicher Chief Medical