[PPT] - Solutions for Efficient Memory Access for Cubic Lattices and Random PowerPoint Presentation

SLIDE 1

Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms

GTC 2015 Speaker: Dr. Matteo Lulli

Prof. M. Bernaschi and Prof. G. Parisi

March the 19th, 2015

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 1 / 31

SLIDE 2

Outlook

1

Cubic stencils

2

PRNGs

3

Multi-GPU and MPI

4

Results

5

Conclusions & Perspectives

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 2 / 31

SLIDE 3

Motivations

Phase Transitions in disordered systems

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 3 / 31

SLIDE 4

Motivations

Phase Transitions in disordered systems Equilibrium Monte Carlo analysis works well for non-disordered systems Disordered systems are very hard to equilibrate

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 3 / 31

SLIDE 5

Motivations

Phase Transitions in disordered systems Equilibrium Monte Carlo analysis works well for non-disordered systems Disordered systems are very hard to equilibrate A large number of disorder realizations, samples, is required 3D Ising spin glass at most L = 40 equilibrated so far (Janus, FPGA dedicated machine)

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 3 / 31

SLIDE 6

Motivations

Phase Transitions in disordered systems Equilibrium Monte Carlo analysis works well for non-disordered systems Disordered systems are very hard to equilibrate A large number of disorder realizations, samples, is required 3D Ising spin glass at most L = 40 equilibrated so far (Janus, FPGA dedicated machine) Robust out-of-equilibrium methods for the study of phase transitions would be very useful

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 3 / 31

SLIDE 7

Motivations

Phase Transitions in disordered systems Equilibrium Monte Carlo analysis works well for non-disordered systems Disordered systems are very hard to equilibrate A large number of disorder realizations, samples, is required 3D Ising spin glass at most L = 40 equilibrated so far (Janus, FPGA dedicated machine) Robust out-of-equilibrium methods for the study of phase transitions would be very useful

Why GPUs

Even with out-of-equilibrium methods usual CPUs are not sufficiently powerful in order to obtain good estimates

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 3 / 31

SLIDE 8

Outline for section 1

1

Cubic stencils

2

PRNGs

3

Multi-GPU and MPI

4

Results

5

Conclusions & Perspectives

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 4 / 31

SLIDE 9

Standard checkerboard pattern

Nearest-neighbours based problems in 3D

i spz spy spx smz smy smx

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 5 / 31

SLIDE 10

Standard checkerboard pattern

Nearest-neighbours based problems in 3D Cubic lattice of linear size L = 2n: Checkerboard colouring Each lattice site has nearest neighbours

f the other colour

i spz spy spx smz smy smx

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 5 / 31

SLIDE 11

Standard checkerboard pattern

Nearest-neighbours based problems in 3D Cubic lattice of linear size L = 2n: Checkerboard colouring Each lattice site has nearest neighbours

f the other colour

Allocation choices

Unified allocation: one array

i spz spy spx smz smy smx

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 5 / 31

SLIDE 12

Standard checkerboard pattern

Nearest-neighbours based problems in 3D Cubic lattice of linear size L = 2n: Checkerboard colouring Each lattice site has nearest neighbours

f the other colour

Allocation choices

Unified allocation: one array Separated allocation: two arrays

i spz spy spx smz smy smx

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 5 / 31

SLIDE 13

Standard checkerboard pattern

Nearest-neighbours based problems in 3D Cubic lattice of linear size L = 2n: Checkerboard colouring Each lattice site has nearest neighbours

f the other colour

Allocation choices

Unified allocation: one array Separated allocation: two arrays

i spz spy spx smz smy smx

The parity, (−1)xi+yi+zi, of each lattice site has to be taken into account

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 5 / 31

SLIDE 14

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

1 2 3 5 4 6 7 1 3 2 4 5 7 6 y x y′ x′ z = 0 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 15

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

1 2 3 5 4 6 7 1 3 2 4 5 7 6 y x y′ x′ z = 0 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 16

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

1 2 3 5 4 6 7 1 3 2 4 5 7 6 2 y x y′ x′ z = 0 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 17

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

1 2 3 5 4 6 7 1 3 2 4 5 7 6 2 5 y x y′ x′ z = 0 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 18

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

1 2 3 5 4 6 7 1 3 2 4 5 7 6 2 5 7 y x y′ x′ z = 0 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 19

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

8 9 10 8 9 2 5 7 10 11 11 13 13 12 12 14 14 15 15 y x y′ x′ z = 1 z′ = 0 10

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 20

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

8 9 10 8 9 2 5 7 10 11 11 13 13 12 12 14 14 15 15 10 12 y x y′ x′ z = 1 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 21

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

8 9 10 8 9 2 5 7 10 11 11 13 13 12 12 14 14 15 15 10 12 15 y x y′ x′ z = 1 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 22

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

8 9 10 8 9 2 5 7 10 11 11 13 13 12 12 14 14 15 15 10 12 15 9 y x y′ x′ z = 1 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 23

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

20 2 5 7 20 21 21 23 23 22 22 16 16 17 17 10 12 15 9 18 18 19 19 y x y′ x′ z = 2 z′ = 0 20

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 24

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

20 2 5 7 20 21 21 23 23 22 22 16 16 17 17 10 12 15 9 18 18 19 19 20 22 y x y′ x′ z = 2 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 25

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

20 2 5 7 20 21 21 23 23 22 22 16 16 17 17 10 12 15 9 18 18 19 19 20 22 17 y x y′ x′ z = 2 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 26

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

20 2 5 7 20 21 21 23 23 22 22 16 16 17 17 10 12 15 9 18 18 19 19 20 22 17 19 y x y′ x′ z = 2 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 27

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

26 2 5 7 26 27 27 29 29 28 28 30 30 31 31 10 12 15 9 24 24 25 25 20 22 17 19 30 y x y′ x′ z = 3 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 28

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

26 2 5 7 26 27 27 29 29 28 28 30 30 31 31 10 12 15 9 24 24 25 25 20 22 17 19 30 24 y x y′ x′ z = 3 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 29

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

26 2 5 7 26 27 27 29 29 28 28 30 30 31 31 10 12 15 9 24 24 25 25 20 22 17 19 30 24 y x y′ x′ z = 3 z′ = 0 27

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 30

Sliced checkerboard pattern: definition

It is always possible to remap the lattice sites Yavors’kii et al., Heisenberg spin glass, ’snake-like’ pattern, unified Ferrero et al., Potts 2D, separated

Sliced scheme

26 2 5 7 26 27 27 29 29 28 28 30 30 31 31 10 12 15 9 24 24 25 25 20 22 17 19 30 24 27 29 y x y′ x′ z = 3 z′ = 0

Periodic boundary conditions are necessary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 6 / 31

SLIDE 31

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 32

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 33

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 34

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 35

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 36

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 37

Sliced checkerboard pattern: geometric interpretation

Sites of planes orthogonal to n = (+1, −1, +1) are one-coloured.

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 7 / 31

SLIDE 38

Sliced checkerboard pattern: cubic stencils

i spz spy spx smz smy smx i spz spy spx smz smy smx

Variables belonging to one slice are decoupled No sites parity involved

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 8 / 31

SLIDE 39

Sliced checkerboard pattern: cubic stencils

i spz spy spx smz smy smx i spz spy spx smz smy smx

Variables belonging to one slice are decoupled No sites parity involved The spz, smy, spx nearest neighbours belong to the upper slice The smz, spy, smx nearest neighbours belong to the lower slice

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 8 / 31

SLIDE 40

Sliced checkerboard pattern: cubic stencils

i spz spy spx smz smy smx i spz spy spx smz smy smx

Variables belonging to one slice are decoupled No sites parity involved The spz, smy, spx nearest neighbours belong to the upper slice The smz, spy, smx nearest neighbours belong to the lower slice The method can be generalized to any dimension

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 8 / 31

SLIDE 41

Outline for section 2

1

Cubic stencils

2

PRNGs

3

Multi-GPU and MPI

4

Results

5

Conclusions & Perspectives

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 9 / 31

SLIDE 42

Linear Congruential PRNG

Lehmer-Park-Miller MINSTD Rn+1 = (16807 Rn)mod(231 − 1),

D. Carta implementation: no modulus

and overflow handled with 32 bit integers only

MINSTD overflow handling

No 64 bits variables and no ’mod’ involved RNGT lo = 16807(seed&0xffff); RNGT hi = 16807(seed>>16); lo += (hi&0x7fff)<<16; lo += hi>>15; if(lo > 0x7fffffff) lo -= 0x7fffffff; return lo;

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 10 / 31

SLIDE 43

Linear Congruential PRNG

Lehmer-Park-Miller MINSTD Rn+1 = (16807 Rn)mod(231 − 1),

D. Carta implementation: no modulus

and overflow handled with 32 bit integers only Easy implementation: one word state Coalescence is obtained by definition R[i] . . . 1 2 3 4 5

MINSTD overflow handling

No 64 bits variables and no ’mod’ involved RNGT lo = 16807(seed&0xffff); RNGT hi = 16807(seed>>16); lo += (hi&0x7fff)<<16; lo += hi>>15; if(lo > 0x7fffffff) lo -= 0x7fffffff; return lo;

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 10 / 31

SLIDE 44

Linear Congruential PRNG

Lehmer-Park-Miller MINSTD Rn+1 = (16807 Rn)mod(231 − 1),

D. Carta implementation: no modulus

and overflow handled with 32 bit integers only Easy implementation: one word state Coalescence is obtained by definition R[i] . . . 1 2 3 4 5 More than one instance: read once, store once Low quality random numbers, but still useful in some cases

MINSTD overflow handling

No 64 bits variables and no ’mod’ involved RNGT lo = 16807(seed&0xffff); RNGT hi = 16807(seed>>16); lo += (hi&0x7fff)<<16; lo += hi>>15; if(lo > 0x7fffffff) lo -= 0x7fffffff; return lo;

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 10 / 31

SLIDE 45

Linear Congruential PRNG

Lehmer-Park-Miller MINSTD Rn+1 = (16807 Rn)mod(231 − 1),

D. Carta implementation: no modulus

and overflow handled with 32 bit integers only Easy implementation: one word state Coalescence is obtained by definition R[i] . . . 1 2 3 4 5 More than one instance: read once, store once Low quality random numbers, but still useful in some cases

MINSTD overflow handling

No 64 bits variables and no ’mod’ involved RNGT lo = 16807(seed&0xffff); RNGT hi = 16807(seed>>16); lo += (hi&0x7fff)<<16; lo += hi>>15; if(lo > 0x7fffffff) lo -= 0x7fffffff; return lo;

Avoiding ’if’ statement

We need to avoid branching lo -= ((-((lo&0x80000000)>>31))&0x7fffffff);

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 10 / 31

SLIDE 46

Lagged-Fibonacci-like PRNGs: Parisi-Rapuano

Definition

Modified lagged-Fibonacci PRNG ira[i] = ira[i - 24] + ira[i - 55]; R = ira[i]^ ira[i - 61]; At least 62 entries state

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 11 / 31

SLIDE 47

Lagged-Fibonacci-like PRNGs: Parisi-Rapuano

Definition

Modified lagged-Fibonacci PRNG ira[i] = ira[i - 24] + ira[i - 55]; R = ira[i]^ ira[i - 61]; At least 62 entries state

GPU implementation

Common practice: load one or more states in Shared Memory. Lags can be used for threads to work together. However, lags not suitable (Weigel 2012) Good trade-off between efficiency and quality

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 11 / 31

SLIDE 48

Lagged-Fibonacci-like PRNGs: Parisi-Rapuano

Definition

Modified lagged-Fibonacci PRNG ira[i] = ira[i - 24] + ira[i - 55]; R = ira[i]^ ira[i - 61]; At least 62 entries state

GPU implementation

Common practice: load one or more states in Shared Memory. Lags can be used for threads to work together. However, lags not suitable (Weigel 2012) Good trade-off between efficiency and quality

Lagged-Fibonacci-like PRNGs: new memory access

One state per thread with coalescing seed = ira[(i - 24)threads + threadId] + ira[(i - 55)threads + threadId]; ira[ithreads + threadId] = seed; seed ^ = ira[(i - 61)threads + threadId]; The method is general and it can be applied to the well-known Mersenne-Twister (MT19937) Huge memory consumption ∝ Nthreads × Nstate Parisi-Rapuano: NP R = 62 Mersenne Twister MT19937 NMT = 624

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 11 / 31

SLIDE 49

Benchmarks: host API

PRAND benchmark results using cuRand host API Filling an array of 229 single-precision floating point variables

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 12 / 31

SLIDE 50

Benchmarks: host API

PRAND benchmark results using cuRand host API Filling an array of 229 single-precision floating point variables tINIT : time needed to initialize the PRNG turand

INIT : time needed to initialize the PRNG using /dev/urandom

tGEN: is the generation time

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 12 / 31

SLIDE 51

Benchmarks: host API

PRAND benchmark results using cuRand host API Filling an array of 229 single-precision floating point variables tINIT : time needed to initialize the PRNG turand

INIT : time needed to initialize the PRNG using /dev/urandom

tGEN: is the generation time

0.05 0.1 0.15 0.2 0.25 MTGP32 X OR W O W MT19937 MTGP32 X OR W O W MT19937 MTGP32 X OR W O W MT19937

tINIT(s)

uRand Host API T esla M2090 T esla K20X GTX Titan 1 2 3 4 5 6 7 8 MT19937 PR MINSTD MT19937 PR MINSTD MT19937 PR MINSTD

t

urand

INIT

(s)

My Host API T esla M2090 T esla K20X GTX Titan

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 12 / 31

SLIDE 52

Benchmarks: host API

PRAND benchmark results using cuRand host API Filling an array of 229 single-precision floating point variables tINIT : time needed to initialize the PRNG turand

INIT : time needed to initialize the PRNG using /dev/urandom

tGEN: is the generation time

2 4 6 8 10 12 14 MTGP32 X OR W O W MT19937 MTGP32 X OR W O W MT19937 MTGP32 X OR W O W MT19937

tGEN(s)

uRand Host API T esla M2090 T esla K20X GTX Titan 2 4 6 8 10 12 14 MT19937 PR MINSTD MT19937 PR MINSTD MT19937 PR MINSTD

tGEN(s)

My Host API T esla M2090 T esla K20X GTX Titan

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 12 / 31

SLIDE 53

Benchmarks: device API

Number of instances per second Launch configuration: 64 blocks of 256 threads Each thread produces 215 instances, repeated 10 times

2 4 6 8 10 12 14 16 18 MTGP32 MTGP32 MTGP32 MTGP32 instan es/se · 109 uRand Devi e API T esla M2090 T esla K20X GTX Titan GTX 680 2 4 6 8 10 12 14 16 18 MT19937 PR MT19937 PR MT19937 PR MT19937 PR instan es/se · 109 My Devi e API T esla M2090 T esla K20X GTX Titan GTX 680

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 13 / 31

SLIDE 54

Dieharder Tests

Let us test the generators initialized via /dev/urandom against Dieharder

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 22 24 26 28 210 212 214 216 218 220

fra tion threads MT19937 P ASSED WEAK F AILED

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 14 / 31

SLIDE 55

Dieharder Tests

Let us test the generators initialized via /dev/urandom against Dieharder

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 22 24 26 28 210 212 214 216 218 220

fra tion threads MT19937 P ASSED WEAK F AILED 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 22 24 26 28 210 212 214 216 218 220 222

fra tion threads 32-bits P arisi-Rapuano P ASSED WEAK F AILED

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 14 / 31

SLIDE 56

Dieharder Tests

Let us test the generators initialized via /dev/urandom against Dieharder

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 22 24 26 28 210 212 214 216 218 220

fra tion threads MT19937 P ASSED WEAK F AILED 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 22 24 26 28 210 212 214 216 218 220 222

fra tion threads LCG32 P ASSED WEAK F AILED 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 22 24 26 28 210 212 214 216 218 220 222

fra tion threads 32-bits P arisi-Rapuano P ASSED WEAK F AILED

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 14 / 31

SLIDE 57

Outline for section 3

1

Cubic stencils

2

PRNGs

3

Multi-GPU and MPI

4

Results

5

Conclusions & Perspectives

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 15 / 31

SLIDE 58

CUDA streams

Separated update: all reds and then all blues Communication handled by MPI z direction slicing of the system: bulk + boundary

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 16 / 31

SLIDE 59

CUDA streams

Separated update: all reds and then all blues Communication handled by MPI z direction slicing of the system: bulk + boundary

Strategy

Overlapping the update of the boundaries and their transfer with the update of the bulk One stream each

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 16 / 31

SLIDE 60

CUDA streams

Separated update: all reds and then all blues Communication handled by MPI z direction slicing of the system: bulk + boundary

Strategy

Overlapping the update of the boundaries and their transfer with the update of the bulk One stream each Stream 0 Stream 1 Boundary MPI t D2H H2D Bulk

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 16 / 31

SLIDE 61

Sliced checkerboard and MPI

Standard checkerboard boundaries are two-coloured Sliced checkerboard boundaries are one-coloured Sliced scheme: communication between nodes is one-directional only

. . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Standard scheme . . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Sliced scheme MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 17 / 31

SLIDE 62

Sliced checkerboard and MPI

Standard checkerboard boundaries are two-coloured Sliced checkerboard boundaries are one-coloured Sliced scheme: communication between nodes is one-directional only

. . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Standard scheme . . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Sliced scheme MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 17 / 31

SLIDE 63

Sliced checkerboard and MPI

Standard checkerboard boundaries are two-coloured Sliced checkerboard boundaries are one-coloured Sliced scheme: communication between nodes is one-directional only

. . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Standard scheme . . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Sliced scheme MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 17 / 31

SLIDE 64

Sliced checkerboard and MPI

Standard checkerboard boundaries are two-coloured Sliced checkerboard boundaries are one-coloured Sliced scheme: communication between nodes is one-directional only

. . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Standard scheme . . . . . . . . . . . . . . . . . . bulk boundary boundary n-th system . . . . . . (n − 1)-th system bulk boundary . . . . . . (n + 1)-th system bulk boundary Sliced scheme MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 17 / 31

SLIDE 65

Outline for section 4

1

Cubic stencils

2

PRNGs

3

Multi-GPU and MPI

4

Results

5

Conclusions & Perspectives

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 18 / 31

SLIDE 66

Three-dimensional Edwards-Anderson model

Three-dimensional Spin Glass model H = −

ik

Jikσiσk, σi ∈ {−1, +1}, Jik ∈ {−1, +1} Bimodal disorder P(Jik) = 1 2 [δK(Jik + 1) + δK(Jik − 1)] Two-valued variables: one variable-one bit Asynchronous MultiSpin-Coding (AMSC) avoiding ’if’ statements eik1 Jik1 σi0 σi1 σi2 σk0 σk1 σk2 ^ Jik0 Jik2 ^ eik0 eik2 = eik1 eik0 eik2

ik

= s2i0 s2i1 s2i2 s1i0 s0i0 s1i1 s0i1 s1i2 s0i2 Jik = −1 → 1 Jik = +1 → 0 σi = +1 → 1 σi = −1 → 0 eik = σi^σk^Jik

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 19 / 31

SLIDE 67

Single- & Multi-GPU Results

The smaller the lattice the more disorder realizations are needed to attain saturation Let k be the number of coded disorder realizations and psFlipn,mstd psFlipn,mstd(L, k) = tsw · n ·

32 · k · 4 · L3−1 ,

10−4 10−3 10−2 t

sw(s)

100 101 102 103 20 22 24 26 28 210 212 214

psFlip1, mstd

k

L=8 L=16 L=32 L=64

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 20 / 31

SLIDE 68

Single- & Multi-GPU Results

Values for psFlip1,mstd for different values of L = 4k Sliced (red), Standard (green), Standard-Bitwise (blue)

6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 psFlip1, mstd T esla M2090 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6 8 32 64 96 128 160 192 224 256

L

T esla K20X

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 21 / 31

SLIDE 69

Single- & Multi-GPU Results

Values for psFlip1,mstd for different values of L = 4k Sliced (red), Standard (green), Standard-Bitwise (blue)

4 4.5 5 5.5 6 6.5 7 psFlip1, mstd GTX 680 2.6 2.8 3 3.2 3.4 3.6 3.8 4 8 32 64 96 128 160 192 224 256

L

GTX Titan

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 21 / 31

SLIDE 70

Single- & Multi-GPU Results

Bandwidth for different values of L = 4k Sliced (red), Standard (green), Standard-Bitwise (blue)

90 105 120 135 T esla M2090 100 125 150 Bandwidth (GB/s) GTX 680 190 200 210 220 8 32 64 96 128 160 192 224 256

L

GTX Titan

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 21 / 31

SLIDE 71

Single- & Multi-GPU Results

Values for psFlip1,x for different random number generators and L = 4k Sliced (red), Standard (green), Standard-Bitwise (blue)

5 5.5 6 P arisi-Rapuano 5.2 5.4 5.6 5.8 6 MT19937 6 7 8 9 psFlip1, x uRand X OR W O W 11 13 15 8 32 64 96 128 160 192 224 256 L uRand MTGP32

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 22 / 31

SLIDE 72

Single- & Multi-GPU Results (CSCS Piz Daint)

strong-scaling efficiency ηSC = psFlip1,x psFlipN,x ,

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 32 64 96 128 160 192 224 256 ηSC L NGPU = 2 NGPU = 4 NGPU = 8

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 23 / 31

SLIDE 73

Single- & Multi-GPU Results (CSCS Piz Daint)

0.1 1 10 8 16 32 64 128 256 512 1024 psFlipN,mstd/N L NGPU = 1 NGPU = 2 NGPU = 4 NGPU = 8 NGPU = 16 NGPU = 32

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 23 / 31

SLIDE 74

Single- & Multi-GPU Results (CSCS Piz Daint)

Data scaling L N psFlipN,x(L, k) ∝ tsw L N −2 , x = L N . 100 200 300 400 500 600 700 20 40 60 80 100 120 140 x psFlipN,mstd x N=2 N=4 N=8 N=16 N=32

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 23 / 31

SLIDE 75

Physics results

This GPU implementation together with a new theoretical approach allowed us to obtain cutting-edge estimations for the critical parameters for the 3D Ising Spin Glass phase transition 3.1 years of one GTX Titan ∼ 20 faster than high-end CPUs implementations and comparable with FPGA (Janus 2, 2013)

Tc ν η ω z Years Janus 2013 1.1019(29) 2.562(42) −0.3900(36) 1.12(10) 75 (FPGA) Our Work1 1.099(5) 2.47(10) −0.39(1) 1.3(2) 6.80(15) 3.1 (GPU)

Haus. et al. 2008

1.109(10) 2.45(15) −0.375(10) 1.0(1) 40 (CPU)

The use of GPUs was necessary and unavoidable in order to prove the validity and effectiveness of the new theoretic approach

1M. Lulli, G. Parisi & A. Pelissetto in preparation

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 24 / 31

SLIDE 76

Outline for section 5

1

Cubic stencils

2

PRNGs

3

Multi-GPU and MPI

4

Results

5

Conclusions & Perspectives

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 25 / 31

SLIDE 77

Conclusions & Perspectives

Conclusions

A new sliced cubic stencil access pattern which in some cases performs better than highly tuned solutions A new access pattern for lagged-Fibonacci-like PRNGs which performs better than analogues in CURAND (∼ 2×) 3D Ising Spin Glass single GPU: performances are stable for a large interval (the largest so far) of L; the MT19937 PRNG performs only ∼ 10% slower than the PR The sliced multi-GPU version shows a very good strong scaling efficiency and competitive speeds for dynamic studies Together with a new off-equilibrium finite-size scaling approach we obtained the 2nd most precise estimates for EA3D critical parameters so far

Perspectives

Implement Parallel Tempering dynamics for equilibrium simulations

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 26 / 31

SLIDE 78

matteo.lulli@gmail.com ...and please give your feedback!

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 27 / 31

SLIDE 79

Even and Odd subsequences for L = 2n

Values for psFlip1,mstd for different values of L = 2n Sliced (red), Standard (green), Standard-Bitwise (blue)

6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 psFlip1,mstd Tesla M2090 3.5 4 4.5 5 5.5 6 8 32 64 96 128 160 192 224 256 L Tesla K20X

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 28 / 31

SLIDE 80

Even and Odd subsequences for L = 2n

Values for psFlip1,mstd for different values of L = 2n Sliced (red), Standard (green), Standard-Bitwise (blue)

4 4.5 5 5.5 6 6.5 7 7.5 psFlip1,mstd GTX 680 2.6 2.8 3 3.2 3.4 3.6 3.8 4 8 32 64 96 128 160 192 224 256 L GTX Titan

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 28 / 31

SLIDE 81

Even and Odd subsequences for L = 2n

What is happening? L0 = 2(2m), V0 2 = (4m)3 2 = 32m3, (32m3)mod(32) = 0, ∀ m ∈ N, L1 = 2(2m + 1), V1 2 = [2(2m + 1)]3 2 = 4(2m + 1)3, [4(2m + 1)3]mod(32) = 0, ∀ m ∈ N. The odd subsequence is never commensurate to the actual warp size. For every sample there is one warp branching

1 8 32 64 96 128 160 192 224 256 bran h/k

L

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 29 / 31

SLIDE 82

More on MultiSpin-Coding

Two-valued variables: one variable-one bit

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 30 / 31

SLIDE 83

More on MultiSpin-Coding

Two-valued variables: one variable-one bit Asynchronous MultiSpin-Coding (AMSC)

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 30 / 31

SLIDE 84

More on MultiSpin-Coding

Two-valued variables: one variable-one bit Asynchronous MultiSpin-Coding (AMSC) eik1 Jik1 σi0 σi1 σi2 σk0 σk1 σk2 ^ Jik0 Jik2 ^ eik0 eik2 = eik1 eik0 eik2

ik

= s2i0 s2i1 s2i2 s1i0 s0i0 s1i1 s0i1 s1i2 s0i2 Jik = −1 → 1 Jik = +1 → 0 σi = +1 → 1 σi = −1 → 0 eik = σi^σk^Jik

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 30 / 31

SLIDE 85

More on MultiSpin-Coding

Two-valued variables: one variable-one bit Asynchronous MultiSpin-Coding (AMSC) eik1 Jik1 σi0 σi1 σi2 σk0 σk1 σk2 ^ Jik0 Jik2 ^ eik0 eik2 = eik1 eik0 eik2

ik

= s2i0 s2i1 s2i2 s1i0 s0i0 s1i1 s0i1 s1i2 s0i2 Jik = −1 → 1 Jik = +1 → 0 σi = +1 → 1 σi = −1 → 0 eik = σi^σk^Jik

Metropolis Dynamics - Avoiding ’if’ statements

∆E = H[{σi=a, −σa}] − H[{σi=a, σa}] = −12, −8, −4, 0, 4, 8, 12. The acceptance probability Pflip(∆E) =

1,

∆E ≤ 0 e−β∆E, ∆E > 0

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 30 / 31

SLIDE 86

More on MultiSpin-Coding

Two-valued variables: one variable-one bit Asynchronous MultiSpin-Coding (AMSC) eik1 Jik1 σi0 σi1 σi2 σk0 σk1 σk2 ^ Jik0 Jik2 ^ eik0 eik2 = eik1 eik0 eik2

ik

= s2i0 s2i1 s2i2 s1i0 s0i0 s1i1 s0i1 s1i2 s0i2 Jik = −1 → 1 Jik = +1 → 0 σi = +1 → 1 σi = −1 → 0 eik = σi^σk^Jik

Metropolis Dynamics - Avoiding ’if’ statements

(0, 0, 0) = 0 → ∆E = −12 (0, 0, 1) = 1 → ∆E = −8 (0, 1, 0) = 2 → ∆E = −4 (0, 1, 1) = 3 → ∆E = 0 (1, 0, 0) = 4 → ∆E = 4 (1, 0, 1) = 5 → ∆E = 8 (1, 1, 0) = 6 → ∆E = 12

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 30 / 31

SLIDE 87

More on MultiSpin-Coding

Two-valued variables: one variable-one bit Asynchronous MultiSpin-Coding (AMSC) eik1 Jik1 σi0 σi1 σi2 σk0 σk1 σk2 ^ Jik0 Jik2 ^ eik0 eik2 = eik1 eik0 eik2

ik

= s2i0 s2i1 s2i2 s1i0 s0i0 s1i1 s0i1 s1i2 s0i2 Jik = −1 → 1 Jik = +1 → 0 σi = +1 → 1 σi = −1 → 0 eik = σi^σk^Jik

Metropolis Dynamics - Avoiding ’if’ statements

Normalized transition probabilities & random number comparison Rmax exp(−β∆E) = EXP12, EXP8, EXP4 cond12 = -(R < EXP12); cond8 = -(R < EXP8); cond4 = -(R < EXP4); Spin flip: spin = mask^ spin; mask = cond12 | (~sum2) | ((sum2 & (sum2^sum1)) & (cond8 | (cond4 & (~sum0))));

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 30 / 31

SLIDE 88

Tiny differences - Calligraphy

Sliced Addressing

x = i%d L; y = (i/d L)%d L; z = i/d A; smz = i + (SM(z - 1, d hL) - z)d A; spy = smz + (SP(y + 1, d L) - y)d L; smy = i + (SM(y - 1, d L) - y)*d L; smx = spy - x + SM(x - 1, d L); spx = smy - x + SP(x + 1, d L);

Standard Addressing

x = i%d hL; y = (i/d hL)%d L; z = i/d hA; par = (y^ z)&1; spx = i - x + SP(x + 1 - (par^ 1), d hL); smx = i - x + SM(x - 1 + par, d hL); spy = i + (SP(y + 1, d L) - y)d hL; smy = i + (SM(y - 1, d L) - y)d hL; spz = i + (SP(z + 1, d L) - z)d hA; smz = i + (SM(z - 1, d L) - z)d hA; For the sliced scheme spz = i, and parity is not involved

Matteo Lulli - arXiv: 1114.0127 - lullimat.org Solutions for Efficient Memory Access for Cubic Lattices and Random Number Algorithms 31 / 31