Efficient Correlation-Free Many-States Lattice Monte Carlo on GPUs - - PowerPoint PPT Presentation

efficient correlation free many states lattice monte
SMART_READER_LITE
LIVE PREVIEW

Efficient Correlation-Free Many-States Lattice Monte Carlo on GPUs - - PowerPoint PPT Presentation

Efficient Correlation-Free Many-States Lattice Monte Carlo on GPUs Jeffrey Kelling, Gza dor, Martin Weigel, Sibylle Gemming 8th May 2017 Member of the Helmholtz Association Jeffrey Kelling, Gza dor, Martin Weigel, Sibylle Gemming | FWIO


slide-1
SLIDE 1

Efficient Correlation-Free Many-States Lattice Monte Carlo on GPUs

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming 8th May 2017

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association

slide-2
SLIDE 2

1 Introduction: What is this talk about?

surface growth, physical aging (and non-equilibrium systems) lattice Monte-Carlo

y x p q

2 Trivial parallism vs. SIMT

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 1/22

slide-3
SLIDE 3

Applications for Monte Carlo: Stochastic Prosesses

http://hubblesite.org/newscenter/ archive/releases/2007/17/image/a Müller, T., Heinig, K.-H. et al. Appl.

  • Phys. Lett. 85 2373 (2004)

http://en.wikipedia.org/wiki/File: Rub_al_Khali_002.JPG https://www.hzdr.de/db/Cms?pOid= 24344&pNid=2707

game theory

  • e. g.: Perc, Matjaž Eur. J. Phys.

38(4) 045801 (2017)

sociology finance ...

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 2/22

slide-4
SLIDE 4

Non-Equilibrium vs Equilibrium

  • ut-of-Equilibrium:

kinetics of interest

? ? ? 8-states Potts model,

J kBT = 5

Equilibrium Properties:

  • nly final state relevant

disordered state

  • rdered state

?

8-states Potts model

  • ptimal algorithm reproduces

physical evolution

  • ptimal algorithm reaches

equilibrium quickly

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 3/22

slide-5
SLIDE 5

Non-Equilibrium Systems

101 102 103 104 105 106 107 100 101 102

W 2(L, t) =

1 L2

  • L2
  • i

h2

i (t) − L2

  • i

hi(t)

  • L

= lateral systemsize hi = surface height at site i

t[Monte Carlo steps (MCS)] Interface Roughness W 2 0.1 MMCS 0.6 MMCS 20.5 MMCS 150.05 MMCS

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 4/22

slide-6
SLIDE 6

Domain Decomposition

Random Sequential (RS)

  • n GPU: domain decomposition

1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 + uncorrelated updates − < 48 B per domain in smem Stochastic Cellular Automaton (SCA) update odd/even sublattice update probability p < 1 + linear memory access ⇒ fast

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 5/22

slide-7
SLIDE 7

Parallel random sequential updates are hard. Why should we care for them?

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 6/22

slide-8
SLIDE 8

Auto-Correlation of a Lattice Gas

100 101 102 103 10−6 10−5 10−4 10−3 10−2 10−1 100 101 t/s C(t, s) · s0.76

Random Sequential

C(t, s) = φ(t)φ(s)−φ(t) φ(s) t, s: time, waiting-time

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 7/22

slide-9
SLIDE 9

Auto-Correlation of a Lattice Gas

100 101 102 103 10−6 10−5 10−4 10−3 10−2 10−1 100 101 t/s C(t, s) · s0.76

SCA − limit (correction) Checkerboard SCA Random Sequential

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 7/22

slide-10
SLIDE 10

KPZ–Equation for Surface Growth

101 102 103 104 105 106 107 100 101 102 2βeff t[Monte Carlo steps (MCS)] Interface Roughness W 2

y x p q

2 + 1D octahedron model

Ódor, G., Liedke, B., Heinig, K.-H. Phys. Rev. E 79 021125 (2009)

dth(x, t) = v

  • mean growth vel.

+ σ2∇2h(x, t)

  • surface tension

+ λ [∇h(x, t)]2

  • local growth vel.

+ η(x, t)

noise

Kardar–Parisi–Zhang stochastic differential equation

Kardar, M., Parisi, G., Zhang, Y.-C. Phys. Rev. Lett. 56 889 (1986)

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 8/22

slide-11
SLIDE 11

β and the Kim–Kosterlitz Hypothesis

β = 1/4?

Kim, J. M., Kosterlitz, J. M. Phys. Rev. Lett. 62 2289 (1989)

  • ctahedron model

∆h = ±1 β < 1/4

0.02 0.04 0.06 0.08

1/t

1/2

0.236 0.238 0.24 0.242 0.244 0.246

βeff

12 13 16 17

Kelling, J., Ódor, G. Phys. Rev. E 84 061150 (2011)

restricted solid-on-solid model ∆h ≤ N β ≈ 1/4 for N > 1?

Kim, J. M. J. Korean Phys. Soc. 67(9) 1529 (2015)

We need more states.

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 9/22

slide-12
SLIDE 12

Part 2 Trivial parallism vs. SIMT Handling more states.

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 10/22

slide-13
SLIDE 13

Trivial parallism vs. SIMT

efficient simulation of independent copies . . . vector of 32, . . . , 128, 256, . . . layers depending on application

⇒ “random” accesses to vectors in global memory ⇒ no caching of simulation state required ⇒ very efficient use of GPUs ⇒(vector processors/data parallelism)

Ito, N., Kanada, Y. Supercomputer 3(25) 1988

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 11/22

slide-14
SLIDE 14

Trivial parallism vs. SIMT

efficient simulation of independent copies . . . Trivially parallel → Multi-Surface

→ large samples ⇒ good statistics → large parameter studies → large sets of initial conditions + random site-selection

Ito, N., Kanada, Y. Supercomputer 3(25) 1988

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 11/22

slide-15
SLIDE 15

Multi-Surface Approach for GPUs

double-tiling at device layer ... with random origin Multi-Surface at block layer

1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 4 3 4 3 1 4 2 3 1 4 2 3

. . .

global memory ... multi-processor 1 multi-processor N shared memory, up to 48 kB shared memory, up to 48 kB ... thread 1 thread M ... thread 1 thread M sync sync sync sync sync

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 12/22

slide-16
SLIDE 16

Decorrelating Samples

random site-selection is about introducing uncorrelated noise we want to average over independent samples domain growth, phase ordering: structure evolution

random initial conditions independent random update acceptance

(Boltzmann factors exp ∆E/kBT )

(quenched disorder) ⇒ no problem

surface growth

flat initial conditions ⇒ all simulations with identical site-selection would be identical randomly discard every 2nd update

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 13/22

slide-17
SLIDE 17

Not Decorrelating Samples

Cases where identical noise across samples is desirable: sampling initial conditions calculating response functions * parallel annealing

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 14/22

slide-18
SLIDE 18

RSOS Multi-Surface

8 bits per lattice-site are enough ⇒ process 4 packed samples per thread 4 bits per height-difference

word 0≡thread 0

  • sample 0

(x, y)

sample 1

(x, y)

sample 2

(x, y)

sample 3

(x, y)

word 1≡thread 1

  • sample 4

(x, y)

sample 5

(x, y)

sample 6

(x, y)

sample 7

(x, y) . . . randomly select 2 out of 4 samples for each thread ⇒ no idle threads

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 15/22

slide-19
SLIDE 19

Collective Generation of Random Coordinates

all threads access the same coordinate for each update ⇒ pre-compute list of update coordinates in shared memory each thread computes one component:

1 generate random number 2 apply transformations (origin shift, periodic boundary conditions)

collectively refill list when used up

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 16/22

slide-20
SLIDE 20

Performance

Octahedron RS Octahedron SCA p = 0.95 Octahedron SCA p = 0.5 RSOS RS Potts RS Potts RS Kawasaki

100 200

9 11 50 229 7 4.5

update attempts/ns

bit-coded number of states 4 large systems multi-surface any number of states large samples

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 17/22

slide-21
SLIDE 21

Memory Limits: RSOS

single-GPU implementations 64 threads per block ⇒ 256 samples ⇒ 256 B / MS lattice site ⇒ 212 × 212 sites need 4 GB of gmem + random number generator states

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 18/22

slide-22
SLIDE 22

Memory Limits: Beyond

consider: 216 × 216 lattices sites, 2 bits per ⇒ 1 GB per sample efficient code would run > 1024 samples

akin to SCA: Kelling, J., Ódor, G., Gemming, S. INES ’16, IEEE (2016)

  • ur work actually needs this many samples

spreading lattice across multiple GPUs more efficent then trival multi-GPU use

0.02 0.04 0.06 0.08

1/t

1/2

0.236 0.238 0.24 0.242 0.244 0.246

βeff

12 13 16 17

Kelling, J., Ódor, G.

  • Phys. Rev. E 84 061150 (2011)

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 19/22

slide-23
SLIDE 23

How did the β-thing turn out?

β = 0.241(1) for all N

Kelling, J., Ódor, G., Gemming, S. Phys. Rev. E 94 022107 (2016)

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 20/22

slide-24
SLIDE 24

Conclusions and Outlook

Think about vectorizing your trivial parallelism. we are developing a framework for Nd applications in C++11

@NVIDIA: we would really appreciate some C++14 in CUDA

multi-GPU in the making ... code will be made available after restructuring

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 21/22

slide-25
SLIDE 25

Acknowledgements

Artur Erbe Jörg Schuster Peter Zahn Henrik Schulz Nils Schmeißer Michael Bussmann Guido Juckeland my other colleagues computing time at ZIH Dresden, NIIF Hungary, HZDR Computing Center J.Kelling@HZDR.de Thank You.

This work has received funding from the Erasmus+ program via the Leonardo-Büro Sachsen and Coventry University.

Jeffrey Kelling, Géza Ódor, Martin Weigel, Sibylle Gemming | FWIO | http//www.hzdr.de

Member of the Helmholtz Association Page 22/22