High performance computing for numerical probability J er ome - - PowerPoint PPT Presentation

high performance computing for numerical probability
SMART_READER_LITE
LIVE PREVIEW

High performance computing for numerical probability J er ome - - PowerPoint PPT Presentation

HPC for numerical probability High performance computing for numerical probability J er ome Lelong Universit e de Grenoble Alpes Ensimag / LJK Thursday 15 October 2015 J. Lelong (Univ. Grenoble Alpes) 15/10/2015 1 / 31 HPC for


slide-1
SLIDE 1

HPC for numerical probability

High performance computing for numerical probability

J´ erˆ

  • me Lelong

Universit´ e de Grenoble Alpes – Ensimag / LJK

Thursday 15 October 2015

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 1 / 31

slide-2
SLIDE 2

HPC for numerical probability

1

Introduction to parallel computing Why we need parallel computing Parallel architectures

2

Generating random numbers in parallel

3

How HPC can help numerical probability Monte Carlo PDE methods Tree methods Non linear problems

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 2 / 31

slide-3
SLIDE 3

HPC for numerical probability

1

Introduction to parallel computing Why we need parallel computing Parallel architectures

2

Generating random numbers in parallel

3

How HPC can help numerical probability Monte Carlo PDE methods Tree methods Non linear problems

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 3 / 31

slide-4
SLIDE 4

HPC for numerical probability

Why we need HPC (1)

◮ We want to handle larger problems (more data or variables). ◮ We want programs to be faster.

Sequential programming has reached some limitations

◮ Moore’s law reaches physical limits (broken since 2004)

◮ No more possible to double the density of transistors every 18

months (Gordon Moore, 1965).

◮ Increasing frequency increases both the energy consumption

and heating.

◮ Frequency stays almost constant, but the number of cores

increases.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 4 / 31

slide-5
SLIDE 5

HPC for numerical probability

Why we need HPC (2)

◮ Memory is a real bottleneck.

◮ Since the 70s, the frequency of CPUs has increased much

faster than the one of memory. Processors keep waiting for data.

◮ When computing a[i] = b[i] + c[i] using an Intel Core

i7-6700HQ.

◮ Memory Bandwidth 34 GB/s, with 2 channels. ◮ Maximum turbo frequency 3.5 Ghz ◮ 4 adds per cycle ◮ Transferring data takes 3 ∗ 8/(34E9/2)/(1/3.5E9/4) = 19.7.

more times than computing the sum. amount data / bandwidth / time for add

◮ Parallel computing gives access to more memory and more

memory channels.

◮ When all the cores of processor are computing, the clock speed

  • f each core decreases a bit but all the memory channels can

be used. Memory is less a bottleneck.

◮ Memory bandwidth increases very slowly.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 5 / 31

slide-6
SLIDE 6

HPC for numerical probability

Main architectures (1)

◮ Multi core units: shared memory. All the cores share the same

global memory.

◮ Scaling is often hard/bad mainly because of cache

synchronization.

◮ Programming is pretty easy at first sight (OPENMP) ◮ No need of message passing but some concurrent accesses.

◮ Clusters: distributed memory. Aggregation of processing units

through a high speed network. Each unit has its local memory and there is no global memory access.

◮ Scaling is better. ◮ Need of a specific communication protocol to exchange data

between the processing units.

◮ Programming needs to handle data passing explicitly. ◮ Optimizing the ratio communication/computations requires

careful design.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 6 / 31

slide-7
SLIDE 7

HPC for numerical probability

Main architectures (2)

◮ Clusters of multi cores: two levels of parallelism. Distributed

memory between the nodes (first level) and shared memory inside each node (second level).

◮ Programming is delicate with two different parallel paradigms. ◮ Optimizing is complex. ◮ Can achieve better performances.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 7 / 31

slide-8
SLIDE 8

HPC for numerical probability

Main architectures (3)

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 8 / 31

slide-9
SLIDE 9

HPC for numerical probability

Main architectures (4)

◮ Grids: heterogeneous processing units linked through a low

speed network.

◮ Low network. ◮ Only for applications with very little communication. ◮ Startup latencies.

◮ Accelerators:

◮ GPU: it requires to use a dedicated programming language. ◮ Intel MIC (Xeon Phi): 60–core with 8GB of RAM; pragmas

programming.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 9 / 31

slide-10
SLIDE 10

HPC for numerical probability

Accelerators

Excerpt from the Top 500 highlights (June 2015).

◮ The No. 1 system and the No. 7 system use Intel Xeon Phi

processors to speed up their computational rate. The No. 2 system and the No. 6 system use NVIDIA GPUs to accelerate computation.

◮ A total of 90 systems on the list are using

accelerator/co-processor technology, up from 75 on November

  • 2014. 52 of these use NVIDIA chips, 4 use ATI Radeon, and

there are now 35 systems with Intel MIC technology (Xeon Phi). 4 systems use a combination of Nvidia and Intel Xeon Phi accelerators/co-processors.

◮ 97% the systems use processors with six or more cores and

87.8% use eight or more cores.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 10 / 31

slide-11
SLIDE 11

HPC for numerical probability

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 11 / 31

slide-12
SLIDE 12

HPC for numerical probability

Shared memory (1)

◮ Global memory space for all processing units and local caches. ◮ All units have the same memory. Memory changes are viewed

“instantaneously”. But, maintaining the memory coherence costs system time and may cause concurrent access.

◮ Two kinds of shared memories

◮ SMP (Symmetric Multi Processors). All units are plugged on a

unique memory bus. CPU CPU CPU CPU MEMORY

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 12 / 31

slide-13
SLIDE 13

HPC for numerical probability

Shared memory (2)

◮ NUMA (Non Uniform Memory Access). The memory topology

ensures faster access to closer memory. CPU CPU CPU CPU MEMORY CPU CPU CPU CPU MEMORY CPU CPU CPU CPU MEMORY CPU CPU CPU CPU MEMORY

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 13 / 31

slide-14
SLIDE 14

HPC for numerical probability

Distributed memory

◮ Each processor has its own memory. No global memory space. ◮ Access to other processors’ memory requires message passing

through an interconnect network. No problem of concurrent access but explicit communication.

CPU CPU CPU CPU MEMORY MEMORY MEMORY MEMORY INTERCONNECT NETWORK

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 14 / 31

slide-15
SLIDE 15

HPC for numerical probability

1

Introduction to parallel computing Why we need parallel computing Parallel architectures

2

Generating random numbers in parallel

3

How HPC can help numerical probability Monte Carlo PDE methods Tree methods Non linear problems

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 15 / 31

slide-16
SLIDE 16

HPC for numerical probability

Random number generators

◮ A random number generator is a deterministic recurrent

sequence with statistical properties close to those of an i.i.d. sample from the uniform distribution.

◮ Fully determined by the initial state and the transition

function. xn+1 = (a0 · xn + a1 · xn−1 + · · · + ak · xn−k + b) mod m.

◮ What do we expect from parallel random number generators?

◮ Reproducibility: rerun a scenario (at least with the same

number of processing units)

◮ Independence of the numbers generated on different processing

units and within one unit.

◮ No communication between the generators (at least after

startup).

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 16 / 31

slide-17
SLIDE 17

HPC for numerical probability

Parallel random number generators

Two possible approaches

◮ Splitting: split a single generator in several independent

streams, each of them still having the same properties as the initial generator but with a shorter period.

◮ Parametrization: each stream uses different parameters for

its transition function (usually the multipliers of the recurrence) in order to achieve both independence and a maximum period.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 17 / 31

slide-18
SLIDE 18

HPC for numerical probability

Splitting (1)

◮ Blocking: assign each processing unit a contiguous bloc of

the sequential generator.

◮ It requires to be able to jump ahead at startup (usually only

possible for a jump of size 2k, see [L’Ecuyer and Cˆ

e, 91])

◮ Shorter period. ◮ The way the sequential samples are spread on the processing

units may depend on the number of units.

◮ Long range correlation may occur at short scale.

unit 1 unit 2 unit 3 unit 4 unit 5

M M M M M

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 18 / 31

slide-19
SLIDE 19

HPC for numerical probability

Splitting (2)

◮ Leap frog method: assign each processing unit an equally

spaced subsequence of the original sequence.

◮ It requires to make a jump of size p at each call as quickly as

getting the next value in the sequence. Usually possible for linear congruential generators.

◮ Numbers used by each processing unit change with the number

  • f units.

◮ Elements of a sub stream are equally spaced in the original

  • sequence. Such sub streams may show high correlations, see

[Matteis and Pagnutti, 84].

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 19 / 31

slide-20
SLIDE 20

HPC for numerical probability

Splitting (3)

◮ Splitting the Mersenne Twister generator, see

[Haramoto et al., 08] et [Haramoto, 08].

◮ Nice, because of the huge period N = 219937 with negligible

long range correlations. With 105 processors, the period is still 219920 ≈ 105981.

◮ Hard to implement because of the computation of the nth

power of the multiplier.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 20 / 31

slide-21
SLIDE 21

HPC for numerical probability

Parametrization

◮ People often use different seeds to obtain their streams. No

proof of independence, see [Knuth, 81].

◮ Use different multipliers for each stream.

◮ SPRNG (see [Mascagni and Scrinicasan, 00] and the references

therein): a library of parallel random number generators based

  • n parametrization.

◮ See [Matsumoto and Nishimura, 00] for parametrization of

Mersenne Twister generators using the mutlipliers.

◮ How to determine them to ensure that each stream has good

properties: usually random search for good multipliers.

◮ Initialization can take a while. ◮ Part of the bits of the multiplier is used to encode a unique id.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 21 / 31

slide-22
SLIDE 22

HPC for numerical probability

1

Introduction to parallel computing Why we need parallel computing Parallel architectures

2

Generating random numbers in parallel

3

How HPC can help numerical probability Monte Carlo PDE methods Tree methods Non linear problems

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 22 / 31

slide-23
SLIDE 23

HPC for numerical probability

Monte Carlo: an embarrassingly parallel approach

◮ Compute

1 M

M

  • i=1

f (Xi), Xi ∼ X i.i.d.

◮ M has to be quite large to ensure good convergence (CLT). ◮ All iterations are independent: no communication required

except for the final reduction.

◮ If the evaluation of f is complex enough, the efficiency of a

parallelization is close to 1.

◮ Only requirement: have a good parallel random generator.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 23 / 31

slide-24
SLIDE 24

HPC for numerical probability

PDE methods: a deterministic approach

◮ Discretize the linear parabolic PDE

∂tu + Lu = f ,

  • n[0, T] × Ω

with initial conditions.

◮ Tow levels:

◮ Use a parallel linear algebra library for solving the linear

systems (usually sparse) coming from the discretization. OK for a small number of shared memory units.

◮ Use a clever domain decomposition to solve smaller problems

but then need to ensure smoothness when crossing inner borders.

◮ In any case, ask people from numerical analysis who have been

using HPC for a long time.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 24 / 31

slide-25
SLIDE 25

HPC for numerical probability

Tree methods (1)

  • XN

= ZN Xi = op(Zi, E[Xi+1|Fi]), 0 ≤ i ≤ N − 1 where (Zn)n is a discrete recombining tree.

time 0 time 1 time 2 time 3 time 4 time 5 time 6 time 7

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 25 / 31

slide-26
SLIDE 26

HPC for numerical probability

Tree methods (2)

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 26 / 31

slide-27
SLIDE 27

HPC for numerical probability

Tree methods (3)

◮ See [Zubair and Mukkamal, 08], [Zhang et. al, 11] for the 1d

case.

◮ Only interesting on shared memory architectures. ◮ Easy to visualize in 1d, but the geometry becomes trickier in

higher dimensions. However, the real interest of the method would show up in multi–dimensional problems (number of nodes ≈ Nd)

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 27 / 31

slide-28
SLIDE 28

HPC for numerical probability

Tree methods (4)

◮ Theoretical speed–up with two cores

◮ sequential time: proportional to N(N + 1)/2. ◮ parallel time: proportional to 3/4 × N(N + 1)/2.

◮ Results for a tree with depth = 20, 000 on a Core i5 with 4

cores (2 virtual cores + 2 physical cores). Nb of cores Wall time sequential 3.1 2 2.42 4 1.70

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 28 / 31

slide-29
SLIDE 29

HPC for numerical probability

Non linear problems (1)

◮ Optimal stopping

◮ Dynamic programming equation

  • XN

= ZN Xi = max(Zi, E[Xi+1|Fi]), 0 ≤ i ≤ N − 1 Only possible to parallelize inside each time step.

◮ OK for a small number of shared memory units, see

[Avramidis et al., 99].

◮ Many GPU implementations, see [Abbas-Turki et al. 14]. ◮ The real difficulty comes from the regression part. ◮ Classification approach, see [Ibanez and Zapatero, 04];

[Dung Doan et al., 10].

Only shared memory solutions investigated so far. Scalability properties?

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 29 / 31

slide-30
SLIDE 30

HPC for numerical probability

Non linear problems (2)

◮ Stochastic control problems: very detailed study of an hybrid

two level approach, see [Vialle and Warin, 10]

◮ BSDEs

◮ Dynamic programming approach ◮ Picard’s iteration: a distributed memory approach, see

[Labart and L., 13].

◮ Stratified regression, see [Gobet et al., 15].

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 30 / 31

slide-31
SLIDE 31

HPC for numerical probability

Conclusion

◮ Monte Carlo is embarrassingly parallel. ◮ Non linear problems are hard to run in parallel, especially on

distributed memory architectures.

◮ To write an efficient parallel algorithm, you have to think

parallel.

◮ Parallel codes aim at high dimensional problem. The

algorithm may be slow on low dimensional problem.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31

slide-32
SLIDE 32

HPC for numerical probability

  • L. Abbas-Turki, S. Vialle, B. Lapeyre, and P. Mercier.

Pricing derivatives on graphics processing units using monte carlo simulation. Concurrency and Computation: Practice and Experience, 26(9):1679–1697, 2014.

  • T. Avramidis, Y. Zinchenko, T. F. Coleman, and A. Verma.

Efficiency improvements for pricing American options with a stochastic mesh. In Proceedings of the 1999 Winter Simulation Conference, pages 344–350. IEEE Press, 1999.

  • G. Marsaglia.
  • Diehard. ftp://stat.fsu.edu/pub/diehard.
  • V. Dung Doan, A. Gaiwad, M. Bossy, F. Baude, and
  • I. Stokes-Rees.

Parallel pricing algorithms for multimensional bermudan/american options using Monte Carlo methods.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31

slide-33
SLIDE 33

HPC for numerical probability

Mathematics and Computers in Simulation, 81(3):568–577, 2010.

  • E. GOBET, J. Lopez-Salas, P. Turkedjiev, and C. Vasquez.

Stratified regression Monte-Carlo scheme for semilinear PDEs and BSDEs with large scale parallelization on GPUs. to appear, 2015.

  • A. Ibanez and F. Zapatero.

Monte Carlo valuation of American options through computation of the optimal exercise frontier. Journal of Financial and Quantitative Analysis, 39:253–275, 2004.

  • D. E. Knuth.

The Art of Computer Programming,

  • Vol. 2: Seminumerical Algorithms, Second edition.

Addison-Wesley, Reading, Massachusetts, 1981.

  • P. L’Ecuyer.
  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31

slide-34
SLIDE 34

HPC for numerical probability

Efficient and portable combined random number generators.

  • Comm. of the ACM, 31:742-774, 1988.

L’Ecuyer and Cˆ

e (1991). Implementing a Random Number Package with Splitting Facilities. ACM Transactions on Mathematics Software, 17(1):98-111, 1991.

  • C. Labart and J. Lelong.

A parallel algorithm for solving bsdes. Monte Carlo Methods Appl., to appear, January 2013.

  • M. Mascagni and A. Srinivasan.

Algorithm 806: Sprng: a scalable library for pseudorandom number generation. ACM Trans. Math. Softw., 26:436–461, September 2000.

  • A. De Matteis and A. Pagnutti.
  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31

slide-35
SLIDE 35

HPC for numerical probability

Long-range correlations in linear and non-linear random number generators. Parallel Computing, 1:175-180, 1984.

  • M. Matsumoto and T. Nishimura.

Dynamic creation of pseudorandom number generators. Monte Carlo and Quasi-Monte Carlo Methods 1998, Springer, 2000, pp 56–69.

  • H. Haramoto, M. Matsumoto, T. Nishimura, F. Panneton, and
  • P. L’Ecuyer.

Efficient jump ahead for f2-linear random number generators. INFORMS J. on Computing, 20:385–390, July 2008.

  • H. Haramoto, M. Matsumoto, and P. L’Ecuyer.

A fast jump ahead algorithm for linear recurrences in a polynomial space. In Proceedings of the 5th international conference on Sequences and Their Applications, SETA ’08, pages 290–298, Berlin, Heidelberg, 2008. Springer-Verlag.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31

slide-36
SLIDE 36

HPC for numerical probability

  • X. Warin and S. Vialle.

Design and experimentation of a large scale distributed stochastic control algorithm applied to energy management problems. In C. Myers, editor, Stochastic Control, chapter 7, pages 103–124. Sciyo, August 2010.

  • N. Zhang, A. Roux, and T. Zastawniak.

Parallel binomial valuation of american options with proportional transaction costs. In O. Temam, P.-C. Yew, and B. Zang, editors, Advanced Parallel Processing Technologies, volume 6965 of Lecture Notes in Computer Science, pages 88–97. Springer Berlin Heidelberg, 2011.

  • M. Zubair and R. Mukkamala.

High performance implementation of binomial option pricing. In O. Gervasi, B. Murgante, A. Lagan` a, D. Taniar, Y. Mun, and M. Gavrilova, editors, Computational Science and Its

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31

slide-37
SLIDE 37

HPC for numerical probability

Applications – ICCSA 2008, volume 5072 of Lecture Notes in Computer Science, pages 852–866. Springer Berlin Heidelberg, 2008.

  • J. Lelong (Univ. Grenoble Alpes)

15/10/2015 31 / 31