Using the Global Arrays Toolkit to Reimplement NumPy for - - PowerPoint PPT Presentation

using the global arrays toolkit to reimplement numpy for
SMART_READER_LITE
LIVE PREVIEW

Using the Global Arrays Toolkit to Reimplement NumPy for - - PowerPoint PPT Presentation

Using the Global Arrays Toolkit to Reimplement NumPy for Distributed Computation Jeff Daily , Pacific Northwest National Laboratory jeff.daily@pnnl.gov Robert R. Lewis, Washington State University bobl@tricity.wsu.edu Motivation Lots of


slide-1
SLIDE 1

Using the Global Arrays Toolkit to Reimplement NumPy for Distributed Computation

Jeff Daily, Pacific Northwest National Laboratory jeff.daily@pnnl.gov Robert R. Lewis, Washington State University bobl@tricity.wsu.edu

slide-2
SLIDE 2

Scipy July 13 2011

Motivation

Lots of NumPy applications

NumPy (and Python) are for the most part single-threaded Resources underutilized

Computers have multiple cores Academic/business clusters are common

Lots of parallel libraries or programming languages

Message Passing Interface (MPI), Global Arrays (GA), X10,

Co-Array Fortran, OpenMP, Unified Parallel C, Chapel, Titianium, Cilk

Can we transparently parallelize NumPy?

2

slide-3
SLIDE 3

Scipy July 13 2011

Background – Parallel Programming

Single Program, Multiple Data (SPMD)

Each process runs the same copy of the program Different branches of code run by different threads

3

if my_id == 0: foo() else: bar()

slide-4
SLIDE 4

Scipy July 13 2011

Background – Message Passing Interface

Each process assigned a rank starting from 0 Excellent Python bindings – mpi4py Two models of communication

Two-sided i.e. message passing (MPI-1 standard) One-sided (MPI-2 standard)

4

if MPI.COMM_WORLD.rank == 0: foo() else: bar()

slide-5
SLIDE 5

Scipy July 13 2011

Background – Communication Models

message passing

MPI

P1 P0

receive send

P1 P0

put

  • ne-sided communication

SHMEM, ARMCI, MPI-2-1S

Message Passing:

Message requires cooperation

  • n both sides. The processor

sending the message (P1) and the processor receiving the message (P0) must both participate.

One-sided Communication:

Once message is initiated on sending processor (P1) the sending processor can continue computation. Receiving processor (P0) is not involved. Data is copied directly from switch into memory on P0.

5

slide-6
SLIDE 6

Scipy July 13 2011

Background – Global Arrays

Distributed dense arrays that can be accessed through a shared memory-like style single, shared data structure/ global indexing

e.g., ga.get(a, (3,2))

rather than buf[6] on process 1

Local array portions can be ga.access()’d

Physically distributed data Global Address Space

2 4 6 1 3 5 7

6

slide-7
SLIDE 7

Scipy July 13 2011

Remote Data Access in GA vs MPI

Message Passing:

identify size and location of data blocks loop over processors: if (me = P_N) then pack data in local message buffer send block of data to message buffer on P0 else if (me = P0) then receive block of data from P_N in message buffer unpack data from message buffer to local buffer endif end loop copy local data on P0 to local buffer

Global Arrays:

buf=ga.get(g_a, lo=None, hi=None, buffer=None) Global Array handle Global upper and lower indices of data patch Local ndarray buffer P0 P1 P2 P3

slide-8
SLIDE 8

Scipy July 13 2011

Background – Global Arrays

Shared data model in context of distributed dense arrays Much simpler than message-passing for many applications Complete environment for parallel code development Compatible with MPI Data locality control similar to distributed memory/ message passing model Extensible Scalable

8

slide-9
SLIDE 9

Scipy July 13 2011

Previous Work to Parallelize NumPy

Star-P Global Arrays Meets MATLAB (yes, it’s not NumPy, but…) IPython gpupy Co-Array Python

9

slide-10
SLIDE 10

Scipy July 13 2011

Design for Global Arrays in NumPy (GAiN)

All documented NumPy functions are collective

GAiN programs run in SPMD fashion

Not all arrays should be distributed

GAiN operations should allow mixed NumPy/GAiN inputs

Reuse as much of NumPy as possible (obviously) Distributed nature of arrays should be transparent to user Use owner-computes rule to attempt data locality

  • ptimizations

10

slide-11
SLIDE 11

Scipy July 13 2011

Why Subclassing numpy.ndarray Fails

The hooks:

__new__(), __array_prepare__() __array_finalize__() __array_priority__

First hook __array_prepare__() is called after the

  • utput array has been created

No means of intercepting array creation Array is allocated on each process – not distributed

11

slide-12
SLIDE 12

Scipy July 13 2011

The gain.ndarray in a Nutshell

Global shape and P local shapes Memory allocated from Global Arrays library, wrapped in local numpy.ndarray The memory distribution is static Views and array operations query the current global_slice

12

[0:3,0:3] (3,3) [0:3,3:6] (3,3) [0:3,6:9] (3,3) [0:3,9:12] (3,3) [3:6,0:3] (3,3) [3:6,3:6] (3,3) [3:6,6:9] (3,3) [3:6,9:12] (3,3) [0:6,0:12] (6,12)

slide-13
SLIDE 13

Scipy July 13 2011

Example: Slice Arithmetic

Observation: In both cases shown here, Array b could be created either using the standard notation (top) or the “canonical” form (bottom)

13

a = ndarray(6,12) c = b[1:-1,1:-1] c = a[slice(2,4,1), slice(2,10,1)] b = a[1:-1,1:-1] b = a[slice(1,5,1), slice(1,11,1)] a = ndarray(6,12) c = b[1,:] c = a[2, slice(0,12,3)] b = a[::2,::3] b = a[slice(0,6,2), slice(0,12,3)]

slide-14
SLIDE 14

Scipy July 13 2011

Example: Binary Ufunc

Owner-computes rule means output array owner does the work

ga.access() other input array portions since all

distributions and shapes are the same

call original NumPy ufunc on the pieces

14

+ =

slide-15
SLIDE 15

Scipy July 13 2011

Example: Binary Ufunc with Sliced Arrays

15

+ =

Owner-computes rule means output array owner does the work

ga.get() other input array portions since arrays not

aligned

call original NumPy ufunc

slide-16
SLIDE 16

Scipy July 13 2011

Example: Binary Ufunc

Broadcasting works too Not all arrays are distributed

16

+ =

slide-17
SLIDE 17

Scipy July 13 2011

How to Use GAiN

Ideally, change one line in your script: #import numpy import ga.gain as numpy Run using the MPI process manager: $ mpiexec -np 4 python script.py

17

slide-18
SLIDE 18

Scipy July 13 2011

Live Demo: laplace.py

2D Laplace equation using an iterative finite difference scheme (four point averaging, Gauss-Seidel or Gauss- Jordan). I’ll now show you how to use GAiN (This is not the “pretty pictures” part of the presentation -- there’s nothing pretty about raw computation.)

18

slide-19
SLIDE 19

Scipy July 13 2011

laplace.py Again, but Bigger

19

slide-20
SLIDE 20

Scipy July 13 2011

GAiN is Not Complete (yet)

20

What’s finished:

Ufuncs (all, but not reduceat or outer) ndarray (mostly) flatiter numpy dtypes are reused! Various array creation and other functions:

zeros, zeros_like, ones, ones_like, empty, empty_like eye, identity, fromfunction, arange, linspace, logspace dot, diag, clip, asarray

Everything else doesn’t exist, including order=‘ GAiN is here to stay – it’s official supported by the GA project (me!)

slide-21
SLIDE 21

Scipy July 13 2011

Thanks! Time for Questions

21

Using the Global Arrays Toolkit to Reimplement NumPy for Distributed Computation

Jeff Daily, Pacific Northwest National Laboratory jeff.daily@pnnl.gov Robert R. Lewis, Washington State University bobl@tricity.wsu.edu

Where to get the code until pnl.gov domain is restored: https://github.com/jeffdaily/Global-Arrays-Scipy-2011 Where to get the code, usually: https://svn.pnl.gov/svn/hpctools/trunk/ga Website (documentation, download releases, etc): http://www.emsl.pnl.gov/docs/global