Jussi Enkovaara Martti Louhivuori Python in High-Performance - - PDF document

jussi enkovaara martti louhivuori python in high
SMART_READER_LITE
LIVE PREVIEW

Jussi Enkovaara Martti Louhivuori Python in High-Performance - - PDF document

Jussi Enkovaara Martti Louhivuori Python in High-Performance Computing CSC IT Center for Science Ltd, Finland January 29-31, 2018 import sys, os try: from Bio.PDB import PDBParser __biopython_installed__ = True except ImportError:


slide-1
SLIDE 1

import sys, os try: from Bio.PDB import PDBParser __biopython_installed__ = True except ImportError: __biopython_installed__ = False __default_bfactor__ = 0.0 # default B-factor __default_occupancy__ = 1.0 # default occupancy level __default_segid__ = '' # empty segment ID class EOF(Exception): def __init__(self): pass class FileCrawler: """ Crawl through a file reading back and forth without loading anything to memory. """ def __init__(self, filename): try: self.__fp__ = open(filename) except IOError: raise ValueError, "Couldn't open file '%s' for reading." % filename self.tell = self.__fp__.tell self.seek = self.__fp__.seek def prevline(self): try: self.prev()

Python in High-Performance Computing Jussi Enkovaara Martti Louhivuori

CSC – IT Center for Science Ltd, Finland January 29-31, 2018

slide-2
SLIDE 2

All material (C) 2018 by the authors. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/

slide-3
SLIDE 3

PRACTICALITIES

Computing servers

We will use either the classroom computers or CSC’s computing server Taito for the

  • exercises. You can use also your laptop if all the necessary Python modules (numpy,

cython, mpi4py etc.) are installed. To log in to Taito, use the provided tnrgXX username and password, e.g. % ssh –X trng10@taito.csc.fi For editing files you can use e.g. nano editor: nano example.py Also other popular editors (vim, emacs) are available.

Python environment at CSC

Python is available on all computing servers (taito, sisu), in order to utilize scientific computing packages use the module command to load correct Python environment. Default Python version is 2.7, in this course we will be using Python 3.4, which can be loaded in Taito as % module load python-env/3.4.5 In the classroom computers module load is not needed.

Exercise material

All the exercise material (skeleton codes, input files, model solutions, etc.) is available in github at https://github.com/csc-training/hpc-python The material can be downloaded (cloned) as % git clone https://github.com/csc-training/hpc-python.git Each exercise is contained in its own subdirectory, i.e. numpy/reference-copy numpy/one-dimensional-arrays …

slide-4
SLIDE 4

General exercise instructions

Simple exercises can be carried out directly in the interactive interpreter. For more complex ones it is recommended to write the program into a .py file. Still, it is useful to keep an interactive interpreter open for testing! Some exercises contain references to functions/modules which are not addressed in actual lectures. In these cases Python's interactive help (and google) are useful, e.g. In [4]: help(numpy) It is not necessary to complete all the exercises, instead you may leave some for further study at home. Also, some Bonus exercises are provided in the end of the sheet.

Visualisation

In some exercises it might be convenient to do visualisations with matplotlib Python

  • package. Interactive plotting is most convenient with the IPython enhanced
  • interpreter. For enabling interactive plotting, start IPython with --matplotlib

argument: % ipython –matplotlib Simple x-y plots can then be done as: In [1]: import matplotlib.pyplot as plt … In [6]: plt.plot(x,y) # line In [7]: plt.plot(x,y, ’ro’) # individual points in red Look matplotlib documentation for additional information for visualisation.

Parallel calculations

In class room workstations, one needs to load the MPI environment before using mpi4py: % module load mpi/openmpi-x86_64 After that MPI parallel Python programs can be launched with mpirun, e.g. to run with 4 MPI tasks one issues % mpirun –np 4 python3 example.py

slide-5
SLIDE 5

In Taito one can launch interactive MPI programs with srun: % srun -n4 python3 hello.py Note that for real production calculations in Taito one should use batch job scripts, see https://research.csc.fi/taito-user-guide

slide-6
SLIDE 6

Basic array manipulation

1. Reference vs. copy

Investigate the behavior of the statements below by looking at the values of the arrays a and b after assignments: a = np.arange(5) b = a b[2] = -1 b = a[:] b[1] = -1 b = a.copy() b[0] = -1

2. One dimensional arrays

Start from a Python list containing both integers and floating point values, and construct then a NumPy array from the list. Generate a 1D NumPy array containing all numbers from -2.0 to 2.0 with a spacing of 0.2. Use optional start and step arguments of np.arange() function. Generate another 1D NumPy array containing 11 equally spaced values between 0.5 and 1.5. Extract every second element of the array.

3. Two dimensional arrays and slicing

First, create a 4x4 array with arbitrary values, then a) Extract every element from the second row. b) Extract every element from the third column. c) Assign a value of 0.21 to upper left 2x2 subarray. Next, create a 8x8 array with checkerboard pattern, i.e. alternating zeros and ones: ( 1 1 … 1 … 1 1 … … … … … )

slide-7
SLIDE 7

4. Splitting and combining arrays

Continue with the previous 4x4 array a) Use np.split() function for splitting the array into two new 2x4 arrays. Reconstruct the original 4x4 array by using np.concatenate(). b) Repeat the above exercise but create now 4x2 subarrays and then combine them.

5. Subdiagonal matrices

Create a 6x6 matrix with 1’s above and below the diagonal and zeros otherwise: ( 1 1 1 1 1 1 1 1 1 1 0) Use the numpy.eye() –function.

slide-8
SLIDE 8

NumPy tools

6. Input and output

File xy_coordinates.dat contains a list of (x,y) value pairs. Read the data with numpy.loadtxt(). Add then 2.5 to all y values and write the new data into a file using numpy.savetxt().

7. Polynomials

Fit a second order polynomial to the data of previous exercises by using numpy.polyfit(). Construct the values of the polynomial in the interval [-6,6] with numpy.polyval(). You can use matplotlib for plotting both the original points and the fitted polynomial.

8. Random numbers

Generate a one dimensional 1000 element array of uniformly distributed random numbers using numpy.random module. Calculate the mean and standard deviation of the array using numpy.mean() and numpy.std(). Choose some other random distribution and calculate its mean and standard deviation. You can visualize the random distributions with matplotlib’s hist() function.

9. Linear algebra

Construct two symmetric 2x2 matrices A and B. (hint: a symmetric matrix can be constructed easily as Asym = A + AT) Calculate the matrix product C=A*B using numpy.dot(). Calculate the eigenvalues of matrix C with numpy.linalg.eigvals().

slide-9
SLIDE 9

Advanced NumPy

  • 10. Advanced indexing

Start with 10x10 array of uniformly distribute random numbers (np.random.random). Find all the elements larger than 0.5 by using Boolean mask. Find also the indices of elements larger than 0.5. You can use the above mask and numpy.nonzero() for finding the indices. Check that the values obtained by using the index array are the same as with the earlier direct Boolean indexing.

  • 11. Translation with broadcasting

File points_circle.dat contains x, y coordinates along a circle. Translate all the coordinates with some vector e.g. (2.1, 1.1). Plot both the original and translated points in order to see the effect of translation.

  • 12. Finite-difference

Derivatives can be calculated numerically with the finite-difference method as: Construct 1D Numpy array containing the values of xi in the interval [0,π/2] with spacing Δx=0.1. Evaluate numerically the derivative of sin in this interval (excluding the end points) using the above formula. Try to avoid for loops. Compare the result to function cos in the same interval.

  • 13. Numerical integration

A simple method for evaluating integrals numerically is by the middle Riemann sum

slide-10
SLIDE 10

with x’i = (xi + xi-1)/2. Use the same interval as in the first exercise and investigate how much the Riemann sum of sin differs from 1.0. Avoid for loops. Investigate also how the results changes with the choice of Δx.

  • 14. Temporary arrays

Try different NumPy array expressions and investigate how much memory is used in temporary arrays. You can use the Unix command ‘/usr/bin/time –v ‘ for finding out the maximum memory usage of program, or utilize the maxmem() function in demos/memory_usage.py

  • 15. Numexpr

Try different array expressions and investigate how much numexpr can speed them

  • up. Try to vary also the number of threads used by numexpr. IPython and %timeit

magic can be useful in testing.

slide-11
SLIDE 11

Performance analysis

  • 16. Using cProfile

The file heat_simple.py contains (very inefficient) implementation of the two dimensional heat equation. Use cProfile for investigating where the time is spent in the program. You can try to profile also the more efficient model solution of numpy/heat-equation

slide-12
SLIDE 12

Optimising with Cython

  • 17. Type declarations

Create a setup.py file in order to speed up the heat equation solver with Cython. Based on the profile in the previous exercise, insert static type declarations for a proper locations. Investigate the effect on performance. You can use applications

  • wn timers and/or timeit. Annotated HTML-report with “cython –a …” can be

useful when tuning performance

  • 18. C-functions, fast indexing and directives

Continue optimising the code by reducing function call overhead, utilizing fast array indexing and including compiler directives. When finished with the optimisation, compare performance to Python/NumPy model solution of bonus_heat.py which uses array operations. You can play around also with larger input data as provided in bottle_medium.dat and bottle_large.dat.

slide-13
SLIDE 13

Interfacing with libraries

  • 19. Interfacing with C

a) The files evolve.h and evolve.c contain a pure C implementation of the single time step in heat equation. The C implemention can be built into a shared library with the provided Makefile by executing make command. Use cffi for utilizing the library function instead of the Python function heat_simple.py . Compare the performance to Cython implementation b) Repeat the previous exercise, but use now Cython for interfacing.

  • 20. Interfacing with Fortran

The file evolve.f90 contain a pure Fortran implementation of the single time step in heat equation. Use f2py for creating Python interface, and utilize the Fortran implementation in heat_simple.py . You may need to insert f2py attributes into the Fortran file.

slide-14
SLIDE 14

Multiprocessing

  • 21. Simple calculation

Calculate the square of all numbers 1...10 using a separate Process to calculate the square of each number and print out the result.

  • 22. Work distribution

Read the atom coordinates of the Zika virus (5ire.pdb) using the simple parser provided in pdb.py (see comments in the file for examples on how to use it). In center-of-coords.py is a skeleton code you may also use as a starting point. Calculate the center of coordinates in parallel by dividing the coordinates in smaller chunks (e.g. 100 coordinates) to separate tasks. After completing the tasks in worker processes, combine the results in the master process. If needed, figure out also how to deal with any coordinates left out of the neat uniform-sized chunks. a) Use a Pool of Workers to distribute the tasks to worker processes. Hint: if you return also the weight of each chunk, you can process also non-uniform sized chunks. OR b) Use a Queue to distribute the tasks to worker processes. Store the results for each chunk in a separate queue.

slide-15
SLIDE 15

Parallel programming with mpi4py

  • 23. Parallel “Hello World”

Create a parallel Python program which prints out the number of processes and the rank of each process.

  • 24. Simple message exchange

a) Write a simple program where two processes send and receive a message to/from each other using send and recv. The message content is a dictionary with a key {‘rank’ : myrank} where myrank is the rank of the sending process. After receiving a message, each process should print out the rank of the process and the value in the received dictionary. b) In each process, construct a 100 000 element NumPy array which is initialized to the rank of process. Send and receive the array using Send and Recv and after receiving print out the rank of process together with the first element of the received array. Investigate what happens when reordering the send and receive calls in one of the processes.

  • 25. Message chain

Write a simple program where every processor sends data to the next one. Let ntasks be the number of the tasks, and myid the rank of the current process. Your program should work as follows:  Every task with a rank less than ntasks-1 sends a message to task myid+1. For example, task 0 sends a message to task 1.  The message content is an integer array where each element is initialized to myid.  The message tag is the receiver’s id number.  The sender prints out the number of elements it sends and the tag number.  All tasks with rank ≥ 1 receive messages.  Each receiver prints out their myid, and the first element in the received array. a) Implement the program described above using Send and Recv. b) Use Sendrecv instead of Send and Recv. c) Can the code be simplified using MPI.PROC_NULL?

slide-16
SLIDE 16
  • 26. Non-blocking communication

Go back to “Message chain” exercise and implement it using non-blocking communication

  • 27. Collective operations

In this exercise we test different routines for collective communication. a) First, write a program where rank 0 sends and array containing numbers from 0 to 7 to all the other ranks using collective communication. Next, let us continue with four processes with following data vectors: Task 0: 1 2 3 4 5 6 7 Task 1: 8 9 10 11 12 13 14 15 Task 2: 16 17 18 19 20 21 22 23 Task 3: 24 25 26 27 28 29 30 31 In addition, each task has a receive buffer for eight elements and the values in the buffer are initialized to -1. Implement a program that sends and receives values from the data vectors to receive buffers using a single collective communication routine for each case, so that the receive buffers will have the following values: b) Task 0: 1

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Task 1: 2 3

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Task 2: 4 5

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Task 3: 6 7

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

c) Task 0:

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Task 1: 1 8 9 16 17 24 25 Task 2:

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Task 3:

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

d) Task 0: 8 10 12 14 16 18 20 22 Task 1:

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

Task 2: 40 42 44 46 48 50 52 54 Task 3:

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

You can start from scratch or use a skeleton file collectives.py. Tip: create two communicators

slide-17
SLIDE 17
slide-18
SLIDE 18

Bonus exercises

  • B1. Game of life

Game of life is a cellular automaton devised by John Conway in 70's: http://en.wikipedia.org/wiki/Conway's_Game_of_Life The game consists of two dimensional orthogonal grid of cells. Cells are in two possible states, alive or dead. Each cell interacts with its eight neighbours, and at each time step the following transitions occur:

  • Any live cell with fewer than two live neighbours dies, as if caused by

underpopulation

  • Any live cell with more than three live neighbours dies, as if by overcrowding
  • Any live cell with two or three live neighbours lives on to the next generation
  • Any dead cell with exactly three live neighbours becomes a live cell

The initial pattern constitutes the seed of the system, and the system is left to evolve according to rules. Deaths and births happen simultaneously. Implement the Game of Life using Numpy. Try first 32x32 square grid and cross- shaped initial pattern: Try also other grids and initial patterns (e.g. random pattern).

slide-19
SLIDE 19

For visualization you can use Matplotlib: import matplotlib.pyplot as plt plt.imshow(array)

  • B2. Rotation with broadcasting

Two dimensional coordinates can be rotated by my multiplying the 2-element vector

  • f coordinates by 2x2 rotation matrix

R=(cos (𝜄) −sin (𝜄) sin (𝜄) cos (𝜄) ) where 𝜄 is the angle of rotation (in radians). Start from the x-y coordinates in the file points_circle.dat and rotate them by 90°. Utilize broadcasting for performing the rotation with a single np.dot call. (Hint: you may need to create additional dimensions). Plot both the original and rotated points in order to see the effect of rotation.

  • B3. Two dimensional heat equation

Heat (or diffusion) equation is where u(x, y, t) is the temperature field that varies in space and time, and α is thermal diffusivity constant. The two dimensional Laplacian can be discretized with finite differences as Given an initial condition (u(t=0) = u0) one can follow the time dependence of the temperature field with explicit time evolution method: (Note that algorithm is stable only when )

𝜖𝑣 𝜖𝑢 = 𝛽𝛼2𝑣

𝛼2𝑣(𝑗, 𝑘) → 𝑣(𝑗 − 1, 𝑘) − 2𝑣(𝑗,𝑘) + 𝑣(𝑗 + 1,𝑘) (𝛦𝑦)2 + 𝑣(𝑗,𝑘 − 1) − 2𝑣(𝑗, 𝑘) + 𝑣(𝑗, 𝑘 + 1) (𝛦𝑧)2 𝑣𝑛+1(𝑗,𝑘) = 𝑣𝑛(𝑗, 𝑘) + ∆𝑢 𝛽 𝛼2𝑣𝑛(𝑗, 𝑘)

∆𝑢 < 1 2 𝛽 (𝛦𝑦𝛦𝑧)2 (𝛦𝑦)2 + (𝛦𝑧)2

slide-20
SLIDE 20

Implement two dimensional heat equation with NumPy using the initial temperature field in the file bottle.dat (the file consists of a header and 200 x 200 data array). As a boundary condition use fixed values as given in the initial field. You can start from the skeleton code in the file bonus_heat.py

  • B4. Parallel heat equation

Parallelize the whole heat equation program with MPI, by dividing the grid in rows and assigning one row block to one task. A domain decomposition, that is. The tasks are able to update the grid independently everywhere else than on the row boundaries – there the communication of a single row with the nearest neighbor is

  • needed. This is realized by having additional ghost layers that contain the boundary

data of the neighboring tasks. As the system is aperiodic, the outermost ranks communicate with single neighbor, and the inner ranks with the two neighbors. Insert the proper MPI routines into skeleton code heat_mpi.py (search for“TODO”s). A schematic representation of decomposition looks like (row wise composition is

  • btained just by rotating 90 degrees):

portion grid Rank #0 Rank #1 Rank #2 Rank #3 Local

  • f the

Each rank has additional columns in both ends of the local grid (colored columns). The first actual column of Rank #2 is communicated to “rightmost” ghostlayer of Rank #1.

Remember to update all ghost layers at each iteration.