1
CS240A: Parallelism in CSE Applications
Tao Yang Slides revised from James Demmel and Kathy Yelick www.cs.berkeley.edu/~demmel/cs267_Spr11
CS240A: Parallelism in CSE Applications Tao Yang Slides revised - - PowerPoint PPT Presentation
CS240A: Parallelism in CSE Applications Tao Yang Slides revised from James Demmel and Kathy Yelick www.cs.berkeley.edu/~demmel/cs267_Spr11 1 Category of CSE Simulation Applications discrete Discrete event systems Time and space are
1
Tao Yang Slides revised from James Demmel and Kathy Yelick www.cs.berkeley.edu/~demmel/cs267_Spr11
CS267 Lecture 4 2
discrete continuous
CS267 Lecture 4 3
every point in space along a wire, just endpoints
Star Wars: The Force Unleashed
Analysis, Terminator 3: Rise of the Machines
variable, setup equations.
Then solve these equations
x x
1
x
2
x
3
y h h h Straight line approximation
2 n n 1 n
2 n n n 1 n
2 n n n 1 n
Thus starting from an initial value y0
Approximate: Then:
Exact Error x
n
yn y'n hy'n Solution 1.00000 1.00000 0.02000 1.00000 0.00000 0.02 1.02000 1.04000 0.02080 1.02040
0.04 1.04080 1.08080 0.02162 1.04162
0.06 1.06242 1.12242 0.02245 1.06367
0.08 1.08486 1.16486 0.02330 1.08657
0.1 1.10816 1.20816 0.02416 1.11034
0.12 1.13232 1.25232 0.02505 1.13499
0.14 1.15737 1.29737 0.02595 1.16055
0.16 1.18332 1.34332 0.02687 1.18702
0.18 1.21019 1.39019 0.02780 1.21443
0.2 1.23799 1.43799 0.02876 1.24281
n n n n n n 1 n
http://numericalmethods.eng.usf.edu 9
2 2 2
8 5
http://numericalmethods.eng.usf.edu 10
2 1 1 2 2
2 x y y y dx y d
i i i
Using the approximation of
i i
1 1
and
2 1 1 2 1 1
i i i i i i i i
2 1 1 1 2 1 2 1
1 2 2 2 1 2
i i i i i i
u r r r u r r u r r r
Gives you
http://numericalmethods.eng.usf.edu 11
5 , a r i
Step 1 At node
0038731 .
0
u
Step 2 At node Step 3 At node " 6 . 5 6 . 5 , 1
1
r r r i
6 . 6 . 5 2 1 6 . 1 6 . 5 1 6 . 2 6 . 1 6 . 6 . 5 2 1
2 2 1 2 2 2
u u u
9266 . 2 5874 . 5 6290 . 2
2 1
u u u
, 2 i
2 . 6 6 . 6 . 5
1 2
r r r
6 . 2 . 6 2 1 6 . 1 2 . 6 1 6 . 2 6 . 1 6 . 2 . 6 2 1
3 2 2 2 2 1 2
u u u
9122 . 2 5816 . 5 6434 . 2
3 2 1
u u u
http://numericalmethods.eng.usf.edu 12
Step 4 At node , 3 i
8 . 6 6 . 2 . 6
2 3
r r r
6 . 8 . 6 2 1 6 . 1 8 . 6 1 6 . 2 6 . 1 6 . 8 . 6 2 1
4 2 3 2 2 2 2
u u u
9003 . 2 5772 . 5 6552 . 2
4 3 2
u u u Step 5 At node Step 6 At node , 4 i
4 . 7 6 . 8 . 6
3 4
r r r
6 . 4 . 7 2 1 6 . 1 4 . 7 1 6 . 2 6 . 1 6 . 4 . 7 2 1
5 2 4 2 2 3 2
u u u
8903 . 2 6062 . 5 6651 . 2
5 4 3
u u u
, 5 i
8 6 . 4 . 7
4 5
r r r
0030769 . /
5
b r
u u
http://numericalmethods.eng.usf.edu 13 1 8903 . 2 6062 . 5 6651 . 2 9003 . 2 5772 . 5 6552 . 2 9122 . 2 5816 . 5 6434 . 2 9266 . 2 5874 . 5 6290 . 2 1
5 4 3 2 1
u u u u u u 0030769 . 0038731 .
=
0038731 .
0
u 0036115 .
1
u
0034159 .
2
u 0032689 .
3
u
0031586 .
4
u
0030769 .
5
u
x x x
Graph and “stencil”
CS267 Lecture 4 15
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j) Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j)
for each row i for k=ptr[i] to ptr[i+1]-1 do y[i] = y[i] + val[k]*x[ind[k]]
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)×x(j)
for each row i for k=ptr[i] to ptr[i+1]-1 do y[i] = y[i] + val[k]*x[ind[k]]
A y x Representation of A
SpMV: y = y + A*x, only store, do arithmetic, on nonzero entries
CS267 Lecture 4 16
= (row i of A) * x … a sparse dot product
x y P1 P2 P3 P4 May require communication
02/09/2010 CS267 Lecture 7 17
1 1 1 1 2 1 1 1 1 3 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 1 2 3 4 5 6
3 6 1 5 4 2
CS267 Lecture 4 18
P0 P1 P2 P3 P4
= *
P0 P1 P2 P3 P4
CS267 Lecture 4 19
1 1 1 1 2 1 1 1 1 3 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 1 2 3 4 5 6 3 6 1 5 2
blocks, which represent the 7 (bidirectional) edges
4
CS267 Lecture 4 20
loaded into cache or registers may be reused (temporal locality)
elements in the cache may be used (spatial locality)
where A - H are functions of x and y only
(steady state heat equations)
(heat transfer equations)
(wave equations)
y x yy xy xx
80-100 60-80 40-60 20-40 0-20 Steam Steam Steam Ice bath
at particular (x, y) location in plate
yy xx
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
2
) 2 ( 1 ) 2 ( 1
1 , 1 1 , 1 , 1 2 , 1 , , 1 2
j i j i j i j i j i j i
u u u h u u u h
, 1 , 1 1 , 1 , ,
j i j i j i j i j i
02/01/2011 CS267 Lecture 5 29
4 -1 -1
L =
4
Graph and “5 point stencil”
3D case is analogous (7 point stencil)
For i=1 to n for j= 1 to n w[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1]) / 4.0; Swap w and u
u(i,j+1) u(i+1,j) w(i,j) u(i-1,j) u(i,j-1) Start with initial values. Iteratively update variables based on equations
For i = 1, n For j = 1, n u[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1]) / 4.0;
u(i,j+1) u(i+1,j) w(i,j) u(i-1,j) u(i,j-1)
u
02/01/2011 CS267 Lecture 5 32
(a) 2D dependence graph (b) After red/black variable reordering
neighboring processors.
02/01/2011 CS267 Lecture 5 34
Implemented using “ghost” regions. Adds memory overhead
02/01/2011 CS267 Lecture 5 35
02/01/2011 CS267 Lecture 5 36
Example of Matrix Reordering Application
02/01/2011 CS267 Lecture 7 37
When performing Gaussian Elimination Zeros can be filled Matrix can be reordered to reduce this fill But it’s not the same
parallelism
02/01/2011 CS267 Lecture 5 38
02/01/2011 CS267 Lecture 9 39
Computing
problems on irregular meshes
02/01/2011 CS267 Lecture 5 40
CS267 Lecture 4 42
depending on the other variables.
functions; also called a state machine.
inputs change, based on an “event” from another part of the system; also called event driven simulation.
CS267 Lecture 4 43
P4 P1 P2 P3 P5 P6 P7 P8 P9
Repeat compute locally to update local system barrier() exchange state info with neighbors until done simulating
CS267 Lecture 4 44
n*(p-1) edge crossings 2*n*(p1/2 –1) edge crossings
minimizing “surface to volume ratio” of partition
CS267 Lecture 4 45
edge crossings = 6 edge crossings = 10
better
CS267 Lecture 4 46
when an event arrives from another component:
empty ponds).
changing).
do you know when to execute a “receive”?
CS267 Lecture 4 47
engine
CS267 Lecture 4 48
force = external_force + nearby_force + far_field_force
CS267 Lecture 4 49
% fishp = array of initial fish positions (stored as complex numbers) % fishv = array of initial fish velocities (stored as complex numbers) % fishm = array of masses of fish % tfinal = final time for simulation (0 = initial time) dt = .01; t = 0; % loop over time steps while t < tfinal, t = t + dt; fishp = fishp + dt*fishv; accel = current(fishp)./fishm; % current depends on position fishv = fishv + dt*accel; % update time step (small enough to be accurate, but not too small) dt = min(.1*max(abs(fishv))/max(abs(accel)),1); end
CS267 Lecture 4 50
step, other data
CS267 Lecture 4 51
communication.
physical region in which particles are located
CS267 Lecture 4 52
boundary:
processors.
Communicate particles in boundary region to neighbors
Need to check for collisions between regions
CS267 Lecture 4 53
Example: each square contains at most 3 particles
CS267 Lecture 4 54
communication.
needs to “visit” every other particle.
Implement by rotating particle sets.
particles
CS267 Lecture 4 55
solve on a regular mesh:
1) Particles are moved to nearby mesh points (scatter) 2) Solve mesh problem 3) Forces are interpolated at particles from mesh points (gather)
CS267 Lecture 4 56
resembles a single large particle.
CS267 Lecture 4 57
regular mesh, where it is easier to compute forces
as a group, when far away