Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January - PowerPoint PPT Presentation

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January 2009 PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and by a Google Fellowship

M lti Multicore architecture hit t � Multiple cores per chip � Modest on-chip caches � Memory bandwidth issue y � ▪ Increasing gap between CPU speed and off-chip memory bandwidth ▪ Increasing bandwidth consumption by aggressive hardware prefetching Software � Many optimizations increase memory bandwidth requirement � ▪ Parallelization, Software prefetching, ILP Some optimizations reduce memory bandwidth requirement Some optimizations reduce memory bandwidth requirement � � ▪ Array contraction, index compression Loop transformations to improve data locality � ▪ Loop tiling, loop fusion and others ▪ Restricted by data/control dependences y p 2

� Loop tiling is used to increase data locality � Example program: PDE iterative solver The base implementation do t = 1 itmax do t 1,itmax update(a, n, f); ! Compute residual and convergence test ! Compute residual and convergence test error = residual(a, n) if (error .le. tol) then exit endif end do 3

� Tiling is skewed to satisfy data dependences � After tiling, parallelism only exists within a tile due to data dependences between tiles due to data dependences between tiles 4

The tiled version with speculated execution with speculated execution do t = 1, itmax/M + 1 Questions ! Save the old result into buffer as checkpoint 1. How to select chunk 1 How to select chunk oldbuf(1:n 1:n) = a(1:n 1:n) oldbuf(1:n, 1:n) = a(1:n, 1:n) size? ! Execute a chunk of M iterations after tiling 2. Is recovery overhead update_tile(a, n, f, M) necessary? ? ! Compute residual and perform convergence test error = residual(a, n) if (error .le. tol) then call recovery(oldbuf, a, n, f) exit end if end if end do 5

� Mitigate the memory bandwidth problem � Apply data locality optimizations to challenging cases � Relax restrictions imposed by data/control R l i i i d b d / l dependences 6

� Basic idea: allow to use of old neighboring values in B i id ll t f ld i hb i l i the computation, still converging � Originally proposed to reduce communication cost O i i ll d t d i ti t and synchronization overhead � Convergence rate of asynchronous algorithms 1 Convergence rate of asynchronous algorithms 1 � May slowdown convergence rate � Our contribution is to use the asynchronous model � Our contribution is to use the asynchronous model to improve parallelism and locality simultaneously � Relax dependencies � Monotone exit condition M t it diti [1] Frommer, A. and Szyld, D. B. 1994. Asynchronous two-stage iterative methods. In Numer. Math. 69, 2, Dec 1994. 7

Th The tiled version without recovery til d i ith t do t = 1, itmax/M + 1 ! Execute a chunk of M iterations after tiling ! E t h k f M it ti ft tili update_tile(a, n, f, M) ! Compute residual and convergence test error = residual(a, n) if (error .le. tol) then exit end if end do 8

� Achieve parallelism across the grid � Not just within a tile � Apply loop tiling to improve data locality A l l tili t i d t l lit � Requiring a partition of time steps in chunks � Eliminate recovery overhead � Eliminate recovery overhead 9

� Chunk size : # iterations executed speculatively in the tiled code � Ideal if we can predict the exact iterations to converge � However, it is unknown until convergence happens � Too large a chunk we pay overshooting overhead � Too large a chunk, we pay overshooting overhead � Too small, poor data reuse and poor data locality p p y 10

� Poor solutions � Use a constant chunk size (randomly pick) � Estimate based on the theoretical convergence rate E i b d h h i l � A better solution: Adaptive chunk size � A better solution: Adaptive chunk size � Use the latest convergence progress to predict how many more iterations are required to converge r i :residual error of i -th round of tiled code 11

� Platforms for experiments: � Intel Q6600, AMD8350 Barcelona, Intel E5530 Nehalem � Evaluated numerical methods: Jacobi, GS, SOR � Performance results � Synchronous model vs. asynchronous model with the best chuck size chuck size � Original code vs. loop tiling � Impact of the chunk size � Adaptive chunk selection vs. the ideal chunk size 12

� Peak bandwidth of our platforms BW SBW Machine Model L1 L2 L3 (GB/s) (GB/s) AMD8350 64KB 512KB 4x2MB A 21.6 6 18.9 8 9 4 4 4x4 cores private i t private i t shared h d Q6600 32KB 2x4MB B N/A 8.5 4.79 1x4 cores private shared E5530 E5530 256KB 256KB 1MB 1MB 2 8MB 2x8MBs C 51 31.5 2x4 cores private private shared 13

� Machine A 50 tiled tiled-norec Performance Performance async-base async-tiled 40 varies with peedup chunk size! 30 20 20 Sp async-tiled til d version is the 10 best! 0 0 0 50 50 100 100 150 150 Chunk Machine kernel parallel tiled tiled-norec async-base async-tiled A Jacobi Jacobi 5 95 5.95 16 76 16.76 27 24 27.24 5 47 5.47 39 11 39.11 16 cores 14

� Machine B 4 tiled tiled-norec async base async-base async tiled async-tiled Poor Poor 3 performance Speedup without tiling 2 ( async-base and parallel)! and parallel)! S 1 0 0 50 100 150 Chunk Machine kernel parallel tiled tiled-norec async-base async-tiled B Jacobi Jacobi 1 01 1.01 2 55 2.55 3 44 3.44 1.01 1 01 3 67 3.67 4 cores 4 15

� Machine C 15 tiled tiled tiled-norec tiled norec async-base async-tiled 10 eedup Spe 5 0 Chunk Ch k 0 20 40 60 80 100 120 140 160 Machine kernel parallel tiled tiled-norec async-base async-tiled C Jacobi Jacobi 3 73 3.73 8.53 8 53 12.69 12 69 3 76 3.76 13 39 13.39 8 cores 16

Machine kernel parallel tiled tiled-norec async-base async-tiled A GS 5.49 12.76 22.02 26.19 30.09 B GS 0.68 5.69 9.25 4.90 14.72 C GS 3.54 8.20 11.86 11.00 19.56 A SOR 4.50 11.99 21.25 29.08 31.42 B SOR 0.65 5.24 8.54 7.34 14.87 C SOR 3.84 7.53 11.51 11.68 19.10 • Asynchronous tiled version performs better than synchronous tiled version (even without recovery cost) • Asynchronous baseline suffers more on machine B due to less memory bandwidth available memory bandwidth available 17

� adaptive-1 : lower bound of chunk size is 1 � adaptive-8: lower bound of chunk size is 8 50 40 Jacobi GS 40 30 up up Speedu 30 30 Speed 20 20 async-tiled async-tiled 10 adaptive-1 10 adaptive-1 adaptive-8 p adaptive-8 adaptive 8 0 0 0 0 20 40 60 80 0 100 Initial Chunk Initial Chunk 18

Showed how to benefit from the asynchronous model for relaxing � data and control dependences improve parallelism and data locality (via loop tiling) at the same time � An adaptive method to determine the chunk size � because the iteration count is usually unknown in practice � Good performance enhancement when tested on three well-known Good performance enhancement when tested on three well-known � � numerical kernels on three different multicore systems. 19

� Thank you! 20

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January - PowerPoint PPT Presentation

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January 2009 PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and by a Google

Computer Network Fundamentals Instructor: Lixia Zhang (lixia@cs.ucla.edu) Office: 4531G

Hedging Predictions in Machine Learning Alex Gammerman and Zhiyuan Luo zhiyuan@cs.rhul.ac.uk

Purdue University, West Lafayette, USA 1 aabujaba@purdue.edu, 2 bertino@purdue.edu 1 Data

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2015 Fabbian USA Corp. (c) 2015 Fabbian USA Corp. (c) 2015

(c) 2016 Fabbian USA Corp. (c) 2015 Fabbian USA Corp. (c) 2015 Fabbian USA Corp. (c) 2015

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

Purdue STAR 2015 Purdue University Purdue Polytechnic Institute (formerly College of Technology)

Lifelong Machine Learning in the Big Data Era Zhiyuan Chen and Bing Liu Department of Computer

ALGORITHM NORTHERN LIGHTS YUN LIU DONGFENG YU TIANXIANG GAO ZHIYUAN LI XIAOYANG ZHENG SOME

Natural Language Processing Zhiyuan Liu THUNLP liuzy@tsinghua.edu.cn 1 What is Natural

Affordances SWEN-444 What is an Affordance? Psychologist James Gibson, Theory of

SPATIAL SEARCHES IN ASTRONOMY DATABASES MULTI-DIMENSIONAL INDEXING FOR SIMULATIONS AND

Solving PDEs for Electrostatics Via Relaxation (Simple, Not Industrial Strength) Rubin H Landau

CENG4480 Lecture 07: PID Control Bei Yu byu@cse.cuhk.edu.hk (Latest update: August 19, 2020)

Robust Communication for Jungle Computing Jason Maassen Computer Systems Group Department of

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

The Pros and Cons of Adopting and Applying Design Patterns in the Real World Tina Andersen (1)

Ma Maximal C Causa sality R Reduction on fo for TSO and PSO Shiyou Huang Jeff Huang