Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of - PDF document

� 11/2/2014 Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of Computer Engineering Sharif University of Technology Fall 2014 Optimization Techniques Some slides come from Dr. Cristina Amza @ http://www.eecg.toronto.edu/~amza/ and professor Daniel Etiemble @ http://www.lri.fr/~de/ � 1

� 11/2/2014 Returning to Sequential vs. Parallel � Sequential execution time: t seconds. � Startup overhead of parallel execution: t_st seconds (depends on architecture) � (Ideal) parallel execution time: t/p + t_st. � If t/p + t_st > t, no gain. 3 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. General Idea � Parallelism limited by dependencies. � Restructure code to eliminate or reduce dependencies. � Sometimes possible by compiler, but good to know how to do it by hand. 4 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 2

� 11/2/2014 Optimizations: Example for (i = 0; i< 100000; i++) a[i + 1000] = a[i] + 1; Cannot be parallelized as is. May be parallelized by applying certain code transformations. 5 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Reorganize code such that � dependences are removed or reduced � large pieces of parallel work emerge � loop bounds become known � … Code can become messy … There is a point of diminishing returns. 6 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 3

� 11/2/2014 Factors that Determine Speedup � Characteristics of parallel code � granularity � load balance � locality � Synchronization & communication 7 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Granularity � Granularity = size of the program unit that is executed by a single processor. � May be a single loop iteration, a set of loop iterations, etc. � Fine granularity leads to: � (positive) ability to use lots of processors � (positive) finer-grain load balancing � (negative) increases overhead 8 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 4

� 11/2/2014 Granularity and Critical Sections Small granularity => more processors involved => more critical section accesses => more contention overheads => Lower performance! 9 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Load Balance � Load imbalance = different execution time of processors between barriers. � Execution time may not be predictable. � Regular data parallel: yes. � Irregular data parallel or pipeline: perhaps. 10 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 5

� 11/2/2014 Static Load Balancing � Block � best locality � possibly poor load balance � Cyclic � better load balance � worse locality � Block-cyclic � load balancing advantages of cyclic (mostly) � better locality 11 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Dynamic Load Balancing � Centralized: single task queue. � Easy to program � Excellent load balance � Distributed: task queue per processor. � Less communication/synchronization 12 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 6

� 11/2/2014 Dynamic Load Balancing (cont.) � Task stealing: � Processes normally remove and insert tasks from their own queue. � When queue is empty, remove task(s) from other queues. � Extra overhead and programming difficulty. � Better load balancing. 13 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Semi-static Load Balancing � Measure the cost of program parts. � Use measurement to partition computation. � Done once, done every iteration, done every n iterations. 14 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 7

� 11/2/2014 Example: Molecular Dynamics (MD) � Simulation of a set of bodies under the influence of physical laws. � Atoms, molecules, ... � Have same basic structure. 15 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Molecular Dynamics (Skeleton) for some number of timesteps { for all molecules i for all other molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); } 16 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 8

� 11/2/2014 Molecular Dynamics � To reduce amount of computation, account for interaction only with nearby molecules. 17 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Molecular Dynamics (cont.) for some number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); } 18 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 9

� 11/2/2014 Molecular Dynamics (cont.) for each molecule i number of nearby molecules: count[i] array of indices of nearby molecules: index[j], ( 0 <= j < count[i]) 19 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Molecular Dynamics (cont.) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); } 20 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 10

� 11/2/2014 Molecular Dynamics (simple) for some number of timesteps { parallel for for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); parallel for for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); } 21 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Molecular Dynamics (simple) � Simple to program � Possibly poor load balance � block distribution of i iterations (molecules) could lead to uneven neighbor distribution � cyclic does not help 22 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 11

� 11/2/2014 Better Load Balance � Assign iterations such that each processor has ~ the same number of neighbors. � Array of “assign records” � size: number of processors � two elements: � beginning i value (molecule) � ending i value (molecule) � Recompute partition periodically 23 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Frequency of Balancing � Every time neighbor list is recomputed. � once during initialization. � every iteration. � every n iterations. � Extra overhead vs. better approximation and better load balance. 24 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 12

� 11/2/2014 Some Hints for Vectorization and SIMDization � Using Pointers avoids Vectorization int a[100]; int a[100]; int *p; p=a; for (i=0; i<100;i++) a[i] = i; for (i=0; i<100;i++) *p++ = i; 25 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � Loop Carried Dependencies S1: A[i]=A[i]+ B[i]; S2: B[i+1]= C[i]+ D[i] S2: B[i+1]= C[i]+ D[i] S1*: A[i+1]=A[i+1]+ B[i+1]; 26 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 13

� 11/2/2014 Dependencies do not parallelize! Dependencies imply sequentiality. They � must be broken, if possible, in order to be able to parallelize. 1. A<-B+C 2. D<-A*B 3. E<-C-D 1 2 3 27 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Dependencies do not parallelize! Privatization do i=1,N P: A=... Q:X(i)=A+.... end do � In the example above, Q is dependent on P, and because of this, the loop cannot be parallelized. � Assuming that there is no circular dependence of P on to Q, the privatization method helps break this dependence. pardoi=1,N P:A(i)=... Q:X(i)=A(i)+... end pardo 28 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 14

� 11/2/2014 Dependencies do not parallelize! � In OpenMP, if explicit privatization is used, then #pragma omp parallel for for( i=0; i<N; i++) { A[i]=... ; X[i]=A[i]+... ; } 29 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Dependencies do not parallelize! � In OpenMP, similar results could be achieved if A were to be declared private. #pragma omp parallel for private(A) for( i=0; i<N; i++) { A =... ; X[i]=A +... ; } 30 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 15

� 11/2/2014 Dependencies do not parallelize! Reduction do i=1,N P:X(i)=... Q: Sum=Sum+X(i) end do Statement Q depends on itself since the sum is built sequentially. This type of calculation can be parallelized depending on the underlying system. For example, if the underlying system is a shared memory one, one can easily derive the sum in log 2 N time (provided that there are enough processors to carry out additions in parallel). 31 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Dependencies do not parallelize! pardo i=1,N P: X(i)=... Q: Sum=sum_reduce(X(i)) end pardo 32 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 16

� 11/2/2014 Dependencies do not parallelize! Induction If a loop depicts a recursion on one of the variables e.g. x ( i )= x ( i -1)+ y ( i ) one can use the carry generation and propagation techniques (i.e. solving the recursion) in order to parallelize the code. This method is called induction. 33 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. Memory Access Pattern � All the elements of the line are used before the next line is referenced. � This type of access pattern is often referred to as “unit stride.” for (int i=0; i<n; i++) for (int j=0; j<n; j++) V sum += a[i][j]; for (int j=0; j<n; j++) for (int i=0; i<n; i++) NV sum += a[i][j]; 34 Multicore Computing, SHARIF U. OF TECHNOLOGY, 2014. � 17

Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of - PDF document

11/2/2014 Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of Computer Engineering Sharif University of Technology Fall 2014 Optimization Techniques Some slides come from Dr. Cristina Amza @

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Multi-core model checking for biological applications Jaco van de Pol 22 November 2013

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Scalable Multi-Core Model Checking Alfons Laarman ( alfons@laarman.com ), Theory joint work with

From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

CSEN 1013 Seminar Multi-Core & High Performance Computing Nvidia Fermi Ahmed Labib February

Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of Computer Engineering Sharif

Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

CS184c: Computer Architecture [Parallel and Multithreaded] Day 2: April 5, 2001 Message

Welcome! , = (, ) , + , ,

On Recent Experience With Discretizations of Convection-Diffusion Equations Volker John (WIAS

General Equilibrium and Efficiency Mathematical Treatment 1 Framework and Assumptions The

Defect Removal Metrics September 30, 2004 Swami Natarajan RIT Software Engineering Defect

Sampling Resampling Warping Morphing Dr. Shai Avidan Faculty of Engineering Tel-Aviv

Modeling filter configuration and introduction to data exploration Tyler Moore Computer Science

Introduction to Business Statistics Introduction to Business Statistics QM 120 Ch Chapter 4 t

Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of - PDF document

11/2/2014 Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of Computer Engineering Sharif University of Technology Fall 2014 Optimization Techniques Some slides come from Dr. Cristina Amza @

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Multi-core model checking for biological applications Jaco van de Pol 22 November 2013

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Scalable Multi-Core Model Checking Alfons Laarman ( alfons@laarman.com ), Theory joint work with

From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

CSEN 1013 Seminar Multi-Core &amp; High Performance Computing Nvidia Fermi Ahmed Labib February

Multi-Core Computing Instructor: Hamid Sarbazi-Azad Department of Computer Engineering Sharif

Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

CS184c: Computer Architecture [Parallel and Multithreaded] Day 2: April 5, 2001 Message

Welcome! , = (, ) , + , ,

On Recent Experience With Discretizations of Convection-Diffusion Equations Volker John (WIAS

General Equilibrium and Efficiency Mathematical Treatment 1 Framework and Assumptions The

Defect Removal Metrics September 30, 2004 Swami Natarajan RIT Software Engineering Defect

Sampling Resampling Warping Morphing Dr. Shai Avidan Faculty of Engineering Tel-Aviv

Modeling filter configuration and introduction to data exploration Tyler Moore Computer Science

Introduction to Business Statistics Introduction to Business Statistics QM 120 Ch Chapter 4 t

CSEN 1013 Seminar Multi-Core & High Performance Computing Nvidia Fermi Ahmed Labib February