ManyCore ManyCore Computing: ManyCore ManyCore Computing: - PowerPoint PPT Presentation

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on Numerical The Impact on Numerical S S oftware for Linear Algebra oftware for Linear Algebra Libraries Libraries Libraries Libraries Jack Dongarra k INNOVATIVE COMP ING LABORATORY U i University of Tennessee i f T Oak Ridge National Laboratory University of Manchester 11/20/2007 1

Performance Proj ection Performance Proj ection Top500 Data Top500 Data p 1 F/s 100 PF/s 10 PF/s 1 PF/s SUM SUM 100 TF/s 10 TF/s 1 TF/s 1 TF/s 100 GF/s N=1 10 GF/ 10 GF/s 1 GF/s N=500 100 MF/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2

What Will a What Will a Petascale Petascale S S ystem Looks Like? ystem Looks Like? Possible Petascale System 1. # of cores per nodes p 10 – 100 cores 2. Performance per nodes 100 – 1,000 GFlop/s 3. Number of nodes 1,000 - 10,000 nodes 4 Latency inter nodes 4. Latency inter-nodes 1 μ sec 1 μ sec 5. Bandwidth inter-nodes 10 GB/s 6. Memory per nodes 10 GB Part I: First rule in linear algebra: Have an efficient DGEMM • � Motivation in 2. performance per node 5. bandwidth inter-nodes 6. memory per nodes p p y p • Part II: Algorithms for multicore and latency avoiding algorithms for LU, QR … � Motivation in: 1 Number of cores per node 2 performance per node 4 Latency inter-nodes 1. Number of cores per node 2. performance per node 4. Latency inter nodes Part III: Algorithms for fault tolerance • � Motivation in: 1. Number of cores per node 3. number of nodes

Maj or Changes to S Maj or Changes to S oftware oftware • Must rethink the design of our software software � Another disruptive technology • Similar to what happened with cluster computing and message passing � Rethink and rewrite the applications, algorithms and software algorithms, and software • Numerical libraries for example will change change � For example, both LAPACK and ScaLAPACK will undergo major changes g j g to accommodate this 4

Coding for an Coding for an Abstract Abstract M Multicore ulticore Parallel software for multicores should have two characteristics: • Fine granularity: � high level of parallelism is needed � cores will probably be associated with relatively small � cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality reduce bus traffic and improve data locality. • Asynchronicity: as the degree of TLP grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.

ManyCore ManyCore - - Parallelism for the Parallelism for the Masses Masses • We are looking at the following g g concepts in designing the next numerical library implementation y p � Dynamic Data Driven Execution � Self Adapting p g � Block Data Layout � Mixed Precision in the Algorithm Mixed Precision in the Algorithm � Exploit Hybrid Architectures � Fault Tolerant Methods Fault Tolerant Methods 6

A New Generation of S A New Generation of S oftware: oftware: Algorithms follow hardware evolution in time LINP ACK (70’s) Rely on (Vector operations) (Vector operations) - Level-1 BLAS Level 1 BLAS operations LAP LAP ACK (80’s) ACK (80 s) Rely on Rely on (Blocking, cache - Level-3 BLAS friendly) operations S caLAP ACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLAS MA (00’s) Rely on New Algorithms - a DAG/ scheduler (many-core friendly) - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) h l l it th l ll ( lti t l ti ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

A New Generation of S A New Generation of S oftware: oftware: Parallel Linear Algebra S Parallel Linear Algebra S oftware for oftware for Multicore Multicore Architectures (PLAS Architectures (PLAS MA) MA) Algorithms follow hardware evolution in time LINP ACK (70’s) Rely on (Vector operations) (Vector operations) - Level-1 BLAS Level 1 BLAS operations LAP LAP ACK (80’s) ACK (80 s) Rely on Rely on (Blocking, cache - Level-3 BLAS friendly) operations S caLAP ACK (90’s) Rely on (Distributed Memory) - PBLAS Mess Passing PLAS MA (00’s) Rely on New Algorithms - a DAG/ scheduler (many-core friendly) - block data layout These new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) h l l it th l ll ( lti t l ti ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

Developing P Developing Parallel arallel A Algorithms lgorithms parallelism LAPACK LAPACK parallelism Threaded PThreads OpenMP s BLAS sequential sequential PThreads OpenMP BLAS BLAS

Steps in the LAPACK LU Steps in the LAPACK LU DGETF2 LAPACK (Factor a panel) DLSWP LAPACK (B (Backward swap) k d ) DLSWP DLSWP LAPACK LAPACK (Forward swap) DTRSM BLAS (Triangular solve) DGEMM BLAS (Matrix multiply) 10

LU Timing Profile (4 LU Timing Profile (4 core core system) system) Threads – no lookahead Time for each component DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk Sync Phases Bulk Sync Phases DGEMM

Adaptive Lookahead Adaptive Lookahead - Dynamic Dynamic Reorganizing algorithms to use Event Driven Multithreading Event Driven Multithreading 12 this approach

Fork Fork- -Join vs. Dynamic Execution Join vs. Dynamic Execution A A T T T Fork-Join – parallel BLAS B C C Time Experiments on Experiments on pe pe e ts o e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown 13 with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

Fork Fork- -Join vs. Dynamic Execution Join vs. Dynamic Execution A A T T T Fork-Join – parallel BLAS B C C Time DAG-based – dynamic scheduling Time saved Experiments on Experiments on pe pe e ts o e ts o Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown 14 with 2 Sockets w/ 8 Treads with 2 Sockets w/ 8 Treads

Achieving Achieving A Asynchronicity synchronicity The matrix factorization can be The matrix factorization can be represented as a DAG: • nodes: tasks that operate on “tiles” • edges: dependencies among tasks • edges: dependencies among tasks Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated.

Achieving Achieving A Asynchronicity synchronicity A critical path can be defined as the A critical path can be defined as the shortest path that connects all the nodes with the higher number of outgoing edges. t i d Priorities: Priorities:

Achieving asynchronicity Achieving Achieving A Asynchronicity synchronicity � Very fine granularity � Few dependencies, i.e., high flexibility for the scheduling of flexibility for the scheduling of tasks asynchronous scheduling � No idle times No idle times � Some degree of adaptativity � Better locality thanks to block data layout

Cholesky Cholesky Factorization Factorization DAG DAG- -based based Dependency Tracking Dependency Tracking 1: 1 1: 1: 1: 1: 1: 1: 1:1 1:1 2 3 4 1:2 2:2 2: 2: 2: 2 2 3 3 4 4 1:3 2:3 3:3 2: 2 1:4 2:4 3:4 4:4 2: 2: 3 4 3: 3: 3: 3: Dependencies expressed by the DAG Dependencies expressed by the DAG 3 4 are enforced on a tile basis: � fine-grained parallelization 3: � flexible scheduling 3 3

Cholesky Cholesky on the IBM Cell on the IBM Cell Pi Pipelining: li i � Between loop iterations. Double Buffering: � Within BLAS, � Within BLAS � Between BLAS, � Between loop iterations. Result: � Minimum load imbalance, � Minimum dependency stalls, � Minimum memory stalls (no waiting for data). ( g ) Achieves 174 Gflop/s; 85% of peak in SP. 19

Cholesky Cholesky - - Using 2 Cell Chips Using 2 Cell Chips 20

Parallelism in LAPACK: Parallelism in LAPACK: Blocked Storage Blocked Storage Column-Major Column Major

Parallelism in LAPACK: Parallelism in LAPACK: Blocked Blocked S Storage torage Column-Major Column Major Blocked Blocked

Parallelism in LAPACK: blocked storage Parallelism in LAPACK: blocked storage Column-Major Column Major Blocked Blocked

Parallelism in LAPACK: Parallelism in LAPACK: Blocked Blocked S Storage torage The use of blocked storage can significantly improve performance p p Blocking Speedup 2 2 1.8 DGEMM DTRSM 1.6 1.4 speedup 1.2 1 0.8 0.6 0.4 0.2 0 0 64 128 256 block size

ManyCore ManyCore Computing: ManyCore ManyCore Computing: - PowerPoint PPT Presentation

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on Numerical The Impact on Numerical S S oftware for Linear Algebra oftware for Linear Algebra Libraries Libraries Libraries Libraries Jack

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing

Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu 1,2 , Kevin Klues 1 ,

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures* Engin

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi ,

Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernab

Sparse Matrix-Matrix Mul/plica/on for Modern Manycore Architectures

Progress with PETSc on Manycore and GPU-based Systems on the Path to Exascale Richard Tran Mills

Understanding Manycore Scalability of File Systems Changwoo Min , Sanidhya Kashyap, Stefgen Maass

FluidCheck: A Redundant Threading based Approach for Reliable Execution in Manycore Processors

On a RISC-V Lightweight Manycore for Operating Systems Research Pedro Henrique Penna 1,2 Advisors:

Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architecture Ichitaro

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory

Computer Graphics Seminar MTAT.03.305 Spring 2020 Raimond Tunnel Computer Graphics

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium

Nested Lists Motivating this Course Module In previous video series, introduced lists All

1 Arrays Arrays The array's element type is the type of values it stores The size of an array

Arrays CS180 15Feb2008 Announcements Exam1GradesonBlackboard

ThS. Trn Th Thanh Nga Khoa CNTT, Trng H Nng Lm TPHCM Email: ngattt@hcmuaf.edu.vn