Linpack Evaluation on Linpack Evaluation on a Supercomputer with p - PowerPoint PPT Presentation

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous Accelerators Toshio Endo, Akira Nukada Satoshi Matsuoka, Naoya Maruyama y y Tokyo Institute of Technology, Japan IPDPS 2010, Atlanta

GPU/Accelerators for High Performance Computing � In HPC systems, power consumption has been/will remain a major concern � GPU and Accelerators are promising for their excellent Flops/Watt ratio ClearSpeed NVidia GeForce ATI Radeon HD X620 X620 GTX285 GTX285 4870 4870 Speed (SP) 1063GFlops 1200GFlops Speed (DP) p ( ) 80GFlops p 88GFlops p 240GFlops p Memory BW 6.4GB/s 159GB/s 115GB/s Power 25W 183W 160W

Heterogeneous Systems Heterogeneous Systems Heterogeneous architectures that combines general g g purpose CPUs and accelerators will be attractive for � Generality by general purpose CPUs • Typically x86/x86-64 CPUs � Higher Flops/Watt ratio by accelerators • GPUs, Cell processor, ClearSpeed… Example: • LANL RoadRunner: 1.4PF with 12240 PowerXCell8i • NUDT Tianhe-1: 1.2PF with 5120 Radeon HD 4870 • TokyoTech TSUBAME: 160TF with 680 Tesla S1070 GPUs+648 ClearSpeed

Our Contribution Ou Co bu o � Demonstrated scalability of a heterogeneous system, TSUBAME � A Linpack implementation that uses cooperatively: • 10,368 Opteron cores p • 612 Tesla GPUs • 648 ClearSpeed accelerators • 640 Xeon � A different strategy than on Roadrunner or Tianhe-1 is required required � 87.01TFlops � #56 in Top500 ranking + +

LANL RoadRunner (2008) LANL RoadRunner (2008) The largest heterogeneous system The first PetaFlops machine in the world! ld! � 6120 dual-core Opterons and 12240 PowerXCell 8i 12240 P XC ll 8i � IBM blades P Peak performance is 1.4PFlops k f i 1 4PFl � >90% comes from Cell #2 in Top500 ranking Linpack performance is 1.042PFlops

Tokyo-Tech TSUBAME Supercomputer Tokyo-Tech Supercomputer and Supercomputer and UBiquitously Accessible Mass storage Mass-storage Environment 燕 “TSUBAME” also means “swallow”, the symbol mark of Tokyo-Tech the symbol mark of Tokyo Tech

TSUBAME Basic Data TSUBAME Basic Data � 655-node Linux cluster • Sun Fire X4600 • 8 Dual-core Opteron 880 (=16cores) per node • 32GB DDR memory per node • And Tesla S1070 GPU and ClearSpeed accelerators � ~1.1MW power consumption, 350 m 2 footprint p p , p � SUSE Linux Enterprise 10 � Jobs are managed by a batch scheduler � Jobs are managed by a batch scheduler • A customized version of Sun N1 Grid Engine � A production system used by >1,500 users A d ti t d b >1 500

Accelerators Installed (1): NVIDIA Tesla S1070 � 4GPUs in 1U box • 800 watts/box � Each GPU has: • 30 Multi Processors x 8 Stream processors • 86GFlops (double prec) • 4GB GDDR3 memory • 4GB GDDR3 memory • 102GB/s memory bandwidth Tesla S1070 box � Connected with hosts via external PCI-Express cables � Connected with hosts via external PCI Express cables • 2 GPUs hang on a cable � Programming with CUDA programming language g g p g g g g � � 320 out of 655 TSUBAME nodes are connected with 2 GPUs respectively p y • ‘Inter-node’ heterogeneity

Accelerators Installed (2): ClearSpeed X620 Accelerator � PCI-X board • 2 CSX600 x 96 SIMD cores • 80GFlops (double prec) • 80GFlops (double prec) • 1GB DDR memory • 6.4GB/s memory bandwidth y • 25 watts /board � Programming with ClearSpeed C n programming language � Each TSUBAME node has a board

TSUBAME Node with Hybrid Accelerators Other nodes Other nodes ClearSpeed SDR InfiniBand 8 dual-core 1GB/s x 2 O t Opteron CPUs CPU (16 cores) PCI-X PCI e gen1 x8 PCI-e gen1 x8 1GB/s 1GB/s 2GB/s 2GPUs of Tesla 32GB memory 32GB memory SunFire X4600

History of TSUBAME in Top500 History of TSUBAME in Top500 Jun06 Nov06 Jun07 Nov07 Jun08 Nov08 Jun09 Nov09 Linpack 38.18 47.38 48.88 56.43 67.70 77.48 87.01 Speed (TF) ( ) Rank 7 9 14 16 24 29 41 56 Opteron Opteron CS x 648 CS x 360 Xeon eo Tesla � The 3 rd system as a heterogeneous system y g y � • From Nov 06 to Nov 07, it was the 1 st � Continuous improvement for 7 times � Continuous improvement for 7 times

What is Linpack? What is Linpack? � A numerical benchmark used in Top500 p supercomputer ranking (www.top500.org) • Solves a dense linear equation Ax = b of order N q • A direct solver; total computation cost is O(N 3 ) • Users can configure N; In TSUBAME, N~1,000,000 g ; , , , � HPL (High-performance Linpack) by A. Petitet • A famous MPI parallel implementation designed for • A famous MPI parallel implementation, designed for uniform systems • Based on blocked LU-decomposition, with partial pivoting Based on blocked LU decomposition, with partial pivoting • The most time consuming part is matrix-multiplication (DGEMM) ( ) • Used as a basis of our implementation

HPL Algorithm HPL Algorithm LU decomposition of N × N matrix A for (k = 0; k < N; k += B) for (k 0; k < N; k + B) U U Panel factorization with partial UU pivoting to obtain L p g U U A’ N A L Broadcast L U A’ L A’ A Row exchange, and compute U Row exchange, and compute U L L L L A’ L A’ Update the rest part of matrix = − × A ' A ' L U B DGEMM is the most time consuming i

Data Decomposition in HPL Data Decomposition in HPL � Matrix A is uniformly distributed with 2D block- cyclic distribution among processes Matrix distribution on Matrix distribution on Each process has a Each process has a 6 (=2x3) processes “partial-matrix” U L A L A L L N = − × A ' A ' L U L L L L B

Design Issues on Heterogeneous Systems � Who computes? • Kernel (DGEMM, DTRSM ) ( , ) • Accelerators? Both CPU and accelerators? • Non-kernel Non kernel � Where are matrix data placed? • Host memory? Accelerator memory? • Strategies depend on system architecture g p y • We compare our decision with that on Roadrunner [PPoPP09] [ ] • More challenging on TSUBAME

Who Computes? Who Computes? � Non-kernel Breakdown of peak performance (DP) peak performance (DP) • Only CPUs are used for MPI • Only CPUs are used for MPI per processor type communication, pivoting… RR TSUBAME � Kernel functions � Kernel functions 100% 100% 90% • On Roadrunner, Cells contribute 53.9 80% 96% of performance 70% 60% • Ratio of CPUs is 4% 1410 52.2 50% ⇒ Only Cells are used 40% 7.3 7.3 30% 30% • On TSUBAME, CPUs contribute O TSUBAME CPU ib 20% 49.8 35% 10% 46.7 0% • Omitting any type of processors Omitting any type of processors Roadrunner TSUBAME heavily degrades performance Total 1457TF Total 163.2TF ⇒ All of CPUs,GPUs,ClearSpeed Opteron Xeon ClearSpeed are used d Tesla Tesla Cell Cell

Where are matrix data placed? (1) Where are matrix data placed? (1) A RR node A RR node A TSUBAME node A TSUBAME node 16GB Host memory 32GB CPUs Cell Tesla Clear Accelerators Speed Device memory Device memory 4GB 4GB 1GB Host mem : Device mem Host mem : Device mem Host mem : Device mem Host mem : Device mem 16GB = 4GB x 4 32GB > 4GBx2 + 1GB

Where are matrix data placed? (2) Where are matrix data placed? (2) � In Linpack, the matrix size should be larger to gain speed in Flops • ⇒ it should be as large as host memory � On RR, • (1) Device memory = Host memory • (2) Kernel computation is done only by Cells ( ) p y y ⇒ Matrix data are on Cell device memory � On TSUBAME � On TSUBAME, • Device memory < Host memory ⇒ Matrix data are usually on host memory ⇒ Matrix data are usually on host memory

Executing Kernel Functions on Accelerators PCI-e/ � Matrix data is on host memory, when PCI-X DGEMM function is called � Pipelined DGEMM execution: � Pipelined DGEMM execution: A part of input data is moved from host (1) (1) Input data to device Computes DGEMM on accelerators (2) The results are moved back to host, (3) (2) calc DGEMM() then repeats for next partial matrix then repeats for next partial matrix (3) Output data M More frequent and larger amount of PCI-e/PCI-X communications are f t d l t f PCI /PCI X i ti required than on RR

Challenging Issues on TSUBAME � Intra-node heterogeneity: • CPU/GPU/ClearSpeed are used for kernel • On RR, using only Cell is sufficient � Inter-node heterogeneity: • Half the nodes have GPUs, while others don’t Half the nodes have GPUs, while others don t • On RR, nodes are uniform � Frequent PCI-e/PCI-X communication: � Frequent PCI-e/PCI-X communication: • The whole input/output is moved via PCI • On RR matrix data always resides in Cell device • On RR, matrix data always resides in Cell device memory H How can we run HPL, originally designed for uniform HPL i i ll d i d f if systems, efficiently?

Coping with intra-node Heterogeneity Coping with intra node Heterogeneity � We ‘virtualize’ heterogeneous processors at BLAS layer � Processors are providers of DGEMM performance � We control mapping between processes and processors • An MPI process divides its own sub-matrix with a proper ratio and throws DGEMM tasks to CPUs and accelerators • All processes should be mapped with processors of similar All processes should be mapped with processors of similar performance Example of mapping during DGEMM Example of mapping during DGEMM Processes Processors

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p - PowerPoint PPT Presentation

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous Accelerators Toshio Endo, Akira Nukada Satoshi Matsuoka, Naoya Maruyama y y Tokyo Institute of Technology, Japan IPDPS 2010, Atlanta GPU/Accelerators for

Jack Dongarra University of Tennessee & Oak Ridge National Laboratory, USA 1 LINPACK is a

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Ehsan Totoni Josep Torrellas Laxmikant V. Kale Charm

Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ort

Reflecting on the Goal and Baseline of Exascale Computing Thomas C. Schulthess | T. Schulthess

HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

IP Infrastructure Geolocation Guan-Yan Cai, Michael McCarrin ,

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Organising Deep Networks Edouard Oyallon advisor: Stphane Mallat following the works of

Kenya is located in East Africa, which lies on the Equator The population of Kenya is 47.5

Flexible Hierarchical Execution of Parallel Task Loops Michael Robson, Villanova University

Data-Intensive Applications on Numerically-Intensive Supercomputers David Daniel / James Ahrens

Reap Rewards Now 25 January 2012 Speaker Profile Cliff Sperber German Martinez Jennifer Slomack

Derivations from the disc algebra into natural modules Yemon Choi University of Saskatchewan

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p - PowerPoint PPT Presentation

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous Accelerators Toshio Endo, Akira Nukada Satoshi Matsuoka, Naoya Maruyama y y Tokyo Institute of Technology, Japan IPDPS 2010, Atlanta GPU/Accelerators for

Jack Dongarra University of Tennessee &amp; Oak Ridge National Laboratory, USA 1 LINPACK is a

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Ehsan Totoni Josep Torrellas Laxmikant V. Kale Charm

Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ort

Reflecting on the Goal and Baseline of Exascale Computing Thomas C. Schulthess | T. Schulthess

HPCG: ONE YEAR LATER Jack Dongarra &amp; Piotr Luszczek University of Tennessee/ORNL Michael

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation &amp; Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

IP Infrastructure Geolocation Guan-Yan Cai, Michael McCarrin ,

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Organising Deep Networks Edouard Oyallon advisor: Stphane Mallat following the works of

Kenya is located in East Africa, which lies on the Equator The population of Kenya is 47.5

Flexible Hierarchical Execution of Parallel Task Loops Michael Robson, Villanova University

Data-Intensive Applications on Numerically-Intensive Supercomputers David Daniel / James Ahrens

Reap Rewards Now 25 January 2012 Speaker Profile Cliff Sperber German Martinez Jennifer Slomack

Derivations from the disc algebra into natural modules Yemon Choi University of Saskatchewan

Jack Dongarra University of Tennessee & Oak Ridge National Laboratory, USA 1 LINPACK is a

HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori