The Era of Heterogeneous Compute: Challenges and Opportunities - PowerPoint PPT Presentation

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical and Computer Engineering Georgia Institute of Technology SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

System Diversity Amazon EC2 GPU Instances Mobile Platforms Heterogeneity is Mainstream Tianhe-1A Keeneland System SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

Outline  Drivers and Evolution to Heterogeneous Computing  The Ocelot Dynamic Execution Environment  Dynamic Translation for Execution Models  Dynamic Instrumentation of Kernels  Related Projects SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Evolution to Multicore Power Wall = α + + 2 P CV f V I V I dd dd st dd leak NVIDIA Fermi: 480 cores Performance Frequency Core Scaling Scaling (Multicore) (Instruction Pipelining Level (RISC) Parallelism) 2000  Intel Nehalem-EX: 8 cores 1980’s 1990’s Tilera: 64 cores 4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Consolidation on Chip Vector Extensions Programmable Programmable AES Instructions Pipeline (GEN6) Accelerator Intel Sandy Bridge Multiple Models of Computation Multi-ISA Intel Knights Corner 16, PowerPC cores Accelerators • Crypto Engine • RegEx Engine • XML Engine • CP<[press Engine PowerEN 5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Major Customization Trends Uniform ISA Multi-ISA Asymmetric Heterogeneous Knights Corner PowerEN  Disruptive impact on the  Minimal disruption to the software stack? software ecosystems  Higher degree of customization  Limited customization? 6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Asymmetry vs. Heterogeneity Performance Functional Heterogeneous Asymmetry Asymmetry Tile Tile Tile Tile Tile Tile MC MC MC MC Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile MC MC MC MC Tile Tile Tile Tile Tile Tile  Complex cores and simple cores  Multiple voltage and frequency islands  Shared instruction set architecture (ISA)  Different memory  Multi-ISA technologies  Subset ISA  Microarchitecture  Distinct microarchitecture  STT-RAM, PCM,  Memory & Flash  Fault and migrate model of Interconnect hierarchy operation 1 Uniform ISA Multi-ISA 1 Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008. 7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

HPC Systems: Keeneland Courtesy J. Vetter (GT/ORNL) 201 TFLOPS in 7 racks (90 sq ft incl service area) 677 MFLOPS per watt on HPL (# 9 on Green500, Nov 2010) Final delivery system planned for early 2012 Keeneland System (7 Racks) Rack (6 Chassis) S6500 Chassis (4 Nodes) ProLiant SL390s G7 (2CPUs, 3GPUs) M2070 201528 Xeon 5660 GFLOPS 40306 6718 GFLOPS 12000-Series 1679 GFLOPS Director Switch GFLOPS 515 67 GFLOPS 24/18 GB GFLOPS Integrated with NICS Full PCIe X16 Datacenter GPFS and TG bandwidth to all GPUs 8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

A Data Rich World Large Graphs topnews.net.tz Mixed Modalities and levels of parallelism Irregular, Unstructured Computations and Data Pharma Images from math.nist.gov, blog.thefuturescompany.com,melihsozdinler.blogspot.com conventioninsider.com Waterexchange.com Trend analysis 9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Enterprise: Amazon EC 2 GPU Instance NVIDIA Tesla Amazon EC2 GPU Instances Elements Characteristics OS CentOS 5.5 CPU 2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz) GPU 2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1 Memory 22 GB Storage 1690 GB I/O 10 GigE Price $2.10/hour SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10

Impact on Software At System Scale  We need ISA level stability  Commercially, it is infeasible to constantly re-factor and re-optimize applications  Avoid software “silos”  Performance portability  New architectures need new algorithms At Chip Scale  What about our existing software? 11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Will Heterogeneity Survive? Will We See Killer AMPs (Asymmetric Multicore Processors)? 12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

System Software Challenges of Heterogeneity  Execution Portability – Systems evolve over time esd.lbl.gov – New systems Sandia.gov  Performance Optimization Language Front-End Productivity Tools  New algorithms Run-Time Emerging Software  Introspection Stacks Dynamic OS/VM Optimizations  Productivity tools Device interfaces  Application Migration – Protect investments in existing code bases 13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Outline  Drivers and Evolution to Heterogeneous Computing  The Ocelot Dynamic Execution Environment  Dynamic Translation for Execution Models  Dynamic Instrumentation of Kernels  Related Projects SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

15 Ocelot: Project Goals  Encourage proliferation of GPU computing  Lower the barriers to entry for researchers and developers  Establish links to industry standards, e.g., OpenCL  Understand performance behavior of massively parallel, data intensive applications across multiple processor architecture types  Develop the next generation of translation, optimization, and execution technologies for large scale, asymmetric and heterogeneous architectures. http://code.google.com/p/gpuocelot/ 15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Key Philosophy  Start with an explicitly parallel internal representations  Auto-serialization vs. auto-parallelization  Proliferation of domain specific languages and explicitly parallel language extensions like CUDA, OpenCL, and others Kernel level model: bulk synchronous processing (BSP) Kernel-Level Model: NVIDIA’s Parallel Thread Execution (PTX) 16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

NVIDIA’s Compute Unified Device Architecture (CUDA) Bulk synchronous execution model  For access to CUDA tutorials http://developer.nvidia.com/cuda-education-training 17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Need for Execution Model Translation CUDA Haskell C++AMP C/C++ Datalog OpenCL Languages: Designed for Productivity Compiler Execution Models (EM): Dynamic Translation of Tools EMs to bridge this gap Run Time Hardware Architectures – Design under speed, cost, and energy constraints 18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

19 Ocelot Vision: Multiplatform Dynamic Compilation esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) Just-in-time code generation and optimization for data intensive applications • Environment for i) compiler research, ii) architecture research, and iii) productivity tools 19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

20 Ocelot CUDA Runtime Overview  A complete reimplementation of the CUDA Runtime API  Compatible with existing applications  Link against libocelot.so instead of libcudart  Ocelot API Extensions  Device switching R. Domingo & D. Kaeli (NEU) Kernels execute anywhere  Key to portability! 20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

21 Remote Device Layer  Remote procedure call layer for Ocelot device calls  Execute local applications that run kernels remotely  Multi-GPU applications can become multi-node 21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Internal Structure 1 PTX Kernel CUDA Application nvcc  Ocelot is built with nvcc and the LLVM backend  Structured around PTX IR  LLVM IR Translator  Compile stock CUDA applications without modification  Other front-ends in progress: OpenCL and Datalog 1 G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT , September 2010. . 22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

For Compiler Researchers  Pass Manager Orchestrates analysis and transformation passes  Analysis Passes generate meta-data:  E.g., Data-flow graph, Dominator and Post-dominator trees, Thread frontiers  Meta-data consumed by transformations  Transformation Passes modify the IR  E.g., Dead code elimination, Instrumentation, etc. Pass Manager Transformation Analysis Pass Pass Metadata SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

The Era of Heterogeneous Compute: Challenges and Opportunities - PowerPoint PPT Presentation

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical and Computer Engineering Georgia

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

1 1 easy to compute , 1 easy to compute 2

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Lisa Randall, Harvard University Entering LHC Era Entering LHC Era Many challenges as LHC

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser

RTMS PRES TMS PRESENT ENTATI TION ON FR FROM OM NEW ERA C NEW ERA COM OMMER ERCE CE

FLAG-ERA Presentation FLAG-ERA JTC 2017 Project Kick-off Seminar March 21-22, 2018 Edouard

FASHION THE VICTORIAN ERA & THE CORSET THE VICTORIAN ERA & THE CORSET THE VICTORIAN

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,

A Characterization and Analysis of PTX Kernels Andrew Kerr*, Gregory Diamos, and Sudhakar

Usable assembly language for GPUs D. J. Bernstein University of Illinois at Chicago 319 ms:

Det Detec ectin ing An Anom omal alou ous Com omputat ation ion wit ith RN RNNs on on

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari, R. Ubal, D.

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim,

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Challenges in GPGPU architectures: fixed-function units and regularity Sylvain Collange CARAMEL

The Era of Heterogeneous Compute: Challenges and Opportunities - PowerPoint PPT Presentation

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical and Computer Engineering Georgia

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. &amp; Law Response to ERA I ( ii)

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

1 1 easy to compute , 1 easy to compute 2

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Lisa Randall, Harvard University Entering LHC Era Entering LHC Era Many challenges as LHC

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser

RTMS PRES TMS PRESENT ENTATI TION ON FR FROM OM NEW ERA C NEW ERA COM OMMER ERCE CE

FLAG-ERA Presentation FLAG-ERA JTC 2017 Project Kick-off Seminar March 21-22, 2018 Edouard

FASHION THE VICTORIAN ERA &amp; THE CORSET THE VICTORIAN ERA &amp; THE CORSET THE VICTORIAN

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,

A Characterization and Analysis of PTX Kernels Andrew Kerr*, Gregory Diamos, and Sudhakar

Usable assembly language for GPUs D. J. Bernstein University of Illinois at Chicago 319 ms:

Det Detec ectin ing An Anom omal alou ous Com omputat ation ion wit ith RN RNNs on on

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D.

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim,

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Challenges in GPGPU architectures: fixed-function units and regularity Sylvain Collange CARAMEL

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

FASHION THE VICTORIAN ERA & THE CORSET THE VICTORIAN ERA & THE CORSET THE VICTORIAN

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari, R. Ubal, D.