the era of heterogeneous compute challenges and
play

The Era of Heterogeneous Compute: Challenges and Opportunities - PowerPoint PPT Presentation

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical and Computer Engineering Georgia


  1. The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical and Computer Engineering Georgia Institute of Technology SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  2. System Diversity Amazon EC2 GPU Instances Mobile Platforms Heterogeneity is Mainstream Tianhe-1A Keeneland System SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

  3. Outline  Drivers and Evolution to Heterogeneous Computing  The Ocelot Dynamic Execution Environment  Dynamic Translation for Execution Models  Dynamic Instrumentation of Kernels  Related Projects SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  4. Evolution to Multicore Power Wall = α + + 2 P CV f V I V I dd dd st dd leak NVIDIA Fermi: 480 cores Performance Frequency Core Scaling Scaling (Multicore) (Instruction Pipelining Level (RISC) Parallelism) 2000  Intel Nehalem-EX: 8 cores 1980’s 1990’s Tilera: 64 cores 4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  5. Consolidation on Chip Vector Extensions Programmable Programmable AES Instructions Pipeline (GEN6) Accelerator Intel Sandy Bridge Multiple Models of Computation Multi-ISA Intel Knights Corner 16, PowerPC cores Accelerators • Crypto Engine • RegEx Engine • XML Engine • CP<[press Engine PowerEN 5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  6. Major Customization Trends Uniform ISA Multi-ISA Asymmetric Heterogeneous Knights Corner PowerEN  Disruptive impact on the  Minimal disruption to the software stack? software ecosystems  Higher degree of customization  Limited customization? 6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  7. Asymmetry vs. Heterogeneity Performance Functional Heterogeneous Asymmetry Asymmetry Tile Tile Tile Tile Tile Tile MC MC MC MC Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile MC MC MC MC Tile Tile Tile Tile Tile Tile  Complex cores and simple cores  Multiple voltage and frequency islands  Shared instruction set architecture (ISA)  Different memory  Multi-ISA technologies  Subset ISA  Microarchitecture  Distinct microarchitecture  STT-RAM, PCM,  Memory & Flash  Fault and migrate model of Interconnect hierarchy operation 1 Uniform ISA Multi-ISA 1 Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008. 7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  8. HPC Systems: Keeneland Courtesy J. Vetter (GT/ORNL) 201 TFLOPS in 7 racks (90 sq ft incl service area) 677 MFLOPS per watt on HPL (# 9 on Green500, Nov 2010) Final delivery system planned for early 2012 Keeneland System (7 Racks) Rack (6 Chassis) S6500 Chassis (4 Nodes) ProLiant SL390s G7 (2CPUs, 3GPUs) M2070 201528 Xeon 5660 GFLOPS 40306 6718 GFLOPS 12000-Series 1679 GFLOPS Director Switch GFLOPS 515 67 GFLOPS 24/18 GB GFLOPS Integrated with NICS Full PCIe X16 Datacenter GPFS and TG bandwidth to all GPUs 8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  9. A Data Rich World Large Graphs topnews.net.tz Mixed Modalities and levels of parallelism Irregular, Unstructured Computations and Data Pharma Images from math.nist.gov, blog.thefuturescompany.com,melihsozdinler.blogspot.com conventioninsider.com Waterexchange.com Trend analysis 9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  10. Enterprise: Amazon EC 2 GPU Instance NVIDIA Tesla Amazon EC2 GPU Instances Elements Characteristics OS CentOS 5.5 CPU 2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz) GPU 2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1 Memory 22 GB Storage 1690 GB I/O 10 GigE Price $2.10/hour SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10

  11. Impact on Software At System Scale  We need ISA level stability  Commercially, it is infeasible to constantly re-factor and re-optimize applications  Avoid software “silos”  Performance portability  New architectures need new algorithms At Chip Scale  What about our existing software? 11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  12. Will Heterogeneity Survive? Will We See Killer AMPs (Asymmetric Multicore Processors)? 12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  13. System Software Challenges of Heterogeneity  Execution Portability – Systems evolve over time esd.lbl.gov – New systems Sandia.gov  Performance Optimization Language Front-End Productivity Tools  New algorithms Run-Time Emerging Software  Introspection Stacks Dynamic OS/VM Optimizations  Productivity tools Device interfaces  Application Migration – Protect investments in existing code bases 13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  14. Outline  Drivers and Evolution to Heterogeneous Computing  The Ocelot Dynamic Execution Environment  Dynamic Translation for Execution Models  Dynamic Instrumentation of Kernels  Related Projects SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  15. 15 Ocelot: Project Goals  Encourage proliferation of GPU computing  Lower the barriers to entry for researchers and developers  Establish links to industry standards, e.g., OpenCL  Understand performance behavior of massively parallel, data intensive applications across multiple processor architecture types  Develop the next generation of translation, optimization, and execution technologies for large scale, asymmetric and heterogeneous architectures. http://code.google.com/p/gpuocelot/ 15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  16. Key Philosophy  Start with an explicitly parallel internal representations  Auto-serialization vs. auto-parallelization  Proliferation of domain specific languages and explicitly parallel language extensions like CUDA, OpenCL, and others Kernel level model: bulk synchronous processing (BSP) Kernel-Level Model: NVIDIA’s Parallel Thread Execution (PTX) 16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  17. NVIDIA’s Compute Unified Device Architecture (CUDA) Bulk synchronous execution model  For access to CUDA tutorials http://developer.nvidia.com/cuda-education-training 17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  18. Need for Execution Model Translation CUDA Haskell C++AMP C/C++ Datalog OpenCL Languages: Designed for Productivity Compiler Execution Models (EM): Dynamic Translation of Tools EMs to bridge this gap Run Time Hardware Architectures – Design under speed, cost, and energy constraints 18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  19. 19 Ocelot Vision: Multiplatform Dynamic Compilation esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) Just-in-time code generation and optimization for data intensive applications • Environment for i) compiler research, ii) architecture research, and iii) productivity tools 19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  20. 20 Ocelot CUDA Runtime Overview  A complete reimplementation of the CUDA Runtime API  Compatible with existing applications  Link against libocelot.so instead of libcudart  Ocelot API Extensions  Device switching R. Domingo & D. Kaeli (NEU) Kernels execute anywhere  Key to portability! 20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  21. 21 Remote Device Layer  Remote procedure call layer for Ocelot device calls  Execute local applications that run kernels remotely  Multi-GPU applications can become multi-node 21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  22. Ocelot Internal Structure 1 PTX Kernel CUDA Application nvcc  Ocelot is built with nvcc and the LLVM backend  Structured around PTX IR  LLVM IR Translator  Compile stock CUDA applications without modification  Other front-ends in progress: OpenCL and Datalog 1 G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT , September 2010. . 22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  23. For Compiler Researchers  Pass Manager Orchestrates analysis and transformation passes  Analysis Passes generate meta-data:  E.g., Data-flow graph, Dominator and Post-dominator trees, Thread frontiers  Meta-data consumed by transformations  Transformation Passes modify the IR  E.g., Dead code elimination, Instrumentation, etc. Pass Manager Transformation Analysis Pass Pass Metadata SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend