experiences using tegra k1 and x1 for highly energy
play

Experiences Using Tegra K1 and X1 for Highly Energy Efficient - PowerPoint PPT Presentation

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra,


  1. Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra, Australia April 07, 2016

  2. Introduction & Background Overview 1 Introduction & Background 2 Power Measurement Environment 3 Experimental Platforms 4 Approach 5 Results & Analysis 6 Conclusion Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20

  3. Introduction & Background Use of low-powered SoCs for HPC Nvidia Jetson TK1 : ARM + GPU SoC Nvidia Jetson TX1 : ARM + GPU SoC TI Keystone II : ARM + DSP SoC Adapteva Parallella : ARM + 64-core NoC TI BeagleBoard : ARM + DSP SoC Terasic DE1 : ARM + FPGA SoC Rockchip Firefly : ARM + GPU SoC Freescale Wandboard : ARM + GPU SoC Cubieboard4 : ARM + GPU SoC http://cs.anu.edu.au/systems Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20

  4. Introduction & Background Use of low-powered SoCs for HPC In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on-chip devices Understanding the performance-energy trade-off Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20

  5. Introduction & Background Contributions Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems - Intel Xeon CPUs and NVIDIA K20 and K80 GPUs Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20

  6. Power Measurement Environment Measurement Requirements SoC systems generally consume very low power ∼ few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µ Amps to a few Amps, a very high-precision ammeter must be used to measure subtle changes Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20

  7. Power Measurement Environment Measurement Apparatus µ Current Gold : High-precision ammeter for measuring low-currents An mbed LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µ Current Gold https://www.eevblog.com/projects/ucurrent/ The ADC has a resolution of 0.81 ± 0.40mV, which corresponds to 0.81mA. This is 9.7 ± 4.8mW at 12V. https://developer.mbed.org/ Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20

  8. Power Measurement Environment Power Measurement Environment Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20

  9. Experimental Platforms Experimental Platforms TK1 TX1 SANDY HASWELL ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU 4 4 2 × 8 2 × 12 CPU Cores 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz CPU Freq. 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 RAM GK20A GM20B K20m (GK110) K80 (GK210) GPU 192 256 2496 2496 GPU Cores 852 MHz 998 MHz 706 MHz 875 MHz GPU Freq. Shared Shared 5GB 12GB GPU RAM v6.5 v7.0 v7.0 v7.5 CUDA Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20

  10. Approach Evaluation Kernel C = A × B        C 1  =  A  ×  B 1 C 2 B 2  ւց ւց ւց C 1 = A × B 1 C 2 = A × B 2              C 1  =  A  B 1  C 2  =  A  B 2  ×  ×   CPU GPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20

  11. Approach Approaches Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information Beaumont et al., Matrix Multiplication on Heterogeneous Platforms C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs Dynamic Partitioning: Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures → Existing approaches do not consider the use of shared physical memory or the implications for energy efficiency Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20

  12. Approach Our approach Static partitioning: Guess a partition based on experimentally measured peak performances of CPU and Dynamic partitioning: GPU CPU and GPU remove chunks of matrix Used the achieved peaks to refine the columns from a workqueue partition Chunk size must be sufficient to occupy Repeat until convergence CPU and GPU fully Suitable for repeated calculations of the On traditional discrete GPU systems, same size copies have to be carefully scheduled Implemented using OpenMP Use of shared memory on SoC systems: Two threads, one each for CPU and GPU, taking work off a master queue CUDA driver automatically protects The GPU thread executes at the expense CUDA-allocated memory during kernel of doing productive work on the CPU execution phase cores We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20

  13. Results & Analysis Results: Best split performance Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM 4096 14 12 2176 26 TK1 4096 18 9 2608 25 TX1 8192 311 836 2128 1099 SANDY 16384 804 1124 6912 1870 HASWELL SGEMM 4096 34 205 448 227 TK1 4096 38 391 128 399 TX1 16384 643 2318 3392 2887 SANDY 16384 1753 2526 6896 4109 HASWELL Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20

  14. Results & Analysis Best Split Search - Tegra K1/X1 TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES DGEMM GFLOPS 25 120 JOULES 20 100 15 80 10 60 TK1 JOULES TX1 JOULES TK1 GFLOPS TX1 GFLOPS SGEMM GFLOPS 60 400 JOULES 300 40 200 20 100 0 0 1 , 000 2 , 000 3 , 000 4 , 000 Split Size Given to CPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20

  15. Results & Analysis Best Split Search - Intel + NVIDIA GPUs SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES DGEMM GFLOPS 1 , 500 JOULES 100 1 , 000 50 500 SANDY JOULES HASWELL JOULES SANDY GFLOPS HASWELL GFLOPS SGEMM GFLOPS 60 JOULES 2 , 000 40 1 , 000 20 0 1 , 000 2 , 000 3 , 000 4 , 000 Split Size Given to CPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20

  16. Results & Analysis Performance Scaling - TK1 28 CPU 26 GPU 24 SPLIT DGEMM GFLOPS 22 DYNAMIC 20 TBALANCE 18 PEAK (CPU+GPU) 16 14 12 10 8 6 4 2 0 16 32 64 128 256 512 1024 2048 4096 280 260 240 220 SGEMM GFLOPS 200 180 160 140 120 100 80 60 40 20 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20

  17. Results & Analysis Performance Scaling - TX1 CPU 30 GPU SPLIT DGEMM GFLOPS 25 DYNAMIC TBALANCE 20 PEAK (CPU+GPU) 15 10 5 0 16 32 64 128 256 512 1024 2048 4096 500 450 400 SGEMM GFLOPS 350 300 250 200 150 100 50 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20

  18. Results & Analysis Energy Efficiency - TX1 - SGEMM CPU GPU SPLIT TBALANCE 10 − 8 DYNAMIC Joules/FLOP (SP) 10 − 9 4 . 22 · 10 − 10 10 − 10 3 . 75 · 10 − 11 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20

  19. Results & Analysis Energy Efficiency - Haswell - SGEMM CPU GPU SPLIT TBALANCE DYNAMIC 10 − 9 Joules/FLOP (SP) 1 . 76 · 10 − 10 10 − 10 8 . 24 · 10 − 11 512 1024 2048 4096 8192 16384 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20

  20. Conclusion Conclusion A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was observed from exploting both CPU and GPU together The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1 while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80. Contact: Alistair.Rendell@anu.edu.au https://www.linkedin.com/in/alistair-rendell-6230b72 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend