Experiences Using Tegra K1 and X1 for Highly Energy Efficient - - PowerPoint PPT Presentation
Experiences Using Tegra K1 and X1 for Highly Energy Efficient - - PowerPoint PPT Presentation
Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra,
Introduction & Background
Overview
1
Introduction & Background
2
Power Measurement Environment
3
Experimental Platforms
4
Approach
5
Results & Analysis
6
Conclusion Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20
Introduction & Background
Use of low-powered SoCs for HPC
Nvidia Jetson TK1: ARM + GPU SoC Nvidia Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Adapteva Parallella: ARM + 64-core NoC TI BeagleBoard: ARM + DSP SoC Terasic DE1: ARM + FPGA SoC Rockchip Firefly: ARM + GPU SoC Freescale Wandboard: ARM + GPU SoC Cubieboard4: ARM + GPU SoC http://cs.anu.edu.au/systems
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20
Introduction & Background
Use of low-powered SoCs for HPC
In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on-chip devices Understanding the performance-energy trade-off
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20
Introduction & Background
Contributions
Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems
- Intel Xeon CPUs and NVIDIA K20 and K80 GPUs
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20
Power Measurement Environment
Measurement Requirements
SoC systems generally consume very low power ∼ few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µAmps to a few Amps, a very high-precision ammeter must be used to measure subtle changes
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20
Power Measurement Environment
Measurement Apparatus
µCurrent Gold: High-precision ammeter for measuring low-currents An mbed LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µCurrent Gold The ADC has a resolution of 0.81±0.40mV, which corresponds to 0.81mA. This is 9.7±4.8mW at 12V.
https://www.eevblog.com/projects/ucurrent/ https://developer.mbed.org/ Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20
Power Measurement Environment
Power Measurement Environment
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20
Experimental Platforms
Experimental Platforms
TK1 TX1 SANDY HASWELL CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU Cores 4 4 2×8 2×12 CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 GPU GK20A GM20B K20m (GK110) K80 (GK210) GPU Cores 192 256 2496 2496 GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared Shared 5GB 12GB CUDA v6.5 v7.0 v7.0 v7.5
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20
Approach
Evaluation Kernel
C = A × B C1 C2 = A × B1 B2 ւց ւց ւց C1 = A × B1 C2 = A × B2 C1 = A × B1 C2 = A × B2 CPU GPU
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20
Approach
Approaches
Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information
Beaumont et al., Matrix Multiplication on Heterogeneous Platforms
- C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU
Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs
Dynamic Partitioning:
Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures
→ Existing approaches do not consider the use of shared physical memory
- r the implications for energy efficiency
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20
Approach
Our approach
Static partitioning:
Guess a partition based on experimentally measured peak performances of CPU and GPU Used the achieved peaks to refine the partition Repeat until convergence Suitable for repeated calculations of the same size
Use of shared memory on SoC systems:
CUDA driver automatically protects CUDA-allocated memory during kernel execution phase We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution
Dynamic partitioning:
CPU and GPU remove chunks of matrix columns from a workqueue Chunk size must be sufficient to occupy CPU and GPU fully On traditional discrete GPU systems, copies have to be carefully scheduled Implemented using OpenMP Two threads, one each for CPU and GPU, taking work off a master queue The GPU thread executes at the expense
- f doing productive work on the CPU
cores
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20
Results & Analysis
Results: Best split performance
Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM TK1 4096 14 12 2176 26 TX1 4096 18 9 2608 25 SANDY 8192 311 836 2128 1099 HASWELL 16384 804 1124 6912 1870 SGEMM TK1 4096 34 205 448 227 TX1 4096 38 391 128 399 SANDY 16384 643 2318 3392 2887 HASWELL 16384 1753 2526 6896 4109
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20
Results & Analysis
Best Split Search - Tegra K1/X1
10 15 20 25
DGEMM GFLOPS
TK1 GFLOPS TX1 GFLOPS
60 80 100 120
JOULES
TK1 JOULES TX1 JOULES
100 200 300 400
SGEMM GFLOPS
TK1 GFLOPS TX1 GFLOPS
1,000 2,000 3,000 4,000 20 40 60
Split Size Given to CPU JOULES
TK1 JOULES TX1 JOULES Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20
Results & Analysis
Best Split Search - Intel + NVIDIA GPUs
500 1,000 1,500
DGEMM GFLOPS
SANDY GFLOPS HASWELL GFLOPS
50 100
JOULES
SANDY JOULES HASWELL JOULES
1,000 2,000
SGEMM GFLOPS
SANDY GFLOPS HASWELL GFLOPS
1,000 2,000 3,000 4,000 20 40 60
Split Size Given to CPU JOULES
SANDY JOULES HASWELL JOULES Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20
Results & Analysis
Performance Scaling - TK1
16 32 64 128 256 512 1024 2048 4096 2 4 6 8 10 12 14 16 18 20 22 24 26 28
DGEMM GFLOPS CPU GPU SPLIT DYNAMIC TBALANCE PEAK (CPU+GPU)
16 32 64 128 256 512 1024 2048 4096 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Matrix Dimension M=N=K SGEMM GFLOPS Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20
Results & Analysis
Performance Scaling - TX1
16 32 64 128 256 512 1024 2048 4096 5 10 15 20 25 30
DGEMM GFLOPS CPU GPU SPLIT DYNAMIC TBALANCE PEAK (CPU+GPU)
16 32 64 128 256 512 1024 2048 4096 50 100 150 200 250 300 350 400 450 500
Matrix Dimension M=N=K SGEMM GFLOPS Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20
Results & Analysis
Energy Efficiency - TX1 - SGEMM
128 256 512 1024 2048 4096 10−10 10−9 10−8
4.22 · 10−10 3.75 · 10−11 Matrix Dimension M=N=K Joules/FLOP (SP) CPU GPU SPLIT TBALANCE DYNAMIC Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20
Results & Analysis
Energy Efficiency - Haswell - SGEMM
512 1024 2048 4096 8192 16384 10−10 10−9
1.76 · 10−10 8.24 · 10−11 Matrix Dimension M=N=K Joules/FLOP (SP) CPU GPU SPLIT TBALANCE DYNAMIC Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20
Conclusion
Conclusion
A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was
- bserved from exploting both CPU and GPU together
The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1 while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80.
Contact: Alistair.Rendell@anu.edu.au https://www.linkedin.com/in/alistair-rendell-6230b72
Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20