Experiences Using Tegra K1 and X1 for Highly Energy Efficient - - PowerPoint PPT Presentation

experiences using tegra k1 and x1 for highly energy
SMART_READER_LITE
LIVE PREVIEW

Experiences Using Tegra K1 and X1 for Highly Energy Efficient - - PowerPoint PPT Presentation

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra,


slide-1
SLIDE 1

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell

Research School of Computer Science Australian National University Canberra, Australia

April 07, 2016

slide-2
SLIDE 2

Introduction & Background

Overview

1

Introduction & Background

2

Power Measurement Environment

3

Experimental Platforms

4

Approach

5

Results & Analysis

6

Conclusion Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20

slide-3
SLIDE 3

Introduction & Background

Use of low-powered SoCs for HPC

Nvidia Jetson TK1: ARM + GPU SoC Nvidia Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Adapteva Parallella: ARM + 64-core NoC TI BeagleBoard: ARM + DSP SoC Terasic DE1: ARM + FPGA SoC Rockchip Firefly: ARM + GPU SoC Freescale Wandboard: ARM + GPU SoC Cubieboard4: ARM + GPU SoC http://cs.anu.edu.au/systems

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20

slide-4
SLIDE 4

Introduction & Background

Use of low-powered SoCs for HPC

In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on-chip devices Understanding the performance-energy trade-off

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20

slide-5
SLIDE 5

Introduction & Background

Contributions

Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems

  • Intel Xeon CPUs and NVIDIA K20 and K80 GPUs

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20

slide-6
SLIDE 6

Power Measurement Environment

Measurement Requirements

SoC systems generally consume very low power ∼ few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µAmps to a few Amps, a very high-precision ammeter must be used to measure subtle changes

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20

slide-7
SLIDE 7

Power Measurement Environment

Measurement Apparatus

µCurrent Gold: High-precision ammeter for measuring low-currents An mbed LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µCurrent Gold The ADC has a resolution of 0.81±0.40mV, which corresponds to 0.81mA. This is 9.7±4.8mW at 12V.

https://www.eevblog.com/projects/ucurrent/ https://developer.mbed.org/ Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20

slide-8
SLIDE 8

Power Measurement Environment

Power Measurement Environment

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20

slide-9
SLIDE 9

Experimental Platforms

Experimental Platforms

TK1 TX1 SANDY HASWELL CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU Cores 4 4 2×8 2×12 CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 GPU GK20A GM20B K20m (GK110) K80 (GK210) GPU Cores 192 256 2496 2496 GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared Shared 5GB 12GB CUDA v6.5 v7.0 v7.0 v7.5

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20

slide-10
SLIDE 10

Approach

Evaluation Kernel

C = A × B   C1 C2   =   A   ×   B1 B2   ւց ւց ւց C1 = A × B1 C2 = A × B2   C1   =   A   ×   B1     C2   =   A   ×   B2   CPU GPU

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20

slide-11
SLIDE 11

Approach

Approaches

Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information

Beaumont et al., Matrix Multiplication on Heterogeneous Platforms

  • C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU

Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs

Dynamic Partitioning:

Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures

→ Existing approaches do not consider the use of shared physical memory

  • r the implications for energy efficiency

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20

slide-12
SLIDE 12

Approach

Our approach

Static partitioning:

Guess a partition based on experimentally measured peak performances of CPU and GPU Used the achieved peaks to refine the partition Repeat until convergence Suitable for repeated calculations of the same size

Use of shared memory on SoC systems:

CUDA driver automatically protects CUDA-allocated memory during kernel execution phase We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution

Dynamic partitioning:

CPU and GPU remove chunks of matrix columns from a workqueue Chunk size must be sufficient to occupy CPU and GPU fully On traditional discrete GPU systems, copies have to be carefully scheduled Implemented using OpenMP Two threads, one each for CPU and GPU, taking work off a master queue The GPU thread executes at the expense

  • f doing productive work on the CPU

cores

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20

slide-13
SLIDE 13

Results & Analysis

Results: Best split performance

Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM TK1 4096 14 12 2176 26 TX1 4096 18 9 2608 25 SANDY 8192 311 836 2128 1099 HASWELL 16384 804 1124 6912 1870 SGEMM TK1 4096 34 205 448 227 TX1 4096 38 391 128 399 SANDY 16384 643 2318 3392 2887 HASWELL 16384 1753 2526 6896 4109

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20

slide-14
SLIDE 14

Results & Analysis

Best Split Search - Tegra K1/X1

10 15 20 25

DGEMM GFLOPS

TK1 GFLOPS TX1 GFLOPS

60 80 100 120

JOULES

TK1 JOULES TX1 JOULES

100 200 300 400

SGEMM GFLOPS

TK1 GFLOPS TX1 GFLOPS

1,000 2,000 3,000 4,000 20 40 60

Split Size Given to CPU JOULES

TK1 JOULES TX1 JOULES Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20

slide-15
SLIDE 15

Results & Analysis

Best Split Search - Intel + NVIDIA GPUs

500 1,000 1,500

DGEMM GFLOPS

SANDY GFLOPS HASWELL GFLOPS

50 100

JOULES

SANDY JOULES HASWELL JOULES

1,000 2,000

SGEMM GFLOPS

SANDY GFLOPS HASWELL GFLOPS

1,000 2,000 3,000 4,000 20 40 60

Split Size Given to CPU JOULES

SANDY JOULES HASWELL JOULES Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20

slide-16
SLIDE 16

Results & Analysis

Performance Scaling - TK1

16 32 64 128 256 512 1024 2048 4096 2 4 6 8 10 12 14 16 18 20 22 24 26 28

DGEMM GFLOPS CPU GPU SPLIT DYNAMIC TBALANCE PEAK (CPU+GPU)

16 32 64 128 256 512 1024 2048 4096 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Matrix Dimension M=N=K SGEMM GFLOPS Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20

slide-17
SLIDE 17

Results & Analysis

Performance Scaling - TX1

16 32 64 128 256 512 1024 2048 4096 5 10 15 20 25 30

DGEMM GFLOPS CPU GPU SPLIT DYNAMIC TBALANCE PEAK (CPU+GPU)

16 32 64 128 256 512 1024 2048 4096 50 100 150 200 250 300 350 400 450 500

Matrix Dimension M=N=K SGEMM GFLOPS Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20

slide-18
SLIDE 18

Results & Analysis

Energy Efficiency - TX1 - SGEMM

128 256 512 1024 2048 4096 10−10 10−9 10−8

4.22 · 10−10 3.75 · 10−11 Matrix Dimension M=N=K Joules/FLOP (SP) CPU GPU SPLIT TBALANCE DYNAMIC Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20

slide-19
SLIDE 19

Results & Analysis

Energy Efficiency - Haswell - SGEMM

512 1024 2048 4096 8192 16384 10−10 10−9

1.76 · 10−10 8.24 · 10−11 Matrix Dimension M=N=K Joules/FLOP (SP) CPU GPU SPLIT TBALANCE DYNAMIC Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20

slide-20
SLIDE 20

Conclusion

Conclusion

A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was

  • bserved from exploting both CPU and GPU together

The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1 while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80.

Contact: Alistair.Rendell@anu.edu.au https://www.linkedin.com/in/alistair-rendell-6230b72

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20