1
NVIDIA TEGRA K1
Mar 2, 2014 NVIDIA Confidential Steve Oberlin
CTO, Accelerated Computing
Accelerating Exascale How the End of Moores Law Scaling is Changing - - PowerPoint PPT Presentation
Accelerating Exascale How the End of Moores Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use Steve Oberlin CTO, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential 1 A Little
1
CTO, Accelerated Computing
2
3
4
5
6
7
8
9
G Bell, History of Supercomputers, LLNL, April 2013
10
Justin Rattner, Micro2000 presentation, 1990
Maxed out power budget
11
Leakage was not important, and voltage scaled with feature size
Leakage has limited threshold voltage, largely ending voltage scaling
12
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
13
14
50 100 150 200 250 300 2011 2012 2013 0% 20% 40% 60% 80% 100% 2010 2011 2012 2013
NVIDIA GPUs
INTEL PHI
OTHERS
Intersect360 Research HPC User Site Census: Systems, July 2013 Intersect360 HPC User Site Census: Systems, July 2013 IDC HPC End-User MSC Study, 2013
% of HPC Customers with Accelerators
44% 77%
15
SGEMM / W Normalized
2012 2014 2008 2010 2016
Tesla Fermi Kepler Maxwell Pascal
20 16 12 8 6 2 4 10 14 18
16
1 4,503.17 GSIC Center, Tokyo Tech 2 3,631.86 Cambridge University 3 3,517.84 University of Tsukuba 4 3,185.91 Swiss National Supercomputing (CSCS) 5 3,130.95 ROMEO HPC Center 6 3,068.71 GSIC Center, Tokyo Tech 7 2,702.16 University of Arizona 8 2,629.10 Max-Planck 9 2,629.10 (Financial Institution) 10 2,358.69 CSIRO 37 1959.90 Intel Endeavor (top Xeon Phi cluster) 49 1247.57 Météo France (top CPU cluster)
17
18
20PF 10MW 2 GFLOPs/W
1,000PF (50x) 20MW (2x) 50 GFLOPs/W (25x) On LINPACK
Year 2013-14 2016 2020 28nm 16nm 7nm Logic Energy Scaling Factor (0.70x) 1 0.70 0.49 Wires Energy Scaling Factor (0.90x) 1 0.85 0.72 VDD (Volts) 0.9 0.80 0.75 Total Power (W) (70% Logic / 30% Wires) 100 58 38 Energy Efficiency Improvements due
1.00 1.70 2.57
19
HBM HBM HBM HBM HBM HBM HBM HBM
20
DDR Memory Stacked Memory
NVLink 80 GB/s
DDR4 50-75 GB/s
HBM 1 Terabyte/s
21
22
20mm
256 bits
23
24
25
180W continuous power and 25x25 mm die size Bi-Section Bandwidth = 6T Bytes/s Data is moved an average of 15mm
nTECH’13 - John Wilson
Compute +
Global Signaling
Compute +
Global Signaling
26
[What if there were no long wires?]
27
Mobile Super Chip
MOBILE ARCHITECTURE Maxwell Kepler Tesla Fermi Tegra 3 Tegra 4 Tegra K1 GPU ARCHITECTURE
Mobile
192 Kepler Cores
28
29
30
31
NoC
C0 Cn
TOC0 L20
C0 Cn
TOC0 L20
C0 Cn
TOC0 L20
C0 Cn
TOC0 L20
MC
DRAM Stacks DRAM Stacks
MC
NIC
Stacked, Bulk, & NVRAM
LOC 0 LOC 7
L2cpu NoC/Bus MC NVRAM DRAM DRAM DRAM
Throughput optimized, parallel code
Latency optimized, OS, pointer chasing
100K nodes link
System Interconnect
link
32
Oreste Villa, Scaling the Power Wall: A Path to Exascale
33
34
35
36
37
forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }
Map foralls in time and space Map molecules across memories Stage data up/down hierarchy Select mechanisms Exposed storage hierarchy Fast comm/sync/thread mechanisms
38
39
40
41