1 of 16
High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors
(Invited Paper)
High Performance and Energy Efficient Machine Learning Accelerators - - PowerPoint PPT Presentation
High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage
1 of 16
(Invited Paper)
2
Teraflops of performance operating on Terabytes of data
Terabytes TIPS Gigabytes MIPS Megabytes GIPS
Kilobytes KIPS
Mult- Media
3D & Video
Text
Personal Media Creation and Management Entertainment, learning and virtual travel Health
Multi-core
Single-core Financial Analytics
Model-based Apps Recognition Mining Synthesis
3 of 16
3
4 of 16 4
5
5
Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric
Graphics Video Special Purpose Engines Integrated Memory Controllers Off Die interconnect
Cache Cache Cache
Last Level Cache Last Level Cache Last Level Cache
Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric
Graphics Video Special Purpose Engines Integrated Memory Controllers Off Die interconnect
Cache Cache Cache
Last Level Cache Last Level Cache Last Level Cache
Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric
Graphics Video Special Purpose Engines Graphics Video Special Purpose Engines Integrated Memory Controllers Off Die interconnect
Cache Cache Cache
Last Level Cache Last Level Cache Last Level Cache
Cache Cache Cache Cache Cache Cache
Last Level Cache Last Level Cache Last Level Cache
Dynamic V/F control Independent V/F control regions Workload-based core activation & shutdown Scenario-based power allocation Maximize performance & efficiency
6 of 16
6
7
More, better transistors More cores Continued benefits from Moore’s Law
45nm
2007
105 103 107 109
14nm Trigate 2014
Source: Intel
8
Source: Intel
9
Single Core Plateau
Single Thread Performance 1996 – 2004: Increased 28x 2004 – 2012: Increased 4.6x IPC gains now at ~3%/gen
64 128 192 256 320 384 448 45 32 22 14 10 7
# Cores Process Node
# Cores to achieve 90% max performance (Amdahl’s Law) Cores available by Process Scaling
Multicore Scalability Gap
Dark Silicon
System Integration
Neuromorphic Computing Biological form “Intelligent” Applications
CPU MEM
if X then … else …
MEM
Brain Inspired Computing Standard Computing 01100 11010 00100
11
Good for efficiency, but problematic for SW and System complexity.
FPGA
12
iq.intel.com
THEORY HARDWARE / SOFTWARE / SIMULATION APPLICATIONS / SOLUTIONS Neurithmic
…and more…
14
15 of 16
15
Reference Vector 0
psum127 valid127 {minaddr, minprecise, minvalid,minpsum}
Global Control
Query Object Vector (Q)
Reference Vector 1 Reference Vector 2 Reference Vector 127 Minimum Sort Network
psum0 valid0 3 3 3 3 Reference Vector Storage Partial Distance Compute Accumulator Local Control
Reference Vector
1024b 1024
16 of 16
Query Vector (q) Reference Vector(r)
128x8b
×n ×n-1
17 of 16
32 64 96 128 1 6 11 16 21 26
Search Space (Valid Vectors) Sort Iteration kth NN Found 3 10 1 Example kNN Operation (Euclidean)
18 of 16
19 of 16
12 24 36 48 60 5 10 15 20 25 2 4 6 8 10
k Nearest Neighbors
20 of 16
3.0 3.5 4.0 4.5 5.0 20 35 50 65 80 2 4 6 8 10 k Nearest Neighbors
14nm CMOS, 338MHz, 750mV, 25°C
21 of 16
25 50 75 100 125 0.1 1 10 100 350 450 550 650 750 850 Supply Voltage (mV)
14nm CMOS, 25°C
22 of 16
1 2 3 4 5 6 350 450 550 650 750 850 Supply Voltage (mV)
23
23
2W –100 G igaFLO PS
10 year goal: ~300X Improvement in energy efficiency Equal to 20 pJ /FLOPS at the system level 20MW - E xaFLOPS
25 of 16
i i min i i i 2 min
26 of 16
i 2 min
27 of 16
st nd rd st nd rd
28 of 16
5 10 15 20 2 3 4 5 6 7 8 9 10
Euclidean Manhattan k Nearest Neighbor Search Space Reduction (×)