Exploring Tradeoffs between Programmability and Efficiency in - - PowerPoint PPT Presentation

exploring tradeoffs between programmability and
SMART_READER_LITE
LIVE PREVIEW

Exploring Tradeoffs between Programmability and Efficiency in - - PowerPoint PPT Presentation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y Exploring Tradeoffs between


slide-1
SLIDE 1

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECS

Electrical Engineering and Computer Sciences

BERKELEY PAR LAB

Exploring Tradeoffs between Programmability and Efficiency in
 Data-Parallel Accelerators

Yunsup Lee1, Rimas Avizienis1, Alex Bishara1, Richard Xia1, Derek Lockhart2, Christopher Batten2, Krste Asanovic1

1The Parallel Computing Lab, UC Berkeley 2Computer Systems Lab, Cornell University

slide-2
SLIDE 2

Yunsup Lee / UC Berkeley Par Lab

DLP Kernels Dominate Many Computational Workloads

Graphics Rendering Computer Vision Audio Processing Physical Simulation

slide-3
SLIDE 3

Yunsup Lee / UC Berkeley Par Lab

DLP Accelerators are Getting Popular

Sandy Bridge Tegra Knights Ferry Fermi

slide-4
SLIDE 4

Yunsup Lee / UC Berkeley Par Lab

Important Metrics when Comparing DLP Accelerator Architectures

  • Performance per Unit Area
  • Energy per Task
  • Flexibility (What can it run well?)
  • Programmability (How hard is it to

write code?)

slide-5
SLIDE 5

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Programmability: It’s a tradeoff

Programmability Efficiency Programmability Efficiency MIMD Vector

Irregular DLP

Vector MIMD

Regular DLP

slide-6
SLIDE 6

Yunsup Lee / UC Berkeley Par Lab

Maven Provides Both Greater Efficiency and Easier Programmability

Programmability Efficiency Programmability Efficiency MIMD Vector

Irregular DLP

Vector MIMD Maven/Vector-Thread Maven/Vector-Thread

Regular DLP

slide-7
SLIDE 7

Yunsup Lee / UC Berkeley Par Lab

Where does the GPU/SIMT fit in this picture?

Programmability Efficiency Programmability Efficiency MIMD Vector GPU SIMT?

Irregular DLP

Vector MIMD GPU SIMT? Maven/Vector-Thread Maven/Vector-Thread

Regular DLP

slide-8
SLIDE 8

Yunsup Lee / UC Berkeley Par Lab

Outline

§ Data-Parallel Architecture

Design Patterns

§ MIMD, Vector-SIMD, Subword-SIMD, SIMT, Maven/Vector-Thread

§ Microarchitectural Components § Evaluation Framework § Evaluation Results

slide-9
SLIDE 9

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #1: MIMD

Programmer’s Logical View FILTER OP

}

slide-10
SLIDE 10

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #1: MIMD

Programmer’s Logical View Typical Micro- architecture Examples: Tilera Rigel

slide-11
SLIDE 11

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #2: Vector-SIMD

Programmer’s Logical View

slide-12
SLIDE 12

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #2: Vector-SIMD

Programmer’s Logical View Typical Micro- architecture Examples: T0 Cray-1

slide-13
SLIDE 13

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #3: Subword-SIMD

Programmer’s Logical View Typical Micro- architecture Examples: AVX/SSE

slide-14
SLIDE 14

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #4: GPU/SIMT

Programmer’s Logical View

slide-15
SLIDE 15

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #4: GPU/SIMT

Programmer’s Logical View Typical Micro- architecture Example: Fermi

slide-16
SLIDE 16

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #5: Vector-Thread (VT)

Programmer’s Logical View

slide-17
SLIDE 17

Yunsup Lee / UC Berkeley Par Lab

DLP Pattern #5: Vector-Thread (VT)

Programmer’s Logical View Typical Micro- architecture Examples: Scale Maven

slide-18
SLIDE 18

Yunsup Lee / UC Berkeley Par Lab

Outline

§ Data Parallel Architectural Design

Patterns

§ Microarchitectural Components

  • § Evaluation Framework

§ Evaluation Results

slide-19
SLIDE 19

Yunsup Lee / UC Berkeley Par Lab

Focus on the Tile

MIMD Tile Vector Tile with Four Single-Lane Cores Vector Tile with One Four-Lane Core

slide-20
SLIDE 20

§ Developed a

library of parameterized synthesizable RTL components

uArchitecture

slide-21
SLIDE 21

§ 32-bit integer

multiplier, divider

§ Single-precision

floating-point add, multiply, divide, square root

Retimable
 Long-latency
 Functional Units

slide-22
SLIDE 22

5-stage
 Multi-threaded
 Scalar Core

  • § Change number
  • f entries in

register file (32,64,128,256) to vary degree of multi-threading (1,2,4,8 threads)

slide-23
SLIDE 23

§ Vector registers

and ALUs

§ Density-time

Execution

§ Replicate the

lanes and execute in lock step for higher throughput

§ Vector-SIMD:

Flag Registers

Vector Lanes

slide-24
SLIDE 24

Vector
 Issue Unit

  • § Vector-SIMD: VIU
  • nly handles

scheduling, data dependent control done by flag registers

§ Maven: VIU fetches

instructions, PVFB handles uT branches and does control flow convergence

slide-25
SLIDE 25

Vector
 Memory Unit

  • § VMU Handles unit

stride, constant stride vector memory

  • perations

§ Vector-SIMD:

VMU handles scatter, gather

§ Maven: VMU

handles uT loads and stores

slide-26
SLIDE 26

Blocking, Non- blocking Caches

  • § Access Port Width
  • § Refill Port Width

§ Cache Line Size § Total Capacity § Associativity

Only for Non- blocking Caches:

§ # MSHR § # secondary

misses per MSHR

slide-27
SLIDE 27

Yunsup Lee / UC Berkeley Par Lab

A Big Design Space …

§ Number of entries in scalar register file § 32,64,128,256 (1,2,4,8 threads) § Number of entries in vector register file § 32,64,128,256 § Architecture of vector register file § 6r3w unified register file, 4x 2r1w banked register file § Per-bank integer ALU § Density time execution § Pending Vector Fragment Buffer (PVFB) § FIFO, 1-stack, 2-stack

slide-28
SLIDE 28

Yunsup Lee / UC Berkeley Par Lab

Outline

§ Data Parallel Architectural Design

Patterns

§ Microarchitectural Components § Evaluation Framework § Evaluation Results

slide-29
SLIDE 29

Yunsup Lee / UC Berkeley Par Lab

Programming Methodology

§ Use GCC C++ Cross Compiler (which we ported) § MIMD § Custom application-scheduled lightweight threading lib § Vector-SIMD § Leverage built-in GCC vectorizer for mapping very simple regular DLP code § Use GCCʼs inline assembly extensions for more complicated code § Maven § Use C++ Macros with special library, which glues the control thread and microthreads § Automatic vector register allocation added to GCC

slide-30
SLIDE 30

Yunsup Lee / UC Berkeley Par Lab

Microbenchmarks & Application Kernels

Name Explanation Irregularity vvadd 1000 element FP vector-vector add Regular bsearch 1000 look-ups into a sorted array Very Irregular bsearch-cmv inner-loop rewritten with cond. mov Somewhat Irregular

Microbenchmarks

Name Explanation Irregularity viterbi Decode frames using Viterbi alg. Regular rsort Radix sort on an array of integers Slightly Irregular kmeans K-means clustering algorithm Slightly Irregular dither Floyd-Steinberg dithering Somewhat Irregular physics Newtonian physics simulation Very Irregular strsearch Knuth-Morris-Pratt algorithm Very Irregular

Application Kernels

slide-31
SLIDE 31

Yunsup Lee / UC Berkeley Par Lab

Evaluation Methodology

slide-32
SLIDE 32

Yunsup Lee / UC Berkeley Par Lab

Three Example Layouts

D$ I$ D$ I$ D$ I$

MIMD Tile 1 Core x 4 Lanes Maven Tile 4 Cores x 1 Lane Maven Tile

slide-33
SLIDE 33

Yunsup Lee / UC Berkeley Par Lab

Need Gate-level Activity for Accurate Energy Numbers

Configuration Post Place&Route Statistical (mW) Simulated Gate-level Activity (mW) MIMD 1 149 137-181 MIMD 2 216 130-247 MIMD 3 242 124-261 MIMD 4 299 221-298 Multi-core Vector-SIMD 396 213-331 Multi-lane Vector-SIMD 224 137-252 Multi-core Vector-Thread 1 428 162-318 Multi-core Vector-Thread 2 404 147-271 Multi-core Vector-Thread 3 445 172-298 Multi-core Vector-Thread 4 409 225-304 Multi-core Vector-Thread 5 410 168-300 Multi-lane Vector-Thread 1 205 111-167 Multi-lane Vector-Thread 2 223 118-173

slide-34
SLIDE 34

Yunsup Lee / UC Berkeley Par Lab

Outline

§ Data Parallel Architectural Design

Patterns

§ Microarchitectural Components § Evaluation Framework § Evaluation Results

slide-35
SLIDE 35

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Number of uTs running bsearch-cmv

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32

mimd-c4

r32 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

slide-36
SLIDE 36

Yunsup Lee / UC Berkeley Par Lab

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32

mimd-c4

Efficiency vs. Number of uTs running bsearch-cmv

Faster Lower Energy

r32 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

slide-37
SLIDE 37

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Number of uTs running bsearch-cmv

r32 r64 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32 r64

mimd-c4

slide-38
SLIDE 38

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Number of uTs running bsearch-cmv

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32 r64 r128 r256

mimd-c4

r32 r64 r128 r256 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

slide-39
SLIDE 39

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Number of uTs running bsearch-cmv

r32 r64 r128 r256 r32 r64 r128 r256 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32 r64 r128 r256 r32 r64 r128 r256

mimd-c4 vt-c4v1

slide-40
SLIDE 40

Yunsup Lee / UC Berkeley Par Lab

6r3w Vector Register File is Area Inefficient

r32 r64 r128 r256 r32 r64 r128 r256 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized Area

ctrl reg mem fp int cp i$ d$

MIMD Tile Vector-Thread Tile

slide-41
SLIDE 41

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Number of uTs with Banking running bsearch-cmv

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32 r64 r128 r256 r128 r256

mimd-c4 vt-c4v1 vt-c4v1+b

r32 r64 r128 r256 r32 r64 r128 r256 r128 r256 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

slide-42
SLIDE 42

Yunsup Lee / UC Berkeley Par Lab

Efficiency vs. Number of uTs with Per-Bank Integer ALU running bsearch-cmv

1.0 1.4 1.8 2.2 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32 r64 r128 r256 r128 r256

mimd-c4 vt-c4v1 vt-c4v1+b vt-c4v1+bi

r32 r64 r128 r256 r32 r64 r128 r256 r128 r256 r128 r256 5 10 15 20 25 30 Energy / Task (uJ)

ctrl reg mem fp int cp i$ d$ leak

slide-43
SLIDE 43

Yunsup Lee / UC Berkeley Par Lab

Bank Vector Register File Per-Bank Integer ALUs

r 3 2 r 6 4 r 1 2 8 r 2 5 6 r 3 2 r 6 4 r 1 2 8 r 2 5 6 r 1 2 8 + b r 2 5 6 + b r 1 2 8 + b i r 2 5 6 + b i 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized Area

ctrl reg mem fp int cp i$ d$

MIMD Tile Vector-Thread Tile Banking Local ALUs

slide-44
SLIDE 44

Yunsup Lee / UC Berkeley Par Lab

Results running bsearch compared to bsearch-cmv

2.0 4.0 6.0 8.0 10.0 12.0 14.0 Normalized Tasks / Sec 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Energy / Task FIFO cmv +FIFO FIFO+dt 1-stack 1-stack+dt 2-stack 2-stack+dt cmv +2-stack+dt

Results of Design Space Exploration Apply Density-Time Execution Convergence Scheme: 2-Stack PVFB

slide-45
SLIDE 45

Yunsup Lee / UC Berkeley Par Lab

Results Running Application Kernels

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 r32 1.0 2.0 3.0 r32 1.0 2.0 3.0 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 Normalized Tasks / Second Normalized Tasks / Second / Area

viterbi rsort kmeans dither physics strsearch

slide-46
SLIDE 46

Yunsup Lee / UC Berkeley Par Lab

Results Running Application Kernels

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 r32 1.0 2.0 3.0 r32 1.0 2.0 3.0 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 Normalized Tasks / Second Normalized Tasks / Second / Area

Performance Performance per Unit Area

viterbi rsort kmeans dither physics strsearch

slide-47
SLIDE 47

Yunsup Lee / UC Berkeley Par Lab

Results Running Application Kernels

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 r32 1.0 2.0 3.0 r32 1.0 2.0 3.0 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 Normalized Tasks / Second Normalized Tasks / Second / Area

More Irregular

viterbi rsort kmeans dither physics strsearch

slide-48
SLIDE 48

Yunsup Lee / UC Berkeley Par Lab

Multi-threading is not Effective on DLP Code

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 1.0 2.0 3.0 r32 1.0 2.0 3.0 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 2.5 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 Normalized Tasks / Second Normalized Tasks / Second / Area

viterbi rsort kmeans dither physics strsearch

slide-49
SLIDE 49

Yunsup Lee / UC Berkeley Par Lab

Vector-SIMD is Faster and/or More Efficient than MIMD

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane 0.5 1.0 1.5 r32 mlane 0.5 1.0 1.5 r32 mlane 1.0 2.0 3.0 r32 mlane 1.0 2.0 3.0 r32 mlane 0.5 1.0 1.5 2.0 2.5 r32 mlane 0.5 1.0 1.5 2.0 2.5 r32 mlane 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 2.0 r32 0.5 1.0 1.5 r32 0.5 1.0 1.5 r32 Normalized Tasks / Second Normalized Tasks / Second / Area

viterbi rsort kmeans dither physics strsearch

No Vector-SIMD Implementation Too hard to map

slide-50
SLIDE 50

Yunsup Lee / UC Berkeley Par Lab

Maven Vector-Thread is More Efficient than Vector-SIMD

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane 0.5 1.0 1.5 r32 mlane 0.5 1.0 1.5 r32 mlane 1.0 2.0 3.0 r32 mlane 1.0 2.0 3.0 r32 mlane 0.5 1.0 1.5 2.0 2.5 r32 mlane 0.5 1.0 1.5 2.0 2.5 r32 mlane 0.5 1.0 1.5 2.0 r32 mlane 0.5 1.0 1.5 2.0 r32 mlane 0.5 1.0 1.5 r32 mlane 0.5 1.0 1.5 r32 mlane Normalized Tasks / Second Normalized Tasks / Second / Area

viterbi rsort kmeans dither physics strsearch

slide-51
SLIDE 51

Yunsup Lee / UC Berkeley Par Lab

Multi-Lane Tiles are More Efficient than Multi-Core Tiles

0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane mcore 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane mcore 0.5 1.0 1.5 r32 mlane mcore 0.5 1.0 1.5 r32 mlane mcore 1.0 2.0 3.0 r32 mlane mcore 1.0 2.0 3.0 r32 mcore/mlane 0.5 1.0 1.5 2.0 2.5 r32 mlane mcore 0.5 1.0 1.5 2.0 2.5 r32 mlane mcore 0.5 1.0 1.5 2.0 r32 mlane mcore 0.5 1.0 1.5 2.0 r32 mlane mcore 0.5 1.0 1.5 r32 mlane mcore 0.5 1.0 1.5 r32 mlane mcore Normalized Tasks / Second Normalized Tasks / Second / Area

viterbi rsort kmeans dither physics strsearch

slide-52
SLIDE 52

Yunsup Lee / UC Berkeley Par Lab

Comparing vector load/stores vs. uT load/stores running vvadd

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Tasks / Sec 1 2 3 4 5 6 Normalized Energy / Task vec ld/st

slide-53
SLIDE 53

Yunsup Lee / UC Berkeley Par Lab

uT load/stores are Inefficient

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Tasks / Sec 1 2 3 4 5 6 Normalized Energy / Task vec ld/st uT ld/st

9x Slower 5x More Energy

slide-54
SLIDE 54

Yunsup Lee / UC Berkeley Par Lab

Memory Coalescing Helps, but Still Far Off

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Tasks / Sec 1 2 3 4 5 6 Normalized Energy / Task vec ld/st uT ld/st uT ld/st + mem coalescing

slide-55
SLIDE 55

Yunsup Lee / UC Berkeley Par Lab

Conclusions

§ Vector architectures are more area and energy efficient

than MIMD architectures on regular DLP and (surprisingly) on irregular DLP

§ The Maven vector-thread architecture is a promising

alternative to traditional vector-SIMD architectures, providing greater efficiency and easier programmability

§ Using real RTL implementations and a standard ASIC

toolflow is necessary to compare energy-optimized future architectures

  • This work was supported in part by Microsoft (Award #024263) and Intel

(Award #024894, equipment donations) funding and by matching funding from U.C. Discovery (Award #DIG07-10227)