The Why, Where and How of Multicore Anant Agarwal MIT and Tilera - - PowerPoint PPT Presentation

the why where and how of multicore
SMART_READER_LITE
LIVE PREVIEW

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera - - PowerPoint PPT Presentation

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore? Whatevers Inside? What is Multicore? Whatevers Inside? Seriously, multicore satisfies three properties Single chip Multiple


slide-1
SLIDE 1

Anant Agarwal MIT and Tilera Corp.

The Why, Where and How of Multicore

slide-2
SLIDE 2

What is Multicore?

Whatever’s “Inside”?

slide-3
SLIDE 3

What is Multicore?

Single chip Multiple distinct processing engines Multiple, independent threads of control (or program counters – MIMD)

Whatever’s “Inside”?

Seriously, multicore satisfies three properties

p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m p

switch

m

BUS

p p c c L2 Cache

BUS

RISC DSP c m SRAM

slide-4
SLIDE 4

Outline

The why The where The how

slide-5
SLIDE 5

Outline

The why The where The how

slide-6
SLIDE 6

The “Moore’s Gap”

Houston, we have a problem… 1998 time Performance (GOPS) 2002 0.01 1992 2006 0.1 1 10 100 1000 2010 Transistors

1. Diminishing returns from single CPU mechanisms (pipelining, caching, etc.) 2. Wire delays 3. Power envelopes

Pipelining Superscalar

SMT, FGMT, CGMT

OOO

The Moore’s Gap

slide-7
SLIDE 7

The Moore’s Gap – Example

Pentium 4 1.4 GHz Year 2000 0.18 micron 42M transistors 393 (Specint 2000) Pentium 3 1 GHz Year 2000 0.18 micron 28M transistors 343 (Specint 2000) Transistor count increased by 50% Performance increased by only 15%

slide-8
SLIDE 8

Closing Moore’s Gap Today

Today’s applications have ample parallelism – and they are not PowerPoint and Word! Technology: On-chip integration of multiple cores is now possible Two things have changed:

slide-9
SLIDE 9

Parallelism is Everywhere

Super Computing

Graphics Gaming Video Imaging Settops TVs e.g., H.264 Communications Cellphones e.g., FIRs Security Firewalls e.g., AES Wireless e.g., Viterbi Networking e.g., IP forwarding General Purpose

Databases Webservers Multiple tasks

slide-10
SLIDE 10

Integration is Efficient

*90nm, 32 bits, 1mm

p c s p c s p c p c

BUS

Bandwidth 2GBps Latency 60ns Energy > 500pJ Bandwidth > 40GBps* Latency < 3ns Energy < 5pJ Discrete chips Multicore Parallelism and interconnect efficiency enables harnessing the “power of n” n cores yield n-fold increase in performance This fact yields the multicore opportunity

slide-11
SLIDE 11

Why Multicore?

Performance Power efficiency Let’s look at the opportunity from two viewpoints

slide-12
SLIDE 12

The Performance Opportunity

Processor Cache

CPI: 1 + 0.01 x 100 = 2 90nm

Processor 1x

Cache 1x

CPI: (1 + 0.01 x 100)/2 = 1 65nm

Processor 1x

Cache 1x

Go multicore

Processor 1x Cache 3x

CPI: 1 + 0.006 x 100 = 1.6 65nm Push single core

Smaller CPI (cycles per instruction) is better

Single processor mechanisms yielding diminishing returns Prefer to build two smaller structures than one very big one

slide-13
SLIDE 13

Multicore Example: MIT’s Raw Processor

16 cores Year 2002 0.18 micron 425 MHz IBM SA27E std. cell 6.8 GOPS Google MIT Raw

slide-14
SLIDE 14

Raw’s Multicore Performance

Architecture Space Performance

Raw 425 MHz, 0.18μm

[ISCA04]

Speedup vs. P3

P3 600 MHz, 0.18μm

slide-15
SLIDE 15

The Power Cost of Frequency

Synthesized Multiplier power versus frequency (90nm)

1.00 3.00 5.00 7.00 9.00 11.00 13.00 250.0 350.0 450.0 550.0 650.0 750.0 850.0 950.0 1050.0 1150.0 FREQUENCY (MHz) Power (normalized to Mul32@250MHz) Increase Area Increase Voltage `

Frequency V Power V3 (V2F) For a 1% increase in freq, we suffer a 3% increase in power

∞ ∞

slide-16
SLIDE 16

“New” Superscalar Multicore

Freq V Perf Power

Superscalar

1 1 1 1 Cores 1

Multicore’s Opportunity for Power Efficiency

PE (Bops/watt) 1 1.5X 1.5X 1.5X 3.3X 1X 0.45X 0.75X 0.75X 1.5X 0.8X 2X 1.88X

(Bigger # is better)

50% more performance with 20% less power Preferable to use multiple slower devices, than one superfast device

slide-17
SLIDE 17

Outline

The why The where The how

slide-18
SLIDE 18

The Future of Multicore

Number of cores will double every 18 months

‘05 ‘08 ‘11 ‘14 64 256 1024 4096 ‘02 16 Academia Industry 16 64 256 1024 4

But, wait a minute… Need to create the “1K multicore” research program ASAP!

slide-19
SLIDE 19

Outline

The why The where The how

slide-20
SLIDE 20

Multicore Challenges The 3 P’s

Performance challenge Power efficiency challenge Programming challenge

slide-21
SLIDE 21

Performance Challenge

Interconnect It is the new mechanism Not well understood Current systems rely on buses or rings Not scalable - will become performance bottleneck Bandwidth and latency are the two issues

BUS

L2 Cache p c p c p c p c

slide-22
SLIDE 22

Interconnect Options

BUS

p c p c p c

Bus Multicore

s s s s p c p c p c p c

Ring Multicore

p c s p c s p c s p c s p c s p c s p c s p c s p c s

Mesh Multicore

slide-23
SLIDE 23

Imagine This City…

slide-24
SLIDE 24

Bus Ring Mesh

Interconnect Bandwidth

2-4 4-8 > 8 Cores:

slide-25
SLIDE 25

Communication Latency

Communication latency not interconnect problem! It is a “last mile” issue Latency comes from coherence protocols or software overhead

rMPI 100 1000 10000 100000 1000000 10000000 100000000 1000000000 1 10 100 1000 10000 100000 1000000 1E+07 Message Size (words) End-to-End Latency (cycles) rMPI

Highly optimized MPI implementation on the Raw Multicore processor

Challenge: Reduce overhead to a few cycles Avoid memory accesses, provide direct access to interconnect, and eliminate protocols

slide-26
SLIDE 26

rMPI vs Native Messages

rMPI percentage overhead (cycles) (compared to native GDN): Jacobi

  • 50.00%

0.00% 50.00% 100.00% 150.00% 200.00% 250.00% 300.00% 350.00% 400.00% 450.00% 500.00% N=16 N=32 N=64 N=128 N=256 N=512 N=1024 N=2048 problem size

  • verhead

2 tiles 4 tiles 8 tiles 16 tiles

slide-27
SLIDE 27

Power Efficiency Challenge

Existing CPUs at 100 watts 100 CPU cores at 10 KWatts! Need to rethink CPU architecture

slide-28
SLIDE 28

The Potential Exists

Power Perf Processor Power Efficiency 1/2W 1/8X** RISC* 25X 100W 1 Itanium 2 1

Assuming 130nm * 90’s RISC at 425MHz ** e.g., Timberwolf (SpecInt)

slide-29
SLIDE 29

Area Equates to Power

Less than 4% to ALUs and FPUs Madison Itanium2 0.13µm L3 Cache

Photo courtesy Intel Corp.

slide-30
SLIDE 30

Less is More

Resource size must not be increased unless the resulting percentage increase in performance is at least the same as the percentage increase in the area (power) Remember power of n: n 2n cores doubles performance 2n cores have 2X the area (power) e.g., increase a resource only if for every 1% increase in area there is at least a 1% increase in performance

Kill If Less than Linear

“KILL Rule” for Multicore

slide-31
SLIDE 31

Communication Cheaper than Memory Access

3pJ Network transfer (1mm) 500pJ Off-chip memory read 50pJ 32KB cache read 2pJ ALU add Energy Action

90nm, 32b

Migrate from memory oriented computation models to communication centric models

slide-32
SLIDE 32

Multicore Programming Challenge

Traditional cluster computing programming methods squander multicore opportunity

– Message passing or shared memory, e.g., MPI, OpenMP – Both were designed assuming high-overhead communication

  • Need big chunks of work to minimize comms, and huge caches

– Multicore is different – Low-overhead communication that is cheaper than memory access

  • Results in smaller per-core memories

Must allow specifying parallelism at any granularity, and favor communication over memory

slide-33
SLIDE 33

Stream Programming Approach

ASIC-like concept Read value from network, compute, send out value Avoids memory access instructions, synchronization and address arithmetic

A channel with send and receive ports Core B Core A (E.g., FIR) Core C (E.g., FIR) e.g., pixel data stream

e.g., Streamit, StreamC

slide-34
SLIDE 34

Conclusion

Multicore can close the “Moore’s Gap” Four biggest myths of multicore – Existing CPUs make good cores – Bigger caches are better – Interconnect latency comes from wire delay – Cluster computing programming models are just fine For multicore to succeed we need new research – Create new architectural approaches; e.g., “Kill Rule” for cores – Replace memory access with communication – Create new interconnects – Develop innovative programming APIs and standards

slide-35
SLIDE 35

Vision for the Future

The ‘core’ is the LUT (lookup table) of the 21st century If we solve the 3 challenges, multicore could replace all hardware in the future

p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s