The Future is not w hat it used to be... Erik Hagersten Then... - - PowerPoint PPT Presentation

the future is not w hat it used to be
SMART_READER_LITE
LIVE PREVIEW

The Future is not w hat it used to be... Erik Hagersten Then... - - PowerPoint PPT Presentation

The Future is not w hat it used to be... Erik Hagersten Then... ENI AC 1 9 4 6 ( 5 kHz) 1 8 0 0 0 radiorr sladdprogram m erad 5 KHz AVDARK ENIAC 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten|


slide-1
SLIDE 1

The Future is not w hat it used to be...

Erik Hagersten

slide-2
SLIDE 2

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Then... ENI AC 1 9 4 6 ( ”5 kHz”)

ENIAC

1 8 0 0 0 radiorör sladdprogram m erad ”5 KHz”

slide-3
SLIDE 3

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Then ( in Sw eden)

 BARK (~1950)

 8 000 relays,  80 km cables

 BESK (~1953)

 2 400 vac. tubes  ”20 kHz” (world record)

slide-4
SLIDE 4

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

“Recently” APZ 2 1 2 , 1 9 8 3

Ericsson’s Supercom puter ( “5 MHz”)

slide-5
SLIDE 5

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

APZ 2 1 2

m arketing brochure quotes:

 ”Very compact”

 6 times the performance  1/6:th the size  1/5 the power consumption

 ”A breakthrough in computer science”  ”Why more CPU power?”  ”All the power needed for future development”  ”…800,000 BHCA, should that ever be needed”  ”SPC computer science at its most elegance”  ”Using 64 kbit memory chips”  ”1500W power consumption

slide-6
SLIDE 6

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

6 5 years of “im provem ents”

 Speed  Size  Price  Price/performance  Reliability  Predictability  Energy  Safety  Usability….

slide-7
SLIDE 7

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

”Moore’s Law ”

Pop: Double perform ance every 1 8 -2 4 th m onth

1 10 100 1000

2006

Perform ance [ log] Year

Single-core Multicore

slide-8
SLIDE 8

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/

slide-9
SLIDE 9

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/

slide-10
SLIDE 10

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/

slide-11
SLIDE 11

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Exponentiell utveckling: Doublerings/ halverings-tider

( according to Kurzw eil)

Dynam ic RAM Mem ory ( bits per dollar) 1 .5 years

Average Transistor Price 1 .6 years

Microprocessor Cost per Transistor Cycle 1 .1 years

Total Bits Shipped 1 .1 years

Processor Perform ance in MI PS 1 .8 years

Transistors in I ntel Microprocessors 2 .0 years

Log scale

1 10 100 1000 time

slide-12
SLIDE 12

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/

slide-13
SLIDE 13

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Linear scale 1 9 4 0  2 0 1 7 ( 2 x perform ance every 1 8 th m onth)

0,E+00 5,E+14 1,E+15 2,E+15 2,E+15 3,E+15 3,E+15 4,E+15 40 50 60 70 80 90 10 Performance Year

Doubling every 18th month since 1940

slide-14
SLIDE 14

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Exponentiell utveckling

Exam ple: Doubling every 2 nd year How long does it it take for 1 0 0 0 x im provem ent? Exam ple: Doubling every 1 8 th m onth How long does it it take for 1 0 0 0 x im provem ent?

Log scale

1 10 100 1000 time

Linear scale

?

slide-15
SLIDE 15

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Looking Forw ard

Three rules of common wisdom:

 Do not bet against exponential trends  Do not bet against exponential trends  Do not bet against exponential trends

But, is it possible to continue ”Moore’s Law”?

  • Are there show-stoppers?
  • Can we utilize an exponential growth of

#cores?

slide-16
SLIDE 16

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

0,5 1 1,5 2 2,5 3 3,5 1 2 3 4

Number of Cores Used

Throughput

Not everything scales as fast!

Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores

1.0

slide-17
SLIDE 17

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Nerd Curve: 4 7 0 .LBM

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) Possible miss rate if utilization problem was fixed Running

  • ne thread

Running four threads

3 ,5 % 5 ,0 % cache size cache miss rate  Less amount of work per memory byte moved @ four threads

slide-18
SLIDE 18

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

CPU CPU CPU CPU DRAM

Rem em ber: I t is getting w orse!

From Karlsson and Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution. IPDPS March 2007. [graph updated with more recent data]

Computation vs Bandwidth

1 2 3 4 5 6 2007 2008 2009 2010 2011 2012 2013 2014 2015

Y e a r # T * T _ f r e q / # P * P _ f r e q

Source: I nternatronal Technology Roadm ap for Sem iconductors ( I TRS)

#Cores ~ #Transistors

HPCwire Feb 2011 [cites Linley Gwennap and Justin Rattner] W ithout Silicon Photonics, Moore's Law W on't Matter HPCwire Feb 2011 Grow ing Data Deluge Prom pts Processor Redesign

#Pins

slide-19
SLIDE 19

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Case study: Lim ited by bandw idth

slide-20
SLIDE 20

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Nerd Curve ( again)

Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) Possible miss rate if utilization problem was fixed Running four threads

2 ,5 % 5 ,0 % cache size cache miss rate  Twice the amount of work per memory byte moved

  • rig application
  • ptimized application
slide-21
SLIDE 21

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

0,5 1 1,5 2 2,5 3 3,5

1 2 3 4

# Cores Used

Througput

 Better Mem ory Usage!

Example: 470.LBM

Modified to promote better cache utilization Original code

21

slide-22
SLIDE 22

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

1 2 3 4 1 2 3 4 # Cores

App: Cigar

Example 2: A Scalable Parallel Application

Looks like a perfect scalable application! Are we done?

Performance

slide-23
SLIDE 23

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

5 10 15 20 25 30

1 2 3 4

Original Optimized

#Cores

7.3x

App: Cigar

Performance

Example 2: The Same Application Optimized

Looks like a perfect scalable application! Are we done?  Duplicate one data structure

slide-24
SLIDE 24

I m plem entation Trends

slide-25
SLIDE 25

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Predicting the future is hard

Predicting: “Chip Multiprocessor” aka Multicores

[ from PARA Bergen 2 0 0 0 ]

Chip Multiprocessor (CMP): Mem CPU $1 CPU $1 CPU $1 CPU $1 L2$ Mem I/F External I/F t treads Simple fast CPU

  • - many open

questions

slide-26
SLIDE 26

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Multi-CMPs

[ from PARA Bergen 2 0 0 0 ]

Mem c chips Mem Mem Mem Mem Mem Mem Mem Interconnect Explicit parallelism: # chips x # threads/chip

  • Global shared memory
  • Global/local comm cost >10
  • Gotta’ explore small caches
  • Gotta’ explore locality!
  • OS scalability ?
  • Application scalability ?
slide-27
SLIDE 27

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

W hy Multicores Now ?

  • - Hur Mår ”Moore’s Lag”? --

1.

Not enough ILP/MLP to get payoff from using more transistors

2.

Signal propagation delay » transistor delay

3.

Power consumption Pdyn ~ C • f • V2

Perf [log] ~2007 Single core Multi core time

slide-28
SLIDE 28

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Darling, I shrunk the com puter

Mainframes Super Minis: Microprocessor: Mem Chip Multiprocessor (CMP): A multiprocessor on a chip! Mem

Sequential execution (≈one program)

Need TLP to m ake one chip run fast Paradigm Shift

slide-29
SLIDE 29

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

HPC in the Rear Mirror...

1980 1990 2000 2010 ????

Nifty Parallel Vector

† Not general Expensive † Hard to use No standards

Killer Micro SMPs Beowulf x86 Linux Clusters MC Clusters MC + Accelerators

* Forced by technology † High cost, Bad scaling * Promise

  • f performance

† COTS perf management * Scalability Naive view * UNIX Commercial computing * COTS cost convergence † ???? † ????

slide-30
SLIDE 30

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Parallelism can be used to hide m em ory latency

 Intel ”Hyper Threading”  T1 Niagara, MIC, … (4 threads per core)  GPUs  ...  Is this a good idea?  It cannot hide the need for bandwidth!

slide-31
SLIDE 31

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Parallelism is a Hard Currency

Speedup Parallelism

Rem em ber Am dahl’s Law ?

slide-32
SLIDE 32

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Do you have 1 0 0 0 threads to spare?

SI MD rears its ugly head again

512 ”cores” (C)

16 C/StreamProcessor (SP)

SP is SIMD-ish (sort off)

Full DP-FP IEEE support

64kB L1 cache /SP

768kB global shared cache

(less than the sum of L1:s)

Atomic instructions

ECC correction for DRAM

Debugging support

Synch within SP efficient

Giant chip/high power

...

L1 L2

This is SIMD-ish (aka Vector)

slide-33
SLIDE 33

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Coh. Mem. NVI DI A Ferm i:

  • Special language...
  • Topology matter...
  • User-managed memory

I/O bus Common research papers: ”How to get 100X speedup”. Starting to get debunking of those results [ISCA 2010, IBM Journal 201

SI MD and CPUs?

Rem inds m e of:

† Hard to use No standards * Scalability

slide-34
SLIDE 34

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2 [Pic from Michael Wulf, PGI]

Intel’s Knights Ferry [MIC] (topology like Sandy Bridge) Vector instruction Other efforts:

  • AMD Fusion (x86 + GPU)
  • ARM + NVIDIA collaboration (project Denver)

SI MD and CPUs?

Coh. Mem .

slide-35
SLIDE 35

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

Trends for 2 0 1 6

 No major revolution of the Multicore magnitude  Challenge: Will the number of cores double every 2

years?

 Moving towards MIMD+SIMD “fusion”  Architecture complexity grows  Bumpy memory/communication costs  Heterogenious architectures (e.g., ARM: big.LITTLE)  Memory bandwidth the bottleneck  Energy is a first-class citizen  Users are getting less computer-savvy (and ideally

should not have to be)

slide-36
SLIDE 36

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

I m plications

 One size will not “fit all”  SIMD parallelism will be more prominent, but the

jury is still out about how this will be done

 More heterogeneous arch (size, mem, isa)  More parallelism needed, but ... memory/power to

become the bottleneck anyhow

 Diversity: Different applications will need different

“heterogeneous configurations”

 Even harder to use resources efficiently

slide-37
SLIDE 37

Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh

AVDARK 2 0 1 2

HiPEAC Roadm ap -- High Perform ance

Em bedded Architecture and Com pilers