The Future is not w hat it used to be... Erik Hagersten Then... - - PowerPoint PPT Presentation
The Future is not w hat it used to be... Erik Hagersten Then... - - PowerPoint PPT Presentation
The Future is not w hat it used to be... Erik Hagersten Then... ENI AC 1 9 4 6 ( 5 kHz) 1 8 0 0 0 radiorr sladdprogram m erad 5 KHz AVDARK ENIAC 2 0 1 2 Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten|
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Then... ENI AC 1 9 4 6 ( ”5 kHz”)
ENIAC
1 8 0 0 0 radiorör sladdprogram m erad ”5 KHz”
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Then ( in Sw eden)
BARK (~1950)
8 000 relays, 80 km cables
BESK (~1953)
2 400 vac. tubes ”20 kHz” (world record)
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
“Recently” APZ 2 1 2 , 1 9 8 3
Ericsson’s Supercom puter ( “5 MHz”)
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
APZ 2 1 2
m arketing brochure quotes:
”Very compact”
6 times the performance 1/6:th the size 1/5 the power consumption
”A breakthrough in computer science” ”Why more CPU power?” ”All the power needed for future development” ”…800,000 BHCA, should that ever be needed” ”SPC computer science at its most elegance” ”Using 64 kbit memory chips” ”1500W power consumption
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
6 5 years of “im provem ents”
Speed Size Price Price/performance Reliability Predictability Energy Safety Usability….
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
”Moore’s Law ”
Pop: Double perform ance every 1 8 -2 4 th m onth
1 10 100 1000
2006
Perform ance [ log] Year
Single-core Multicore
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Exponentiell utveckling: Doublerings/ halverings-tider
( according to Kurzw eil)
Dynam ic RAM Mem ory ( bits per dollar) 1 .5 years
Average Transistor Price 1 .6 years
Microprocessor Cost per Transistor Cycle 1 .1 years
Total Bits Shipped 1 .1 years
Processor Perform ance in MI PS 1 .8 years
Transistors in I ntel Microprocessors 2 .0 years
Log scale
1 10 100 1000 time
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Ray Kurzw eil pictures w w w .Kurzw eilAI .net/ pps/ W orldHealthCongress/
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Linear scale 1 9 4 0 2 0 1 7 ( 2 x perform ance every 1 8 th m onth)
0,E+00 5,E+14 1,E+15 2,E+15 2,E+15 3,E+15 3,E+15 4,E+15 40 50 60 70 80 90 10 Performance Year
Doubling every 18th month since 1940
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Exponentiell utveckling
Exam ple: Doubling every 2 nd year How long does it it take for 1 0 0 0 x im provem ent? Exam ple: Doubling every 1 8 th m onth How long does it it take for 1 0 0 0 x im provem ent?
Log scale
1 10 100 1000 time
Linear scale
?
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Looking Forw ard
Three rules of common wisdom:
Do not bet against exponential trends Do not bet against exponential trends Do not bet against exponential trends
But, is it possible to continue ”Moore’s Law”?
- Are there show-stoppers?
- Can we utilize an exponential growth of
#cores?
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
0,5 1 1,5 2 2,5 3 3,5 1 2 3 4
Number of Cores Used
Throughput
Not everything scales as fast!
Example: 470.LBM "Lattice Boltzmann Method" to simulate incompressible fluids in 3D Throughput (as defined by SPEC): Amount of work performed per time unit when several instances of the application is executed simultaneously. Our TP study: compare TP improvement when you go from 1 core to 4 cores
1.0
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Nerd Curve: 4 7 0 .LBM
Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) Possible miss rate if utilization problem was fixed Running
- ne thread
Running four threads
3 ,5 % 5 ,0 % cache size cache miss rate Less amount of work per memory byte moved @ four threads
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
CPU CPU CPU CPU DRAM
Rem em ber: I t is getting w orse!
From Karlsson and Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution. IPDPS March 2007. [graph updated with more recent data]
Computation vs Bandwidth
1 2 3 4 5 6 2007 2008 2009 2010 2011 2012 2013 2014 2015
Y e a r # T * T _ f r e q / # P * P _ f r e q
Source: I nternatronal Technology Roadm ap for Sem iconductors ( I TRS)
#Cores ~ #Transistors
HPCwire Feb 2011 [cites Linley Gwennap and Justin Rattner] W ithout Silicon Photonics, Moore's Law W on't Matter HPCwire Feb 2011 Grow ing Data Deluge Prom pts Processor Redesign
#Pins
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Case study: Lim ited by bandw idth
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Nerd Curve ( again)
Miss rate (excluding HW prefetch effects) Utilization, i.e., fraction cache data used (scale to the right) Possible miss rate if utilization problem was fixed Running four threads
2 ,5 % 5 ,0 % cache size cache miss rate Twice the amount of work per memory byte moved
- rig application
- ptimized application
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
0,5 1 1,5 2 2,5 3 3,5
1 2 3 4
# Cores Used
Througput
Better Mem ory Usage!
Example: 470.LBM
Modified to promote better cache utilization Original code
21
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
1 2 3 4 1 2 3 4 # Cores
App: Cigar
Example 2: A Scalable Parallel Application
Looks like a perfect scalable application! Are we done?
Performance
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
5 10 15 20 25 30
1 2 3 4
Original Optimized
#Cores
7.3x
App: Cigar
Performance
Example 2: The Same Application Optimized
Looks like a perfect scalable application! Are we done? Duplicate one data structure
I m plem entation Trends
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Predicting the future is hard
Predicting: “Chip Multiprocessor” aka Multicores
[ from PARA Bergen 2 0 0 0 ]
Chip Multiprocessor (CMP): Mem CPU $1 CPU $1 CPU $1 CPU $1 L2$ Mem I/F External I/F t treads Simple fast CPU
- - many open
questions
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Multi-CMPs
[ from PARA Bergen 2 0 0 0 ]
Mem c chips Mem Mem Mem Mem Mem Mem Mem Interconnect Explicit parallelism: # chips x # threads/chip
- Global shared memory
- Global/local comm cost >10
- Gotta’ explore small caches
- Gotta’ explore locality!
- OS scalability ?
- Application scalability ?
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
W hy Multicores Now ?
- - Hur Mår ”Moore’s Lag”? --
1.
Not enough ILP/MLP to get payoff from using more transistors
2.
Signal propagation delay » transistor delay
3.
Power consumption Pdyn ~ C • f • V2
Perf [log] ~2007 Single core Multi core time
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Darling, I shrunk the com puter
Mainframes Super Minis: Microprocessor: Mem Chip Multiprocessor (CMP): A multiprocessor on a chip! Mem
Sequential execution (≈one program)
Need TLP to m ake one chip run fast Paradigm Shift
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
HPC in the Rear Mirror...
1980 1990 2000 2010 ????
Nifty Parallel Vector
† Not general Expensive † Hard to use No standards
Killer Micro SMPs Beowulf x86 Linux Clusters MC Clusters MC + Accelerators
* Forced by technology † High cost, Bad scaling * Promise
- f performance
† COTS perf management * Scalability Naive view * UNIX Commercial computing * COTS cost convergence † ???? † ????
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Parallelism can be used to hide m em ory latency
Intel ”Hyper Threading” T1 Niagara, MIC, … (4 threads per core) GPUs ... Is this a good idea? It cannot hide the need for bandwidth!
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Parallelism is a Hard Currency
Speedup Parallelism
Rem em ber Am dahl’s Law ?
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Do you have 1 0 0 0 threads to spare?
SI MD rears its ugly head again
512 ”cores” (C)
16 C/StreamProcessor (SP)
SP is SIMD-ish (sort off)
Full DP-FP IEEE support
64kB L1 cache /SP
768kB global shared cache
(less than the sum of L1:s)
Atomic instructions
ECC correction for DRAM
Debugging support
Synch within SP efficient
Giant chip/high power
...
L1 L2
This is SIMD-ish (aka Vector)
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Coh. Mem. NVI DI A Ferm i:
- Special language...
- Topology matter...
- User-managed memory
I/O bus Common research papers: ”How to get 100X speedup”. Starting to get debunking of those results [ISCA 2010, IBM Journal 201
SI MD and CPUs?
Rem inds m e of:
† Hard to use No standards * Scalability
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2 [Pic from Michael Wulf, PGI]
Intel’s Knights Ferry [MIC] (topology like Sandy Bridge) Vector instruction Other efforts:
- AMD Fusion (x86 + GPU)
- ARM + NVIDIA collaboration (project Denver)
SI MD and CPUs?
Coh. Mem .
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
Trends for 2 0 1 6
No major revolution of the Multicore magnitude Challenge: Will the number of cores double every 2
years?
Moving towards MIMD+SIMD “fusion” Architecture complexity grows Bumpy memory/communication costs Heterogenious architectures (e.g., ARM: big.LITTLE) Memory bandwidth the bottleneck Energy is a first-class citizen Users are getting less computer-savvy (and ideally
should not have to be)
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2
I m plications
One size will not “fit all” SIMD parallelism will be more prominent, but the
jury is still out about how this will be done
More heterogeneous arch (size, mem, isa) More parallelism needed, but ... memory/power to
become the bottleneck anyhow
Diversity: Different applications will need different
“heterogeneous configurations”
Even harder to use resources efficiently
Dept of I nform ation Technology| w w w .it.uu.se Erik Hagersten| user.it.uu.se/ ~ eh
AVDARK 2 0 1 2