Preparing for a Post Moores Law World Todd Austin University of - - PowerPoint PPT Presentation

preparing for a post moore s law world
SMART_READER_LITE
LIVE PREVIEW

Preparing for a Post Moores Law World Todd Austin University of - - PowerPoint PPT Presentation

Preparing for a Post Moores Law World Todd Austin University of Michigan Perspectives on Scaling C-FAR : Center for Future Architectures Research Focused on scaling in 2020-2030 silicon Performance, power and cost 27 faculty


slide-1
SLIDE 1

Preparing for a Post Moore’s Law World

Todd Austin University of Michigan

slide-2
SLIDE 2

Perspectives on Scaling

  • C-FAR: Center for Future Architectures Research
  • Focused on scaling in 2020-2030 silicon
  • Performance, power and cost
  • 27 faculty at 14 universities, 92 students
  • Why is C-FAR’s mission important?
  • The promise… tomorrow’s applications need powerful systems
  • Why is C-FAR’s mission challenging?
  • The threats… slowing innovation and degrading silicon

2

Computer Vision Machine Learning Big Data Analytics End of Dennard Scaling Many Idle Cores Silicon Defects

All of the work presented in this talk is that of C-FAR faculty.

slide-3
SLIDE 3

Moore’s Law Performance Gap

3

Today, gap is cresting 10x Lack of perceived value Dark silicon Diminished ILP

slide-4
SLIDE 4

180 130 90 65 45 32 22 14 10 7 1 10 100 1000 Technology Node (nm) 10nm slips by 5-6 quarters 14nm slips by 2 quarters 7nm by end 2020?

Is Density Still Scaling?

Street Dates for Intel’s Lead Generation Products

Courtesy David Brooks @ Harvard 4

slide-5
SLIDE 5

But, the technology scaling component has left us.

What Does This All Mean to Architects?

5

Today, value = scalability (performance, power, cost).

slide-6
SLIDE 6

Remedy #1: Chip Multiprocessors

6

slide-7
SLIDE 7

CMP Performance Scaling for the Highly Parallel PARSEC Benchmarks

7 From “Dark Silicon and the End of Multicore Scaling,” by Esmaeilzadeh et al.

slide-8
SLIDE 8

What Does the Press Think?

8

slide-9
SLIDE 9

We Investigate: Who’s to Blame?

9

?

Programmers

slide-10
SLIDE 10

Largest NA Bitcoin Miner

  • GPGPU-based system
  • Fills 2000 sq.ft. warehouse
  • Computes 1 petahash/s
  • Reportedly generates $8M

in Bitcoins per month

  • Unfortunately soon to be
  • bsolete as Bitcoin difficulty

continues to scale

10

slide-11
SLIDE 11

We Investigate: Who’s to Blame?

11

?

Programmers Educators

slide-12
SLIDE 12

CS Education is Booming

  • CS enrollment on a fast-rising trajectory for a decade
  • Parallel programming at UM
  • EECS 381, Object-Oriented and Advanced Programming
  • EECS 482, Operating Systems
  • EECS 570, Parallel Computer Architecture
  • EECS 587, Parallel Computing
  • EECS 591, Distributed Systems
  • EECS 598, Ubiquitous Parallelism
  • I have been teaching and

developing CS in Ethiopia

  • Nearly 600 students in the

CS program

  • 2nd most popular major in the

university

12

CS EE CE

UM EECS Enrollment

slide-13
SLIDE 13

We Investigate: Who’s to Blame?

13

?

Programmers Educators The Transistor

slide-14
SLIDE 14

The Dark Silicon Dilemma

14 Courtesy Michael Taylor @ UCSD

slide-15
SLIDE 15

The Dark Silicon Dilemma

15 Courtesy Michael Taylor @ UCSD

slide-16
SLIDE 16

The Dark Silicon Dilemma

16 Courtesy Michael Taylor @ UCSD

slide-17
SLIDE 17

We Investigate: Who’s to Blame?

17

?

Programmers Educators Architects The Transistor

slide-18
SLIDE 18

The Tyranny of Amdahl’s Law

18

(P) (N) (S)

Where we need to be today! (10x)

slide-19
SLIDE 19

We Investigate: Who’s to Blame?

19

?

Programmers Educators Architects The Transistor What is the solution?

slide-20
SLIDE 20

A Story about Jason and His Two Advisors

20

slide-21
SLIDE 21

EVA: Embedded Vision Architecture

21

Application-specific Functional Units Heterogeneous Multicore EVA Functional Units Monopoly Compare, Dot Product Unit, Vector Max, Decision Tree Compare

Initial EVA design:

90x greater efficiency for computer vision algorithms Customized Memory System

slide-22
SLIDE 22

Where We Need to Focus

22

Parallelism Customization

Heterogeneous parallel systems

  • vercome dark silicon and the tyranny of Amdahl’s Law.
slide-23
SLIDE 23

Why These Ideas Will Likely Fail, Unless We Make a Change…

  • The Good: Hetero-parallel systems

can close the Moore’s Law gap

  • The Bad: Dennard scaling has

stopped, Moore’s Law is slowing, leaving a growing gap

  • The Ugly: Hetero-parallel designs

needed to close the gap will be too expensive to afford

  • We must make design much cheaper!

23

slide-24
SLIDE 24

What I Want You to Remember

  • Successfully bridging the Moore’s Law performance gap is

less about “How” to do it and more about “How Much” does it cost!

  • My claim: if we can effect a 100x reduction in the cost to

bring a design to market, innovation will flourish and scaling challenges will be overcome.

24

slide-25
SLIDE 25

Design Costs Are Skyrocketing

20 40 60 80 100 120 140 0.5u 0.35u 0.25u 0.18u 0.13u 90nm 65nm 45nm 28nm 20nm Cost to Market ($ million) Silicon Technology Node Mask Costs S/W Development and Testing H/W Design and Verification Source: International Business Strategies

25

$88M $120M $500K

slide-26
SLIDE 26

Outcome: “Nanodiversity” is Dwindling

Source: Gartner Group

26

2000 4000 6000 8000 10000 12000 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Total ASIC Starts Year

slide-27
SLIDE 27

Inexpensive “Design” Promotes Innovation and Adaptation

  • Don’t Believe Me? Ask Mother Nature!
  • r/K selection theory is a biological mechanism

that organisms use to better adapt to their environment

  • In unstable environments, r-selection

predominates as the ability to reproduce quickly is crucial

  • In stable environments, K-selection

predominates as the ability to compete successfully for limited resources is crucial

27

slide-28
SLIDE 28

The Remedy: Scale Innovation

  • Ultimate goal: accelerate system architecture innovation

and make it sufficiently inexpensive that anyone can do it anywhere

  • Approach #1: Expect more from architectural innovation
  • Approach #2: Reduce the cost to design custom hardware
  • Approach #3: Embrace open-source concepts
  • Approach #4: Widen the applicability of custom hardware
  • Approach #5: Reduce the cost of manufacturing custom H/W

28

slide-29
SLIDE 29

1) Expect more from architectural innovation

29

“Give me 15% speedup and I’ll accept your paper” “I need 1% speedup for 1% area” “Your idea needs to deliver 2x or more, or someone else should fund it”

slide-30
SLIDE 30

HELIX-UP Unleashed Parallelization

  • Traditional parallelizing

compilers must honor possible dependencies

  • HELIX-UP manufactures

parallelism by profiling which deps do not exist and which are not needed

  • Based on user supplied output

distortion function

  • Big step for parallelization
  • 2x speedup over parallelizing

compilers, 6x over serial, < 7% distortion

Thread 0 Thread 1 Thread 2 Thread 3

Data Data Data Iteration 0 Iteration 1

David Brooks @ Harvard

Nehalem 6 cores, 2 threads per core 30

slide-31
SLIDE 31

Association Rule Mining with the Automata Processor

  • Micron’s Automata processor
  • Implements FSMs at memory
  • Massively parallel with accelerators
  • Mapped data-mining ARM rules

to memory-based FSMs

  • ARM algorithms identify relationships

between data elements

  • Implementations are often memory

bottlenecked

  • Big-data sets had big speedups
  • 90x+ over single CPU performance
  • 2-9x+ speedups over CMPs and GPUs
  • Joint effort with UVA and Micron

31

Kevin Skadron @ UVA

slide-32
SLIDE 32

2) Reduce the cost to design custom hardware

  • Better tools and infrastructure
  • Scalable accelerator synthesis and compilation, generate code and H/W for

highly reusable accelerators

  • Composable design space exploration, enables efficient exploration of

highly complex design spaces

  • Well put-together benchmark suites to drive development efforts

32

Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Accelerator Specific Datapath

David Brooks @ Harvard

slide-33
SLIDE 33

Feature Tracking Disparity Map Image Stitch Image Segmentation Robot Localization Texture Synthesis SIFT Support Vector Machines

CortexSuite: A Synthetic Brain Benchmark Suite

Michael Taylor @ UCSD

33

slide-34
SLIDE 34
  • Thought experiment: let’s design the next great

smartphone

3) Embrace Open-Source Concepts

34

Red = non-free IP, Green = free IP

slide-35
SLIDE 35

3) Embrace Open-Source Concepts

35

As a community, we need to consider: How much of our basic technology should be free?

Red = non-free IP, Green = free IP

slide-36
SLIDE 36

Open-Source H/W is Growing

36

slide-37
SLIDE 37

4) Widen the Applicability of Customized H/W

37

  • ESP: Ensembles of Specialized Processors
  • Ensembles are algorithmic-specific processors optimized for code “patterns”
  • Approach uses composable customization to deliver speed and efficiency

that is widely applicable to general purpose programs

  • Grand challenges remain: what are the components and how are they

connected?

ILP Engine Dense Engine Sparse Engine Graph Engine ESP Core

Glue Code Dense Code Sparse Code Graph Code ESP Code

Dense Graph Sparse

Applications

Multimedia Analysis Computer Vision Machine Learning

Computational Patterns Specializers with custom implementations and autotuning Krste Asanovic @ UC-Berkeley

slide-38
SLIDE 38
  • Brick-and-mortar silicon explores assembly-time

customization, i.e., MCMs + 3D + FPGA interconnect

  • Diversity via brick ecosystem & interconnect flexibility
  • Brick design costs amortized across all designs
  • Robust interconnect and custom bricks rival ASIC speeds
  • Another thought experiment: what if building a house

were like fabricating a chip?

5) Reduce the cost of manufacturing customized H/W

H/W brick 38

Martha Kim @ Columbia

Brick-and-mortar silicon design flow: 1) Assemble brick layer 2) Connect with mortar layer 3) Package assembly 4) Deploy software

slide-39
SLIDE 39

Conclusions

  • Heterogeneous design could continue

Moore’s law perf. scaling via innovation alone

  • But, it requires a diverse hardware ecosystem with

affordable customization

  • Effective and affordable customization won’t

happen without our help

1.

Expect more from architectural innovation

2.

Reduce the cost to design customized design

3.

Embrace open-source concepts

4.

Widen the applicability of customization

5.

Reduce the cost of custom manufacturing

  • Increasing “nanodiversity” is a good thing
  • More jobs, companies, and students
  • More competition and scalable innovation

39

slide-40
SLIDE 40

Questions

?

? ? ? ? ? ? ? ? ? ? ?