HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James - - PowerPoint PPT Presentation

hardware software co design not just a clich
SMART_READER_LITE
LIVE PREVIEW

HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James - - PowerPoint PPT Presentation

HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James Bornholt Luis Ceze sa pa University of Washington SNAPL 2015 time 2005 2015 immemorial (not to scale) free lunch time 2005 2015 immemorial exponential


slide-1
SLIDE 1

sa pa

University of Washington SNAPL 2015 Adrian Sampson James Bornholt Luis Ceze

Hardware–Software Co-Design: Not Just a Cliché

slide-2
SLIDE 2

2005 2015 time immemorial (not to scale)

slide-3
SLIDE 3

2005 2015 time immemorial (not to scale)

free lunch

exponential single-threaded performance scaling!

slide-4
SLIDE 4

10 100 1,000 10,000 1985 1990 1995 2000 2005 2010 2015 2020 Year of Introduction Clock Frequency (MHz)

  • m 1986 to 2008 as measured by the bench-

l Technology

slide-5
SLIDE 5

2005 2015 time immemorial

free lunch multicore era

we’ll scale the number of cores instead

slide-6
SLIDE 6

The multicore transition was a stopgap, not a panacea.

slide-7
SLIDE 7

2005 2015 time immemorial

free lunch multicore era ? who knows? ? ? ? ?

slide-8
SLIDE 8

Application Language Architecture Circuits

slide-9
SLIDE 9

Application Language Architecture Circuits hardware–software abstraction boundary

parallelism data movement guard bands energy costs

slide-10
SLIDE 10

Application Language Architecture Circuits hardware–software abstraction boundary

parallelism data movement guard bands energy costs

slide-11
SLIDE 11

Approximate Computing

lessons learned from

New Opportunities

for hardware–software co-design

slide-12
SLIDE 12

Approximate Computing

lessons learned from

New Opportunities

for hardware–software co-design

slide-13
SLIDE 13

Application Language Architecture Circuits new abstractions for incorrectness

slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

Application Language Architecture Circuits new abstractions for incorrectness

type systems debuggers probabilistic guarantees auto-tuning flaky functional units lossy cache compression neural acceleration drowsy SRAMs

slide-18
SLIDE 18

The von Neumann curse

useful work

  • ther crud

we don’t care about and can’t fix

slide-19
SLIDE 19

Hardware design costs sanity & well-being

Thierry Moreau, FPGA design champion

[Moreau et al.; HPCA 2015]

slide-20
SLIDE 20

Trust your compiler

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache

slide-21
SLIDE 21

Trust your compiler

st r1 x st.a r2 y ld x r3 ld.a y r4

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache

slide-22
SLIDE 22

Trust your compiler

st r1 x st.a r2 y ld x r3 ld.a y r4

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache 1 1 1 line state bits?

slide-23
SLIDE 23

Trust your compiler

st r1 x st.a r2 y ld x r3 ld.a y r4

[Esmaeilzadeh, Sampson, Ceze, Burger; ASPLOS 2012]

approximate cache line state bits?

slide-24
SLIDE 24

Approximate Computing

lessons learned from

New Opportunities

for hardware–software co-design

slide-25
SLIDE 25

More hardware flexibility that humans can actually program

slide-26
SLIDE 26

More hardware flexibility that humans can actually program

FPGA

slide-27
SLIDE 27

More hardware flexibility that humans can actually program

explicit data movement explicit memory blocks explicit physical routing explicit clock frequency explicit ILP explicit numeric bit width FPGA

slide-28
SLIDE 28

More hardware flexibility that humans can actually program

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1 Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy Fowers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger Microsoft Abstract

Datacenter workloads demand high computational capabili- ties, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance dat- acenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, recon- figurable fabric to accelerate portions of large-scale software

  • services. Each instantiation of the fabric consists of a 6x8 2-D

torus of high-end Stratix V FPGAs embedded into a half-rack

  • f 48 machines. One FPGA is placed into each server, acces-

sible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables. In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the desirable to reduce management issues and to provide a consis- tent platform that applications can rely on. Second, datacenter services evolve extremely rapidly, making non-programmable hardware features impractical. Thus, datacenter providers are faced with a conundrum: they need continued improve- ments in performance and efficiency, but cannot obtain those improvements from general-purpose systems. Reconfigurable chips, such as Field Programmable Gate Arrays (FPGAs), offer the potential for flexible acceleration

  • f many workloads. However, as of this writing, FPGAs have

not been widely deployed as compute accelerators in either datacenter infrastructure or in client devices. One challenge traditionally associated with FPGAs is the need to fit the ac- celerated function into the available reconfigurable area. One could virtualize the FPGA by reconfiguring it at run-time to support more functions than could fit into a single device. However, current reconfiguration times for standard FPGAs

slide-29
SLIDE 29

More hardware flexibility that humans can actually program

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou1 Kypros Constantinides2 John Demme3 Hadi Esmaeilzadeh4 Jeremy Fowers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck5 Stephen Heil Amir Hormati6 Joo-Young Kim Sitaram Lanka James Larus7 Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger Microsoft Abstract

Datacenter workloads demand high computational capabili- ties, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance dat- acenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, recon- figurable fabric to accelerate portions of large-scale software

  • services. Each instantiation of the fabric consists of a 6x8 2-D

torus of high-end Stratix V FPGAs embedded into a half-rack

  • f 48 machines. One FPGA is placed into each server, acces-

sible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables. In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the desirable to reduce management issues and to provide a consis- tent platform that applications can rely on. Second, datacenter services evolve extremely rapidly, making non-programmable hardware features impractical. Thus, datacenter providers are faced with a conundrum: they need continued improve- ments in performance and efficiency, but cannot obtain those improvements from general-purpose systems. Reconfigurable chips, such as Field Programmable Gate Arrays (FPGAs), offer the potential for flexible acceleration

  • f many workloads. However, as of this writing, FPGAs have

not been widely deployed as compute accelerators in either datacenter infrastructure or in client devices. One challenge traditionally associated with FPGAs is the need to fit the ac- celerated function into the available reconfigurable area. One could virtualize the FPGA by reconfiguring it at run-time to support more functions than could fit into a single device. However, current reconfiguration times for standard FPGAs

23 authors!

slide-30
SLIDE 30

Trust, but formally verify

useful work

slide-31
SLIDE 31

Trust, but formally verify

useful work checking that software doesn’t do anything crazy

slide-32
SLIDE 32

Trust, but formally verify

Application Language Architecture Circuits

verified properties

e.g., [Hunt and Larus; OSR April 2007]

slide-33
SLIDE 33

Hardware beyond core computation

power supply & battery mobile display & backlight new memory technologies software-defined networking CPU GPU FPGA accelerators

slide-34
SLIDE 34

2005 2015 time immemorial

free lunch multicore era the era

  • f language

co-design?

slide-35
SLIDE 35