Towards Accelerator-Rich Architectures and Systems Zhenman Fang, - - PowerPoint PPT Presentation

towards accelerator rich architectures and systems
SMART_READER_LITE
LIVE PREVIEW

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, - - PowerPoint PPT Presentation

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, Postdoc Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research https://sites.google.com/site/fangzhenman/ The Trend of


slide-1
SLIDE 1

Towards Accelerator-Rich Architectures and Systems

Zhenman Fang, Postdoc

Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research https://sites.google.com/site/fangzhenman/

slide-2
SLIDE 2

2

Specialized Accelerators,

e.g., audio, video, face, imaging, DSP, …

Die photo of Apple A8 SoC

[www.anandtech.com/show/8562/chipworks-a8]

GPU CPU

The Trend of Accelerator-Rich Chips

Fixed-function accelerators (ASIC): Application-Specific Integrated Circuit instead of general-purpose processors

Maltiel Consulting estimates Harvard’s estimates

[Shao, IEEE Micro'15]

A4 2010 A5 A6 A7 A8 A9 A10 2016

5 10 15 20 25 30 35 40

# of IP Blocks

Increasing # of Accelerators in Apple SoC (Estimated)

slide-3
SLIDE 3

3

Cloud service providers begin to deploy FPGAs in their datacenters

The Trend of Accelerator-Rich Cloud

FPGA

2x throughput improvement!

[Putnam, ISCA'14]

Field-Programmable Gate Array (FPGA) accelerators

ü Reconfigurable commodity HW ü Energy-efficient, a high-end board costs ~25W

slide-4
SLIDE 4

4

Cloud service providers begin to deploy FPGAs in their datacenters

Accelerators are becoming 1st class citizens

§ Intel expectation: 30% datacenter nodes with FPGAs by 2020, after the $16.7 billion acquisition of Altera

The Trend of Accelerator-Rich Cloud

FPGA

2x throughput improvement!

[Putnam, ISCA'14]

slide-5
SLIDE 5

5

Post-Moore Era: Potential for Customized Accelerators

Source: Bob Broderson, Berkeley Wireless group

Accelerators promise 10X -1000x gains of performance per watt by trading off flexibility for performance!

ASICs FPGAs Moore’s law dead!

slide-6
SLIDE 6

6

Challenges in Making Accelerator-Rich Architectures and Systems Mainstream

“Extended” Amdahl’s law: 𝐩𝐰𝐟𝐬𝐛𝐦𝐦_𝐭𝐪𝐟𝐟𝐞𝐯𝐪 = 𝟐 𝒍𝒇𝒔𝒐𝒇𝒎% 𝒃𝒅𝒅_𝒕𝒒𝒇𝒇𝒆𝒗𝒒 + 𝟐 − 𝒍𝒇𝒔𝒐𝒇𝒎% + 𝒋𝒐𝒖𝒇𝒉𝒔𝒃𝒖𝒋𝒑𝒐 Accelerator CPU Integration overhead How to characterize and accelerate killer applications? How to efficiently integrate accelerators into future chips?

§ E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17]

slide-7
SLIDE 7

7

How to characterize and accelerate killer applications? How to efficiently integrate accelerators into future chips?

§ E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17]

How to deploy commodity accelerators in big data systems?

§ E.g., a naïve integration may lead to 1000x slowdown [HotCloud'16]

How to program such architectures and systems?

Challenges in Making Accelerator-Rich Architectures and Systems Mainstream

slide-8
SLIDE 8

8

Overview of My Research 1

  • Application Drivers
  • Workload characterization and acceleration

4

  • Compiler Support
  • From many-core to accelerator-rich architectures

3

  • Accelerator-Rich Systems
  • Accelerator-as-a-Service (AaaS) in cloud deployment

2

  • Accelerator-Rich Architectures (ARA)
  • Modeling and optimizing CPU-Accelerator interaction
slide-9
SLIDE 9

9

Dimension #1: Application Drivers

image processing [ISPASS'11] deep learning [ICCAD'16]

ü Analysis and combination of task,

pipeline, and data parallelism

ü 13x speedup on 16-core CPU ü 46x speedup on GPU ü Caffeine: FPGA engine for Caffe ü 1.46 TOPS for 8-bit Conv layer ü 100x speedup for FCN layer ü 5.7x energy savings over GPU ü 2.6x speedup for in-memory

genome sort (Samtool)

ü Record 9.6GB/s throughput

for genome compression

  • n Intel-Altera HARPv2;

50x speedup over ZLib

genomics [D&T'17]

How do accelerators achieve such speedup?

slide-10
SLIDE 10

10

For Review Only

Out[m][r][c] =

N

X

n=0 K1

X

i=0 K2

X

j=0

W[m][n][i][j]∗In[n][S1∗r+i][S2∗c+j];

Dimension #1: Application Drivers

image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17]

Input 0 Input 1 Weight 0 Weight 1

X X + + +

Output 0 Weight 2 Weight 3

X X + + +

Output 1

E.g., convolutional accelerator on-chip Weight Input Output

DRAM #2: Customized pipeline #3: Parallelization #1: Caching #4: Double buffer #5: DRAM re-organization #6: Precision customization Kernel: Convolutional Matrix-Multiplication

slide-11
SLIDE 11

11

Dimension #1: Application Drivers

image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17]

3.2 0.005 1.8 7.6 36.5 68.3 100 20 40 60 80 100 GFLOPS 1.46 TOPS ü Programmed in Xilinx High-Level Synthesis (HLS) ü Results collected on Alpha Data PCIe-7v3 FPGA board

slide-12
SLIDE 12

12

Dimension #2: Accelerator-Rich Architectures (ARA)

GAM: Global Acc Manager SPM: ScratchPad Memory ISA Extension

Overview of Accelerator-Rich Architecture

Acc1 SPM DMA

Customizable Network-on-Chip

C1 L1 Cm L1

GAM

SPM DMA TLB Shared LLC Cache DRAM Controller Accn SPM DMA

slide-13
SLIDE 13

13

Acc1 SPM DMA

Customizable Network-on-Chip

C1 L1 Cm L1

GAM

SPM DMA TLB Shared LLC Cache DRAM Controller Accn SPM DMA

Dimension #2: Accelerator-Rich Architectures (ARA)

ARA Modeling:

ü PARADE simulator:

gem5 + HLS [ICCAD'15]

ü Fast ARAPrototyper flow

  • n FPGA-SoC [arXiv'16]

Multicore Modeling:

ü Transformer simulator

[DAC'12, LCTES'12]

ARA Optimization:

ü Sources of accelerator

gains [FCCM'16]

ü CPU-Acc co-design:

address translation for unified memory space, 7.6x speedup, 6.4% gap to ideal [HPCA'17 best paper nominee]

ü AIM: near memory

acceleration gives another 4x speedup [Memsys'17] More information in ISCA'15 & MICRO'16 tutorials: http://accelerator.eecs.harvard.edu/micro16tutorial/

PARADE is open source: http://vast.cs.ucla.edu/ software/parade-ara-simulator

slide-14
SLIDE 14

14

Dimension #3: Accelerator-Rich Systems

Accelerator designer (e.g., FPGA) Cloud service provider w/ accelerator-enabled cloud Big data application developer (e.g., Spark)

Easy accelerator registration into cloud Easy and efficient accelerator invocation and sharing

Accelerator-as-a-service

Blaze prototype:

1 server w/ FPGA ~= 3 CPU servers

FPGA

[HotCloud'16, ACM SOCC'16]

Blaze works with Spark and YARN and is open source: https://github.com/UCLA-VAST/blaze

CPU-FPGA platform choice [DAC'16]:

1) mainstream PCIe, or 2) coherent PCIe (CAPI), or 3) Intel-Altera HARP (coherent, one package)

slide-15
SLIDE 15

15

Dimension #4: Compiler Support

[ICS'14, TACO'15, ongoing]

mem system improvement

source-to-source compiler for coordinated data prefetching: 1.5x speedup on Xeon Phi many-core processor

Accelerator-Rich Architectures & Systems

Future work

slide-16
SLIDE 16

16

Overview of My Research

image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] [ICS'14, TACO'15, ongoing] mem system improvement [DAC'16]

CPU-FPGA: PCIe or QPI?

PARADE [ICCAD'15] ARAPrototyper [arXiv'16] Transformer [DAC'12, LCTES'12]

Accelerator-Rich Architectures & Systems

Tool: System-Level Automation Application Drivers Compiler Support Accelerator-Rich Systems Accelerator-Rich Architectures

tutorials [ISCA'15 & MICRO'16]

FPGA

AaaS: Blaze [HotCloud'16, ACM SOCC'16] sources of gains [FCCM'16] CPU-Acc address translation [HPCA'17 best paper nominee] near mem acceleration [Memsys'17]

slide-17
SLIDE 17

17

Chip-Level CPU-Accelerator Co-design:

Address Translation for Unified Memory Space

[HPCA'17 Best Paper Nominee]

Better programmability and performance

slide-18
SLIDE 18

18

Virtual memory and its benefits

§ Shared memory for multi-process § Memory isolation for security § Conceptually more memory

Virtual Memory and Address Translation 101

MMU Core Memory TLB

Memory Management Unit (MMU): virtual-to-physical address translation Translation Lookaside Buffer (TLB): cache address translation results Virtual memory (per process) Physical memory

Address translation

Page Table

Virtual-to-physical address mapping in page granularity

slide-19
SLIDE 19

19

Accel TLB Main Memory IOMMU MMU Core TLB MMU Core Accel Interconnect Scratchpad Accel Datapath DMA IOMMU IOTL

IOTLB

Accelerator-Rich Architecture (ARA)

#1 Inefficient TLB Support.

TLBs are not specialized to provide low-latency and capture page locality

#2 High Page Walk Latency.

On an IOTLB miss, 4 main memory accesses are required to walk page table

Inefficiency in Today’s ARA Address Translation

Today’s ARA address translation using IOMMU with IOTLB (e.g. 32-entries)

0.2 0.4 0.6 0.8 1 Deblur Denoise Regist. Segment. Black. Stream. Swapt. DispMap LPCIP EKFSLAM RobLoc gmean Medical Imaging Commercial Vision Navigation Performance Relative to Ideal Address Translation IOMMU

IOMMU only achieves 12% performance of ideal address translation

slide-20
SLIDE 20

20

Must provide efficient address translation support

0.2 0.4 0.6 0.8 1 1 2 4 8 16 32 64 128 256 512 1024 Performance Relative to Ideal Address Translation Translation latency (cycles) gmean

Accelerator Performance Is Highly Sensitive to Address Translation Latency

slide-21
SLIDE 21

21

Opportunities for relatively simple TLB and page walker designs

TLB miss behavior of BlackScholes benchmark

Characteristic #1: Regular Bulk Transfer

  • f Consecutive Data (Pages)

Access of consecutive pages

  • f one large memory reference
slide-22
SLIDE 22

22

A shared TLB can be very helpful

Original: 32 * 32 * 32 data array Rectangular tiling: 16 * 16 *16 tiles

Characteristic #2: Impact of Data Tiling – Breaking a Page into Multiple Accelerators

Page 0 Page 1

15 31 Page 15 Page 31 Page 16

Each tile is mapped to a different accelerator for parallel processing. But 1 page is split into 4 accelerators!

slide-23
SLIDE 23

23

Accel TLB Shared TLB To IOMMU Accel TLB Accel TLB Accel TLB

0.2 0.4 0.6 0.8 1 Deblur Denoise Regist. Segment. Black. Stream. Swapt. DispMap LPCIP EKFSLAM RobLoc gmean Medical Imaging Commercial Vision Navigation Performance Relative to Ideal Address Translation

Our Two-Level TLB Design

ü 32-entry private TLB ü 512-entry shared TLB Utilization wall limits the number of simultaneously powered accelerators Still only achieves half the ideal performance => need to improve page walker design

IOMMU Private TLB Two-level TLB

slide-24
SLIDE 24

24

#1 Improve the IOMMU design to reduce page walk latency

§ Need to design a more complex IOMMU, e.g., GPU MMU with parallel page walker [Power, HPCA'14]

#2 Leverage host core MMU that launches accelerators

§ Very simple and efficient as host core has MMU cache & data cache

Page Walker Design Alternatives

L4 L3 L2 L1 Page offset

4-Level Page Walk in 64-bitVirtual Address MMU cache

CR3

Page Table Base Address Prefetch entries to 8 consecutive pages One data cache line

slide-25
SLIDE 25

25

TLB Main Memory MMU Core TLB MMU Core Interconnect Accel TLB Shared TLB Accel TLB Host

0.2 0.4 0.6 0.8 1 Deblur Denoise Regist. Segment. Black. Stream. Swapt. DispMap LPCIP EKFSLAM RobLoc gmean Medical Imaging Commercial Vision Navigation Performance Relative to Ideal Address Translation

On average: 7.6X speedup over naïve IOMMU design, only 6.4% gap between ideal translation

Two-level TLB + hostPageWalk IOMMU Private TLB Two-level TLB

Final Proposal: Two-Level TLB + Host Page Walk

[HPCA'17 Best Paper Nominee]

slide-26
SLIDE 26

26

Datacenter-Level: Deploying FPGA Accelerators at Cloud Scale

slide-27
SLIDE 27

27

Deploying Accelerators in Datacenters

Accelerator designer (e.g., FPGA) Cloud service provider Big data application developer (e.g., Spark)

How to install my accelerators …? How to acquire accelerator resource …? How to program with your accelerators…? Programming challenges:

ü Java/Scala vs OpenCL/C/C++ ü Explicit accelerator sharing by multiple threads & apps

Performance challenges:

ü JVM-to-accelerator communication overhead ü FPGA reconfiguration

  • verhead
slide-28
SLIDE 28

28

Client RM AM NM NM Container Container

GAM NAM NAM

FPGA

Global Accelerator Manager Accelerator-centric scheduling Node Accelerator Manager Local accelerator service GAM NAM

RM: Resource Manager NM: Node Manager AM: Application Master

Blaze works with Apache Spark and YARN, Open source link: https://github.com/UCLA-VAST/blaze Big data applications, e.g., Spark programs

Blaze Proposal: Accelerator-as-a-Service

[ACM SOCC'16, C-FAR Best Demo Award 3/49]

Programming APIs

slide-29
SLIDE 29

29

Blaze Programming Overview

Big Data Application (e.g., Spark programs) Global ACC Manager Node ACC Manager FPGA GPU ACCX ACC Labels Containers Container Info ACC Info ACC Invoke Input data Output data

Register Accelerators

§ APIs to add accelerator service to corresponding nodes

Request Accelerators

§ APIs to invoke accelerators through acc_id § GAM allocates corresponding nodes to applications

slide-30
SLIDE 30

30

Transparent and Efficient Accelerator Sharing

Task Scheduler

Platform Queue Platform Queue App App Task Queue

Application Scheduler

Task Queue Task Queue Accelerator Scheduling FPGA1 FPGA2

#1 Overlapping (pipelining) computation and communication from multiple threads #2 Data caching on FPGA device memory

2 2 1 1 3

#3 Delayed scheduling: same logical tasks are scheduled to same FPGA to avoid reprogram

slide-31
SLIDE 31

31

A 22-node cluster with FPGA-based accelerators

20 workers 1 master / driver Each node:

  • 1. Two Xeon processors
  • 2. One FPGA PCIe card

(Alpha Data)

  • 3. 64 GB RAM
  • 4. 10GbE NIC

Alpha Data board:

  • 1. Virtex 7 FPGA
  • 2. 16GB on-board

RAM Spark:

  • Computation framework
  • In-memory MapReduce system

HDFS:

  • Distributed storage

framework 1 file server 1 10GbE switch

CDSC FPGA-Accelerated Cluster

slide-32
SLIDE 32

32

Programming Efforts Reduction with Blaze

Applications LOC Reduction Logistic Regression 325 K-Means 364 Computational Genomics Measured in Lines of Code (LOC) reduction for accelerator management Applications LOC Reduction Genome Sequence Alignment [HotCloud'16] 896 Genome Compression 360

slide-33
SLIDE 33

33

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Logistic Regression K-Means Genome Sequence Alignment Genome Compression Speedup of FPGA Task over CPU 4.3X CPU Naive Pipelining Pipelining + Caching

Performance of Single-Accelerator with Blaze

With Blaze, a server with an FPGA can replace 1.7~4.3 CPU servers, while providing the same throughput

slide-34
SLIDE 34

34

Performance of Multi-Accelerator Scheduling w/ Blaze

0.0 0.2 0.4 0.6 0.8 1.0 1 0.8 0.6 0.5 0.4 0.2 Normalized throughput Ratio of LR in the mixed Logistic Regression & KMeans workloads Theoretical optimal Static partition Half nodes for LR, half nodes for KM CPU-sharing Default scheduling policy for CPU Blaze-GAM Accelerator-centric delayed scheduling

Static or CPU-style sharing cannot handle dynamic workload distributions; Blaze-GAM performs good in most cases.

slide-35
SLIDE 35

35

Great promise of accelerator-rich architectures and systems

§ Orders-of magnitude performance and energy gains in customized chips § Several folds consolidation of the datacenter size with commodity FPGAs

My contributions for chip-level Accelerator-Rich Architectures

§ Developed the open-source ARA simulator PARADE § Analyzed sources of performance gains for customized accelerators § Proposed an efficient and unified address translation scheme for ARA

My contributions for datacenter-level accelerator deployment

§ Proposed accelerator-as-a-service in the cloud § Contributed the open-source Blaze system

Lots of opportunities to be explored..

Summary So Far

slide-36
SLIDE 36

36

When Internet-of-Things (IoT) Marries Accelerator

IoT devices are very sensitive to power/energy consumption IoT cloud handles big data for real-time analytics Customizable chips

Customized datacenters Trillions of dollars market

Communication costs more energy than computation in IoT, especially after acceleration Communication- Efficient Accelerator- Rich IoT (CearIoT)

slide-37
SLIDE 37

37

IoT devices: Local low-power accelerator to preprocess data (e.g., filtering, compression) Regional edge devices: Simple processing & data aggregation (e.g., genome to variants, image to neural bits, request aggregate) Cloud: Large-scale data processing with customized datacenters; near mem/storage computing for big data

Communication-Efficient Accelerator-Rich IoT (CearIoT)

#1 Architecture support #2 Programming support #3 Runtime support #4 Security support Lots of opportunities…