Towards Accelerator-Rich Architectures and Systems Zhenman Fang, - PowerPoint PPT Presentation

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, Postdoc Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research https://sites.google.com/site/fangzhenman/

The Trend of Accelerator-Rich Chips Increasing # of Accelerators Die photo of Apple A8 SoC in Apple SoC (Estimated) [www.anandtech.com/show/8562/chipworks-a8] 40 35 # of IP Blocks 30 Specialized Accelerators, Fixed-function accelerators (ASIC): 25 Application-Specific Integrated Circuit e.g., audio, video, face, imaging, DSP, … instead of general-purpose processors 20 15 10 5 0 A4 A5 A6 A7 A8 A9 A10 2010 2016 GPU CPU Maltiel Consulting Harvard’s estimates estimates [Shao, IEEE Micro'15] 2

The Trend of Accelerator-Rich Cloud Cloud service providers begin to deploy FPGAs in their datacenters 2x throughput [Putnam, FPGA ISCA'14] improvement! Field-Programmable Gate Array (FPGA) accelerators ü Reconfigurable commodity HW ü Energy-efficient, a high-end board costs ~25W 3

The Trend of Accelerator-Rich Cloud Cloud service providers begin to deploy FPGAs in their datacenters 2x throughput [Putnam, FPGA ISCA'14] improvement! Accelerators are becoming 1 st class citizens § Intel expectation: 30% datacenter nodes with FPGAs by 2020, after the $16.7 billion acquisition of Altera 4

Post-Moore Era: Potential for Customized Accelerators Accelerators promise 10X -1000x gains of performance per watt by trading off flexibility for performance! FPGAs ASICs Moore’s law dead! Source: Bob Broderson, Berkeley Wireless group 5

Challenges in Making Accelerator-Rich Architectures and Systems Mainstream How to characterize and accelerate killer applications? How to efficiently integrate accelerators into future chips? § E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17] “Extended” Amdahl’s law: 𝐩𝐰𝐟𝐬𝐛𝐦𝐦_𝐭𝐪𝐟𝐟𝐞𝐯𝐪 = 𝟐 𝒍𝒇𝒔𝒐𝒇𝒎% 𝒃𝒅𝒅_𝒕𝒒𝒇𝒇𝒆𝒗𝒒 + 𝟐 − 𝒍𝒇𝒔𝒐𝒇𝒎% + 𝒋𝒐𝒖𝒇𝒉𝒔𝒃𝒖𝒋𝒑𝒐 Accelerator CPU Integration overhead 6

Challenges in Making Accelerator-Rich Architectures and Systems Mainstream How to characterize and accelerate killer applications? How to efficiently integrate accelerators into future chips? § E.g., a naïve integration only achieves 12% of ideal performance [HPCA'17] How to deploy commodity accelerators in big data systems? § E.g., a naïve integration may lead to 1000x slowdown [HotCloud'16] How to program such architectures and systems? 7

Overview of My Research • Application Drivers 1 • Workload characterization and acceleration • Compiler Support 4 • From many-core to accelerator-rich architectures • Accelerator-Rich Systems 3 • Accelerator-as-a-Service ( AaaS ) in cloud deployment • Accelerator-Rich Architectures (ARA) 2 • Modeling and optimizing CPU-Accelerator interaction 8

Dimension #1: Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] ü Analysis and combination of task, ü 2.6x speedup for in-memory ü Caffeine: FPGA engine for Caffe pipeline, and data parallelism genome sort (Samtool) ü 1.46 TOPS for 8-bit Conv layer ü 13x speedup on 16-core CPU ü Record 9.6GB/s throughput ü 100x speedup for FCN layer for genome compression ü 46x speedup on GPU ü 5.7x energy savings over GPU on Intel-Altera HARPv2; 50x speedup over ZLib How do accelerators achieve such speedup? 9

For Review Only Dimension #1: Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] K 1 K 2 N Kernel: Convolutional #2: Customized pipeline #6: Precision customization X X X Out [ m ][ r ][ c ] = W [ m ][ n ][ i ][ j ] ∗ In [ n ][ S 1 ∗ r + i ][ S 2 ∗ c + j ]; Matrix-Multiplication n =0 i =0 j =0 E.g., convolutional Input 0 Input 1 DRAM accelerator on-chip #4: Double Input #3: Parallelization buffer X + Weight 0 + Output 0 Weight X + Weight 1 #1: Caching #5: DRAM Output X + Weight 2 re-organization + Output 1 X + Weight 3 10

Dimension #1: Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] 100 1.46 TOPS 100 80 68.3 GFLOPS 60 ü Programmed in Xilinx 36.5 40 High-Level Synthesis (HLS) ü Results collected on Alpha 20 7.6 3.2 1.8 0.005 Data PCIe-7v3 FPGA board 0 11

Dimension #2: Accelerator-Rich Architectures (ARA) GAM: Global SPM: ScratchPad ISA Extension Acc Manager Memory C 1 C m GAM Acc 1 Acc n SPM SPM SPM TLB L1 L1 DMA DMA DMA Customizable Network-on-Chip Shared LLC Cache DRAM Controller Overview of Accelerator-Rich Architecture 12

Dimension #2: Accelerator-Rich Architectures (ARA) ARA Modeling: ARA Optimization: ü PARADE simulator: ü Sources of accelerator PARADE is open source: gem5 + HLS [ICCAD'15] http://vast.cs.ucla.edu/ gains [FCCM'16] software/parade-ara-simulator ü Fast ARAPrototyper flow ü CPU-Acc co-design: on FPGA-SoC [arXiv'16] address translation for unified memory space, C 1 C m 7.6x speedup, 6.4% gap GAM Acc 1 Acc n Multicore Modeling: to ideal [HPCA'17 best SPM SPM SPM TLB L1 L1 paper nominee] ü Transformer simulator DMA DMA DMA ü AIM: near memory [DAC'12, LCTES'12] acceleration gives Customizable Network-on-Chip another 4x speedup Shared LLC Cache DRAM Controller [Memsys'17] More information in ISCA'15 & MICRO'16 tutorials: http://accelerator.eecs.harvard.edu/micro16tutorial/ 13

Dimension #3: Accelerator-Rich Systems Cloud service provider w/ accelerator-enabled cloud Big data application Accelerator designer developer (e.g., Spark) (e.g., FPGA) Easy and efficient Easy accelerator accelerator invocation registration into cloud and sharing Blaze prototype: Accelerator-as-a-service CPU-FPGA platform 1 server w/ FPGA choice [DAC'16] : Blaze works with Spark and YARN ~= 3 CPU servers 1) mainstream PCIe, or and is open source: 2) coherent PCIe (CAPI), or https://github.com/UCLA-VAST/blaze FPGA 3) Intel-Altera HARP [HotCloud'16, ACM SOCC'16] (coherent, one package) 14

Dimension #4: Compiler Support source-to-source compiler for mem system coordinated data prefetching: improvement 1.5x speedup on Xeon Phi [ICS'14, TACO'15, ongoing] many-core processor Future work Accelerator-Rich Architectures & Systems 15

Overview of My Research Application Drivers image processing [ISPASS'11] deep learning [ICCAD'16] genomics [D&T'17] mem system Compiler Accelerator-Rich improvement Support Architectures & Systems [ICS'14, TACO'15, ongoing] Accelerator-Rich FPGA CPU-FPGA: PCIe or QPI? Systems AaaS: Blaze [HotCloud'16, ACM SOCC'16] [DAC'16] Accelerator-Rich PARADE [ICCAD'15] sources of gains [FCCM' 1 6] ARAPrototyper [arXiv' 1 6] Architectures CPU-Acc address translation [HPCA'17 best paper nominee] tutorials [ISCA' 1 5 Transformer [DAC'12, LCTES'12] & MICRO' 1 6] near mem acceleration Tool: System-Level Automation [Memsys'17] 16

Chip-Level CPU-Accelerator Co-design: Address Translation for Unified Memory Space [HPCA'17 Best Paper Nominee] Better programmability and performance 17

Virtual Memory and Address Translation 101 Virtual memory and its benefits § Shared memory for multi-process § Memory isolation for security Address translation § Conceptually more memory Core Translation Lookaside Buffer (TLB): Virtual memory Physical cache address translation results (per process) memory TLB Memory Management Unit (MMU): MMU virtual-to-physical address translation Memory Page Table Virtual-to-physical address mapping in page granularity 18

Inefficiency in Today’s ARA Address Translation Today’s ARA Core Core Accel Accel address translation Accel Datapath TLB TLB using IOMMU with IOMMU Scratchpad IOTLB ( e.g. 32-entries ) MMU MMU DMA Interconnect 1 Ideal Address Translation Performance Relative to IOMMU IOTL IOTLB IOMMU Main Memory 0.8 IOMMU only achieves 0.6 Accelerator-Rich Architecture (ARA) 12% performance of ideal 0.4 address translation #1 Inefficient TLB Support. 0.2 TLBs are not specialized to provide low-latency and capture page locality 0 Deblur Denoise Regist. Segment. Black. Stream. Swapt. DispMap LPCIP EKFSLAM RobLoc gmean #2 High Page Walk Latency. On an IOTLB miss, 4 main memory accesses are required to walk page table Medical Imaging Commercial Vision Navigation 19

Accelerator Performance Is Highly Sensitive to Address Translation Latency 1 gmean Performance Relative to Ideal Address Translation 0.8 0.6 0.4 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1024 Translation latency (cycles) Must provide efficient address translation support 20

Characteristic #1: Regular Bulk Transfer of Consecutive Data (Pages) TLB miss behavior of Access of consecutive pages BlackScholes of one large memory reference benchmark Opportunities for relatively simple TLB and page walker designs 21

Characteristic #2: Impact of Data Tiling – Breaking a Page into Multiple Accelerators Page 31 Original: 32 * 32 * 32 data array … Page 16 Rectangular tiling: 16 * 16 *16 tiles Page 15 … Each tile is mapped to a different accelerator for parallel processing. Page 1 Page 0 But 1 page is split into 4 accelerators! 0 15 31 A shared TLB can be very helpful 22

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, - PowerPoint PPT Presentation

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, Postdoc Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research https://sites.google.com/site/fangzhenman/ The Trend of

Architectures Architectural styles Software architectures Architectures versus middleware

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

modelling rich interaction sensor-based systems statusevent analysis rich set of

US LHC Accelerator Research Program HL-LHC BNL - FNAL- LBNL - SLAC LARP Accelerator Systems 17

Towards Availability and Real-Time Architectures G uarante es for Protected Module Architectures

C4ISR Architectures and Software Architectures Rich Hilliard rh@mitre.org IEEE Architecture

THE GOOD Nutritional value of seafood: Rich source of vitamins Rich source of minerals Rich

Eric Prebys FNAL Accelerator Physics Center 8/17/10 Im the head of the US LHC Accelerator

Challenges in Accelerator Applications Shukui Zhang Thomas Jefferson National Accelerator

FOA Landscape Manouchehr Farkhondeh DOE Office of Nuclear Physics EIC Accelerator Collaboration

KEK, High Energy Accelerator Research Organization KEK High Energy Accelerator Research

Design and Architectures for Design and Architectures for Embedded Systems (ESII) Embedded

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

Statistical Journal of the IAOS - Making more Impact Steve Penneck Nancy McBeth

Data processing during and after the interview Cor Kragt Structure Survey design Dutch LFS

IMMIGRANT ELIGIBILITY FOR HEALTH CARE AND PUBLIC BENEFITS IN CALIFORNIA AIDS Legal Referral

tst rts qts s

Heterogeneous Computing on Power: From Multi-core and Accelerators (GPUs, FPGAs) to Quantum

Responsive and Adaptive Design for Survey Optimization across the Pacific Asaph Young Chun, PhD

RBS Capital Resolution Group Rory Cullinan, CEO, RBS Capital Resolution Group Goldman Sachs

2011 Investor Presentation Barclays Capital Global Financial Services Conference September 14

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, - PowerPoint PPT Presentation

Towards Accelerator-Rich Architectures and Systems Zhenman Fang, Postdoc Computer Science Department, UCLA Center for Domain-Specific Computing Center for Future Architectures Research https://sites.google.com/site/fangzhenman/ The Trend of

Architectures Architectural styles Software architectures Architectures versus middleware

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

SLAC Accelerator Science and R&amp;D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&amp;D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

modelling rich interaction sensor-based systems statusevent analysis rich set of

US LHC Accelerator Research Program HL-LHC BNL - FNAL- LBNL - SLAC LARP Accelerator Systems 17

Towards Availability and Real-Time Architectures G uarante es for Protected Module Architectures

C4ISR Architectures and Software Architectures Rich Hilliard rh@mitre.org IEEE Architecture

THE GOOD Nutritional value of seafood: Rich source of vitamins Rich source of minerals Rich

Eric Prebys FNAL Accelerator Physics Center 8/17/10 Im the head of the US LHC Accelerator

Challenges in Accelerator Applications Shukui Zhang Thomas Jefferson National Accelerator

FOA Landscape Manouchehr Farkhondeh DOE Office of Nuclear Physics EIC Accelerator Collaboration

KEK, High Energy Accelerator Research Organization KEK High Energy Accelerator Research

Design and Architectures for Design and Architectures for Embedded Systems (ESII) Embedded

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

Statistical Journal of the IAOS - Making more Impact Steve Penneck Nancy McBeth

Data processing during and after the interview Cor Kragt Structure Survey design Dutch LFS

IMMIGRANT ELIGIBILITY FOR HEALTH CARE AND PUBLIC BENEFITS IN CALIFORNIA AIDS Legal Referral

tst rts qts s

Heterogeneous Computing on Power: From Multi-core and Accelerators (GPUs, FPGAs) to Quantum

Responsive and Adaptive Design for Survey Optimization across the Pacific Asaph Young Chun, PhD

RBS Capital Resolution Group Rory Cullinan, CEO, RBS Capital Resolution Group Goldman Sachs

2011 Investor Presentation Barclays Capital Global Financial Services Conference September 14

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional