Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu

Background 2019: ● Data centre traffic will reach 10.4 ○ zettabytes. Annual growth rate of ~25%. 83% of traffic will come from the cloud ○ 80% of workloads will be processed in the ○ cloud Failure of Dennard scaling ● Up to 50% - 80% of the chip may be kept powered ● down in order to comply with thermal constraints

Solution: Application-specific accelerators Can significantly increase performance of data centres given fixed power budget ● Either be used as a coprocessor or as a complete replacement ● Studies have shown FPGA-based acceleration: ● 25 x better performance per watt ○ 50-75 x latency improvement ○

Cloud Applications Two types of cloud applications: ● Offline batch processing. High volumes of data, involving several complicated processes ○ Large amount of data are offloaded to the FPGA. Overhead of communication ■ between FPGA and processor minimised. Online streaming processing. Smaller volumes of streaming data, involving simpler ○ processing. Packet processing of network interface card highest computational complexity. FPGAs can be used to offload both the NIC and actual processing of data packets. ■

Issues to Overcome for FPGA-based Accelerators Heterogeneous system increases programming complexity. Main issues: ● Virtualisation and partitioning of FPGA ○ Configuration of FPGAs ○ Scheduling of hardware accelerators ○

Frameworks Presented for FPGA Accelerators

Virtualised Hardware Accelerators (University of Toronto) Aim to ‘virtualise’ FPGAs and enable them as a cloud resource. ● FPGA is split into several reconfigurable regions, with each region viewed as a single resource ● (Virtualised FPGA Resource - VFR). VFRs offered to users via OpenStack ●

Framework VM as resource FPGA as resource ● Hardware accelerator is loaded across multiple FPGAs Instead of single bitstream, a ● collection of partial bitstreams is passed to the agent

Reconfigurable Cloud Computing Environment (Technical University of Dresden) Users implement and execute their own hardware designs on virtual FPGAs. ● They can either allocate a complete physical FPGA or a portion of vFPGA. ● Hypervisor that has access to database containing all physical and virtual FPGA devices and ● their allocation status.

FPGAs in Hyperscale Data Centers (IBM Zurich) Users can build their programmable fabrics of vFPGAs on the cloud ● Rent required FPGAs ●

Implemented Hardware Accelerators: Ryft ONE

Commercial Product - Ryft ONE Simultaneously analyze up to 48TB ● batch and streaming data. Can achieve up to 100x speedup ● while reducing costs by 70%. Functionality includes commonly ● used tasks e.g. term frequency, fuzzy search

Implemented Hardware Accelerators: Microsoft Catapult v1

Board Design Hardware acceleration applied to a group of 1632 servers ● 1 Altera Stratix V D5 FPGA per server, connected via PCIe ● FPGAs are interconnected so that resources can be shared - “ reconfigurable fabric ” ● Requirement: no jumper cables (for power or signalling) ● Limit power draw to under 25W → PCIe bus provided all necessary power. ○ Limit to under 20W - keeps the increase in power consumption below 10%. ○ Each FPGA has 8GB of local DRAM , since SRAM is too expensive ● Industrial-grade material allows FPGA to operate at up to 100°C ● Add electromagnetic-interference shielding to the FPGAs ●

PC to FPGA interface Requirements : ● Low latency (< 10 seconds to transfer 16 KB) ○ Safe for multithreading ○ Custom PCIe interface with DMA support ● Low latency - avoid using system calls to transfer data ● 1 input buffer and 1 output buffer in user-level memory ○ FPGA is given base pointers to those buffers ○ Thread safety - divide the buffers into 64 equal sections ● Give each thread exclusive access to 1 or more sections ○ Each section is 64 KB ○

Network Design FPGAs are connected together in a network ● Low-latency, high bandwidth ● 6 x 8 2D torus network topology ● Balances routability with cabling complexity ○ 20Gb/s bidirectional bandwidth at <1 ● microsecond latency Source: https://en.wikipedia.org/wiki/Torus_interconnect

Overview of Bing Search Finds documents that match the search query Check the Top-level front-end Aggregator Selection service Cache miss cache (TLA) Return the Ranking service search results Ranks the documents

Search Result Ranking 3 stages: ● Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring ○ 8 FPGAs are arranged in a chain ● Queue manager passes documents ● from the selection service through the chain

1. Feature Extraction Search Result Ranking 2. Free Form Expressions 3. Document Scoring Search each document for features related to the search query ● Assign each feature a score ● Hardware allows for multiple feature extraction engines to run ● simultaneously Multiple instruction, single data ( MISD ) ○ Stream Preprocessing FSM : splits the input ● into control and data signals Feature Gathering Network : groups the ● features and sends them onwards

1. Feature Extraction Search Result Ranking 2. Free Form Expressions 3. Document Scoring Mathematical combinations of features ● Involves complex math with large floats ● A custom core was designed for creating ● FFEs Can fit 60 cores on a single D5 FPGA ● Characteristics: ● 1. Each core supports 4 threads 2. Threads are prioritised based on expected latency 3. Long latency operations can be split between multiple FPGAs

1. Feature Extraction Search Result Ranking 2. Free Form Expressions 3. Document Scoring Features Machine learning Floating-point score model Free form expressions ● Search results are ranked in order of document score. ● Documents are compressed to 64 KB before being passed to the ranking service (software implementation does not do this) ○ Compression has minimal impact

Error Handling and Recovery Health Monitor ● Queries servers to check their status ○ If unresponsive : Soft boot → Hard boot → Flag for manual service ○ If responsive : return info about the FPGA: ○ PCIe errors ■ Errors for inter-FPGA network communication ■ DRAM status ■ If a temperature shutdown occurred ■ Mapping Manager ● Manages the roles of the FPGAs ○ Performs reconfigurations ○ Reconfiguring FPGAs may send garbage data, so a “TX halt” signal is sent to its ○ neighbours, telling them to temporarily ignore any data received

Deployment Results Deployed to 34 pods (each pod is a 6 x 8 torus) → 1632 servers ● Increased ranking throughput by 95% at similar latency to a software-only ● solution Increase in power consumption was below 10% ● Increase in cost was below 30% ●

A Cloud-Scale Acceleration Architecture: Microsoft v2

Done by Microsoft, as a work towards Microsoft Catapult ● Overview v2 System Aims to improve fix various issues that came with Catapult ● v1 System. Implemented on Altera Stratix V D5 FPGA board that ● supports dual port 40Gbps LAN port Tested on 5760 servers with synthetic and mirrored ● production data. Achieved noticeable improvement in: ● Local Acceleration ○ Network Acceleration ○ Remote Acceleration ○

Issues with Catapult v1 System 1. The secondary network (6x8 torus) required expensive and complex cabling. 2. Each FPGA needs full awareness of physical location of other machines Failure handling of the torus requires complex re-routing 3. of traffic to neighboring nodes . a. Performance Loss b. Isolation of nodes 4. Limited scalability for direct communication a. One rack -> 48 nodes 5. FPGA can be used for accelerating applications, but has limited ability to enhance datacenter infrastructure.

Proposed Solution (bump-in-the-wire) Couple FPGA with network interface. ● a. FPGA device share the same network topology as the server itself All network traffic is routed through the FPGA ● a. Allow it to accelerate high-bandwidth network flows FPGA uses PCIe connection to host ● a. Gives it capability for local acceleration FPGA is able to generate and consume their own data ● packets, and communicates using LTL (Lightweight Transport Layer) a. Every FPGA can reach every other one in a small number of microseconds in hyperscale.

Proposed Solution (bump-in-the-wire) Discrete NIC (Network Interface Card) ● Allows for simple bypassing, rather than ○ wasting FPGA resource on implementing NIC logic. Possible Drawback of this Design ● Buggy Application may cut network coming ○ into this server Failure in one server do not affect the ○ others Servers have auto health check, while ○ unresponsive server will be rebooted. Thus the proper FPGA image (golden image) will be used again.

Shell Architecture Role: Application ● Logic Shell: I/O and ● Board Specific Logic Elastic Router: ● Intra-Communicat ion Interface for Roles LTL engine: ● Inter-communicati on interface for Roles.

RESOURCE USAGE 44% of FPGA used to support all shell functions. Enough space left for the role(s) to provide large speed up.

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu - PowerPoint PPT Presentation

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu Background 2019: Data centre traffic will reach 10.4 zettabytes. Annual growth rate of ~25%. 83% of traffic will come from the cloud 80% of workloads will be

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Middle School Enrichment & Acceleration Where will students access enrichment and

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging

Acceleration in English and Social Studies Acceleration in English and Social Studies (ELA/SS):

Questions ? Tonights Agenda Acceleration

Neutrino acceleration: analogy with Fermi acceleration and Comptonization Yudai Suwa 1,2 1 Yukawa

Probing Particle Acceleration with Probing Particle Acceleration with X-ray/Gamma X ray/Gamma

Laser-Wakefield Acceleration Application to Endoscopic Oncology Scott Nicks, Toshi Tajima, Dante

Particle Acceleration Particle Acceleration and Injection Problem in Shocks and Injection

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

Building OSGi Components Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 1 About

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services 1. Overview 2.

Largest Districts in Alabama Ranking School District ADM 1 049 Mobile County 53,419.40 2

NRC Sociology Rankings Andrew J. Perrin November 3, 2010 Andrew J. Perrin () NRC Sociology

Itera rati tive Dat ata a Min inin ing Jill illes V s Vreeken 26 June une 2014 2014 (TA

Prototyping CS 4730 Computer Game Design Credit:

SunyoungKim,PhD Last class 1. Brainstorming 2. Sketch 3. Scenario 4. Storyboard Recap: