Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - PowerPoint PPT Presentation

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930)

Background

Data Centres ❖ Centralised computing and network infrastructure ❖ Cloud applications for storage and computation

Current Issues - Performance ❖ End of Moore's Law ❖ Mainstream hardware unable to keep up with growing demand

Current Issues - Power Efficiency ❖ Lots of designed redundancy in data centres ➢ Peak load handling ❖ Growing attention towards sustainability ➢ Environmental and cost concerns

Solution? Application specific hardware acceleration ❖ 25x better performance per watt ❖ 50-75x latency improvement

Cloud Computing Characteristics Two broad categories of cloud applications: ❖ Offline - process large quantities of data, complex operations ➢ Big data, MapReduce ❖ Online - data streaming and delivery ➢ Search engine, video streaming

Implementation Frameworks

Accelerator Frameworks Virtualised Hardware Accelerators FPGAs in the Cloud (IBM) (University of Toronto) ❏ Abstracts portions of FPGAs as a ❏ Abstracts reconfigurable FPGA pool of resources accelerators as Virtual Machines ❏ Predefined functions such as ❏ Openstack for resource control and encryption and hashing allocation

Accelerator Frameworks Virtualised FPGA Accelerators FPGAs in Hyperscale Data Centres (University of Warwick) (IBM) ❏ Integration within server machines ❏ Direct user allocation of FPGA ❏ Usually provide library of operations partition which are faster on FPGA

Implementations and Evaluation

Speedup Metrics ❖ Performance speedup ➢ Kernel speedup - execution of specific task ➢ System speedup - execution of entire application ❖ Energy efficiency - fraction multiplier

Successful Implementations Reconfigurable MapReduce Accelerator - Memcached Acceleration - HP University of Athens ❏ Two distinct accelerator blocks ❏ V1 - Map done by standard ❏ Network accelerator processors, reduce moved to FPGA ❏ Memcached accelerator ❏ V2 - Map moved to FPGA, ❏ Example of streaming acceleration reconfigurable by HLS Speedup 1x Speedup 4.3x Efficiency 10.9x Efficiency 33x

Successful Implementations Microsoft Catapult - Bing Search Engine ❏ Altera FPGA PCIe board installed inside standard server machines ❏ Aid machine learning page ranking Speedup 1.95x Efficiency -

Implemented Hardware Accelerators: Microsoft Catapult v1

Design ● 6 x 8 2D torus embedded into a half-rack of 48 servers ● 1,632 servers ● 1 Altera Stratix VD5 FPGA and local DRAM per server ● PCI Express – PCIe ● Each FPGA has 8GB of local DRAM ● 20 Gbits per second of bidirectional bandwidth and only passive copper cables

Software Interface Communication between FPGA and host CPU design: ● Interface via PCIe ● Interface must incur low latency ● Interface must be multi-threading safe ● FPGA is provided pointer to user space buffer space. ● Buffer space is divided into 64 slots. ● Each thread is statically assigned exclusive access to one or more slots

Software Infrastructure The software infrastructure needs to ● Ensure correct operation ● Detect failures and recover Two services are introduced for these tasks: ● Mapping manager: Configures FPGAs with correct application images ● Health monitor: Is invoked when there is a suspected failure in one or more systems

Correct Operation FPGA reconfiguration may cause instability in system Reason: ● It can appear as a failed PCIe device. This raises a non-maskable interrupt ● It may corrupt its neighbors by randomly sending traffic that appears valid Solution: ● The driver behind the reprogramming must disable non-maskable interrupts ● Send "TX Halt" message. Meaning ignore all message until link establishes

Failure Detection and Recover ● Monitor server notice unresponsive servers ● Health monitor contact each machine to get status. ● Healthy service sends status of local FPGA ● Health monitor update machine list of failed servers ● Mapping manager moves the application

Application Used in Bing's ranking engine Overview: ● If possible, query is served from front end cache ● TLA (Top level aggregator) send query to large number of machines ● These machine find documents ● It send it to machine running ranking service ● Return the search results

Macropipeline ● Process pipe line is divided into macro-pipeline stages ● Time limit for micro-pipeline is 8 micro seconds ● It is 1600 FPGA clock cycles ● Queue manager passes documents from the selection service through the chain ● Tasks are distributed in this fashion: ○ 1 FPGA for feature extraction ○ 2 FPGA for free form expression ○ 1 FPGA for compression ○ 3 FPGA to hold machine learning models ○ 1 FPGA is a spare in case of machine failure

Workload ● 3 stages: ○ Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring ● Documents are only transmitted in compressed form to save bandwidth ● Due to the slot based communication interface, the compressed documents are truncated to 64 KB

Feature Extraction ● Search each document for features related to the search query ● Each of the feature extraction engines can run in parallel, working on the same input stream (MISD computation) ● Hardware allows for multiple feature extraction engines to run simultaneously ● Multiple instruction, single data (MISD) ● Stream Preprocessing FSM: produces a series of control and data messages ● Feature Gathering Network: collects generated feature and value pairs and forwards them to the next pipeline stage

Free Form Expressions ● Custom multicore processor that is efficient at processing thousands of threads with long-latency floating point operations ● 60 cores on a single FPGA ● Characteristics: ○ Each core supports 4 threads ○ Threads are prioritised based on expected latency ○ Long latency operations can be split between multiple FPGAs

Document Scoring ● The features and FFEs as inputs and produces a single floating-point score ● Result scores are sorted

Clound-scale acceleration

Limitations of Catapult V1.0 ● Secondary network complex and expensive ● Failure handling of the torus required complex re-routing of traffic to neighboring nodes, causing both performance loss and isolation of nodes under certain failure patterns ● Number of FPGAs that could communicate with each other is limited to a single rack. ● Application-scale accelerators could not influence the whole datacenter infrastructure, such as network and storage flow.

A new cloud-scale FPGA-based Architecture This architecture eliminates all of the limitations listed above with a single design. The architecture has been — and is being — deployed in the majority of new servers in Microsoft’s production data centers across more than 15 countries and 5 continents. We could call it Catapult V2.0.

Network Topology of the architecture PCIE to local host CPU(local accelerator) It is placed between NIC and network switchs. (bump-in-the-wire) NIC(network interface card) QSFP(quad small Form-factor Pluggable)

Flexibility of the model By enabling the FPGAs to generate and consume their own networking packets independent of the hosts, each and every FPGA in the datacenter can reach every other one. LTL(Lightweight Transport Layer) protocol for low latency communication between pairs of FPGAs. Every host could use remote FPGA resource. Not running FPGAs will be donated to a global pool by their host.

Hardware achitecture Datacenter accelerators must be highly manageable, which means having few variations or versions. Single design must provide positive value across an extremely large, homogeneous deployment. They must be highly flexible at the system level, in addition to being programmable,to justify deployment across a hyperscale infrastructure. Divided into Three parts 1. Local acceleration handles high value scenarios such as search ranking acceleration where every server can benefit from having its own FPGA. 2. Network acceleration can support services such as intrusion detection, deep packet inspection and network encryption. 3. Global acceleration permits accelerators unused by their host servers to be made available for large-scale applications.

Board Design

Shell architecture (ER) with virtual channel support for allowing multiple Roles access to the network, and a Lightweight Transport Layer (LTL) engine used for enabling inter-FPGA communication.

Datacenter deployment 5,760 servers containing this accelerator architecture and placed it into a production datacenter. 3,081 of these machines using the FPGA for local compute acceleration. two FPGAs had hard failures. a low number of soft errors, which were all correctable.

Local Acceleration Bing Search Page ranking . With the single local FPGA, at the target 99th percentile latency, the throughput can be safely increased by 2.25x, which means that fewer than half as many servers would be needed to sustain the target throughput at the required latency.

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - PowerPoint PPT Presentation

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930) Background Data Centres Centralised computing and network infrastructure Cloud applications for storage and computation

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

EMEA P RESS AND SP S UMMIT Draft Special Guest Speaker Presentation: Datacentre Interconnection

DataCentre One Pte. Ltd. Extraordinary General Meeting 23 October 2019 Important Notice The

Flat Datacentre Storage Sumit Mokashi Why is the storage described as a flat one?

Haskell in the datacentre! Simon Marlow Facebook (Copenhagen, April 2019) Haskell powers Sigma

Haskell in the datacentre! Simon Marlow Facebook (FHPC 17, September 2017) Haskell powers

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Middle School Enrichment & Acceleration Where will students access enrichment and

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging

Acceleration in English and Social Studies Acceleration in English and Social Studies (ELA/SS):

Questions ? Tonights Agenda Acceleration

Neutrino acceleration: analogy with Fermi acceleration and Comptonization Yudai Suwa 1,2 1 Yukawa

High-Speed Elliptic Curve Cryptography Accelerator for Koblitz Curves Kimmo J arvinen Jorma

Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the

ADJOINT APPROACH TO ACCELERATOR LATTICE DESIGN* T.M.

The UK - a natural home for global engineering and technology champions? Warren East January

STORK Worshop 26-27 November 2002 Bruges Aniyan Varghese aniyan.varghese@cec.eu.int European

IDPAs Intra-Group Data Processing Agreements 21.10.2020 IAPP Switzerland KnowledgeNet Dr

Erasmus+ mobility for studies 19/20 Info Meeting Call available in English:

RAPID TRANSITIONS IN THE GLOBAL ECONOMY: OPPORTUNITIES AND MAJOR CHALLENGES Michael Spence ISEO

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - PowerPoint PPT Presentation

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930) Background Data Centres Centralised computing and network infrastructure Cloud applications for storage and computation

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

EMEA P RESS AND SP S UMMIT Draft Special Guest Speaker Presentation: Datacentre Interconnection

DataCentre One Pte. Ltd. Extraordinary General Meeting 23 October 2019 Important Notice The

Flat Datacentre Storage Sumit Mokashi Why is the storage described as a flat one?

Haskell in the datacentre! Simon Marlow Facebook (Copenhagen, April 2019) Haskell powers Sigma

Haskell in the datacentre! Simon Marlow Facebook (FHPC 17, September 2017) Haskell powers

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Middle School Enrichment &amp; Acceleration Where will students access enrichment and

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging

Acceleration in English and Social Studies Acceleration in English and Social Studies (ELA/SS):

Questions ? Tonights Agenda Acceleration

Neutrino acceleration: analogy with Fermi acceleration and Comptonization Yudai Suwa 1,2 1 Yukawa

High-Speed Elliptic Curve Cryptography Accelerator for Koblitz Curves Kimmo J arvinen Jorma

Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the

ADJOINT APPROACH TO ACCELERATOR LATTICE DESIGN* T.M.

The UK - a natural home for global engineering and technology champions? Warren East January

STORK Worshop 26-27 November 2002 Bruges Aniyan Varghese aniyan.varghese@cec.eu.int European

IDPAs Intra-Group Data Processing Agreements 21.10.2020 IAPP Switzerland KnowledgeNet Dr

Erasmus+ mobility for studies 19/20 Info Meeting Call available in English:

RAPID TRANSITIONS IN THE GLOBAL ECONOMY: OPPORTUNITIES AND MAJOR CHALLENGES Michael Spence ISEO

Middle School Enrichment & Acceleration Where will students access enrichment and