Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu - - PowerPoint PPT Presentation

data centre acceleration
SMART_READER_LITE
LIVE PREVIEW

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu - - PowerPoint PPT Presentation

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu Background 2019: Data centre traffic will reach 10.4 zettabytes. Annual growth rate of ~25%. 83% of traffic will come from the cloud 80% of workloads will be


slide-1
SLIDE 1

Data Centre Acceleration

Monica Qin Li Aaron Chelvan Sijun Zhu

slide-2
SLIDE 2

Background

  • 2019:

○ Data centre traffic will reach 10.4

  • zettabytes. Annual growth rate of ~25%.

○ 83% of traffic will come from the cloud ○ 80% of workloads will be processed in the cloud

  • Failure of Dennard scaling
  • Up to 50% - 80% of the chip may be kept powered

down in order to comply with thermal constraints

slide-3
SLIDE 3

Solution: Application-specific accelerators

  • Can significantly increase performance of data centres given fixed power budget
  • Either be used as a coprocessor or as a complete replacement
  • Studies have shown FPGA-based acceleration:

○ 25 x better performance per watt ○ 50-75 x latency improvement

slide-4
SLIDE 4

Cloud Applications

  • Two types of cloud applications:

○ Offline batch processing. High volumes of data, involving several complicated processes ■ Large amount of data are offloaded to the FPGA. Overhead of communication between FPGA and processor minimised. ○ Online streaming processing. Smaller volumes of streaming data, involving simpler

  • processing. Packet processing of network interface card highest computational complexity.

■ FPGAs can be used to offload both the NIC and actual processing of data packets.

slide-5
SLIDE 5

Issues to Overcome for FPGA-based Accelerators

  • Heterogeneous system increases programming complexity. Main issues:

○ Virtualisation and partitioning of FPGA ○ Configuration of FPGAs ○ Scheduling of hardware accelerators

slide-6
SLIDE 6

Frameworks Presented for FPGA Accelerators

slide-7
SLIDE 7

Virtualised Hardware Accelerators (University of Toronto)

  • Aim to ‘virtualise’ FPGAs and enable them as a cloud resource.
  • FPGA is split into several reconfigurable regions, with each region viewed as a single resource

(Virtualised FPGA Resource - VFR).

  • VFRs offered to users via OpenStack
slide-8
SLIDE 8

Framework

  • Hardware accelerator is loaded

across multiple FPGAs

  • Instead of single bitstream, a

collection of partial bitstreams is passed to the agent

VM as resource FPGA as resource

slide-9
SLIDE 9

Reconfigurable Cloud Computing Environment (Technical University of Dresden)

  • Users implement and execute their own hardware designs on virtual FPGAs.
  • They can either allocate a complete physical FPGA or a portion of vFPGA.
  • Hypervisor that has access to database containing all physical and virtual FPGA devices and

their allocation status.

slide-10
SLIDE 10

FPGAs in Hyperscale Data Centers (IBM Zurich)

  • Users can build their programmable fabrics of vFPGAs on the cloud
  • Rent required FPGAs
slide-11
SLIDE 11

Implemented Hardware Accelerators: Ryft ONE

slide-12
SLIDE 12

Commercial Product - Ryft ONE

  • Simultaneously analyze up to 48TB

batch and streaming data.

  • Can achieve up to 100x speedup

while reducing costs by 70%.

  • Functionality includes commonly

used tasks e.g. term frequency, fuzzy search

slide-13
SLIDE 13

Implemented Hardware Accelerators: Microsoft Catapult v1

slide-14
SLIDE 14

Board Design

  • Hardware acceleration applied to a group of 1632 servers
  • 1 Altera Stratix V D5 FPGA per server, connected via PCIe
  • FPGAs are interconnected so that resources can be shared - “reconfigurable fabric”
  • Requirement: no jumper cables (for power or signalling)

○ Limit power draw to under 25W → PCIe bus provided all necessary power. ○ Limit to under 20W - keeps the increase in power consumption below 10%.

  • Each FPGA has 8GB of local DRAM, since SRAM is too expensive
  • Industrial-grade material allows FPGA to operate at up to 100°C
  • Add electromagnetic-interference shielding to the FPGAs
slide-15
SLIDE 15

PC to FPGA interface

  • Requirements:

○ Low latency (< 10 seconds to transfer 16 KB) ○ Safe for multithreading

  • Custom PCIe interface with DMA support
  • Low latency - avoid using system calls to transfer data

○ 1 input buffer and 1 output buffer in user-level memory ○ FPGA is given base pointers to those buffers

  • Thread safety - divide the buffers into 64 equal sections

○ Give each thread exclusive access to 1 or more sections ○ Each section is 64 KB

slide-16
SLIDE 16

Network Design

  • FPGAs are connected together in a network
  • Low-latency, high bandwidth
  • 6 x 8 2D torus network topology

○ Balances routability with cabling complexity

  • 20Gb/s bidirectional bandwidth at <1

microsecond latency

Source: https://en.wikipedia.org/wiki/Torus_interconnect

slide-17
SLIDE 17

Overview of Bing Search

Check the front-end cache Top-level Aggregator (TLA) Selection service Ranking service Return the search results

Cache miss

Finds documents that match the search query Ranks the documents

slide-18
SLIDE 18

Search Result Ranking

  • 3 stages:

○ Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring

  • 8 FPGAs are arranged in a chain
  • Queue manager passes documents

from the selection service through the chain

slide-19
SLIDE 19
  • Search each document for features related to the search query
  • Assign each feature a score
  • Hardware allows for multiple feature extraction engines to run

simultaneously ○ Multiple instruction, single data (MISD)

  • Stream Preprocessing FSM: splits the input

into control and data signals

  • Feature Gathering Network: groups the

features and sends them onwards

  • 1. Feature Extraction
  • 2. Free Form Expressions
  • 3. Document Scoring

Search Result Ranking

slide-20
SLIDE 20
  • Mathematical combinations of features
  • Involves complex math with large floats
  • A custom core was designed for creating

FFEs

  • Can fit 60 cores on a single D5 FPGA
  • Characteristics:

1. Each core supports 4 threads 2. Threads are prioritised based on expected latency 3. Long latency operations can be split between multiple FPGAs

Search Result Ranking

  • 1. Feature Extraction
  • 2. Free Form Expressions
  • 3. Document Scoring
slide-21
SLIDE 21

Machine learning model Floating-point score Features Free form expressions

  • Search results are ranked in order of document score.
  • Documents are compressed to 64 KB before being passed to the ranking service

(software implementation does not do this) ○ Compression has minimal impact

Search Result Ranking

  • 1. Feature Extraction
  • 2. Free Form Expressions
  • 3. Document Scoring
slide-22
SLIDE 22

Error Handling and Recovery

  • Health Monitor

○ Queries servers to check their status ○ If unresponsive: Soft boot → Hard boot → Flag for manual service ○ If responsive: return info about the FPGA: ■ PCIe errors ■ Errors for inter-FPGA network communication ■ DRAM status ■ If a temperature shutdown occurred

  • Mapping Manager

○ Manages the roles of the FPGAs ○ Performs reconfigurations ○ Reconfiguring FPGAs may send garbage data, so a “TX halt” signal is sent to its neighbours, telling them to temporarily ignore any data received

slide-23
SLIDE 23

Deployment Results

  • Deployed to 34 pods (each pod is a 6 x 8 torus) → 1632 servers
  • Increased ranking throughput by 95% at similar latency to a software-only

solution

  • Increase in power consumption was below 10%
  • Increase in cost was below 30%
slide-24
SLIDE 24

A Cloud-Scale Acceleration Architecture: Microsoft v2

slide-25
SLIDE 25

Overview

  • Done by Microsoft, as a work towards Microsoft Catapult

v2 System

  • Aims to improve fix various issues that came with Catapult

v1 System.

  • Implemented on Altera Stratix V D5 FPGA board that

supports dual port 40Gbps LAN port

  • Tested on 5760 servers with synthetic and mirrored

production data.

  • Achieved noticeable improvement in:

Local Acceleration

Network Acceleration

Remote Acceleration

slide-26
SLIDE 26

Issues with Catapult v1 System

1. The secondary network (6x8 torus) required expensive and complex cabling. 2. Each FPGA needs full awareness of physical location of

  • ther machines

3.

Failure handling of the torus requires complex re-routing

  • f traffic to neighboring nodes.

a. Performance Loss b. Isolation of nodes

4. Limited scalability for direct communication

a. One rack -> 48 nodes

5. FPGA can be used for accelerating applications, but has limited ability to enhance datacenter infrastructure.

slide-27
SLIDE 27

Proposed Solution (bump-in-the-wire)

  • Couple FPGA with network interface.

a. FPGA device share the same network topology as the server itself

  • All network traffic is routed through the FPGA

a. Allow it to accelerate high-bandwidth network flows

  • FPGA uses PCIe connection to host

a. Gives it capability for local acceleration

  • FPGA is able to generate and consume their own data

packets, and communicates using LTL (Lightweight Transport Layer) a. Every FPGA can reach every other one in a small number of microseconds in hyperscale.

slide-28
SLIDE 28

Proposed Solution (bump-in-the-wire)

  • Discrete NIC (Network Interface Card)

○ Allows for simple bypassing, rather than wasting FPGA resource on implementing NIC logic.

  • Possible Drawback of this Design

○ Buggy Application may cut network coming into this server ○ Failure in one server do not affect the

  • thers

○ Servers have auto health check, while unresponsive server will be rebooted. Thus the proper FPGA image (golden image) will be used again.

slide-29
SLIDE 29

Shell Architecture

  • Role: Application

Logic

  • Shell: I/O and

Board Specific Logic

  • Elastic Router:

Intra-Communicat ion Interface for Roles

  • LTL engine:

Inter-communicati

  • n interface for

Roles.

slide-30
SLIDE 30

RESOURCE USAGE

44% of FPGA used to support all shell functions. Enough space left for the role(s) to provide large speed up.

slide-31
SLIDE 31

Evaluation

  • Local Acceleration
  • Network Acceleration
  • Remote Acceleration
slide-32
SLIDE 32

Local Acceleration ( Bing Search Page Ranking )

  • Implemented critical functions while

supporting ER and LTL ○ All server communication pass through FPGA while FPGA is accelerating. ○ No interaction between pass-through traffic with acceleration

  • Running on full production bed consisting of

thousands of servers.

  • Latency production target can be met even

when throughput is incremented by 2.25x. ○ I.e. more than half of servers can be saved for other uses.

slide-33
SLIDE 33

Local Acceleration (Bing Search Page Ranking)

  • When running with production data,

FPGA query latency is observed to be lower and more stable.

  • FPGA executes queries at latency that

never exceeds the software datacenter at any load.

slide-34
SLIDE 34

Network Acceleration (encryption/decryption)

  • Encryption and decryption can be done by intel CPUs. However, to reach 40Gbps

rate for AES-CBC-128-SHA1 algorithm takes up to 15 cores.

  • These cores could have been used for generating revenue.
  • The FPGA implementation uses 11us, while intel’s best number takes 4us.

However, FPGA processing can be more easily pipelined.

  • Intel’s solution doesn’t take into the disturbance caused by cache to the core.
slide-35
SLIDE 35

Remote Acceleration

  • Unused FPGA can be used by others by communicating through LTL.

○ Since FPGA is tightly coupled to network, it can react quickly ○ LTL uses ACK/NACK to avoid unnecessary timeout in the event of re-ordering ○ LTL has really promising performance

slide-36
SLIDE 36

Remote Acceleration

  • Running the same

page ranking algorithm as before.

  • Remote

acceleration has no major impact

  • n the

acceleration speed up.

  • Therefore it’s a

viable choice

slide-37
SLIDE 37

Remote Acceleration (Over-subscription)

  • Runs latency-sensitive DNN on the

network

  • FPGA has sufficient throughput to sustain

2-2.5 software clients (at high speed, multiple times of production speed) to FPGA before latency spikes.

slide-38
SLIDE 38

HaaS (Hardware as a Service)

  • SaaS (Software)

  • > IaaS (Infrastructure)

  • > HaaS (Hardware)
  • Utilize unused FPGAs to the full

potential without affecting the performance of datacenter

  • Detail is out of scope for this

paper

slide-39
SLIDE 39

Comparison of Reconfigurable Accelerators for Cloud Computing

slide-40
SLIDE 40

System speedup: Defined as speedup achieved for a specific task using hardware accelerator + communication

  • verhead for transferring data to FPGA and vice versa.

Energy efficiency: Energy consumption of the hardware accelerator compared to executing in software

slide-41
SLIDE 41

Overall, it is shown that the hardware accelerators can achieve an order of magnitude better energy efficiency compared to the typical server processors.

Architectures that did not report energy efficiency

(grouped by application)

slide-42
SLIDE 42

Thanks for listening Questions?