Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - - PowerPoint PPT Presentation

datacentre acceleration
SMART_READER_LITE
LIVE PREVIEW

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - - PowerPoint PPT Presentation

Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930) Background Data Centres Centralised computing and network infrastructure Cloud applications for storage and computation


slide-1
SLIDE 1

Datacentre Acceleration

Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930)

slide-2
SLIDE 2

Background

slide-3
SLIDE 3

Data Centres

❖ Centralised computing and network infrastructure ❖ Cloud applications for storage and computation

slide-4
SLIDE 4

Current Issues - Performance

❖ End of Moore's Law ❖ Mainstream hardware unable to keep up with growing demand

slide-5
SLIDE 5

Current Issues - Power Efficiency

❖ Lots of designed redundancy in data centres ➢ Peak load handling ❖ Growing attention towards sustainability ➢ Environmental and cost concerns

slide-6
SLIDE 6

Solution?

Application specific hardware acceleration

❖ 25x better performance per watt ❖ 50-75x latency improvement

slide-7
SLIDE 7

Cloud Computing Characteristics

Two broad categories of cloud applications: ❖ Offline - process large quantities of data, complex

  • perations

➢ Big data, MapReduce ❖ Online - data streaming and delivery ➢ Search engine, video streaming

slide-8
SLIDE 8

Implementation Frameworks

slide-9
SLIDE 9

Accelerator Frameworks

FPGAs in the Cloud (IBM) ❏ Abstracts portions of FPGAs as a pool of resources ❏ Predefined functions such as encryption and hashing Virtualised Hardware Accelerators (University of Toronto) ❏ Abstracts reconfigurable FPGA accelerators as Virtual Machines ❏ Openstack for resource control and allocation

slide-10
SLIDE 10

Accelerator Frameworks

FPGAs in Hyperscale Data Centres (IBM) ❏ Direct user allocation of FPGA partition Virtualised FPGA Accelerators (University of Warwick) ❏ Integration within server machines ❏ Usually provide library of operations which are faster on FPGA

slide-11
SLIDE 11

Implementations and Evaluation

slide-12
SLIDE 12

Speedup Metrics

❖ Performance speedup ➢ Kernel speedup - execution of specific task ➢ System speedup - execution of entire application ❖ Energy efficiency - fraction multiplier

slide-13
SLIDE 13

Successful Implementations

Reconfigurable MapReduce Accelerator - University of Athens ❏ V1 - Map done by standard processors, reduce moved to FPGA ❏ V2 - Map moved to FPGA, reconfigurable by HLS

Speedup 4.3x Efficiency 33x

Memcached Acceleration - HP ❏ Two distinct accelerator blocks ❏ Network accelerator ❏ Memcached accelerator ❏ Example of streaming acceleration

Speedup 1x Efficiency 10.9x

slide-14
SLIDE 14

Successful Implementations

Microsoft Catapult - Bing Search Engine ❏ Altera FPGA PCIe board installed inside standard server machines ❏ Aid machine learning page ranking

Speedup 1.95x Efficiency

slide-15
SLIDE 15

Implemented Hardware Accelerators: Microsoft Catapult v1

slide-16
SLIDE 16

Design

  • 6 x 8 2D torus embedded into a half-rack of 48 servers
  • 1,632 servers
  • 1 Altera Stratix VD5 FPGA and local DRAM per server
  • PCI Express – PCIe
  • Each FPGA has 8GB of local DRAM
  • 20 Gbits per second of bidirectional bandwidth and only passive copper cables
slide-17
SLIDE 17

Software Interface

Communication between FPGA and host CPU design:

  • Interface via PCIe
  • Interface must incur low latency
  • Interface must be multi-threading safe
  • FPGA is provided pointer to user space buffer space.
  • Buffer space is divided into 64 slots.
  • Each thread is statically assigned exclusive access to one or more slots
slide-18
SLIDE 18

Software Infrastructure

The software infrastructure needs to

  • Ensure correct operation
  • Detect failures and recover

Two services are introduced for these tasks:

  • Mapping manager: Configures FPGAs with correct application images
  • Health monitor: Is invoked when there is a suspected failure in one or more

systems

slide-19
SLIDE 19

Correct Operation

FPGA reconfiguration may cause instability in system Reason:

  • It can appear as a failed PCIe device. This raises a non-maskable interrupt
  • It may corrupt its neighbors by randomly sending traffic that appears valid

Solution:

  • The driver behind the reprogramming must disable non-maskable interrupts
  • Send "TX Halt" message. Meaning ignore all message until link establishes
slide-20
SLIDE 20

Failure Detection and Recover

  • Monitor server notice unresponsive servers
  • Health monitor contact each machine to get status.
  • Healthy service sends status of local FPGA
  • Health monitor update machine list of failed servers
  • Mapping manager moves the application
slide-21
SLIDE 21

Application

Used in Bing's ranking engine Overview:

  • If possible, query is served from front end cache
  • TLA (Top level aggregator) send query to large

number of machines

  • These machine find documents
  • It send it to machine running ranking service
  • Return the search results
slide-22
SLIDE 22

Macropipeline

  • Process pipe line is divided into macro-pipeline

stages

  • Time limit for micro-pipeline is 8 micro seconds
  • It is 1600 FPGA clock cycles
  • Queue manager passes documents from the selection

service through the chain

  • Tasks are distributed in this fashion:

○ 1 FPGA for feature extraction ○ 2 FPGA for free form expression ○ 1 FPGA for compression ○ 3 FPGA to hold machine learning models ○ 1 FPGA is a spare in case of machine failure

slide-23
SLIDE 23

Workload

  • 3 stages:

○ Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring

  • Documents are only transmitted in compressed form to save bandwidth
  • Due to the slot based communication interface, the compressed documents are

truncated to 64 KB

slide-24
SLIDE 24

Feature Extraction

  • Search each document for features related to the

search query

  • Each of the feature extraction engines can run in

parallel, working on the same input stream (MISD computation)

  • Hardware allows for multiple feature extraction

engines to run simultaneously

  • Multiple instruction, single data (MISD)
  • Stream Preprocessing FSM: produces a series of

control and data messages

  • Feature Gathering Network: collects generated

feature and value pairs and forwards them to the next pipeline stage

slide-25
SLIDE 25

Free Form Expressions

  • Custom multicore processor that is efficient at processing thousands of threads

with long-latency floating point operations

  • 60 cores on a single FPGA
  • Characteristics:

○ Each core supports 4 threads ○ Threads are prioritised based on expected latency ○ Long latency operations can be split between multiple FPGAs

slide-26
SLIDE 26

Document Scoring

  • The features and FFEs as inputs and produces a single

floating-point score

  • Result scores are sorted
slide-27
SLIDE 27

Clound-scale acceleration

slide-28
SLIDE 28

Limitations of Catapult V1.0

  • Secondary network complex and expensive
  • Failure handling of the torus required complex re-routing of traffic to neighboring

nodes, causing both performance loss and isolation of nodes under certain failure patterns

  • Number of FPGAs that could communicate with each other is limited to a single rack.
  • Application-scale accelerators could not influence the whole datacenter infrastructure,

such as network and storage flow.

slide-29
SLIDE 29

A new cloud-scale FPGA-based Architecture

This architecture eliminates all of the limitations listed above with a single design. The architecture has been — and is being — deployed in the majority of new servers in Microsoft’s production data centers across more than 15 countries and 5

  • continents. We could call it Catapult V2.0.
slide-30
SLIDE 30

Network Topology of the architecture

PCIE to local host CPU(local accelerator) It is placed between NIC and network

  • switchs. (bump-in-the-wire)

NIC(network interface card) QSFP(quad small Form-factor Pluggable)

slide-31
SLIDE 31

Flexibility of the model

By enabling the FPGAs to generate and consume their own networking packets independent of the hosts, each and every FPGA in the datacenter can reach every

  • ther one.

LTL(Lightweight Transport Layer) protocol for low latency communication between pairs of FPGAs. Every host could use remote FPGA resource. Not running FPGAs will be donated to a global pool by their host.

slide-32
SLIDE 32

Hardware achitecture

Datacenter accelerators must be highly manageable, which means having few variations or

  • versions. Single design must provide positive value across an extremely large, homogeneous

deployment. They must be highly flexible at the system level, in addition to being programmable,to justify deployment across a hyperscale infrastructure.

Divided into Three parts

1. Local acceleration handles high value scenarios such as search ranking acceleration where every server can benefit from having its own FPGA. 2. Network acceleration can support services such as intrusion detection, deep packet inspection and network encryption. 3. Global acceleration permits accelerators unused by their host servers to be made available for large-scale applications.

slide-33
SLIDE 33

Board Design

slide-34
SLIDE 34

Shell architecture

(ER) with virtual channel support for allowing multiple Roles access to the network, and a Lightweight Transport Layer (LTL) engine used for enabling inter-FPGA communication.

slide-35
SLIDE 35

Datacenter deployment

5,760 servers containing this accelerator architecture and placed it into a production datacenter. 3,081 of these machines using the FPGA for local compute acceleration. two FPGAs had hard failures. a low number of soft errors, which were all correctable.

slide-36
SLIDE 36

Local Acceleration

Bing Search Page ranking . With the single local FPGA, at the target 99th percentile latency, the throughput can be safely increased by 2.25x, which means that fewer than half as many servers would be needed to sustain the target throughput at the required latency.

slide-37
SLIDE 37

Network Acceleration

Intel AES GCM-128 on Haswell is 1.26 cycles per byte for encrypt and decrypt each. Thus, at a 2.4 GHz clock frequency, 40 Gb/s encryption/decryption consumes five cores. Different standards, such as 256b or CBC are, however, significantly slower. AES-CBC-128-SHA1 fifteen cores to achieve 40 Gb/s full duplex FPGA supports full 40 Gb/s encryption and decryption. The worst case half-duplex FPGA crypto latency for AES-CBC-128-SHA1 is 11 µs for a 1500B packet, from first flit to first flit. In software, based on the Intel numbers, it is approximately 4 µs. AES-CBC-SHA1 is, however, especially difficult for hardware due to tight dependencies. For example, AES-CBC requires processing 33 packets at a time in our implementation, taking only 128b from a single packet once every 33 cycles.

slide-38
SLIDE 38

Remote acceleration

Communication among FPGAs is important to treat the acceleration hardware as a global resource and to deploy services that consume more than one FPGA. FPGAs can be used by other server on network. When the FPGA is not been used by local server, it will be group into server pool and be used by other server that need additional FPGA. Two major function should be implemented on FPGA: 1. inter-FPGA communication 2. intra-FPGA router

slide-39
SLIDE 39

Lightweight Transport Layer (LTL)

LTL is inter-FPGA network protocol use UDP for frame encapsulation and IP for routing packets across datacenter network ❖ Lossless traffic classes

➢ avoid packet drops and reorders

❖ FPGAs are tightly coupled to network

➢ react quickly and efficiently to notification and back off when needed to reduce packets dropped

❖ Use statisticallt allocated and persistent connection

➢ low latency communication once establised connection,

❖ Use ACK/NACK based retransmission schemes

➢ guarantee reliable datacenter network ➢ avoid waiting for timeout when needed retransmission

❖ Implemented Priority Flow control and congestion control scheme

➢ safety insert and remove packets from network

slide-40
SLIDE 40

Elastic Router

The Elastic Router is an on-chip, input-buffered crossbar switch designed to support efficient communication between multiple endpoints on an FPGA across multiple virtual channels It was developed to support both intra and inter FPGAs communication.

  • intra-FPGA communication between Roles on the same FPGA
  • inter-FPGA communication between Roles running on other FPGAs through

the LTL. The ER supports multiple virtual channels and allow a pool of flits to be shared among multiple VC, which reduce aggregate flit buffering requirement.

slide-41
SLIDE 41

LTL and ER are crucial to allow FPGAs to 1. be organized into multi-FPGA services 2. to be remotely managed for use as a remote accelerator when not in use by their host Shell version

  • without LTL block → use local

FPGA

  • with LTL block → multi

FPGAs acceleration

Lightweight Transport Layer

and

Elastic Router

slide-42
SLIDE 42

LTL communication Evaluation

L2 distribution has vary significantly

  • large number of hosts connected
  • latency affected from

○ phusycal distance ○ cabling to transient background traffic ○ switch internal implementation

6x8 Torus (Catapult v1)

  • Limited to group of 48 FPGAs
  • Separate inter-FPGA network is expensive

and conplex to cable and maintain.

  • Failure handling is hard and impact latency

○ dynamically rerouted around faulty FPGA ○ cost extra hops and latency

LTL

  • Network allows access to hundreds of

thousands of hosts/FPGAs in a fixed number of hops

  • Failure handling is simple

○ have spare accessible nodes/FPGAs

slide-43
SLIDE 43

Remote Acceleration Evaluation

Result is throughput of a single accelerator when accessed remotely, latency of remote accesses is minimal.

  • Remote acceleration has minimal

impact on host server. ○ Server can donate their FPGA to global pool.

  • Network bandwidth can be reduced

by remote service. ○ LTL has bandwidth limiting to prevent FPGA from exceed limit.,

slide-44
SLIDE 44

Oversubscription Evaluation

Deploy latency-sensitive Deep Neural Network (DNN) accelerators shared by multiple software clients

  • n network.

Increase ratio of software client to accelerator to measure impact. When FPGA reach its peak throughput, causing latencies to

  • spike. Each FPGA has sufficient

throughput to sustain 2 - 2.5 software clients before latency spike.

slide-45
SLIDE 45

Hardware-as-a service Model (HaaS)

A concept of grouping FPGAs into service pool. Resource Manager (RM)

  • track FPGA resources
  • provide APIs for SM

Service Manager (SM)

  • Manage service-level task

FPGA manager (FM)

  • provide configuration
  • status monitoring
slide-46
SLIDE 46

Thank you!