Datacentre Acceleration
Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930)
Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) - - PowerPoint PPT Presentation
Datacentre Acceleration Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930) Background Data Centres Centralised computing and network infrastructure Cloud applications for storage and computation
Ted Zhang (z5019142) Meisi Li (z5119623) Han Zhao(z5099931) Nattamon Paiboon (z5176930)
❖ Centralised computing and network infrastructure ❖ Cloud applications for storage and computation
❖ End of Moore's Law ❖ Mainstream hardware unable to keep up with growing demand
❖ Lots of designed redundancy in data centres ➢ Peak load handling ❖ Growing attention towards sustainability ➢ Environmental and cost concerns
❖ 25x better performance per watt ❖ 50-75x latency improvement
Two broad categories of cloud applications: ❖ Offline - process large quantities of data, complex
➢ Big data, MapReduce ❖ Online - data streaming and delivery ➢ Search engine, video streaming
FPGAs in the Cloud (IBM) ❏ Abstracts portions of FPGAs as a pool of resources ❏ Predefined functions such as encryption and hashing Virtualised Hardware Accelerators (University of Toronto) ❏ Abstracts reconfigurable FPGA accelerators as Virtual Machines ❏ Openstack for resource control and allocation
FPGAs in Hyperscale Data Centres (IBM) ❏ Direct user allocation of FPGA partition Virtualised FPGA Accelerators (University of Warwick) ❏ Integration within server machines ❏ Usually provide library of operations which are faster on FPGA
❖ Performance speedup ➢ Kernel speedup - execution of specific task ➢ System speedup - execution of entire application ❖ Energy efficiency - fraction multiplier
Reconfigurable MapReduce Accelerator - University of Athens ❏ V1 - Map done by standard processors, reduce moved to FPGA ❏ V2 - Map moved to FPGA, reconfigurable by HLS
Speedup 4.3x Efficiency 33x
Memcached Acceleration - HP ❏ Two distinct accelerator blocks ❏ Network accelerator ❏ Memcached accelerator ❏ Example of streaming acceleration
Speedup 1x Efficiency 10.9x
Microsoft Catapult - Bing Search Engine ❏ Altera FPGA PCIe board installed inside standard server machines ❏ Aid machine learning page ranking
Speedup 1.95x Efficiency
Communication between FPGA and host CPU design:
The software infrastructure needs to
Two services are introduced for these tasks:
systems
FPGA reconfiguration may cause instability in system Reason:
Solution:
Used in Bing's ranking engine Overview:
number of machines
stages
service through the chain
○ 1 FPGA for feature extraction ○ 2 FPGA for free form expression ○ 1 FPGA for compression ○ 3 FPGA to hold machine learning models ○ 1 FPGA is a spare in case of machine failure
○ Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring
truncated to 64 KB
search query
parallel, working on the same input stream (MISD computation)
engines to run simultaneously
control and data messages
feature and value pairs and forwards them to the next pipeline stage
with long-latency floating point operations
○ Each core supports 4 threads ○ Threads are prioritised based on expected latency ○ Long latency operations can be split between multiple FPGAs
floating-point score
nodes, causing both performance loss and isolation of nodes under certain failure patterns
such as network and storage flow.
This architecture eliminates all of the limitations listed above with a single design. The architecture has been — and is being — deployed in the majority of new servers in Microsoft’s production data centers across more than 15 countries and 5
PCIE to local host CPU(local accelerator) It is placed between NIC and network
NIC(network interface card) QSFP(quad small Form-factor Pluggable)
By enabling the FPGAs to generate and consume their own networking packets independent of the hosts, each and every FPGA in the datacenter can reach every
LTL(Lightweight Transport Layer) protocol for low latency communication between pairs of FPGAs. Every host could use remote FPGA resource. Not running FPGAs will be donated to a global pool by their host.
Datacenter accelerators must be highly manageable, which means having few variations or
deployment. They must be highly flexible at the system level, in addition to being programmable,to justify deployment across a hyperscale infrastructure.
Divided into Three parts
1. Local acceleration handles high value scenarios such as search ranking acceleration where every server can benefit from having its own FPGA. 2. Network acceleration can support services such as intrusion detection, deep packet inspection and network encryption. 3. Global acceleration permits accelerators unused by their host servers to be made available for large-scale applications.
(ER) with virtual channel support for allowing multiple Roles access to the network, and a Lightweight Transport Layer (LTL) engine used for enabling inter-FPGA communication.
5,760 servers containing this accelerator architecture and placed it into a production datacenter. 3,081 of these machines using the FPGA for local compute acceleration. two FPGAs had hard failures. a low number of soft errors, which were all correctable.
Bing Search Page ranking . With the single local FPGA, at the target 99th percentile latency, the throughput can be safely increased by 2.25x, which means that fewer than half as many servers would be needed to sustain the target throughput at the required latency.
Intel AES GCM-128 on Haswell is 1.26 cycles per byte for encrypt and decrypt each. Thus, at a 2.4 GHz clock frequency, 40 Gb/s encryption/decryption consumes five cores. Different standards, such as 256b or CBC are, however, significantly slower. AES-CBC-128-SHA1 fifteen cores to achieve 40 Gb/s full duplex FPGA supports full 40 Gb/s encryption and decryption. The worst case half-duplex FPGA crypto latency for AES-CBC-128-SHA1 is 11 µs for a 1500B packet, from first flit to first flit. In software, based on the Intel numbers, it is approximately 4 µs. AES-CBC-SHA1 is, however, especially difficult for hardware due to tight dependencies. For example, AES-CBC requires processing 33 packets at a time in our implementation, taking only 128b from a single packet once every 33 cycles.
Communication among FPGAs is important to treat the acceleration hardware as a global resource and to deploy services that consume more than one FPGA. FPGAs can be used by other server on network. When the FPGA is not been used by local server, it will be group into server pool and be used by other server that need additional FPGA. Two major function should be implemented on FPGA: 1. inter-FPGA communication 2. intra-FPGA router
LTL is inter-FPGA network protocol use UDP for frame encapsulation and IP for routing packets across datacenter network ❖ Lossless traffic classes
➢ avoid packet drops and reorders
❖ FPGAs are tightly coupled to network
➢ react quickly and efficiently to notification and back off when needed to reduce packets dropped
❖ Use statisticallt allocated and persistent connection
➢ low latency communication once establised connection,
❖ Use ACK/NACK based retransmission schemes
➢ guarantee reliable datacenter network ➢ avoid waiting for timeout when needed retransmission
❖ Implemented Priority Flow control and congestion control scheme
➢ safety insert and remove packets from network
The Elastic Router is an on-chip, input-buffered crossbar switch designed to support efficient communication between multiple endpoints on an FPGA across multiple virtual channels It was developed to support both intra and inter FPGAs communication.
the LTL. The ER supports multiple virtual channels and allow a pool of flits to be shared among multiple VC, which reduce aggregate flit buffering requirement.
LTL and ER are crucial to allow FPGAs to 1. be organized into multi-FPGA services 2. to be remotely managed for use as a remote accelerator when not in use by their host Shell version
FPGA
FPGAs acceleration
L2 distribution has vary significantly
○ phusycal distance ○ cabling to transient background traffic ○ switch internal implementation
6x8 Torus (Catapult v1)
and conplex to cable and maintain.
○ dynamically rerouted around faulty FPGA ○ cost extra hops and latency
LTL
thousands of hosts/FPGAs in a fixed number of hops
○ have spare accessible nodes/FPGAs
Result is throughput of a single accelerator when accessed remotely, latency of remote accesses is minimal.
impact on host server. ○ Server can donate their FPGA to global pool.
by remote service. ○ LTL has bandwidth limiting to prevent FPGA from exceed limit.,
Deploy latency-sensitive Deep Neural Network (DNN) accelerators shared by multiple software clients
Increase ratio of software client to accelerator to measure impact. When FPGA reach its peak throughput, causing latencies to
throughput to sustain 2 - 2.5 software clients before latency spike.
A concept of grouping FPGAs into service pool. Resource Manager (RM)
Service Manager (SM)
FPGA manager (FM)