Data Centre Acceleration
Monica Qin Li Aaron Chelvan Sijun Zhu
Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu - - PowerPoint PPT Presentation
Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu Background 2019: Data centre traffic will reach 10.4 zettabytes. Annual growth rate of ~25%. 83% of traffic will come from the cloud 80% of workloads will be
Monica Qin Li Aaron Chelvan Sijun Zhu
○ Data centre traffic will reach 10.4
○ 83% of traffic will come from the cloud ○ 80% of workloads will be processed in the cloud
down in order to comply with thermal constraints
○ 25 x better performance per watt ○ 50-75 x latency improvement
○ Offline batch processing. High volumes of data, involving several complicated processes ■ Large amount of data are offloaded to the FPGA. Overhead of communication between FPGA and processor minimised. ○ Online streaming processing. Smaller volumes of streaming data, involving simpler
■ FPGAs can be used to offload both the NIC and actual processing of data packets.
Issues to Overcome for FPGA-based Accelerators
○ Virtualisation and partitioning of FPGA ○ Configuration of FPGAs ○ Scheduling of hardware accelerators
Virtualised Hardware Accelerators (University of Toronto)
(Virtualised FPGA Resource - VFR).
across multiple FPGAs
collection of partial bitstreams is passed to the agent
VM as resource FPGA as resource
Reconfigurable Cloud Computing Environment (Technical University of Dresden)
their allocation status.
batch and streaming data.
while reducing costs by 70%.
used tasks e.g. term frequency, fuzzy search
○ Limit power draw to under 25W → PCIe bus provided all necessary power. ○ Limit to under 20W - keeps the increase in power consumption below 10%.
○ Low latency (< 10 seconds to transfer 16 KB) ○ Safe for multithreading
○ 1 input buffer and 1 output buffer in user-level memory ○ FPGA is given base pointers to those buffers
○ Give each thread exclusive access to 1 or more sections ○ Each section is 64 KB
○ Balances routability with cabling complexity
microsecond latency
Source: https://en.wikipedia.org/wiki/Torus_interconnect
Check the front-end cache Top-level Aggregator (TLA) Selection service Ranking service Return the search results
Cache miss
Finds documents that match the search query Ranks the documents
○ Feature Extraction (FE) ○ Free Form Expressions (FFE) ○ Document Scoring
from the selection service through the chain
simultaneously ○ Multiple instruction, single data (MISD)
into control and data signals
features and sends them onwards
FFEs
1. Each core supports 4 threads 2. Threads are prioritised based on expected latency 3. Long latency operations can be split between multiple FPGAs
Machine learning model Floating-point score Features Free form expressions
(software implementation does not do this) ○ Compression has minimal impact
○ Queries servers to check their status ○ If unresponsive: Soft boot → Hard boot → Flag for manual service ○ If responsive: return info about the FPGA: ■ PCIe errors ■ Errors for inter-FPGA network communication ■ DRAM status ■ If a temperature shutdown occurred
○ Manages the roles of the FPGAs ○ Performs reconfigurations ○ Reconfiguring FPGAs may send garbage data, so a “TX halt” signal is sent to its neighbours, telling them to temporarily ignore any data received
solution
v2 System
v1 System.
supports dual port 40Gbps LAN port
production data.
○
Local Acceleration
○
Network Acceleration
○
Remote Acceleration
1. The secondary network (6x8 torus) required expensive and complex cabling. 2. Each FPGA needs full awareness of physical location of
3.
Failure handling of the torus requires complex re-routing
a. Performance Loss b. Isolation of nodes
4. Limited scalability for direct communication
a. One rack -> 48 nodes
5. FPGA can be used for accelerating applications, but has limited ability to enhance datacenter infrastructure.
a. FPGA device share the same network topology as the server itself
a. Allow it to accelerate high-bandwidth network flows
a. Gives it capability for local acceleration
packets, and communicates using LTL (Lightweight Transport Layer) a. Every FPGA can reach every other one in a small number of microseconds in hyperscale.
○ Allows for simple bypassing, rather than wasting FPGA resource on implementing NIC logic.
○ Buggy Application may cut network coming into this server ○ Failure in one server do not affect the
○ Servers have auto health check, while unresponsive server will be rebooted. Thus the proper FPGA image (golden image) will be used again.
Shell Architecture
Logic
Board Specific Logic
Intra-Communicat ion Interface for Roles
Inter-communicati
Roles.
RESOURCE USAGE
44% of FPGA used to support all shell functions. Enough space left for the role(s) to provide large speed up.
supporting ER and LTL ○ All server communication pass through FPGA while FPGA is accelerating. ○ No interaction between pass-through traffic with acceleration
thousands of servers.
when throughput is incremented by 2.25x. ○ I.e. more than half of servers can be saved for other uses.
FPGA query latency is observed to be lower and more stable.
never exceeds the software datacenter at any load.
rate for AES-CBC-128-SHA1 algorithm takes up to 15 cores.
However, FPGA processing can be more easily pipelined.
○ Since FPGA is tightly coupled to network, it can react quickly ○ LTL uses ACK/NACK to avoid unnecessary timeout in the event of re-ordering ○ LTL has really promising performance
Remote Acceleration
page ranking algorithm as before.
acceleration has no major impact
acceleration speed up.
viable choice
network
2-2.5 software clients (at high speed, multiple times of production speed) to FPGA before latency spikes.
○
○
potential without affecting the performance of datacenter
paper
System speedup: Defined as speedup achieved for a specific task using hardware accelerator + communication
Energy efficiency: Energy consumption of the hardware accelerator compared to executing in software
Overall, it is shown that the hardware accelerators can achieve an order of magnitude better energy efficiency compared to the typical server processors.
Architectures that did not report energy efficiency
(grouped by application)