A Reconfigurable Fabric for Accelerating Large-Scale Datacenter - PowerPoint PPT Presentation

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

1. Overview 2. Challenges and Solution. 3. Introduction to FPGA 4. Requirement and Architecture. 5.Infrastructure and Platform architecture. 5.1 Debugging support. 5.2 Failure detection and Recovery. 5.3 Correct operation. 5.3 Software Infrastructure 6. Application case study. 6.1 Micro-pipeline. 6.2 Queue Manager and Model Reload. 6.3 Feature extraction. 6.4 Free Form Expression. 7. Evaluation

o Demands for datacenter workloads: o High computation capabilities. o Flexibility o Power efficiency o Low Cost CHALLENGE : Hard to improve all factors simultaneously.

o Composable, reconfigurable fabric to accelerate portions of large-scale software services. o One fabric consists of: o (a.) 6x8 2-Dtorus of high-end Stratix V FPGA o (b.) Embedded into a half-rack of 48 machines. o (c.) Each server has one FPGA. o (d.) Wired to other FPGAs with pair of 10 Gb SAS Cables o (e.) Accessed through PCIe.

o FPGA is one universal chip. o Initially it does not have any intended logic. o FPGA can be converted into microcontroller, digital signal processor. o Components o Contains large number of configurable logic blocks. o CLB can implement any basic function.

o Components: o Multiple CLB can be configured to perform complex digital function. o Each CLB contain flip-flops and lookup tables. o Input Output Block can be programmed to act as input and output ports. o Input Output Block can be connected to internal matrix.

o Larger datacenter needs homogeneity to reduce management issues. o Datacenter evolve rapidly. o Non-programmable hardware is not sufficient. o SOLUTION: o Field Programmable Gate Arrays (FPGA) o Use FPGA as computer accelerators.

Requirement And Architecture o Challenges with FPGA • Standard FPGA reconfiguration time is slow at run- time. • Multiple FPGA cost more and consume more power. • Single FPGA per server restricts sufficient workload acceleration.

Requirement And Architecture o Architecture: o For half-rack consists of 48 server o Medium size FPGA and local DRAM for each server. o FPGAs are directly wired to each other.

o Robust software stack for failure detection. o Three categories of infrastructure: o API for interfacing software with the FPGA. o Interface between FPGA application logic and board- level functions. o Support for resilience and debugging

o Flight data Recorder o Capture important information about FPGA at run- time. o Initially stored on-chip memory. o During health check, it is streamed out. o Circular buffer: head and tail flits of network packets.

Debugging support o Useful to debug o Rare dead lock event. o Untested input resulting in hang. o Server reboots. o Unreliable SL3 links.

o Communication between FPGA and host CPU design goal: o Interface must incur low latency. o Interface must be multi-threading safe. o FPGA is provided pointer to user space buffer space. o Buffer space is divided into 64 slots. o Each thread has exclusive access to slots. o To send data to FPGA, fill slot and set flag.

o Monitor server notice unresponsive servers. o Health monitor contact each machine to get status. o Execute sequence of soft reboot, hard reboot or manual intervention. o Healthy service sends status of local FPGA.

Failure Detection And Recovery o Health monitor update machine list of failed servers. o Mapping manager moves the application. o Movement is done based on the location and type of failure.

o FPGA reconfiguration may cause instability in system. o Reason: o Reconfiguration can appear as failed PCI It triggers non-maskable interrupt bringing instability. o Reconfiguring FPGA can send random traffic to neighbor. This traffic may appear valid.

Correct operation o Solution: o Disable non-maskable for the specific PCI device. o Send "TX Halt" message. Meaning ignore all message until link establishes

o Apart from application developer needs to write: o Host to FPGA communication. o Functions required for data marshaling. o Challenges: o Significant burden on developer. o These changes require portability. o Solution: Partition all programmable logic into partition. o (a) Shell (b) Role

o Solution: o Shell o Programmable logic common across all applications. o Shell consume 23% of FPGA o Features: o Double bit error detection and single bit error correction in DRAM controller. o Scrubber runs continuously to remove configuration errors.

o Software works at datacenter level and server level. o I t needs: o Ensure correct operation. o Failure detection. o Recovery and debugging. o Solution: o Mapping Manager o Health Monitor.

o Used in Bing's ranking engine. o Overview: o If possible, query is served from front end cache. o TLA (Top level aggregator) send query to large number of machines. o These machine find documents. o It send it to machine running ranking service.

Application o Overview: o Ranking service assign score to each document. o TLS sort scores and generate result. o Features: No of time query word occurred in each document.

Application o Similarly many features are sent to machine- learning model. o Model generate score. o FPGAs perform: feature computation and machine learning model.

o Process pipe line is divided into macro-pipeline stages. o Time limit for micro-pipeline is 8 micro seconds. o It is 1600 FPGA clock cycles. o Tasks are distributed in this fashion: o 1 FPGA for feature extraction. o 2 FPGA for free form expression. o 1 FPGA for compression o 3 FPGA to hold machine learning models. o 1 FPGA is a spare in case of machine failure.

o Multiple Models. o Can be selected based on query type or language etc. o DRAM contains all queries for a given model in queue. o Queue Manager selects a queue and reads queries. o Switch queue when queue is empty.

Queue Manager and Model Reload o On switching queue send "Model Reload" command. o Model Reload takes less than 250 micro seconds. o It is relatively slower than document processing time.

o On FPGA accelerator, feature extraction runs in parallel. o Implemented in the form of feature extraction state machine. o Support for running state machine in parallel on same input data.

o Mathematical combination of features. o Example: Adding two features. o Example: Can include complex floating point operation o Custom multicore processor with huge multithreading support.

Free Form Expression o Implemented on FPGA. o Long latency expression split across multiple FPGA. o Single complex FPGA block for ln, fpdiv, exp and float- to-int.

o Node level Experiment: o Significant variation in throughput across all stages. o Throughput limited by FE.

o Power consumption compared to GPU is much more than TPUs. o Same observation is performed for datacenters using FPGAs. Maximum power overhead of FPGAs to our server is of 22.7 W.

§ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter - PowerPoint PPT Presentation

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services 1. Overview 2. Challenges and Solution. 3. Introduction to FPGA 4. Requirement and Architecture. 5.Infrastructure and Platform architecture. 5.1 Debugging support. 5.2

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

FPGA fabric is eating the world The rise of the custom computing machines From the eyes of Steve

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable Computing Computing Reconfigurable Partial reconfiguration reconfiguration

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Reconfigurable Computing Reconfigurable Computing for System on a Chip for System on a Chip

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

Largest Districts in Alabama Ranking School District ADM 1 049 Mobile County 53,419.40 2

Ranking in Heterogeneous Networks with Geo-Location Information Leman Akoglu Abhinav Mishra

AutoSys: The Design and Operation of Learning-Augmented Systems Chieh-Jan Mike Liang, Hui Xue,

Ramps Ramps Yes Yes Y Y A. A. No No B. B. Ramps 3 Ramps 4 Observations About Ramps

Building OSGi Components Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 1 About

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

Data Centre Acceleration Monica Qin Li Aaron Chelvan Sijun Zhu Background 2019: Data

NRC Sociology Rankings Andrew J. Perrin November 3, 2010 Andrew J. Perrin () NRC Sociology