(Re)Configurable Clouds and the Dawn of a New Era
Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016
(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ - - PowerPoint PPT Presentation
(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016 Client Cloud Training Humans GPUs Inference ASICs ? 5.8+ billion 250+ million 400+ million active users active
Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016
Humans ASICs GPUs ?
2.4+ million emails per day
1+ billion customers · 20+ million businesses · 90+ markets worldwide
worldwide queries each month 1 in 4
enterprise customers
50+ billion
minutes of connections handled each month 48+ million users in 41 markets 50+ million active users
400+ million
active accounts
250+ million
active users
8.6+ trillion
storage
Efficiency (ASICS) Homogeneity
6
Microsoft Confidential
Ethernet
8
cores (2 sockets)
Complex ALU Ln, ÷, div Basic Tile Basic Tile Basic Tile Basic Tile
Registers Constants FFE 1 Inst. FFE n Inst. Compression Thresholds…
Local ALU DSP DSP Scheduling Logic Distribution latches Control/Data TokensFeature Transmissio n Network Stream Preprocessing FSM
FE FFE DTS
FE0
89 Non-BodyBlock Features 34 State Machines 55 % Utilization
FE1
55 BodyBlock Features 20 State Machines 45 % Utilization
DTT [3][7] DTT [3][6] DTT [3][5] DTT [3][4] DTT [3][3] DTT [3][2] DTT [3][1] DTT [3][0] DTT [3][11] DTT [3][10] DTT [3][9] DTT [3][8] DTT [2][7] DTT [2][6] DTT [2][5] DTT [2][4] DTT [2][3] DTT [2][2] DTT [2][1] DTT [2][0] DTT [2][11] DTT [2][10] DTT [2][9] DTT [2][8] DTT [1][7] DTT [1][6] DTT [1][5] DTT [1][4] DTT [1][3] DTT [1][2] DTT [1][1] DTT [1][0] DTT [1][11] DTT [1][10] DTT [1][9] DTT [1][8] DTT [0][7] DTT [0][6] DTT [0][5] DTT [0][4] DTT [0][3] DTT [0][2] DTT [0][1] DTT [0][0] DTT [0][11] DTT [0][10] DTT [0][9] DTT [0][8]
FFE [1][3] FFE [1][2] FFE [1][1] FFE [1][0] FFE [0][3] FFE [0][2] FFE [0][1] FFE [0][0]
FFE: 64 cores / chip 256-512 threads DTT: 48 DTT tiles/chip 240 tree processors 2880 trees/chip
11
Stratix V 8GB DDR3 PCIe Gen3 x8
memory across server boundaries
Server 1
FPGA
Server 48
FPGA
Top Of Rack Switch (TOR) Server 2
FPGA
Server 47
FPGA
Server 3
FPGA
Server 46
FPGA
Server 4
FPGA
Server 45
FPGA
Server 23
FPGA
Server 26
FPGA
Server 24
FPGA
Server 25
FPGA
… …
DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS S0 S0 S0 S0 S0 S0 S1 S1 S2 S2 S2 S2
10Gb Ethernet Links FPGA Torus
13
15
CPU CPU FPGA NIC DRAM DRAM DRAM
WCS 2.0 Server Blade Catapult V2
Gen3 2x8 Gen3 x8 QPI Switch QSFP QSFP QSFP 40Gb/s 40Gb/s
WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA
Pikes Peak WCS Tray Backplane
Option Card Mezzanine Connectors
Catapult v2 Mezzanine card
Shell
Health & Config DRAM PCIe DMA Network
Shell Package
HEX LTL Encryption JPEG compression Queue Manager Rams FIFO Flight Recorder
Hardware Acceleration Platform Development HW Library Package(s) Shell
Libs
Catapult Kernel Driver Software Dev Kit & Runtime Library SDK Package Role HW API
Catapult Runtime Library
Operation Factory & Integration Test Suites FPGA Watchdog
(AutoPilot Integration)
DTS LZMA
Driver Deployment
CloudBuild FPGA Flow
License Servers
Product SW CSI: Board & SKU Qualification Process
OpenCL BSP
Golden Image SVC Verification Team OpenCL/HLS Compiler
18
software FPGA 99.9% Query Latency versus Queries/sec HW vs. SW Latency and Load average software load 99.9% software latency 99.9% FPGA latency average FPGA query load
SLB Decap SLB NAT VNET ACL Metering
Rule Action Rule Action Rule Action Rule Action Rule Action Rule Action Decap * DNAT * Rewrite * Allow * Meter *SmartNIC
VFP
VMSwitch VM
SR-IOV (Host Bypass)
50G
QoS
Crypto RDMA
Flow Action
Decap, DNAT, Rewrite, Meter 1.2.3.1->1.3.4.1, 62362->80GFT
ToR
FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server
CS0 CS1 CS2 CS3 ToR
FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server
SP0 SP1 SP2 SP3
their own UDP packets
communication (LTL)
network primitives
up other opportunities
L0 L1/L2
22
ToR ToR CS CS ToR ToR Bing Ranking SW
HPC
Bing Ranking HW
Speech to text Large-scale deep learning
many FPGAs as needed (up to thousands)
layer model
FPGA execution
ToR ToR ToR ToR CS CS CS CS ToR ToR ToR ToR
CPUs FPGAs DRAM
27
28